Improved gene annotation of Z. tritici RNA-seq data and detailed homology searches allowed us to generate an improved annotation of gene models in the IPO323 Z. tritici reference genome. Read data obtained from in planta and in vitro experiments were filtered (Table S1) and used for transcriptome assembly. Assemblies were based on two different but complementary approaches: (i) de novo, where the reads were assembled without a reference sequence and (ii) genome-guided, where the IPO323 reference genome was used as a template for transcript assemblies. In total, 68,653 transcripts were generated by the de novo method and 73,127 transcripts by the genome-guided method (Table 1). These transcripts were used to create a final database containing 13,847 putative gene models next used for the generation of a training set for ab initio predictors and for the correction of predicted gene models. We generated a final training set of 1611 gene models that was used to train two ab initio gene predictors: Augustus (Stanke and Waack 2003) and GeneMark-hmm (Lukashin and Borodovsky 1998). The outputs of these software were combined with 10,893 predicted gene models of another ab initio predictor, Fgenesh (Salamov and Solovyev 2000), obtained from the JGI annotation. We thereby obtained a total of 36,423 different gene models combining models from Augustus, GeneMark-hmm, and Fgenesh. These gene models were processed by EVidence Modeler (EVM) (Haas et al. 2008) to identify a consensus model for each gene. Finally, corrections were applied to the gene models using PASA (Haas et al. 2003) and the previously reconstructed RNA-seq-based transcripts. Table 1 Annotation features for the four members of the Zymoseptoria species complex Z. tritici Z. pseudotritici Z. ardabiliae Z. brevis Assembly Assembly size (Mb) 39.7 32.7 31.5 31.9 No. of scaffolds 21 1164 868 6116 Previous annotation New
annotation Transcriptome reconstruction No. of transcripts (de novo) — 68,653 16,056 16,353 17,870 No. of transcripts (genome-guided) — 73,127 19,076 20,193 15,331 No. of gene models — 13,847 12,027 12,719 10,649 Gene annotation No. of predicted gene models 10,952 11,839 11,044 10,787 10,557 No. of complete gene models 9397 11,795 10,957 10,686 10,342 No. of partial gene models 1555 44 87 101 215 No. of gene models with RNA-seq supporta 9423 10,048 7618 8297 9939 Average gene length (bp) 1599.8 1620.9 1594.3 1584.9 1592.8 Average transcript length (bp) 1388.8 1462.1 1459.4 1440.9 1462.5 Average protein length (aa) 436.6 487.8 488.0 482.0 487.5 No. of exons 28,309 30,068 26,699 26,231 25,367 Average exon length (bp) 505.7 575.2 604.7 593.7 608.1 Average no. of exons per gene 2.59 2.54 2.42 2.43 2.40 No. of introns 17,357 18,226 15,653 15,445 14,809 Average intron length (bp) 124.1 91.6 90.9 98.4 94.6 Average no. of introns per gene 2.27 2.27 2.16 2.16 2.16 No. of genes with introns 7654 8044 7234 7165 6883 Gene density (genes/Mb) 276.0 298.3 338.2 330.3 331.1 Functional annotation % of predicted proteins — 22.9 18.8 19.2 18.1 % of hypothetical proteins — 31.3 31.2 31.4 32.8 % of similar to proteins — 45.8 50 49.4 49.1 No. of secreted proteins 970b 874 838 965 700 No. of small secreted proteins (<300 aa) 441b 441 399 540 331 a Based on alignment with reconstructed transcripts (e-value < 1e-5). b Extracted from Morais do Amaral et al. 2012. The final gene annotation of IPO323 consists of 11,839 gene models (Table 1). Distributions of the number of gene models along the chromosomes of the annotation presented here and the previous JGI annotation are shown in Figure S1. In our new annotation, only 44 gene models were incomplete (i.e., without start and/or stop codon) compared to 1555 incomplete genes in the previous annotation (Table 1). To identify shared and unique gene models, we used a BLAST search (e-value cut-off of 1e-10). We found 4707 identical gene models between the two annotations. Furthermore, we found 442 gene models uniquely predicted by the JGI pipeline and 1200 models uniquely predicted by the pipeline used here. Gene models of our annotation have an average length of 1621 bp and exhibit longer exons (mean exon length 575 vs. 505 bp) and encode longer proteins (488 vs. 437 amino acids) when compared to the annotation generated by the JGI pipeline (Table 1). A BLAST comparison at the nucleotide level of our final gene models against the reconstructed transcripts showed that 10,048 out of the 11,839 predicted genes have support based on the RNA sequencing. In the previous annotation, 9423 predicted genes had support compared to the transcripts reconstructed in this study, which underlines the efficiency of this new annotation to predict biologically relevant gene models. In silico functional annotation based on protein signatures and homologies allowed us to assign a function to 43% of the 11,839 predicted proteins. A total of 2714 sequences (22.9%) had no homologies within the NCBI nonredundant protein database and did not include any known protein domain. These represent either species-specific genes of Z. tritici or incorrectly predicted genes. Given the fact that 55% of these have RNA-seq support, we considered that the majority of these novel genes must be correctly predicted. Characterization of the Z. tritici secretome yielded comparable results to the previously reported Z. tritici secretome (Morais Do Amaral et al. 2012), representing 874 secreted proteins including 441 small, secreted proteins (SSPs), i.e., with a size inferior to 300 amino acids (Table 1). SSPs of Z. tritici have, on average, 2.8-times more cysteine residues compared to the whole proteome (data not shown supporting an extracellular role) (Fass 2012).