PMC:4502367 / 7234-18643 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"25917918-21695235-43386532","span":{"begin":119,"end":123},"obj":"21695235"},{"id":"25917918-21994252-43386533","span":{"begin":228,"end":232},"obj":"21994252"},{"id":"25917918-21994252-43386534","span":{"begin":896,"end":900},"obj":"21994252"},{"id":"25917918-23587118-43386535","span":{"begin":1316,"end":1320},"obj":"23587118"},{"id":"25917918-24920004-43386536","span":{"begin":1798,"end":1802},"obj":"24920004"},{"id":"25917918-23935534-43386537","span":{"begin":2358,"end":2362},"obj":"23935534"},{"id":"25917918-16169926-43386538","span":{"begin":3039,"end":3043},"obj":"16169926"},{"id":"25917918-24920004-43386539","span":{"begin":3413,"end":3417},"obj":"24920004"},{"id":"25917918-22059117-43386540","span":{"begin":3544,"end":3548},"obj":"22059117"},{"id":"25917918-21572440-43386541","span":{"begin":3694,"end":3698},"obj":"21572440"},{"id":"25917918-23845962-43386542","span":{"begin":3712,"end":3716},"obj":"23845962"},{"id":"25917918-15728110-43386543","span":{"begin":4281,"end":4285},"obj":"15728110"},{"id":"25917918-14500829-43386544","span":{"begin":4566,"end":4570},"obj":"14500829"},{"id":"25917918-18757608-43386545","span":{"begin":4712,"end":4716},"obj":"18757608"},{"id":"25917918-2231712-43386546","span":{"begin":4942,"end":4946},"obj":"2231712"},{"id":"25917918-17379688-43386547","span":{"begin":5013,"end":5017},"obj":"17379688"},{"id":"25917918-14534192-43386548","span":{"begin":5703,"end":5707},"obj":"14534192"},{"id":"25917918-9461475-43386549","span":{"begin":5965,"end":5969},"obj":"9461475"},{"id":"25917918-10779491-43386550","span":{"begin":6073,"end":6077},"obj":"10779491"},{"id":"25917918-21695235-43386551","span":{"begin":6149,"end":6153},"obj":"21695235"},{"id":"25917918-18190707-43386552","span":{"begin":6311,"end":6315},"obj":"18190707"},{"id":"25917918-15123596-43386553","span":{"begin":6564,"end":6568},"obj":"15123596"},{"id":"25917918-21513511-43386554","span":{"begin":7188,"end":7192},"obj":"21513511"},{"id":"25917918-17446895-43386555","span":{"begin":8056,"end":8060},"obj":"17446895"},{"id":"25917918-11152613-43386556","span":{"begin":8239,"end":8243},"obj":"11152613"},{"id":"25917918-17446895-43386557","span":{"begin":8373,"end":8377},"obj":"17446895"},{"id":"25917918-25306241-43386558","span":{"begin":8694,"end":8698},"obj":"25306241"},{"id":"25917918-15980438-43386559","span":{"begin":8765,"end":8769},"obj":"15980438"},{"id":"25917918-22096229-43386560","span":{"begin":9310,"end":9314},"obj":"22096229"},{"id":"25917918-25306241-43386561","span":{"begin":9412,"end":9416},"obj":"25306241"},{"id":"25917918-21304975-43386562","span":{"begin":10271,"end":10275},"obj":"21304975"},{"id":"25917918-2231712-43386563","span":{"begin":10421,"end":10425},"obj":"2231712"},{"id":"25917918-23329690-43386564","span":{"begin":10468,"end":10472},"obj":"23329690"},{"id":"25917918-16093699-43386565","span":{"begin":10837,"end":10841},"obj":"16093699"},{"id":"25917918-21109532-43386566","span":{"begin":11134,"end":11138},"obj":"21109532"},{"id":"25917918-17984973-43386567","span":{"begin":11289,"end":11293},"obj":"17984973"}],"text":"Materials and Methods\n\nGenome data\nFor the gene annotation, the genomes of Z. tritici (isolate IPO323) (Goodwin et al. 2011), Z. pseudotritici (isolate ST04IR_2.2.1), and Z. ardabiliae (isolate ST04IR_1.1.1) (Stukenbrock et al. 2011) were used. For the annotation of Z. brevis, the genome of the isolate Zb18110 was sequenced. Genomic DNA was isolated using a standard phenol-chloroform protocol (Sambrook and Russell 2001), and sequencing of 100-bp paired-end reads was performed using a HiSeq2000 Illumina platform (AROS Applied Biotechnology, Denmark). A de novo assembly of the Z. brevis Illumina reads was generated using the CLC Genomics Workbench version 5 (CLC, Aarhus, Denmark) with standard settings for paired-end read assembly. This assembly is available under the NCBI BioProject (PRJNA273516) on Genbank. The assembled genomes of Z. pseudotritici, Z. ardabiliae (Stukenbrock et al. 2011), and Z. brevis used for the gene annotation contained a very low number of repetitive DNA. Therefore, new assemblies better representing the repeat content of these species genomes were constructed using Illumina sequences obtained from the isolates Z. pseudotritici ST04IR_5.5, Z. ardabiliae ST11IR_6.1.1, and Z. brevis Zb163. Assemblies of these new Illumina genomes were generated using SOAPdenovo2 (Luo et al. 2012) with optimized k-mer values allowing the inclusion of repeats in the assemblies. These three assemblies are available under the NCBI BioProject (PRJNA274679) on Genbank.\n\nRNA-seq data\nTwo previously published RNA-seq datasets were used for de novo transcript assembly of Z. tritici IPO323 including one dataset obtained from RNA extracted from infected Triticum aestivum (cultivar Obelisk) seedlings and one dataset obtained from axenically grown fungal cells (Kellner et al. 2014). For Z. pseudotritici, Z. ardabiliae, and Z. brevis, RNA-seq data were obtained from axenically grown cultures. Total RNA was extracted from fungal cells grown in YMS (4 g yeast extract, 4 g malt extract, 4 g sucrose, 20 g bacto agar, 1 liter H2O) agar in a shaking incubator at 200 rpm at 18° using the TRIZOL reagent (Invitrogen, Darmstadt, Germany) following the protocol of the manufacturer. Illumina RNA-seq libraries for two axenic culture replicates per species were prepared from an input of 10 µg total purified polyA RNA (Palma-Guerrero et al. 2013). Libraries were quantified by fluorometry, immobilized, and processed onto a flow cell with a cBot (Illumina), followed by sequencing-by-synthesis with TruSeq v3 chemistry on a HiSeq2000 at the Max Planck Genome Center (Cologne, Germany). RNA-seq data for Z. pseudotritici, Z. ardabiliae, and Z. brevis are respectively available in the NCBI BioProjects (PRJNA277173, PRJNA277174, PRJNA277175). For data processing, the sequence read quality was first evaluated using FASTQC (www.bioinformatics.babraham.ac.uk/projects/fastqc/). Subsequently, reads were filtered using tools from the Galaxy server, including grooming, trimming, filtering, and masking steps (Giardine et al. 2005). Reads with an overall quality score less than 20 were removed. For the remaining reads, all nucleotides with a quality score less than 20 were masked with Ns. For Z. tritici, reads from the host (T. aestivum) transcriptome were initially filtered out using fastq_screen v0.4.1 (www.bioinformatics.babraham.ac.uk/projects/fastq_screen) as described by Kellner et al. (2014).\n\nGene annotation\nProtein-coding genes were identified using the Fungal Genome Annotation pipeline described by Haas et al. (2011). The individual steps of the pipeline are described below.\n\nTranscript reconstruction:\nTranscript reconstruction using Trinity (Grabherr et al. 2011; Haas et al. 2013) was carried using de novo and genome-guided methods. By the de novo method, RNA-seq reads were first assembled into unique sequences of transcripts (contigs) using the Inchworm module within Trinity. Contigs were clustered by the Chrysalis module, and corresponding De Bruijn graphs that represent the possible different isoforms were constructed. In the final step, De Bruijn graphs were processed by the Butterfly module to produce full-length transcripts. In the genome-guided method, RNA-seq reads were first aligned to the genome using GMAP (Wu and Watanabe 2005). Based on these aligned-read clusters, the Chrysalis and Butterfly modules were consecutively executed to produce the final reconstructed transcripts. Transcripts generated by these two methods were combined using the “Program to Assemble Spliced Alignments” (PASA) (Haas et al. 2003) pipeline to build a complete set of unique transcripts corresponding to gene models.\n\nTraining:\nFirst, GeneMark-ES (Ter-Hovhannisyan et al. 2008) was used for ab initio predictions, because its self-training algorithm allowed the identification of high-quality gene models. Next, evidence from homology searches using tBLASTn (e-value cut-off of 1e-10) (Altschul et al. 1990) against a nonredundant protein database (UniRef90) (Suzek et al. 2007) and from reconstructed transcripts were used to filter the ab initio predicted gene models. Only the complete gene models predicted by GeneMark-ES with a support from the homology-based comparison (100% of coverage for each exon) and with the exact same exon–intron boundaries as in the reconstructed transcripts were selected for the training and testing of ab initio gene predictors. The 2693 selected gene models were divided into two sets: one for the training (training set) containing 1611 sequences (60%) and one for assessing prediction accuracy (test set) containing 1082 sequences (40%). The evaluation of the training process was performed using Augustus (Stanke and Waack 2003). For all species, better performance was obtained using datasets of Z. tritici. We used the Z. tritici training set to train the ab initio gene callers for all the species.\n\nGene prediction and annotation:\nThe trained GeneMark.hmm (Lukashin and Borodovsky 1998) and Augustus programs were used to predict gene models. For Z. tritici, Fgenesh (Salamov and Solovyev 2000) gene models obtained from the first annotation by JGI (Goodwin et al. 2011) were also included. Gene models obtained from the trained gene predictors were evaluated and combined using the EVidenceModeler (EVM) software (Haas et al. 2008) to create a weighted consensus of the gene structures. Weights of 3, 5, and 7 were used for ab initio predictions, homology-based predictions, and transcript evidence, respectively. Homology-based gene models obtained from GeneWise (Birney et al. 2004) were only used for Z. tritici as reliable gene models; they could not be obtained in the much more fragmented genomes of Z. pseudotritici, Z. ardabiliae, and Z. brevis. Based on transcript alignments, the PASA pipeline was used to correct the consensus gene model structure and to resolve conflicts between different isoforms.\n\nGene comparison\nBased on the four predicted proteomes of the Zymoseptoria species, sequence comparisons were performed with BLASTp (e-value cut-off of 1e-5). Obtained pairwise protein alignments were processed using the software SiLiX to build families of homologous proteins (Miele et al. 2011). Proteins were clustered together into families if they shared at least 55% of sequence identity over at least 60% of sequence coverage. The building of these families was divided in two steps. First, only complete protein sequences were used to create families. Second, using a semi-bipartite graph, partial protein sequences were added to the existing families.\n\nSecretome prediction\nPlant pathogens secrete proteins to interfere with host immune defenses. Genes encoding secreted proteins are therefore of particular interest due to their potential role in infection and host–pathogen interaction. We screened predicted proteins for the presence of signal peptides and categorized proteins as secreted under the following conditions: (i) if a signal peptide was predicted by both Neural-Network and HMM methods by the software SignalP 3.0 (Emanuelsson et al. 2007); (ii) if zero or one transmembrane domain was present in the protein (the domain having at least 30% overlap with the signal peptide) using the software TMHMM 2.0 (Krogh et al. 2001); and (iii) if the protein was predicted to be targeted to the secretory pathway by the software TargetP 1.1 (Emanuelsson et al. 2007). Secreted proteins with a size of 300 amino acids or less were considered as small secreted proteins (SSPs).\n\nFunctional annotation\nAutomated functional annotations of the predicted proteins of Z. tritici, Z. pseudotritici, Z. ardabiliae, and Z. brevis were performed as previously described by Grandaubert et al. (2014). Using a combination of BLAST and InterProScan (Quevillon et al. 2005), predicted proteins of each of the four species were categorized into three classes. The first class included proteins with no significant BLAST hit and with no known protein domain identified by InterProScan. These proteins were classified as “predicted proteins,” i.e., predictions with no functional support. This class contained an excess of species-specific genes. The second class, termed “hypothetical proteins,” included proteins with at least one domain identified by InterProScan in the InterPro or Pfam databases (Hunter et al. 2012; Coggill et al. 2008) or that fulfilled the BLAST result criteria defined by Grandaubert et al. (2014) and for which the description indicated “hypothetical protein” in more than 90% of the BLAST hits. Globally, this class contained predictions with poor functional evidence likely corresponding to conserved proteins among several organisms but with no defined functions. The third class included predictions that fulfilled the BLAST result conditions with at least one domain from the InterPro or Pfam database corroborating by the consensus BLAST description. This class was termed the “similar to function” and included well-conserved proteins with defined functions in many organisms.\n\nRepeats and transposable elements identification\nRepetitive DNA was identified and annotated in SOAP assembled genomes of the four isolates described above to create a repertoire of repeats specific to the Zymoseptoria genus using the REPET pipeline (Flutre et al. 2011). To obtain more complete elements, repeat families in each species were clustered using Blastclust from the NCBI-BLAST package (Altschul et al. 1990), aligned using Mafft (Katoh and Standley 2013), and new consensuses were then created. These steps were iterated with decreasing values of identity percentage (from 100% to 75%) and coverage (from 100% to 30%) until there was no more clustering of the sequences. Next, the sequences were classified (TEclassifier.py script from REPET) using tBLASTx and BLASTx against the Repbase Update database (Jurka et al. 2005) and by the identification of structural features such as long terminal repeats (LTRs) or terminal inverted repeats (TIRs). The sequences were additionally translated into the six reading frames to perform a protein domain search on the conserved domain database (CDD) (Marchler-Bauer et al. 2011) using RPS-BLAST. Transposable element (TE) families of each strain were classified and named according to the nomenclature defined by Wicker et al. (2007). The 497 repeat families identified in this study are available in FASTA format in Supporting Information, File 1."}