Nucleic_Acids

PMC:3964958 JSON TXT 5 Projects

P-MITE: a database for plant miniature inverted-repeat transposable elements Abstract Miniature inverted-repeat transposable elements (MITEs) are prevalent in eukaryotic species including plants. MITE families vary dramatically and usually cannot be identified based on homology. In this study, we de novo identified MITEs from 41 plant species, using computer programs MITE Digger, MITE-Hunter and/or Repetitive Sequence with Precise Boundaries (RSPB). MITEs were found in all, but one (Cyanidioschyzon merolae), species. Combined with the MITEs identified previously from the rice genome, >2.3 million sequences from 3527 MITE families were obtained from 41 plant species. In general, higher plants contain more MITEs than lower plants, with a few exceptions such as papaya, with only 538 elements. The largest number of MITEs is found in apple, with 237 302 MITE sequences. The number of MITE sequences in a genome is significantly correlated with genome size. A series of databases (plant MITE databases, P-MITE), available online at http://pmite.hzau.edu.cn/django/mite/, was constructed to host all MITE sequences from the 41 plant genomes. The databases are available for sequence similarity searches (BLASTN), and MITE sequences can be downloaded by family or by genome. The databases can be used to study the origin and amplification of MITEs, MITE-derived small RNAs and roles of MITEs on gene and genome evolution. INTRODUCTION Miniature inverted-repeat transposable elements (MITEs) are prevalent in eukaryotic genomes, and are believed to be deletion derivatives of DNA transposons (1,2). Like autonomous DNA transposons, MITEs usually have terminal inverted repeats (TIR), flanked by short direct repeats [also called target site duplication (TSD)]. Compared with autonomous DNA transposons, MITEs are often short (<800 bp) and do not encode transposases. MITEs are often located in gene-rich euchromatic regions and are associated with genes (3,4). Several pieces of evidence suggest that MITEs may affect the expression of nearby genes. MITE Kiddo in rice was shown to upregulate the expression of Ubiquitin2 when inserted in its promoter region (5). However, in other cases, MITE insertions downregulate the expression of nearby genes (6,7). Such downregulation is most likely through small RNAs derived from MITE sequences (6,8). MITE transpositions generate much genetic diversity for a species (9–11). Considering the effects of MITEs on gene expression and variation of MITE insertions in different genotypes, MITEs may contribute to considerable phenotypic diversity as well (12). The first MITE families were discovered through sequence analysis (i.e. identification of TIR and TSD sequences) of insertions of 100–600 bp (13,14). Recently, computer programs were developed to systematically identify MITEs from a database such as genome sequences (6,15–19). Among them, the most successful ones are MITE Digger, MITE-Hunter and RSPB, which identified the vast majority of MITEs in the sequenced genome of rice (6,18,19). The recently reported program MITE Digger is most efficient for de novo MITE identification, particularly in large genomes (19). RSPB is better at identifying MITE families with atypical structures such as MITEs with no TSD or short/diverse TIR sequences. Unfortunately, RSPB requires high computer capacity not found in most laboratories. We predicted that combining MITE Digger, MITE-Hunter and RSPB would allow the detection of a vast majority of, if not all, MITE families in a genome, with no prior information required. With the availability of the three MITE detecting programs and the genome sequences of many plant species, MITEs in several genomes can be readily identified and compared to further our understanding of MITE origin and evolution. MITEs, as repetitive sequences, were included in other databases such as the The Institute for Genomic Research (TIGR) Plant Repeat Databases and Repbase (20,21). However, MITEs vary dramatically and usually cannot be identified through homology search between distantly related species, and consequently, only a small proportion of MITE families have been identified and included in these databases. In this study, MITEs were de novo identified from 41 plant species using computer programs MITE Digger, MITE-Hunter and/or RSPB. Each MITE family was annotated manually. All verified MITE families were stored in a database, P-MITE (for plant MITE). BLASTN search function was appended into the database. MITE sequences from each genome were downloadable. P-MITE will be helpful for the annotation of genes and genomic sequences. It can also be used to study the origin and amplification of MITEs, the comparative analysis between different species, the MITE-derived small RNAs and the roles of MITEs on gene and genome evolution, etc. MATERIALS AND METHODS Plant genomes used in this study Forty-one sequenced and published genomes of plant species, including six lower plant species, were included in this study for MITE identification. The information of the 41 species and the Web sites for their genome sequences are listed in Supplementary Table S1. The MITEs from rice were identified and annotated in a previous study (6). De novo identification of MITEs using MITE Digger, MITE-Hunter and RSPB MITEs from 41 genomes were de novo identified using program MITE Digger, MITE-hunter and/or RSPB (6,18,19). First, program MITE-Hunter was used to run the sequences of each genome. The resulting groups of potential MITEs were manually checked for TSD and TIR sequences. Groups with no precise boundaries (terminals) or no TIR sequences were not considered as MITEs. The confirmed MITEs from MITE-Hunter were put into a database (MITE-Hunter database). To save running time, program RSPB was slightly modified so that the confirmed MITE sequences in the ‘MITE-Hunter database’ were skipped by RSPB. New groups of repetitive sequences with precise boundaries were reported and checked manually for TSDs and TIRs (Supplementary Figure S1). No TSD and TIR information is required to run RSPB, which identifies repetitive sequences with precise boundaries. In subsequent manual annotation, only repetitive sequences <800 bp and TSD/TIR features similar to known MITE superfamilies were maintained. Five species with large genomes or too many short contigs were not successful using RSPB. MITE Digger, released recently, was also used to run some genomes, including genomes >800 Mb. The statistics of MITE families identified in this study is shown in Supplementary Table S2. The number of MITE families that were detected by RSPB, but not by MITE Hunter, is shown in Supplementary Table S3. Classification of MITE superfamily and family A Perl script was written to cluster MITEs identified above into a family if they had significant sequence similarity (BLASTN e < 10−10) (6). MITE families were assigned into superfamilies based on their TIR and TSD sequences. Each MITE family in a genome was named as code_Abc#, where Ab is the first two letters from its genus name, c the first letter from its species name and # a consecutive number. Different superfamilies are represented by different codes, with DTT for Tc1/Mariner, DTM for Mutator, DTA for hAT, DTC for CACTA, DTH for PIF/Harbinger, DTP for P, DTN for Novosib and DTx for unknown (21–23). MITEs with ambiguous TSD and/or TIR features were annotated as unknown superfamily (DTx). MITE families preferentially inserted into simple tandem repeats (microsatellites) were considered as an independent group, MiM (MITEs inserted in microsatellite). A ‘representative’ element was chosen for each family, and the representative elements should have good TIR and perfect TSD sequences if possible. A MITE sequence was considered as a full-length element when its terminals were no more than 3 bp shorter than the representative sequence. To identify all MITE elements, including diverse and/or partial ones, in a genome, a library of all representative elements from each family was used as query sequences to search the entire genome sequence using RepeatMasker v3.2.9 (http://www.repeatmasker.org/). RESULTS AND DISCUSSION De novo identification of MITEs in 41 plant genomes Program MITE-Hunter was applied to 41 plant genomes for genome-wide de novo identification of MITEs. RSPB was also used to run all but five genomes that are either >800 Mb or with too many contigs. MITE Digger was used to search some genomes, including four skipped by RSPB. The MITE sequences obtained from this study were used to execute a BLASTN search of the Repbase, the database most frequently used for repetitive sequences (21). More than 70% of MITE families identified from this study were not included in Repbase (< 10−10), MITE-Hunter, but not RSPB, due to too large genome. A total of 252 MITE families were obtained from maize, which include 97 novel families not covered by maize TE database. However, 61 MITE families listed in maize TE database were not identified by either MITE Digger or MITE-Hunter. The computing process of RSPB needs to be mended before it can be applied to large genomes, such as maize, to identify more novel MITE families. The majority of MITEs were classified into five superfamilies, including Tc1/Mariner, PIF/Harbinger, CACTA, hAT and Mutator. Two superfamilies, P and Novosib, were detected in the genomes of lower plants, although they do not have Tc1/Mariner, CACTA and Mutator. Sixteen MITE families were unclassified owing to ambiguous TSD and/or TIR features. MiM is the least frequent in plant genomes (Supplementary Table S2). The MiM group is present in only 10 of the 41 genomes, with 41 893 elements from 33 families. The strawberry genome contains 14 MiM families, whereas the others have no more than four MiM families. Most elements of these MiM families, including the Micron family in rice (24), were inserted in (TA)n repeats, with only a few exceptions, in which they were inserted into (CA)n/(GT)n repeats. Elements from the MiM group have poor TIR sequences, and no conserved nucleotides were found in their terminals among different families. It remains unclear whether different MiM families belong to the same superfamily, i.e. activated by the same type of transposase. In contrast to the scarce MiM group, the Mutator superfamily has 852 390 elements in the 41 genomes included in this study, with an average of >20 790 elements per genome. MITEs with significant nucleotide identities (BLASTN e < 10−10) were grouped into a family. The largest MITE family is the DTM_Mad25 from the apple genome, with 18 904 elements. The smallest MITE families, DTT_Sob24 and DTH_Sob33 from the Sorghum genome, have only one element. The number of MITEs varies dramatically in different species. In general, the genomes of lower plants have relatively few MITEs (Table 1). No MITEs were detected in the genome of Cyanidioschyzon merolae using either MITE-Hunter or RSPB, and the genome of Selaginella moellendorffii harbors only 73 MITE elements. The number of MITEs also varies considerably among the genomes of higher plants. For example, only one MITE family with 538 elements was detected in the papaya genome, whereas 237 302 elements from 180 MITE families are present in the apple genome. Large variations in total number of MITE elements also occur between closely related species. For example, the Arabidopsis thaliana genome has only 3245 MITE elements, whereas its close relative, Arabidopsis lyrata, contains 18 039 MITE-related sequences. Similarly, the number of MITEs in the genome of watermelon (with 94 314 MITE elements) is seven times as much as in the genome of melon (with 12 991 MITE elements). Table 1. MITE in 41 plant genomes aThe MITE sequences from rice were retrieved from Lu et al. (25). The number of MITEs in a genome is significantly correlated with its genome assembly size (r = 0.72, P < 0.01; Table 1; Figure 1). A similar correlation coefficient (r = 0.68, P < 0.01) was obtained when the six lower plants were excluded from the analysis. Nevertheless, several striking exceptions were observed. For example, the rice genome is only 373 Mb but has the third largest number (179 415) of MITEs among all species studied, whereas papaya with genome size (342 Mb) similar to that of rice, has only 538 elements of one MITE family (Table 1). Figure 1. Strong correlation between the number of MITEs and genome assembly size. Genomes with disproportionately low copy (➁ papaya and ➂ Physcomitrella patens) and high copy (➀ rice and ➃ apple) of MITEs are indicated. The construction and the use of plant MITE database, P-MITE A total of 2.3 million sequences of 3527 MITE families were obtained from 41 (including the rice genome) plant genomes. A series of databases containing MITEs from the 41 plant genomes was constructed. Elements from each of the 3527 MITE families were checked and annotated manually, and one element with better TSD and/or TIR features was chosen as a representative of the family. A database containing all representative elements was constructed, which can be used to study the structure of MITEs, such as their TSD and TIR features. The aforementioned databases are collectively named as P-MITE (for plant MITE), and can be found in http://pmite.hzau.edu.cn/django/mite. The database is searchable using BLASTN algorithm. MITE sequences and representative elements can be downloaded by family or by genome. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online, including [26–66]. FUNDING This work was supported by the ‘973’ National Key Basic Research Program [2009CB119000]; The National Natural Science Foundation of China (NSFC) [30921002]; and the Fundamental Research Funds for the Central Universities [2012ZYTS035]. Funding for open access charge: The Fundamental Research Funds for the Central Universities [2012ZYTS035]. Conflict of interest statement. None declared.

Document structure show

Title	P-MITE: a database for plant miniature inverted-repeat transposable elements
Abstract	Miniature inverted-repeat transposable elements (MITEs) are prevalent in eukaryotic species including plants. MITE families vary dramatically and usually cannot be identified based on homology. In this study, we de novo identified MITEs from 41 plant species, using computer programs MITE Digger, MITE-Hunter and/or Repetitive Sequence with Precise Boundaries (RSPB). MITEs were found in all, but one (Cyanidioschyzon merolae), species. Combined with the MITEs identified previously from the rice genome, >2.3 million sequences from 3527 MITE families were obtained from 41 plant species. In general, higher plants contain more MITEs than lower plants, with a few exceptions such as papaya, with only 538 elements. The largest number of MITEs is found in apple, with 237 302 MITE sequences. The number of MITE sequences in a genome is significantly correlated with genome size. A series of databases (plant MITE databases, P-MITE), available online at http://pmite.hzau.edu.cn/django/mite/, was constructed to host all MITE sequences from the 41 plant genomes. The databases are available for sequence similarity searches (BLASTN), and MITE sequences can be downloaded by family or by genome. The databases can be used to study the origin and amplification of MITEs, MITE-derived small RNAs and roles of MITEs on gene and genome evolution.
Body	INTRODUCTION Miniature inverted-repeat transposable elements (MITEs) are prevalent in eukaryotic genomes, and are believed to be deletion derivatives of DNA transposons (1,2). Like autonomous DNA transposons, MITEs usually have terminal inverted repeats (TIR), flanked by short direct repeats [also called target site duplication (TSD)]. Compared with autonomous DNA transposons, MITEs are often short (<800 bp) and do not encode transposases. MITEs are often located in gene-rich euchromatic regions and are associated with genes (3,4). Several pieces of evidence suggest that MITEs may affect the expression of nearby genes. MITE Kiddo in rice was shown to upregulate the expression of Ubiquitin2 when inserted in its promoter region (5). However, in other cases, MITE insertions downregulate the expression of nearby genes (6,7). Such downregulation is most likely through small RNAs derived from MITE sequences (6,8). MITE transpositions generate much genetic diversity for a species (9–11). Considering the effects of MITEs on gene expression and variation of MITE insertions in different genotypes, MITEs may contribute to considerable phenotypic diversity as well (12). The first MITE families were discovered through sequence analysis (i.e. identification of TIR and TSD sequences) of insertions of 100–600 bp (13,14). Recently, computer programs were developed to systematically identify MITEs from a database such as genome sequences (6,15–19). Among them, the most successful ones are MITE Digger, MITE-Hunter and RSPB, which identified the vast majority of MITEs in the sequenced genome of rice (6,18,19). The recently reported program MITE Digger is most efficient for de novo MITE identification, particularly in large genomes (19). RSPB is better at identifying MITE families with atypical structures such as MITEs with no TSD or short/diverse TIR sequences. Unfortunately, RSPB requires high computer capacity not found in most laboratories. We predicted that combining MITE Digger, MITE-Hunter and RSPB would allow the detection of a vast majority of, if not all, MITE families in a genome, with no prior information required. With the availability of the three MITE detecting programs and the genome sequences of many plant species, MITEs in several genomes can be readily identified and compared to further our understanding of MITE origin and evolution. MITEs, as repetitive sequences, were included in other databases such as the The Institute for Genomic Research (TIGR) Plant Repeat Databases and Repbase (20,21). However, MITEs vary dramatically and usually cannot be identified through homology search between distantly related species, and consequently, only a small proportion of MITE families have been identified and included in these databases. In this study, MITEs were de novo identified from 41 plant species using computer programs MITE Digger, MITE-Hunter and/or RSPB. Each MITE family was annotated manually. All verified MITE families were stored in a database, P-MITE (for plant MITE). BLASTN search function was appended into the database. MITE sequences from each genome were downloadable. P-MITE will be helpful for the annotation of genes and genomic sequences. It can also be used to study the origin and amplification of MITEs, the comparative analysis between different species, the MITE-derived small RNAs and the roles of MITEs on gene and genome evolution, etc. MATERIALS AND METHODS Plant genomes used in this study Forty-one sequenced and published genomes of plant species, including six lower plant species, were included in this study for MITE identification. The information of the 41 species and the Web sites for their genome sequences are listed in Supplementary Table S1. The MITEs from rice were identified and annotated in a previous study (6). De novo identification of MITEs using MITE Digger, MITE-Hunter and RSPB MITEs from 41 genomes were de novo identified using program MITE Digger, MITE-hunter and/or RSPB (6,18,19). First, program MITE-Hunter was used to run the sequences of each genome. The resulting groups of potential MITEs were manually checked for TSD and TIR sequences. Groups with no precise boundaries (terminals) or no TIR sequences were not considered as MITEs. The confirmed MITEs from MITE-Hunter were put into a database (MITE-Hunter database). To save running time, program RSPB was slightly modified so that the confirmed MITE sequences in the ‘MITE-Hunter database’ were skipped by RSPB. New groups of repetitive sequences with precise boundaries were reported and checked manually for TSDs and TIRs (Supplementary Figure S1). No TSD and TIR information is required to run RSPB, which identifies repetitive sequences with precise boundaries. In subsequent manual annotation, only repetitive sequences <800 bp and TSD/TIR features similar to known MITE superfamilies were maintained. Five species with large genomes or too many short contigs were not successful using RSPB. MITE Digger, released recently, was also used to run some genomes, including genomes >800 Mb. The statistics of MITE families identified in this study is shown in Supplementary Table S2. The number of MITE families that were detected by RSPB, but not by MITE Hunter, is shown in Supplementary Table S3. Classification of MITE superfamily and family A Perl script was written to cluster MITEs identified above into a family if they had significant sequence similarity (BLASTN e < 10−10) (6). MITE families were assigned into superfamilies based on their TIR and TSD sequences. Each MITE family in a genome was named as code_Abc#, where Ab is the first two letters from its genus name, c the first letter from its species name and # a consecutive number. Different superfamilies are represented by different codes, with DTT for Tc1/Mariner, DTM for Mutator, DTA for hAT, DTC for CACTA, DTH for PIF/Harbinger, DTP for P, DTN for Novosib and DTx for unknown (21–23). MITEs with ambiguous TSD and/or TIR features were annotated as unknown superfamily (DTx). MITE families preferentially inserted into simple tandem repeats (microsatellites) were considered as an independent group, MiM (MITEs inserted in microsatellite). A ‘representative’ element was chosen for each family, and the representative elements should have good TIR and perfect TSD sequences if possible. A MITE sequence was considered as a full-length element when its terminals were no more than 3 bp shorter than the representative sequence. To identify all MITE elements, including diverse and/or partial ones, in a genome, a library of all representative elements from each family was used as query sequences to search the entire genome sequence using RepeatMasker v3.2.9 (http://www.repeatmasker.org/). RESULTS AND DISCUSSION De novo identification of MITEs in 41 plant genomes Program MITE-Hunter was applied to 41 plant genomes for genome-wide de novo identification of MITEs. RSPB was also used to run all but five genomes that are either >800 Mb or with too many contigs. MITE Digger was used to search some genomes, including four skipped by RSPB. The MITE sequences obtained from this study were used to execute a BLASTN search of the Repbase, the database most frequently used for repetitive sequences (21). More than 70% of MITE families identified from this study were not included in Repbase (< 10−10), MITE-Hunter, but not RSPB, due to too large genome. A total of 252 MITE families were obtained from maize, which include 97 novel families not covered by maize TE database. However, 61 MITE families listed in maize TE database were not identified by either MITE Digger or MITE-Hunter. The computing process of RSPB needs to be mended before it can be applied to large genomes, such as maize, to identify more novel MITE families. The majority of MITEs were classified into five superfamilies, including Tc1/Mariner, PIF/Harbinger, CACTA, hAT and Mutator. Two superfamilies, P and Novosib, were detected in the genomes of lower plants, although they do not have Tc1/Mariner, CACTA and Mutator. Sixteen MITE families were unclassified owing to ambiguous TSD and/or TIR features. MiM is the least frequent in plant genomes (Supplementary Table S2). The MiM group is present in only 10 of the 41 genomes, with 41 893 elements from 33 families. The strawberry genome contains 14 MiM families, whereas the others have no more than four MiM families. Most elements of these MiM families, including the Micron family in rice (24), were inserted in (TA)n repeats, with only a few exceptions, in which they were inserted into (CA)n/(GT)n repeats. Elements from the MiM group have poor TIR sequences, and no conserved nucleotides were found in their terminals among different families. It remains unclear whether different MiM families belong to the same superfamily, i.e. activated by the same type of transposase. In contrast to the scarce MiM group, the Mutator superfamily has 852 390 elements in the 41 genomes included in this study, with an average of >20 790 elements per genome. MITEs with significant nucleotide identities (BLASTN e < 10−10) were grouped into a family. The largest MITE family is the DTM_Mad25 from the apple genome, with 18 904 elements. The smallest MITE families, DTT_Sob24 and DTH_Sob33 from the Sorghum genome, have only one element. The number of MITEs varies dramatically in different species. In general, the genomes of lower plants have relatively few MITEs (Table 1). No MITEs were detected in the genome of Cyanidioschyzon merolae using either MITE-Hunter or RSPB, and the genome of Selaginella moellendorffii harbors only 73 MITE elements. The number of MITEs also varies considerably among the genomes of higher plants. For example, only one MITE family with 538 elements was detected in the papaya genome, whereas 237 302 elements from 180 MITE families are present in the apple genome. Large variations in total number of MITE elements also occur between closely related species. For example, the Arabidopsis thaliana genome has only 3245 MITE elements, whereas its close relative, Arabidopsis lyrata, contains 18 039 MITE-related sequences. Similarly, the number of MITEs in the genome of watermelon (with 94 314 MITE elements) is seven times as much as in the genome of melon (with 12 991 MITE elements). Table 1. MITE in 41 plant genomes aThe MITE sequences from rice were retrieved from Lu et al. (25). The number of MITEs in a genome is significantly correlated with its genome assembly size (r = 0.72, P < 0.01; Table 1; Figure 1). A similar correlation coefficient (r = 0.68, P < 0.01) was obtained when the six lower plants were excluded from the analysis. Nevertheless, several striking exceptions were observed. For example, the rice genome is only 373 Mb but has the third largest number (179 415) of MITEs among all species studied, whereas papaya with genome size (342 Mb) similar to that of rice, has only 538 elements of one MITE family (Table 1). Figure 1. Strong correlation between the number of MITEs and genome assembly size. Genomes with disproportionately low copy (➁ papaya and ➂ Physcomitrella patens) and high copy (➀ rice and ➃ apple) of MITEs are indicated. The construction and the use of plant MITE database, P-MITE A total of 2.3 million sequences of 3527 MITE families were obtained from 41 (including the rice genome) plant genomes. A series of databases containing MITEs from the 41 plant genomes was constructed. Elements from each of the 3527 MITE families were checked and annotated manually, and one element with better TSD and/or TIR features was chosen as a representative of the family. A database containing all representative elements was constructed, which can be used to study the structure of MITEs, such as their TSD and TIR features. The aforementioned databases are collectively named as P-MITE (for plant MITE), and can be found in http://pmite.hzau.edu.cn/django/mite. The database is searchable using BLASTN algorithm. MITE sequences and representative elements can be downloaded by family or by genome. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online, including [26–66]. FUNDING This work was supported by the ‘973’ National Key Basic Research Program [2009CB119000]; The National Natural Science Foundation of China (NSFC) [30921002]; and the Fundamental Research Funds for the Central Universities [2012ZYTS035]. Funding for open access charge: The Fundamental Research Funds for the Central Universities [2012ZYTS035]. Conflict of interest statement. None declared.
Section	INTRODUCTION Miniature inverted-repeat transposable elements (MITEs) are prevalent in eukaryotic genomes, and are believed to be deletion derivatives of DNA transposons (1,2). Like autonomous DNA transposons, MITEs usually have terminal inverted repeats (TIR), flanked by short direct repeats [also called target site duplication (TSD)]. Compared with autonomous DNA transposons, MITEs are often short (<800 bp) and do not encode transposases. MITEs are often located in gene-rich euchromatic regions and are associated with genes (3,4). Several pieces of evidence suggest that MITEs may affect the expression of nearby genes. MITE Kiddo in rice was shown to upregulate the expression of Ubiquitin2 when inserted in its promoter region (5). However, in other cases, MITE insertions downregulate the expression of nearby genes (6,7). Such downregulation is most likely through small RNAs derived from MITE sequences (6,8). MITE transpositions generate much genetic diversity for a species (9–11). Considering the effects of MITEs on gene expression and variation of MITE insertions in different genotypes, MITEs may contribute to considerable phenotypic diversity as well (12). The first MITE families were discovered through sequence analysis (i.e. identification of TIR and TSD sequences) of insertions of 100–600 bp (13,14). Recently, computer programs were developed to systematically identify MITEs from a database such as genome sequences (6,15–19). Among them, the most successful ones are MITE Digger, MITE-Hunter and RSPB, which identified the vast majority of MITEs in the sequenced genome of rice (6,18,19). The recently reported program MITE Digger is most efficient for de novo MITE identification, particularly in large genomes (19). RSPB is better at identifying MITE families with atypical structures such as MITEs with no TSD or short/diverse TIR sequences. Unfortunately, RSPB requires high computer capacity not found in most laboratories. We predicted that combining MITE Digger, MITE-Hunter and RSPB would allow the detection of a vast majority of, if not all, MITE families in a genome, with no prior information required. With the availability of the three MITE detecting programs and the genome sequences of many plant species, MITEs in several genomes can be readily identified and compared to further our understanding of MITE origin and evolution. MITEs, as repetitive sequences, were included in other databases such as the The Institute for Genomic Research (TIGR) Plant Repeat Databases and Repbase (20,21). However, MITEs vary dramatically and usually cannot be identified through homology search between distantly related species, and consequently, only a small proportion of MITE families have been identified and included in these databases. In this study, MITEs were de novo identified from 41 plant species using computer programs MITE Digger, MITE-Hunter and/or RSPB. Each MITE family was annotated manually. All verified MITE families were stored in a database, P-MITE (for plant MITE). BLASTN search function was appended into the database. MITE sequences from each genome were downloadable. P-MITE will be helpful for the annotation of genes and genomic sequences. It can also be used to study the origin and amplification of MITEs, the comparative analysis between different species, the MITE-derived small RNAs and the roles of MITEs on gene and genome evolution, etc.
Title	INTRODUCTION
Section	MATERIALS AND METHODS Plant genomes used in this study Forty-one sequenced and published genomes of plant species, including six lower plant species, were included in this study for MITE identification. The information of the 41 species and the Web sites for their genome sequences are listed in Supplementary Table S1. The MITEs from rice were identified and annotated in a previous study (6). De novo identification of MITEs using MITE Digger, MITE-Hunter and RSPB MITEs from 41 genomes were de novo identified using program MITE Digger, MITE-hunter and/or RSPB (6,18,19). First, program MITE-Hunter was used to run the sequences of each genome. The resulting groups of potential MITEs were manually checked for TSD and TIR sequences. Groups with no precise boundaries (terminals) or no TIR sequences were not considered as MITEs. The confirmed MITEs from MITE-Hunter were put into a database (MITE-Hunter database). To save running time, program RSPB was slightly modified so that the confirmed MITE sequences in the ‘MITE-Hunter database’ were skipped by RSPB. New groups of repetitive sequences with precise boundaries were reported and checked manually for TSDs and TIRs (Supplementary Figure S1). No TSD and TIR information is required to run RSPB, which identifies repetitive sequences with precise boundaries. In subsequent manual annotation, only repetitive sequences <800 bp and TSD/TIR features similar to known MITE superfamilies were maintained. Five species with large genomes or too many short contigs were not successful using RSPB. MITE Digger, released recently, was also used to run some genomes, including genomes >800 Mb. The statistics of MITE families identified in this study is shown in Supplementary Table S2. The number of MITE families that were detected by RSPB, but not by MITE Hunter, is shown in Supplementary Table S3. Classification of MITE superfamily and family A Perl script was written to cluster MITEs identified above into a family if they had significant sequence similarity (BLASTN e < 10−10) (6). MITE families were assigned into superfamilies based on their TIR and TSD sequences. Each MITE family in a genome was named as code_Abc#, where Ab is the first two letters from its genus name, c the first letter from its species name and # a consecutive number. Different superfamilies are represented by different codes, with DTT for Tc1/Mariner, DTM for Mutator, DTA for hAT, DTC for CACTA, DTH for PIF/Harbinger, DTP for P, DTN for Novosib and DTx for unknown (21–23). MITEs with ambiguous TSD and/or TIR features were annotated as unknown superfamily (DTx). MITE families preferentially inserted into simple tandem repeats (microsatellites) were considered as an independent group, MiM (MITEs inserted in microsatellite). A ‘representative’ element was chosen for each family, and the representative elements should have good TIR and perfect TSD sequences if possible. A MITE sequence was considered as a full-length element when its terminals were no more than 3 bp shorter than the representative sequence. To identify all MITE elements, including diverse and/or partial ones, in a genome, a library of all representative elements from each family was used as query sequences to search the entire genome sequence using RepeatMasker v3.2.9 (http://www.repeatmasker.org/).
Title	MATERIALS AND METHODS
Section	Plant genomes used in this study Forty-one sequenced and published genomes of plant species, including six lower plant species, were included in this study for MITE identification. The information of the 41 species and the Web sites for their genome sequences are listed in Supplementary Table S1. The MITEs from rice were identified and annotated in a previous study (6).
Title	Plant genomes used in this study
Section	De novo identification of MITEs using MITE Digger, MITE-Hunter and RSPB MITEs from 41 genomes were de novo identified using program MITE Digger, MITE-hunter and/or RSPB (6,18,19). First, program MITE-Hunter was used to run the sequences of each genome. The resulting groups of potential MITEs were manually checked for TSD and TIR sequences. Groups with no precise boundaries (terminals) or no TIR sequences were not considered as MITEs. The confirmed MITEs from MITE-Hunter were put into a database (MITE-Hunter database). To save running time, program RSPB was slightly modified so that the confirmed MITE sequences in the ‘MITE-Hunter database’ were skipped by RSPB. New groups of repetitive sequences with precise boundaries were reported and checked manually for TSDs and TIRs (Supplementary Figure S1). No TSD and TIR information is required to run RSPB, which identifies repetitive sequences with precise boundaries. In subsequent manual annotation, only repetitive sequences <800 bp and TSD/TIR features similar to known MITE superfamilies were maintained. Five species with large genomes or too many short contigs were not successful using RSPB. MITE Digger, released recently, was also used to run some genomes, including genomes >800 Mb. The statistics of MITE families identified in this study is shown in Supplementary Table S2. The number of MITE families that were detected by RSPB, but not by MITE Hunter, is shown in Supplementary Table S3.
Title	De novo identification of MITEs using MITE Digger, MITE-Hunter and RSPB
Section	Classification of MITE superfamily and family A Perl script was written to cluster MITEs identified above into a family if they had significant sequence similarity (BLASTN e < 10−10) (6). MITE families were assigned into superfamilies based on their TIR and TSD sequences. Each MITE family in a genome was named as code_Abc#, where Ab is the first two letters from its genus name, c the first letter from its species name and # a consecutive number. Different superfamilies are represented by different codes, with DTT for Tc1/Mariner, DTM for Mutator, DTA for hAT, DTC for CACTA, DTH for PIF/Harbinger, DTP for P, DTN for Novosib and DTx for unknown (21–23). MITEs with ambiguous TSD and/or TIR features were annotated as unknown superfamily (DTx). MITE families preferentially inserted into simple tandem repeats (microsatellites) were considered as an independent group, MiM (MITEs inserted in microsatellite). A ‘representative’ element was chosen for each family, and the representative elements should have good TIR and perfect TSD sequences if possible. A MITE sequence was considered as a full-length element when its terminals were no more than 3 bp shorter than the representative sequence. To identify all MITE elements, including diverse and/or partial ones, in a genome, a library of all representative elements from each family was used as query sequences to search the entire genome sequence using RepeatMasker v3.2.9 (http://www.repeatmasker.org/).
Title	Classification of MITE superfamily and family
Section	RESULTS AND DISCUSSION De novo identification of MITEs in 41 plant genomes Program MITE-Hunter was applied to 41 plant genomes for genome-wide de novo identification of MITEs. RSPB was also used to run all but five genomes that are either >800 Mb or with too many contigs. MITE Digger was used to search some genomes, including four skipped by RSPB. The MITE sequences obtained from this study were used to execute a BLASTN search of the Repbase, the database most frequently used for repetitive sequences (21). More than 70% of MITE families identified from this study were not included in Repbase (< 10−10), MITE-Hunter, but not RSPB, due to too large genome. A total of 252 MITE families were obtained from maize, which include 97 novel families not covered by maize TE database. However, 61 MITE families listed in maize TE database were not identified by either MITE Digger or MITE-Hunter. The computing process of RSPB needs to be mended before it can be applied to large genomes, such as maize, to identify more novel MITE families. The majority of MITEs were classified into five superfamilies, including Tc1/Mariner, PIF/Harbinger, CACTA, hAT and Mutator. Two superfamilies, P and Novosib, were detected in the genomes of lower plants, although they do not have Tc1/Mariner, CACTA and Mutator. Sixteen MITE families were unclassified owing to ambiguous TSD and/or TIR features. MiM is the least frequent in plant genomes (Supplementary Table S2). The MiM group is present in only 10 of the 41 genomes, with 41 893 elements from 33 families. The strawberry genome contains 14 MiM families, whereas the others have no more than four MiM families. Most elements of these MiM families, including the Micron family in rice (24), were inserted in (TA)n repeats, with only a few exceptions, in which they were inserted into (CA)n/(GT)n repeats. Elements from the MiM group have poor TIR sequences, and no conserved nucleotides were found in their terminals among different families. It remains unclear whether different MiM families belong to the same superfamily, i.e. activated by the same type of transposase. In contrast to the scarce MiM group, the Mutator superfamily has 852 390 elements in the 41 genomes included in this study, with an average of >20 790 elements per genome. MITEs with significant nucleotide identities (BLASTN e < 10−10) were grouped into a family. The largest MITE family is the DTM_Mad25 from the apple genome, with 18 904 elements. The smallest MITE families, DTT_Sob24 and DTH_Sob33 from the Sorghum genome, have only one element. The number of MITEs varies dramatically in different species. In general, the genomes of lower plants have relatively few MITEs (Table 1). No MITEs were detected in the genome of Cyanidioschyzon merolae using either MITE-Hunter or RSPB, and the genome of Selaginella moellendorffii harbors only 73 MITE elements. The number of MITEs also varies considerably among the genomes of higher plants. For example, only one MITE family with 538 elements was detected in the papaya genome, whereas 237 302 elements from 180 MITE families are present in the apple genome. Large variations in total number of MITE elements also occur between closely related species. For example, the Arabidopsis thaliana genome has only 3245 MITE elements, whereas its close relative, Arabidopsis lyrata, contains 18 039 MITE-related sequences. Similarly, the number of MITEs in the genome of watermelon (with 94 314 MITE elements) is seven times as much as in the genome of melon (with 12 991 MITE elements). Table 1. MITE in 41 plant genomes aThe MITE sequences from rice were retrieved from Lu et al. (25). The number of MITEs in a genome is significantly correlated with its genome assembly size (r = 0.72, P < 0.01; Table 1; Figure 1). A similar correlation coefficient (r = 0.68, P < 0.01) was obtained when the six lower plants were excluded from the analysis. Nevertheless, several striking exceptions were observed. For example, the rice genome is only 373 Mb but has the third largest number (179 415) of MITEs among all species studied, whereas papaya with genome size (342 Mb) similar to that of rice, has only 538 elements of one MITE family (Table 1). Figure 1. Strong correlation between the number of MITEs and genome assembly size. Genomes with disproportionately low copy (➁ papaya and ➂ Physcomitrella patens) and high copy (➀ rice and ➃ apple) of MITEs are indicated. The construction and the use of plant MITE database, P-MITE A total of 2.3 million sequences of 3527 MITE families were obtained from 41 (including the rice genome) plant genomes. A series of databases containing MITEs from the 41 plant genomes was constructed. Elements from each of the 3527 MITE families were checked and annotated manually, and one element with better TSD and/or TIR features was chosen as a representative of the family. A database containing all representative elements was constructed, which can be used to study the structure of MITEs, such as their TSD and TIR features. The aforementioned databases are collectively named as P-MITE (for plant MITE), and can be found in http://pmite.hzau.edu.cn/django/mite. The database is searchable using BLASTN algorithm. MITE sequences and representative elements can be downloaded by family or by genome.
Title	RESULTS AND DISCUSSION
Section	De novo identification of MITEs in 41 plant genomes Program MITE-Hunter was applied to 41 plant genomes for genome-wide de novo identification of MITEs. RSPB was also used to run all but five genomes that are either >800 Mb or with too many contigs. MITE Digger was used to search some genomes, including four skipped by RSPB. The MITE sequences obtained from this study were used to execute a BLASTN search of the Repbase, the database most frequently used for repetitive sequences (21). More than 70% of MITE families identified from this study were not included in Repbase (< 10−10), MITE-Hunter, but not RSPB, due to too large genome. A total of 252 MITE families were obtained from maize, which include 97 novel families not covered by maize TE database. However, 61 MITE families listed in maize TE database were not identified by either MITE Digger or MITE-Hunter. The computing process of RSPB needs to be mended before it can be applied to large genomes, such as maize, to identify more novel MITE families. The majority of MITEs were classified into five superfamilies, including Tc1/Mariner, PIF/Harbinger, CACTA, hAT and Mutator. Two superfamilies, P and Novosib, were detected in the genomes of lower plants, although they do not have Tc1/Mariner, CACTA and Mutator. Sixteen MITE families were unclassified owing to ambiguous TSD and/or TIR features. MiM is the least frequent in plant genomes (Supplementary Table S2). The MiM group is present in only 10 of the 41 genomes, with 41 893 elements from 33 families. The strawberry genome contains 14 MiM families, whereas the others have no more than four MiM families. Most elements of these MiM families, including the Micron family in rice (24), were inserted in (TA)n repeats, with only a few exceptions, in which they were inserted into (CA)n/(GT)n repeats. Elements from the MiM group have poor TIR sequences, and no conserved nucleotides were found in their terminals among different families. It remains unclear whether different MiM families belong to the same superfamily, i.e. activated by the same type of transposase. In contrast to the scarce MiM group, the Mutator superfamily has 852 390 elements in the 41 genomes included in this study, with an average of >20 790 elements per genome. MITEs with significant nucleotide identities (BLASTN e < 10−10) were grouped into a family. The largest MITE family is the DTM_Mad25 from the apple genome, with 18 904 elements. The smallest MITE families, DTT_Sob24 and DTH_Sob33 from the Sorghum genome, have only one element. The number of MITEs varies dramatically in different species. In general, the genomes of lower plants have relatively few MITEs (Table 1). No MITEs were detected in the genome of Cyanidioschyzon merolae using either MITE-Hunter or RSPB, and the genome of Selaginella moellendorffii harbors only 73 MITE elements. The number of MITEs also varies considerably among the genomes of higher plants. For example, only one MITE family with 538 elements was detected in the papaya genome, whereas 237 302 elements from 180 MITE families are present in the apple genome. Large variations in total number of MITE elements also occur between closely related species. For example, the Arabidopsis thaliana genome has only 3245 MITE elements, whereas its close relative, Arabidopsis lyrata, contains 18 039 MITE-related sequences. Similarly, the number of MITEs in the genome of watermelon (with 94 314 MITE elements) is seven times as much as in the genome of melon (with 12 991 MITE elements). Table 1. MITE in 41 plant genomes aThe MITE sequences from rice were retrieved from Lu et al. (25). The number of MITEs in a genome is significantly correlated with its genome assembly size (r = 0.72, P < 0.01; Table 1; Figure 1). A similar correlation coefficient (r = 0.68, P < 0.01) was obtained when the six lower plants were excluded from the analysis. Nevertheless, several striking exceptions were observed. For example, the rice genome is only 373 Mb but has the third largest number (179 415) of MITEs among all species studied, whereas papaya with genome size (342 Mb) similar to that of rice, has only 538 elements of one MITE family (Table 1). Figure 1. Strong correlation between the number of MITEs and genome assembly size. Genomes with disproportionately low copy (➁ papaya and ➂ Physcomitrella patens) and high copy (➀ rice and ➃ apple) of MITEs are indicated.
Title	De novo identification of MITEs in 41 plant genomes
Table caption	Table 1. MITE in 41 plant genomes aThe MITE sequences from rice were retrieved from Lu et al. (25).
Figure caption	Figure 1. Strong correlation between the number of MITEs and genome assembly size. Genomes with disproportionately low copy (➁ papaya and ➂ Physcomitrella patens) and high copy (➀ rice and ➃ apple) of MITEs are indicated.
Section	The construction and the use of plant MITE database, P-MITE A total of 2.3 million sequences of 3527 MITE families were obtained from 41 (including the rice genome) plant genomes. A series of databases containing MITEs from the 41 plant genomes was constructed. Elements from each of the 3527 MITE families were checked and annotated manually, and one element with better TSD and/or TIR features was chosen as a representative of the family. A database containing all representative elements was constructed, which can be used to study the structure of MITEs, such as their TSD and TIR features. The aforementioned databases are collectively named as P-MITE (for plant MITE), and can be found in http://pmite.hzau.edu.cn/django/mite. The database is searchable using BLASTN algorithm. MITE sequences and representative elements can be downloaded by family or by genome.
Title	The construction and the use of plant MITE database, P-MITE
Section	SUPPLEMENTARY DATA Supplementary Data are available at NAR Online, including [26–66].
Title	SUPPLEMENTARY DATA
Section	FUNDING This work was supported by the ‘973’ National Key Basic Research Program [2009CB119000]; The National Natural Science Foundation of China (NSFC) [30921002]; and the Fundamental Research Funds for the Central Universities [2012ZYTS035]. Funding for open access charge: The Fundamental Research Funds for the Central Universities [2012ZYTS035]. Conflict of interest statement. None declared.
Title	FUNDING

Annnotations

blinded

PMC:3964958 JSONTXT 5 Projects

Document structure show

Annnotations

PMC:3964958 JSON TXT 5 Projects