COG, CO and KEGG classification The Clusters of Orthologous Groups (COGs) of proteins were generated by comparing the protein sequences of complete genomes. Each cluster contains proteins or groups of paralogs from at least three lineages [38]. The current COG database contains both prokaryotic clusters and eukaryotic clusters [39]. We aligned the unigenes to the COG databases to find homologous genes and classify possible functions of the unigenes (Figure 2). A total of 14,035 unigenes (10.4% of the total) had a match in COG database with an E value <1e-10 (Table 3). The possible functions of 11,771 (83.87% of COG matched) unigenes were classified and subdivided into 24 COG categories (Table S2). The largest group was ‘General function prediction only’ (2241, 19.04%), followed by ‘Post-translational modification, protein turnover, chaperones’ (1527, 12.97%) and ‘Translation, ribosomal structure and biogenesis’ (908, 7.71%). 10.1371/journal.pone.0079516.g002 Figure 2 COG classification of the unigenes. Possible functions of 11,771 unigenes were classified and subdivided into 24 COG categories. GO is an international standardized gene functional classification system and covers three domains: cellular component, molecular function and biological process. The InterPro domains were annotated by InterProScan Release 27.0, and functional assignments were mapped onto the GO structures. In total, 20,686 unigenes were matched to a GO annotation (Table 3). We used WEGO to perform the GO classifications and draw the GO tree to facilitate the classification of the C. fluminea transcripts into putative functional groups. In total, 20,286 unigenes were assigned GO terms in 46 functional groups and three categories (Table S3), including 19,167 unigenes at the cellular component level, 25,414 unigenes at the molecular function level and 26,279 unigenes at the biological process level (Figure 3). Within the cellular component category, cell (6,447) and cell part (6,447) were the most highly represented groups. Binding (13,252) and catalytic activity (9,019) were most abundant groups within the molecular function category. A total of 22 GO functional groups were assigned into the biological process category, among which metabolic process (9,021) and cellular process (7,726) were the most highly represented. 10.1371/journal.pone.0079516.g003 Figure 3 Classification of C. fluminea sequences based on predicted Gene Ontology (GO) terms. In total, 20,286 unigenes were assigned GO terms in 46 functional groups and three categories, including 19,167 unigenes at the cellular component level, 25,414 unigenes at the molecular function level and 26,279 unigenes at the biological process level. Based on comparative analyses using the KEGG database, 32,042 unigenes (23.8% of the total) were found to have a match with an E value <1e-10 using BLASTx (Table 3). We used a Perl script to retrieve KO information from the BLAST result, establish pathway associations between unigenes and the database and then match these 32,042 sequences to 253 different KEGG pathways (Table S4). Of these 32,042 sequences with KEGG annotation, 10,389 were classified into metabolism groups, with most of them involved in amino acid metabolism, carbohydrate metabolism, lipid metabolism and energy metabolism. The greatest number of sequences were classified into the genetic information processing pathways (9,373), followed by human diseases (6,036), cellular processes (4,862) and environmental information processing (3,199). Over all, the possible functions of the assembled unigenes were assessed by similarity matches with the COG, CO and KEGG databases. The results of these databases searches help us better understand the biological features of C. fluminea. The patterns of the C. fluminea found in this study were common and similar to other organisms [23,30,31,50].