Genome sequence of the Japanese oak silk moth, Antheraea yamamai: the first draft genome in the family Saturniidae

Abstract

Background

 Antheraea yamamai, also known as the Japanese oak silk moth, is a wild species of silk moth. Silk produced by A. yamamai, referred to as tensan silk, shows different characteristics such as thickness, compressive elasticity, and chemical resistance compared with common silk produced from the domesticated silkworm, Bombyx mori. Its unique characteristics have led to its use in many research fields including biotechnology and medical science, and the scientific as well as economic importance of the wild silk moth continues to gradually increase. However, no genomic information for the wild silk moth, including A. yamamai, is currently available.

Findings

In order to construct the A. yamamai genome, a total of 147G base pairs using Illumina and Pacbio sequencing platforms were generated, providing 210-fold coverage based on the 700-Mb estimated genome size of A. yamamai. The assembled genome of A. yamamai was 656 Mb (>2 kb) with 3675 scaffolds, and the N50 length of assembly was 739 Kb with a 34.07% GC ratio. Identified repeat elements covered 37.33% of the total genome, and the completeness of the constructed genome assembly was estimated to be 96.7% by Benchmarking Universal Single-Copy Orthologs v2 analysis. A total of 15 481 genes were identified using Evidence Modeler based on the gene prediction results obtained from 3 different methods (ab initio, RNA-seq-based, known-gene-based) and manual curation.

Conclusions

Here we present the genome sequence of A. yamamai, the first genome sequence of the wild silk moth. These results provide valuable genomic information, which will help enrich our understanding of the molecular mechanisms relating to not only specific phenotypes such as wild silk itself but also the genomic evolution of Saturniidae.

Data Description

 Antheraea yamamai (NCBI Taxonomy ID: 7121), also known as the Japanese oak silk moth, is a wild silk moth species belonging to the Saturniidae family (Fig. 1). Silk moths can be categorized into 2 families—Bombycidae and Saturniidae. Saterniidae has been estimated to contain approximately 1861 species, with 162 genera, and is known as the largest family in the Lepidoptera. Among the many species in family Saturniidae, only a few species, including A. yamamai, can be utilized for silk production. Previous phylogenetic studies have shown that the family Saturniidae shares common ancestors with the family Sphingidae, including the hawk moth (Macroglossum stellatarum), and the Bombycidae family, including the most representative silkworm, Bombyx mori. The estimated divergence time between A. yamamai and B. mori is 84 million years ago (MYA), making it similar to the 88 MYA estimated divergence time between the human and mouse.

Photograph of Antheraea Yamamai. From left: larva, cocoon, and adult A. yamamai, respectively. Green color is one of the representative characteristics of tensan silk.

 A. yamamai produces a specific silk called tensan silk, which shows distinctive characteristics compared with common silk from B. mori, including characteristics such as thickness, bulkiness, compressive elasticity, and resistance to dyeing chemicals. These characteristics have received the attention of researchers as a new biomaterial for use in various fields. Additionally, tensan silk also has been studied for its applications to human health. However, despite the potential importance of the wild silk moth in research and economic fields, no whole genomic information is currently available for this or any other species from the family Saturniidae.

In this study, we present the annotated genome sequence of A. yamamai, the first published genome in the family Saturniidae, with transcriptome datasets collected from 10 different body organ tissues. These data will be a fundamental resource for future studies and provide more insight into the genome evolution and molecular phylogeny of the family Saturniidae.

Sample preparation and sequencing

For whole-genome sequencing, we selected 1 male sample (Ay-7-male1) from a breeding line (Ay-7) of A. yamamai raised at the National Academy of Agricultural Science, Rural Development Administration, Korea. In lepidopterans, males are homogametic (ZZ), and selecting a male sample can reduce the complexity of assembly from excessive repeats on the W chromosome in females. For genomic library construction, we removed the guts of A. yamamai to prevent contamination of genomes from other organisms such as gut microbes and oak, the main food source of A. yamamai. Details of the sample preparation process used in this study are presented in the Supplementary Data. Genomic DNA was extracted using a DNeasy Animal Mini Kit (Qiagen, Hilden, Germany), and the quality of extracted DNA was checked using trenean, picogreen assay, and gel electrophoresis (1% agarose gel/40 ng loading). After quality control processing, we were left with a total of 61.5 ug of A. yamamai DNA for genome sequencing. Using standard Illumina whole-genome shotgun (WGS) sequencing protocol (paired-end and mate-pair), we added 2 long read sequencing platforms, Moleculo (Illumina synthetic long read) and RS II (Pacific Bioscience). Tables 1–3 show a summary of generated data for each library used in this study. RNA-seq libraries were also constructed for 10 different tissues (hemocyte, malpighian tube, midgut, fat body, anterior-middle/silk gland, posterior/silk gland, head, integument, testis, ovary) with 3 biological replicates following standard manufacturer protocol (Illumina, San Diego, CA, USA). For this, more than 100 individual A. yamamai samples in 5 instar stages from the same breeding line were used for tissue anatomy, and 3 samples from each tissue were selected based on the quality of extracted RNA. Details of transcriptome library construction are shown in the Supplementary Data. Information of libraries and generated data is provided in Table 4, and a total of 147 Gb of genomic data and 76 Gb of transcriptomic data was generated for this study.

Summary statistics of generated whole-genome shotgun sequencing data using Illumina Nextseq 500

Library name	Library type	Insert size	Platform	Read length	No. of reads	Total base, bp	Reads retained after trimming	 	350 bp	Paired-end	350 bp	Nextseq500	151	293 176 268	44 269 616 468	291 070 362	 	700 bp	Paired-end	700 bp	Nextseq500	151	246 945 900	37 288 830 900	244 698 580	 	3 Kbp	Mate-pair	3 Kbp	Nextseq500	76	284 204 762	21 599 561 912	195 095 164	 	6 Kbp	Mate-pair	6 Kbp	Nextseq500	76	246 238 370	18 714 116 120	152 496 372	 	9 Kbp	Mate-pair	9 Kbp	Nextseq500	76	239 919 538	18 233 884 888	148 612 724	 	Total					1 310 484 838	140 106 010 288	1 031 973 202	 	

Summary statistics of generated Illumina synthetic long read (Moleculo) library

	500–1499 bp	≥1500 bp	 	No. of assembled reads	302 132	342 738	 	No. of bases in assembled read	268 853 717	1 205 349 082	 	N50 length of assembled read	960	4031	 	

Summary statistics of generated long reads data using Pacbio RS II system

No. of reads	1005,571	 	Total bases	5836 969 225	 	Length of longest (shortest) read	50 132 (50)	 	Average read length	5804.63	 	

Summary statistics of generated transcriptome data obtained from 6 organ tissues using Illumina platform

Tissue	Sample name	Read length	Read count	Total base, bp	 	Hemocyte	Hemocyte_1	76	20 815 674	1 581 991 224	 		Hemocyte_2	76	26 704 666	2 029 554 616	 		Hemocyte_2	76	53 068 562	4 033 210 712	 	Malpighian tube	Malpighi_1	76	22 635 428	1 720 292 528	 		Malpighi_2	76	24 893 788	1 891 927 888	 		Malpighi_3	76	45 213 164	3 436 200 464	 	Midgut	Midgut_1	76	23 350 138	1 774 610 488	 		Midgut_2	76	24 597 972	1 869 445 872	 		Midgut_3	76	50 949 986	3 872 198 936	 	Head	Head_1	76	26 526 276	2 015 996 976	 		Head_2	76	26 581 124	2 020 165 424	 		Head_3	76	40 900 456	3 108 434 656	 	Integument	Skin_1	76	24 592 846	1 869 056 296	 		Skin_2	76	42 775 430	3 250 932 680	 		Skin_3	76	35 043 570	2 663 311 320	 	Fat body	Fat Body_1	76	24 637 810	1 872 473 560	 		Fat Body_2	76	24 037 494	1 826 849 544	 		Fat Body_3	76	40 817 582	3 102 136 232	 	Anterior-middle/silk gland	AM/Silk Gland_1	76	21 399 638	1 626 372 488	 		AM/Silk Gland_2	76	24 292 386	1 846 221 336	 		AM/Silk Gland_3	76	37 331 530	2 837 196 280	 	Posterior/silk gland	P/Silk Gland_1	76	27 359 580	2 079 328 080	 		P/Silk Gland_2	76	23 300 962	1 770 873 112	 		P/Silk Gland_3	76	39 421 430	2 996 028 680	 	Testis	Testis_1	76	40 890 404	3 107 670 704	 		Testis_2	76	45 733 846	3 475 772 296	 		Testis_3	76	44 985 224	3 418 877 024	 	Ovary	Ovary_1	76	40 797 628	3 100 619 728	 		Ovary_2	76	40 409 752	3 071 141 152	 		Ovary_3	76	42 417 892	3 223 759 792	 	

Genome assembly and evaluation

Before conducting genome assembly, we conducted k-mer distribution analysis using a 350-bp paired-end library in order to estimate the size and characteristics of the A. yamamai genome. The quality of our generated raw data was checked using FASTQC (FastQC, RRID:SCR_014583). Sequencing artifacts such as adapter sequences and low-quality bases were removed using Trimmomatic (Trimmomatic, RRID:SCR_011848). Jellyfish was used to count the k-mer frequency for estimation of the genome size of A. yamamai. Figure 2 shows the 19-mer distribution of the A. yamamai genome using a 350-bp paired-end library. In the 19-mer distribution, the second peak, at approximately half the coverage value (x-axis) of the main peak, indicates heterozygosity. Although the inbred line used in this study was the single pair sib-mating maintained for more than 10 generations, high heterozygosity still remains. This phenomenon has been observed in a previous genomic study of the Diamondback moth (Plutella xylostella), and sustained heterozygosity as an important genomic characteristic was hypothesized to be a result of environmental adaption. The underlying mechanism of the sustained heterozygosity is unclear, but associative overdominance can be one of the candidate explanations of this phenomenon. Based on the result of 19-mer distribution analysis, the genome size of A. yamamai was estimated to be 709 Mb. However, this size might be larger than the real genome size of A. yamamai because high heterozygosity could affect the estimation of genome size based on the K-mer distribution. Next, we conducted error correction on Illumina paired-end libraries using the error correction module of Allpaths-LG before the initial contig assembly process (ALLPATHS-LG, RRID:SCR_010742). After error correction, initial contig assembly with 350-bp and 700-bp libraries was conducted using SOAP denovo2 with the parameter option set at K = 19; this approach showed the best assembly statistics compared with other assemblers and parameters (SOAPdenovo2, RRID:SCR_014986). Quality control processing for mate-pair libraries and scaffolding was conducted using Nxtrim and SSPACE (SSPACE, RRID:SCR_011848), respectively. At each scaffolding step, SOAP Gapcloser with -l 155 and -p 31 parameters was repeatedly used to close the gaps within each scaffold. In order to obtain a higher-quality genome assembly of A. yamamai, we employed several long read scaffolding strategies using SSPACE-LongRead. First, we used an Illumina synthetic long read sequencing platform called Moleculo, which has been proven valuable for the study of highly heterozygous genomes in previous studies. After scaffolding was performed using SSPACE-LongRead with Illumina synthetic long read data, the total number of assembled scaffolds was effectively reduced from 398 446 to 24 558. The average scaffold length was also extended from 1.7 Kb to 24.8 Kb. However, there was no impressive improvement in N50 length (approximately 91 Kb to 112 Kb) of assembled scaffolds. Therefore, we employed another type of long read data generated from 10 cells of Pacbio RS II system with P6-C4 chemistry. After final scaffolding processing using Pacbio long reads, the number of scaffolds was reduced to 3675, and N50 length was effectively extended from 112 Kb to 739 Kb. Summary statistics of the assembled A. yamamai genome are provided in Table 5. Final assembly of the A. yamamai genome was 656 Mb (>2 kb) long with 3675 scaffolds, and the N50 length of assembly was 739 Kb with a 34.07% GC ratio. To evaluate the quality of the assembled genome, we conducted Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis using BUSCO v2.0 with insecta_odb9 including 1658 BUSCOs from 42 species (BUSCO, RRID:SCR_015008). From BUSCO analysis, 96.7% of BUSCOs were completely detected in the assembled genome (1576: complete and single-copy; 27: complete and duplicated) among 1658 tested BUSCOs. The numbers of fragmented and missing BUSCOs were 21 and 34, respectively. Based on the result of BUSCO analysis, the genome of A. yamamai presented here was considered properly constructed for downstream analysis.

19-mer distribution of A. yamamai genome using jellyfish with 350-bp paired-end whole genome sequencing data.

Summary statistics of the A. yamamai genome (>2 kb)

Assembled genome		 	Size, 1n	656 Mb	 	GC level	34.07	 	No. of scaffolds	3675	 	N50 of scaffolds, bp	739 388	 	No. of bases in scaffolds, %	19 257 439 (2.93)	 	Longest (shortest) scaffolds, bp	3 156 949 (2003)	 	Average scaffold length, bp	178 657.53	 	

Repeat identification and comparative repeat analysis

To identify repeat elements of the A. yamamai genome, a custom repeat library was constructed using RepeatModeler with RECON, RepeatScout, and TRF. The resulting constructed custom repeat library for A. yamamai was further curated using the CENSOR search, and the curated library was employed in RepeatMasker with Repbase. RepeatMasker (RepeatMasker, RRID:SCR_012954) was conducted with RMBlast and the “no_is” option for skipping bacterial insertion element check. Table 6 summarizes the proportion of identified mobile elements in the A. yamamai genome. The most prevalent repeat elements in the A. yamamai genome were LINE elements (101 Mb, 15.31% of total genome), and total repeat elements accounted for 37.33% of the total genome. In order to compare the repeat elements of A. yamamai with those of other genomes, we conducted the same process for 7 public genomes that are close neighbors of A. yamamai—Aedes aegypti, Bombyx mori, Danaus plexippus, Drosophila melanogaster, Heliconius Melpomene, Melitaea cinxia, and Plutella xylostella. Figure 3 displays the amount and proportion of identified repeat elements from the 8 species. Despite the small genome size of B. mori, the total amount of identified SINE elements in the B. mori genome was 5.77 times larger than that of A. yamamai. The top 5 expanded repeat elements in the A. yamamai genome were DNA/RC, LINE/L2, LINE/RTE-BovB, DNA/TcMar-Mariner, and LINE/CR1. Among these, DNA/TcMar-Mariner was the specifically expanded repeat element in A. yamamai among the 8 species. In B. mori, SINE/tRNA-CR1, LINE/Jockey, DNA/RC, LINE/CR1-Zenon, and LINE/RTE-BovB were the top 5 expanded repeat elements. When comparing the repeat elements of A. yamamai with those of B. mori, which are both producers of the same type of silk, repeat elements showed family and species-specific patterns in the 2 silk moth linages. Particularly, we found that the mariner repeat element, which was found specifically expanded in the A. yamamai genome, was also included in the fibroin gene. A previous sequencing study also showed that the mariner repeat element was inserted in the 5’-end of the fibroin gene of A. yamamai. Fibroin is the core component of the silk protein found in the silk moth, and the physical characteristics of silk mainly depend on the types and unique repeat motifs of the fibroin. This gene is known to have hundreds of tandem repeat motifs, and these kinds of tandem repeats can be derived through transposable elements. This indicates that the mariner repeat element, specifically expanded in the A. yamamai genome, may play an important role in the development of the unique silk of A. yamamai, and the lineage-specific repeat elements may be one of the candidate evolution forces related to host-specific phenotype during genome evolution.

Amount and proportion of identified repeat element from 8 species including A. yamamai. A) Absolute amount of repeat element classified into 8 different categories. B) Proportion of each repeat element in identified total repeat element.

Summary of identified repeat elements in the A. yamamai genome

Repeat element	No. of elements	Length, %	 	SINE	59 968	8 615 338 (1.30)	 	LINE	426 522	101 251 176 (15.31)	 	LTR element	53 977	4 552 386 (0.69)	 	DNA element	512 760	69 071 227 (10.44)	 	Small RNA	43 645	6 691 619 (1.01)	 	Simple repeat	135 989	6 256 839 (0.95)	 	Low complexity	19 937	932 829 (0.14)	 	Unclassified	294 190	54 552 009 (8.25)	 	

Gene prediction and annotation

Three different algorithms were used for gene prediction of the A. yamamai genome: ab initio, RNA-seq transcript-based, and protein homology-based approaches. For ab initio gene prediction, Augustus (Augustus: Gene Prediction, RRID:SCR_008417), Geneid, and GeneMarks-ET were employed. Augustus was trained using known genes of A. yamamai in the NCBI database, and mapping information of RNA-seq data obtained from Tophat (TopHat, RRID:SCR_013035) was also utilized for gene prediction. Geneid was used with predefined parameters for Drosophila melanogaster. GeneMarks-ET was employed using junction information of genes from transcriptome data alignment. For RNA-seq transcript-based prediction, generated transcriptome data from 10 organ tissues of A. yamamai were aligned to the assembled genome and gene information was predicted using Cufflinks (Cufflinks, RRID:SCR_014597). The longest CDS sequences were identified from Cufflinks results using Transdecoder. For the homology-based approach, all known genes of the order Lepidoptera in the NCBI database were aligned using PASA. Table 7 shows the gene prediction results from each method. Gene prediction results from different prediction algorithms were combined using Evidence Modeler, and a consensus gene set of the A. yamamai genome was created. Manual curation was performed based on the 5 types of evidence (3 in silico, known protein, and RNA-seq) using IGV and BLASTP (BLASTP, RRID:SCR_001010). Using IGV with each gene evidence and comparing results with known genes via BLASTP, we mainly focused on removing false positively predicted genes that don’t have enough evidence. Merged and spliced gene structured were corrected by comparing the gene structure with known exon structure in the NCBI NR database. In addition, fibroin and sericin genes, which couldn’t be properly predicted because of their high repeat motif, were also manually identified with previously known sequences with RNA-seq data. The final gene set of the A. yamamai genome contains 15 481 genes. Summary statistics for the consensus gene set are provided in Table 8. The average gene length was 11 016.34 bp with a 34.38% GC ratio, and the number of exons per gene was 5.64. In order to identify the function of predicted genes in A yamamai, 3 nonredundant sequence databases (Swiss-Prot, Uniref100, and NCBI NR) as well as the gene information of 2 species (B. mori and D. melanogaster) were used for target databases using BLASTP. Additionally, protein domain searches were conducted on the consensus gene set using InterproScan5 (InterProScan, RRID:SCR_005829). Figure S1 shows the top 20 identified terms from 7 different InterproScan5 analyses. Among the various analyses conducted using InterproScan5, gene ontology analysis with the Pfam database showed that a large proportion of genes in the A. yamamai genome were related with the function of molecular binding, catalytic activity, internal component of membrane, metabolic process, oxidation-reduction process, and transmembrane transport.

Summary statistics of ab initio, RNA-seq-based, and homology-based gene prediction results

Evidence type	Programs	Element	Total count	Exon/gene	Total length, bp	Mean length, bp	 			Gene	14 576		142 415 318	9770.53	 		Augustus			4.85			 			Exon	70 733		14 736 668	208.34	 			Gene	10 946		46 119 402	4213.35	 	ab_initio	Geneid			2.25			 			Exon	24 686		3 925 563	159.01	 			Gene	27 754		273 745 951	9863.29	 		GeneMarks-ET			5.50			 			Exon	152 660		30 847 503	202.06	 			Gene	36 213		840 429 061	23 207.94	 	RNA-seq	Cufflinks Transdecoder			7.03			 			Exon	254 770		201 721 675	791.77	 	Known gene (NCBI lepidoptera)	PASA (gmap)		44 561		22 484 151	504.57	 	

Summary statistics for the consensus gene set of the A. yamamai genome

Element	No. of elements	Exon/gene	Avg. length	Total length	Genome coverage, %	 	Gene	15 481		11 016.34	170 543 958	25.78	 			5.64				 	Exon	87 346		1346.23	20 840 925	3.31	 	

Comparative genome analysis

We used OrthoMCL and Reciprocal Best Hit within BLASTP for identification of gene family clusters and 1:1 orthologous gene sets. Gene information of 7 taxa (A. aegypti, B. mori, D. plexippus, D. melanogaster, H. melpomene, M. cinxia, and P. xylostella), the same taxa used in repeat analysis, was employed for OrthoMCL with A. yamamai. A total of 17 406 gene family clusters were constructed, and 3586 1:1 orthologous genes were identified. Before conducting comparative genome analysis, we constructed phylogenetic trees for the 8 species. In order to build the phylogenetic tree, multiple sequence alignment for the 1:1 orthologous genes of all 8 species was conducted using PRANK, and Gblocks was used to obtain conserved blocks for the phylogenetic tree. Conserved block sequences were sequentially concatenated to obtain 1 consensus sequence for each species. MEGA was used for constructing Neighbor-Joining Trees (bootstrap 1000, maximum composite likelihood, transitions + transversions, and gamma distributed option), and MrBayes (MrBayes, RRID:SCR_012067) was employed for the construction of Bayesian inference trees. To select the best evolution model for our data, Modeltest was conducted and the GTR-based invariant model was chosen based on the AIC value of Modeltest. Figure 4 shows the constructed phylogenetic tree of the 8 species using 3586 orthologous genes. The bootstrap value and Bayesian poster probability value of all nodes were 100 and 1, respectively. The closest neighbor of A. yamamai was B. mori, which is included in Bombycidae family; this result is consistent with that of previous studies. Three butterfly species (D. plexippus, M. cinxia, and H. meplmene) included in Nymphalidae family were also shown to share a common ancestor with the families Saturniidae and Bombycidae.

Constructed phylogenetic tree and comparative gene family analysis. Node values indicate Bayesian posterior probability, bootstrap and gene expansion, and contraction value. Orange and blue colors indicate expansion and contraction, respectively. Bar chart indicates the number of genes cauterized into 4 groups (Specific, 1:Multi, Multi:Multi, and 1:1) using OrthoMCL.

Based on the constructed phylogenetic tree, gene family expansion and contraction analysis were conducted using a 2-parameter model in CAFÉ and the gene tree was constructed using protein sequence via MEGA. Figure 4 shows the result of gene family expansion and contraction analysis of 8 species; 938 and 1987 gene families of A. yamamai and 567 and 715 gene families of B. mori were estimated to be expanded and contracted from the common ancestors, respectively. Among these, 15 gene families in A. yamamai were estimated to be under rapid expansion during the evolution process. Functions of genes in the rapidly expanded gene families of A. yamamai were transposase, fatty acid synthase, zinc finger protein, chorion (eggshell protein), reverse transcriptase, prostaglandin dehydrogenase, RNA-directed DNA polymerase, gag like protein, juvenile hormone acid methyltransferase, facilitated trehalose transporter, and glucose dehydrogenase. Figure 5 shows the gene tree of 2 chorion gene (chorion class A and B) family clusters rapidly expanded in the A. yamamai genome. Chorion, called eggshell protein, composes the surface of the egg and protects the embryo from environmental threats such as desiccation, flooding, freezing, infection of microorganisms, and physical destruction. It also provides channels, such as aeropyle, that enable gas exchange and maintain proper conditions for diapause of the egg. These diverse functions of eggshell are implemented by the specific eggshell structure, and the surface structure of eggshell varies between species for the adaptation in a different environment. The ancestor of Antherea has the unique aeropyle structure called “aerophyle crown” on the eggshell surface. This unique structure is formed by the circular vertical projection of lamellar chorion from the follicle cell, and it surrounds the aeropyles near the end of oogenesis. Acquiring this kind of de novo complex structure requires numerous genetic changes, and a previous study about Antheraea polyphemus has shown that more than 100 chorion-specific polypeptides were involved in this unique ultra-structure. Therefore, the specific rapid expansion of the chorion class A and B gene family in the A. yamamai genome might be one of the convincing molecular explanations for acquiring this unique ultrastructure in the eggshell surface of Antheraea genus. However, this unique ultrastructure tends to be reduced during the current evolution process of the Antheraea genus. Types of eggshell structures in the Antheraea genus can be categorized into multiple classes based on the morphology and regional distribution of the aeropyle. The shape of the aeropyle in the A. yamamai egg is known to be converted to a mound shape from the crown shape, and these aeropyle mounds only exist in the narrow band surrounding the micropyle region. Only a very few, small aeropyle crowns remain, and it is entirely different from the ancestral form of eggshell surface mostly covered by aeropyle crowns. These regional differences were known to be adjusted by regional differences of filler genes during choriogenesis, and the additional regulations of related genes for choriogenesis have to be considered. This indicates that the specifically expanded chorion gene families of A. yamamai may be one of the remaining evolutionary tracks in the genome of the Antheraea genus. However, further functional studies must be conducted to resolve the limited understanding about the relationship between these expanded chorion gene families and the current eggshell surface formation of A. yamamai.

Expansion of chorion gene in the A. yamamai genome. A, B) The gene trees of chorion A and B in the rapid expanded gene family cluster, respectively. Color of terminal node indicates each taxon identified in the gene family cluster.

The constructed genome of A. yamamai presented here is the first announced genome in the family Saturniidae, and the karyotyping analysis using gamete in the metaphase showed that the genome of A. yamamai consists of 31 chromosomes (Fig. 6). This constructed genome information provides more insight into the genome evolution and phylogeny of the family Saturniidae, which contains the largest number of species in Lepidoptera. For example, although 2 silk moths, A. yamamai and B. mori, appear similar, comparative genome analysis showed the significant differences in the genome size and the specific expansion of repeat elements and gene families between the families Saturniidae and Bombycidae. In case of molecular phylogeny, most previous phylogenetic studies were limited to few genes due to the lack of genomic information on the family Saturniidae. We expect that our study and resulting constructed genome will resolve some limitations of molecular phylogenetic and ecological research on the Saturniidae species. Additionally, constructed genome information will help researchers better understand the molecular background of wild silk and its production. Silk produced by A. yamamai, referred to as tensan silk, shows unique characteristics that have made it valuable in various fields. However, A. yamamai has not been completely domesticated compared with B. mori, making mass production of tensan silk infeasible. Understanding of the molecular mechanisms behind the tensan silk production process is essential for mass production using biotechnology, and this genome sequence with manually curated gene information is a fundamental resource for related research and industrial improvement. Additionally, the transcriptome data of 10 different organ tissues with the 3 biological replications presented here may be also useful resources for uncovering the molecular mechanisms related to specific phenotypes of A. yamamai and the family Saturniidae.

Karyotype of A. yamamai using a gamete of testis in metaphase.

Availability of supporting data

The generated genome sequence and gene information of A. yamamai are available in GigaDB, and generated raw data are available under project accession PRJNA383008 and PRJNA383025 of the NCBI database.

Additional file

Additional file 1: Figure S1. Top 20 terms in each 7 Interproscan5 analysis (CDD, Pfam, PRINTS, ProSitePatterns, ProSiteProfiles, TIGRF, Gene Ontology).

Additional file 1: Supplementrary_information2.xlsx.

Competing interests

All authors report no competing interests.

Abbreviations

BUSCO: Benchmarking Universal Single-Copy Orthologs; MYA: million years ago.

Author contributions

Sampling: Kee-Young Kim, Su-Bae Kim; sequencing: Kwang-Ho Choi, Seong-Wan Kim genome assembly: Seong-Ryul Kim, Woori Kwak, Jae-Sam Hwang, Seung-Won Park repeat element analysis: Seong-Ryul Kim, Woori Kwak, Seung-Won Park gene prediction: Seong-Ryul Kim, Woori Kwak, Hyaekang Kim, Jae-Sam Hwang comparative genome analysis: Seong-Ryul Kim, Woori Kwak, Min-Jae Kim, Kelsey Caetano-Anolles funding and experimental design: Seong-Ryul Kim, Seung-Won Park.

Supplementary Material