PMC:7299288 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"30590436-19667210-16856599","span":{"begin":2077,"end":2078},"obj":"19667210"},{"id":"30590436-25090999-16856600","span":{"begin":2133,"end":2134},"obj":"25090999"},{"id":"30590436-11006331-16856601","span":{"begin":2547,"end":2548},"obj":"11006331"},{"id":"30590436-16304599-16856602","span":{"begin":3039,"end":3041},"obj":"16304599"},{"id":"30590436-28096351-16856603","span":{"begin":3273,"end":3275},"obj":"28096351"},{"id":"30590436-28284542-16856604","span":{"begin":3354,"end":3356},"obj":"28284542"},{"id":"30590436-28985356-16856605","span":{"begin":3358,"end":3360},"obj":"28985356"},{"id":"30590436-19808877-16856606","span":{"begin":3978,"end":3980},"obj":"19808877"},{"id":"30590436-30213855-16856607","span":{"begin":4143,"end":4145},"obj":"30213855"},{"id":"30590436-30185430-16856608","span":{"begin":4147,"end":4149},"obj":"30185430"},{"id":"30590436-25951770-16856609","span":{"begin":4216,"end":4218},"obj":"25951770"},{"id":"30590436-23104886-16856610","span":{"begin":5158,"end":5160},"obj":"23104886"},{"id":"30590436-19289445-16856611","span":{"begin":5171,"end":5173},"obj":"19289445"},{"id":"30590436-23413433-16856612","span":{"begin":5182,"end":5184},"obj":"23413433"},{"id":"30590436-25994148-16856613","span":{"begin":5186,"end":5188},"obj":"25994148"},{"id":"30590436-20147302-16856614","span":{"begin":5201,"end":5203},"obj":"20147302"},{"id":"30590436-27043002-16856615","span":{"begin":5248,"end":5250},"obj":"27043002"},{"id":"30590436-25146293-16856616","span":{"begin":5327,"end":5329},"obj":"25146293"},{"id":"30590436-24423873-16856617","span":{"begin":5435,"end":5436},"obj":"24423873"},{"id":"30590436-25951770-16856618","span":{"begin":5451,"end":5453},"obj":"25951770"},{"id":"30590436-23450226-16856619","span":{"begin":5579,"end":5581},"obj":"23450226"},{"id":"30590436-29697369-16856620","span":{"begin":5597,"end":5599},"obj":"29697369"},{"id":"30590436-24227677-16856621","span":{"begin":6687,"end":6689},"obj":"24227677"},{"id":"30590436-19754506-16856622","span":{"begin":8517,"end":8519},"obj":"19754506"},{"id":"30590436-30254374-16856623","span":{"begin":8593,"end":8594},"obj":"30254374"},{"id":"30590436-29032409-16856624","span":{"begin":8711,"end":8713},"obj":"29032409"},{"id":"30590436-27671113-16856625","span":{"begin":8886,"end":8888},"obj":"27671113"},{"id":"30590436-30254374-16856626","span":{"begin":8945,"end":8946},"obj":"30254374"},{"id":"30590436-26656231-16856627","span":{"begin":9164,"end":9166},"obj":"26656231"},{"id":"30590436-27413047-16856628","span":{"begin":9623,"end":9625},"obj":"27413047"},{"id":"30590436-24423873-16856629","span":{"begin":9954,"end":9955},"obj":"24423873"},{"id":"30590436-20149088-16856630","span":{"begin":11303,"end":11305},"obj":"20149088"},{"id":"30590436-26813401-16856631","span":{"begin":13975,"end":13977},"obj":"26813401"},{"id":"30590436-23104886-16856632","span":{"begin":14080,"end":14082},"obj":"23104886"},{"id":"30590436-23413433-16856633","span":{"begin":14101,"end":14103},"obj":"23413433"},{"id":"30590436-25994148-16856634","span":{"begin":14105,"end":14107},"obj":"25994148"},{"id":"30590436-24227677-16856635","span":{"begin":14156,"end":14158},"obj":"24227677"},{"id":"30590436-27043002-16856636","span":{"begin":15604,"end":15606},"obj":"27043002"},{"id":"30590436-24423873-16856637","span":{"begin":16377,"end":16378},"obj":"24423873"},{"id":"30590436-29697369-16856638","span":{"begin":16813,"end":16815},"obj":"29697369"},{"id":"30590436-25516281-16856639","span":{"begin":21407,"end":21409},"obj":"25516281"},{"id":"30590436-24423873-16856640","span":{"begin":26937,"end":26938},"obj":"24423873"},{"id":"30590436-22199392-16856641","span":{"begin":28442,"end":28444},"obj":"22199392"},{"id":"30590436-27413047-16856642","span":{"begin":30858,"end":30860},"obj":"27413047"},{"id":"30590436-11006331-16856643","span":{"begin":47945,"end":47946},"obj":"11006331"},{"id":"30590436-25976240-16856644","span":{"begin":48374,"end":48376},"obj":"25976240"},{"id":"30590436-26594309-16856645","span":{"begin":48378,"end":48380},"obj":"26594309"},{"id":"30590436-22124482-16856646","span":{"begin":48514,"end":48516},"obj":"22124482"},{"id":"30590436-22021898-16856647","span":{"begin":48518,"end":48520},"obj":"22021898"},{"id":"30590436-25240000-16856648","span":{"begin":48570,"end":48572},"obj":"25240000"},{"id":"30590436-28484260-16856649","span":{"begin":49123,"end":49125},"obj":"28484260"},{"id":"30590436-25421600-16856650","span":{"begin":49694,"end":49696},"obj":"25421600"}],"text":"Homeolog expression quantification methods for allopolyploids \n\nAbstract\nAbstract\nGenome duplication with hybridization, or allopolyploidization, occurs in animals, fungi and plants, and is especially common in crop plants. There is an increasing interest in the study of allopolyploids because of advances in polyploid genome assembly; however, the high level of sequence similarity in duplicated gene copies (homeologs) poses many challenges. Here we compared standard RNA-seq expression quantification approaches used currently for diploid species against subgenome-classification approaches which maps reads to each subgenome separately. We examined mapping error using our previous and new RNA-seq data in which a subgenome is experimentally added (synthetic allotetraploid Arabidopsis kamchatica) or reduced (allohexaploid wheat Triticum aestivum versus extracted allotetraploid) as ground truth. The error rates in the two species were very similar. The standard approaches showed higher error rates (\u003e10% using pseudo-alignment with Kallisto) while subgenome-classification approaches showed much lower error rates (\u003c1% using EAGLE-RC, \u003c2% using HomeoRoq). Although downstream analysis may partly mitigate mapping errors, the difference in methods was substantial in hexaploid wheat, where Kallisto appeared to have systematic differences relative to other methods. Only approximately half of the differentially expressed homeologs detected using Kallisto overlapped with those by any other method in wheat. In general, disagreement in low-expression genes was responsible for most of the discordance between methods, which is consistent with known biases in Kallisto. We also observed that there exist uncertainties in genome sequences and annotation which can affect each method differently. Overall, subgenome-classification approaches tend to perform better than standard approaches with EAGLE-RC having the highest precision.\n\nIntroduction\nGenome duplication, termed polyploidization, is widespread in plants with up to 35% of land plants being recent polyploids [1]. Many crop species in particular are allopolyploids [2], which involves the hybridization of two different species with genome duplication. Thus, there is much interest in the study of genome duplication and the advantages or disadvantages this phenomenon may convey. To explore the underlying mechanisms that may provide the raw genetic material for adaptation, many gene expression studies have been conducted on both natural and synthetic allopolyploid species [3–8]. Allopolyploid species have traditionally been difficult to analyze at the whole genome scale because of the large size of their genomes and the high levels of sequence similarity between duplicated chromosomes. These duplicated gene copies, called homeologs, are in general highly similar and pose challenges to gene expression analyses. However, these homeologs and bias in their expression are of great interest because they potentially contribute to adaptation in polyploid species [9–11].\nRecent improvements in long sequencing read technologies and assembly strategies [12–14] have allowed for breakthroughs in polyploid genome assembly. In plant biology especially, recent allopolyploid species such as bread wheat [15] as well as many other agriculturally important plant species have benefited [16, 17]. Now that de novo genome assembly for allopolyploids is no longer as formidable as it once was, a large number of polyploid reference genomes are expected to become available in the near future to facilitate genome-wide studies. Accordingly, there is a need to evaluate expression quantification methods given the presence of homeologs in allopolyploids.\nHigh levels of sequence similarity pose many challenges to read mapping and consequently, expression quantification as well as other types of sequence analysis. Mapping bias was shown to be an issue in numerous studies on diploid allele specific expression [18–21], which is analogous to the analysis of homozygous tetraploids. In polyploid studies, sequence ambiguity between homeologs has been shown to affect read mapping [22, 23], especially if the assembly or annotation quality is asymmetric [24].\nIn this study we pose the question ‘How do subgenome-classification approaches perform in comparison to traditional RNA-seq methods in allopolyploids?’ and evaluate different approaches and methods for homeolog expression quantification in allopolyploids. To evaluate methods in polyploids, analysis on genetic materials with and without subgenomes is highly valuable as a form of ground truth. Here we used synthetic allotetraploid Arabidopsis kamchatica (Fisch. ex DC.) K. Shimizu \u0026 Kudoh and performed tests with its two direct parental accessions of Arabidopsis halleri and Arabidopsis lyrata. For hexaploid wheat Triticum aestivum Chinese Spring [25] with subgenomes ABD, we performed tests with tetra-Chinese Spring (subgenomes AB) where subgenome D was experimentally removed [26] as well as Aegilops tauschii the diploid progenitor of wheat subgenome D.\nAlthough genome-alignment tools designed for diploid species such as STAR [27], TopHat [28], LAST [29, 30] and GSNAP [31] or pseudo-alignment-based method Kallisto [32] have often been used for RNA-seq analysis in allopolyploid species [25, 33–37], bioinformatic algorithms specific for allopolyploid species have been developed. Among them, HomeoRoq [7] and PolyDog [24] represent similar subgenome-classification approaches, in which alignment quality to each subgenome is considered. PolyCat [38] and EAGLE-RC [39] identify and utilize explicit genotype differences between subgenomes. Among them, PolyCat requires picking one subgenome as the reference and constructing a homeolog SNP-index as opposed to read alignment to all subgenomes, while EAGLE-RC identifies and utilizes explicit genotype differences between subgenomes to calculate the read sequence likelihood for each subgenome alignment. HyLiTE [40] requires sequencing data from all parental species as well as the child polyploid species; thus, it was not suitable for the analysis of hexaploid wheat as the subgenome B progenitor is still unknown. A systematic comparison between subgenome-classification methods would be necessary to evaluate them, though there are commonalities in their approach, and is outside the scope of this review.\nWe test, to the best of our knowledge, all general approaches to quantify expression in polyploids: (i) A standard genome-alignment-based RNA-seq analysis on the full allopolyploid reference genome with two different alignment tools STAR and LAST, with read counts obtained using featureCounts [41].(ii) A pseudo-alignment-based method with Kallisto on the full allopolyploid transcriptome.(iii) A subgenome-classification approach with HomeoRoq, which maps read sequences to each subgenome separately, with read counts obtained using featureCounts.(iv) A subgenome-classification approach with EAGLE-RC, which maps read sequences to each subgenome separately and also explicitly uses genotype variations that discriminate between homeologs as constraints in analysis, with read counts obtained using featureCounts.\nOur results show that EAGLE-RC had the lowest error rate (A. kamchatica, 0.40%; hexaploid wheat, 0.49%) for alignments to the correct subgenome, while Kallisto had the highest error rate (A. kamchatica, 12.43%; hexaploid wheat, 13.44%). LAST and STAR had similar error rates in A. kamchatica, but LAST was more precise in hexaploid wheat. In general, performance between methods was comparable for A. kamchatica but not for hexaploid wheat. We also observed systematic differences in low-expression genes that impacted the homeolog expression bias results. Other concerns include uncertainty in the completeness and accuracy of the genome sequence and annotation, which affected each method differently due to differences in constraints. This may be especially relevant as polyploid species have only recently begun to be sequenced and assembled in large numbers and the gene annotations are in their 1st iterations. In the face of this uncertainty, EAGLE-RC is the most precise in our evaluations.\n\nMethods\nWe evaluated methods on two allopolyploid species, tetraploid A. kamchatica and hexaploid wheat T. aestivum. All codes used in this study are described in Supplemental Methods, given as a series of shell script commands.\n\nRNA sequence data and reference genomes\nThe natural species A. kamchatica [42, 43] was derived from two diploid species A. halleri and A. lyrata recently [8, 44]. It is a model polyploid with a broad distribution range, self-compatibility and transformation technique [44, 45]. To construct synthetic polyploids, we used two highly homozygous parental accessions used for genome assembly: A. halleri Tada mine W302 (ver 2.2, scaffolds N50 712 kb) [46] and A. lyrata lyrpet4 (ver 2.2, scaffolds N50 1.2 Mb) [8]. The two genotypes were crossed then the genome doubling was induced by colchicine treatment. Although synthetic polyploidization may occasionally activate transposable elements or induce chromosomal rearrangements [47], the subgenomes of the synthetic polyploid were derived from the merging of the two parental genomes and are highly or completely identical, providing a unique opportunity to evaluate RNA-seq methods in allopolyploids.\nTo assess classification accuracy in synthetic A. kamchatica, we used data from the parental species so that the ground truth of a read’s subgenome origin is known. A. halleri subsp. gemmifera and A. lyrata subsp. patraea RNA-seq data [48] were obtained under DDBJ accession DRP003263 submission DRA004364. Briefly, this data set consists of four samples each for A. halleri and A. lyrata with 2 x 100 bp paired-end reads for a total of ~20 and ~21 Gb, respectively. For differential homeolog expression analysis, synthetic allotetraploid A. kamchatica RNA-seq data [7] was obtained under DDBJ accession DRP01140. Briefly, this data set consists of three biological replicates of A. kamchatica before and after cold stress with 2 x 100 bp paired-end reads for a total of ~12 and ~10 Gb, respectively.\nThe whole genome assembly and annotation for hexaploid wheat T. aestivum with subgenomes ABD were obtained from the International Wheat Genome Sequencing Consortium [25] (assembly ver 1.0 and annotation ver 1.0, N50 22.8 Mb). The assembly quality is at the chromosome level and the reference genome was split into subgenomes A, B and D allowing for separate read mapping in subgenome-classification methods.\nTo assess classification accuracy in wheat, we generated new tetra-Chinese Spring and A. tauschii data to use in our evaluations. We utilized tetra-Chinese Spring in which the subgenomes AB were extracted by removing subgenome D by repeated backcrossing [26]. Thus, the genome sequence of tetra-Chinese Spring must be very close to the subgenomes AB of hexaploid Chinese Spring. Because there is no genetic material of extracted D genomes from hexaploid wheat, we used A. tauschii KU-2076 (resource in Kyoto University, collected in Iran). Subgenome D of hexaploid wheat is known to be derived from this species, thus it should be highly similar though there is divergence due to within-species variations [49]. We constructed RNA-seq data sets of tetra-Chinese Spring and A. tauschii in triplicate for a total of 2.1 and 2.8 Gb, respectively, under DDBJ run accessions DRR155268–DRR155273. For differential homeolog expression analysis, T. aestivum RNA-seq data were obtained from National Center for Biotechnology Information (BioProject PRJEB12358) with SRA accessions ERR1201752-ERR1201754 and ERR1201770-ERR1201772 describing samples, in triplicate, 24 h after inoculation of fungal pathogen Fusarium graminearum and mock inoculation for a total of 17.3 Gb and 16.1 GB, respectively.\n\nPlant growth, RNA isolation and sequencing\nTetra-Chinese Spring and A. tauschii were grown at 16°C in 8 h light/16 h dark cycle with 60% relative humidity for 2 weeks, and leaf tissues were harvested. RNA was extracted from each tissue using RNeasy Plant Mini Kits (QIAGEN, Hilden, Germany) in combination with DNase I treatment (QIAGEN). Illumina sequencing libraries were made by TruSeq Stranded mRNA Library Prep Kit. RNA-seq was conducted using Illumina HiSeq 4000 at the Functional Genomics Centre, Zurich.\n\nHomeolog identification\nTo annotate homeologs in A. kamchatica, we constructed RNA transcripts from the gene models in the A. halleri and A. lyrata gene annotations using gffread (v0.9.12). Homeologs were then identified based on reciprocal best hit for each subgenome’s transcripts using LAST (v809). We required hits to have an E-value less than 10−10 with at least 200 aligned bases in both transcripts, resulting in 24329 homeolog pairs identified. This method of identifying homeologs is simple but efficient and also allows for the enumeration of genotype differences between subgenomes, a necessary resource for EAGLE-RC.\nTo annotate homeologs in T. aestivum, we constructed RNA transcripts from high-confidence gene models belonging to subgenomes A, B and D, including the UTR regions. We then identified pair-wise homeologs through reciprocal best hit for combinations AB, AD and BD (26556, 27360 and 27311, respectively) and then triple copy homeologs by checking for genes in AB that share the same hit for D in their respective AD and BD hits, resulting in 21196 triple copy homeologs identified. Though these gene models were not used in this study, for reference, we discovered 3977, 4781 and 4814 two-copy homeologs in AB, AD and BD, respectively. For singletons, there were 13 459, 14 093 and 11 551 out of 44 796, 45 463 and 43 754 total gene models in subgenomes A, B and D, respectively.\n\nStandard RNA-seq analysis\nWe tested a standard genome-alignment-based RNA-seq expression quantification approach (Figure 1A) aimed at differential expression analysis [50] that is often used for diploids: (i) Map reads to the allopolyploid reference genome—STAR (v2.5.2b) [27] and LAST (v809) [29, 30].(ii) Count reads using featureCounts (v1.5.1) [41] at the transcript level.(iii) Extract homeolog specific read counts.\nFigure 1 We compare ‘four different’ approaches for quantifying homeolog expression: (a) a standard genome alignment-based RNA-seq analysis on an allopolyploid (concatenated) reference genome using alignment tools STAR and LAST. (b) A pseudo-alignment workflow using Kallisto that is performed on the (concatenated) transcripts of the allopolyploid. (c) A subgenome-classification analysis ‘using HomeoRoq’ that performs alignment on each subgenome’s reference separately then performs read classification based on number of mismatches. (d) A subgenome-classification analysis ‘using EAGLE-RC’ that performs read alignment on each subgenome’s reference separately then performs read classification based on the likelihood of the read to the genotype, discarding common reads. Other miscellaneous sequence-processing tasks such as indexing and sorting were done using samtools (v1.5.2).\nTo construct the reference genome of A. kamchatica, we concatenated the reference genomes of its two parental species, A. halleri and A. lyrata, to obtain an allopolyploid reference. For LAST, we also filtered out read alignments with MAPQ scores less than 20, while STAR MAPQ scores are not as useful for thresholding. In T. aestivum, we excluded all reads that mapped to chrUn.\nWe tested a pseudo-alignment RNA expression quantification workflow (Figure 1B) using Kallisto (v0.43.1) [32], which is also often used for diploids. The built-in expression quantification in Kallisto was used to count reads (est_counts) at the transcript level. When evaluating classification accuracy, we used the pseudobam option in Kallisto to output read assignments and evaluate the proportion which were misassigned.\n\nSubgenome-classification analysis\nWe tested a subgenome-classification approach where, in contrast to the standard approach, the sequencing reads are mapped separately to each subgenome of an allopolyploid’s reference genome (Figures 1C and D) using STAR. Then we utilized a read classifier to assign reads to their subgenome origin, if possible. We tested read classifiers methods HomeoRoq (last updated 11 August 2014) and EAGLE-RC (v1.0.0).\nHomeoRoq [7] (Figure 1C) classifies reads based on the number of mismatches, up to a maximum of 10, between the read and the genome sequence of each subgenome, where reads must be mappable to both subgenomes to be considered. In contrast with EAGLE-RC, which requires computing the subgenome-discriminating variants, HomeoRoq does not require comparative analysis between different subgenomes in advance of classification.\nThe basic EAGLE model [39] is a generative model for read sequences used to calculate the likelihood of a read given a reference genotype hypothesis and an alternative genotype hypothesis. In EAGLE-RC (Figure 1D), the basic model was extended to perform read classification using the variants that discriminate between homeologs. During homeolog identification, subgenome-specific genotype differences (i.e. variants) that discriminate between homeologs are determined. For A. kamchatica, reads are mapped to the reference genomes of A. halleri (H), and A. lyrata (L), separately. EAGLE-RC then calculates the probability for read r given each subgenome as the reference hypothesis \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}${G}_{\\mathrm{ref}}$\\end{document} and the other subgenome as the alternative hypothesis \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}${G}_{\\mathrm{alt}}$\\end{document}: \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$$P\\left[r\\in {G}_{\\mathrm{ref}}\\right]=\\frac{P\\left[r\\ \\right|{G}_{\\mathrm{ref}}]}{P\\left[r\\ \\right|{G}_{\\mathrm{ref}}\\left]+P\\left[r\\ \\right|{G}_{\\mathrm{alt}}\\right]},$$\\end{document} where the classification is determined by the reference with the highest probability, requiring the winning hypothesis to be at least probability 0.95 with marginal probability at least 0.51; otherwise it is ‘unknown’, where the marginal probability is the proportion of the winning hypothesis over the sum of the probabilities for all subgenome reference hypotheses.\nFor hexaploid T. aestivum subgenome classification, we performed a bottom-up workflow using a series of pair-wise classifications (Figure 2). For HomeoRoq, we determined, if possible, the consensus classification from pair-wise classifications. For example, a read is classified as A if there is a consensus A classification in both AB and AD pair-wise classification comparisons.\nFigure 2 We perform read classification on hexaploid T. aestivum using a bottom-up approach from a series of pair-wise classifications with subgenomes A, B and D. A final classification from pair-wise analysis is obtained via consensus for HomeoRoq and via highest probability for EAGLE-RC. For hexaploid subgenome classification with EAGLE-RC, the pair-wise likelihoods per read were calculated, where for each reference hypothesis \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}${G}_{\\mathrm{ref}}$\\end{document}, there are two alternative genome hypotheses, \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}${G}_{\\mathrm{alt}1}$\\end{document} and \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}${G}_{\\mathrm{alt}2}$\\end{document}. The probability of a read belonging to a given reference is then as follows: \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$$P\\left[r\\in {G}_{\\mathrm{ref}}\\right]=\\frac{P\\left[r\\ \\right|{G}_{\\mathrm{ref}}]}{P\\left[r\\ \\right|{G}_{\\mathrm{ref}}\\left]+P\\left[r\\ \\right|{G}_{\\mathrm{alt}1}\\right]+P\\left[r\\ \\right|{G}_{\\mathrm{alt}2}]},$$\\end{document} where the classification is determined by the reference with the highest probability, requiring the winning hypothesis to be at least probability 0.95 with marginal probability at least 0.51, otherwise it is ‘unknown’.\n\nDifferentially expressed homeologs\nTo identify differentially expressed homeologs (DEHs) in A. kamchatica, we used DESeq2 (v1.20.0) [51] in the R environment (v3.5.1) on the read count data for the A. halleri-derived subgenome and A. lyrata-derived subgenome separately. We required a homeolog to be differentially expressed with 0.05 or better false discovery rate (Benjamini–Hochberg FDR) in at least one of the subgenomes. We then performed Fisher’s exact test in R on the combined read counts to determine significant differential homeolog expression ratio (P \u003c 0.05 and fold change ≥ 2). This definition of differential expression requires significant change in gene expression in at least one subgenome according to DESeq2 along with significant change in the homeolog expression ratio between subgenomes according to Fisher’s exact test, under different conditions. There may be a multiple-testing issue with the Fisher’s exact tests for selected homeologs but we did not perform further FDR corrections.\nTo find DEH in hexaploid T. aestivum, we used DESeq2 on the read count data for each of subgenomes A, B and D separately, requiring a gene to be differentially expressed with 0.05 or better FDR in at least one of the subgenomes across three pair-wise tests. We then performed a series of Fisher’s exact tests on the combined read counts (A versus BD, B versus AD and D versus AB) to test for differential homeolog expression ratio. A homeolog is differentially expressed if at least one test has P \u003c 0.05/3 (Bonferroni correction) and fold change ≥ 2. Similar to the tetraploid analysis, we did not perform further FDR corrections.\nTable 1 Classification performance for A. kamchatica. Results are averaged across eight samples, four each for A. halleri and A. lyrata. The percent mapped refers to the number of reads that were mappable and in the case of subgenome classification, the number of reads mappable to any subgenome. The classification error refers to the proportion of mapped reads which were assigned to the wrong subgenome. The number of expressed homeologs with more than one read in any sample is also shown for each subgenome as a percentage of the set of 24 329 homeologs\n\nResults\n\nA. kamchatica read classification\nGenerally, many types of sequence analysis rely on how accurately read sequences can be mapped to a reference. In the case of allopolyploids, read alignment must deal with a higher degree of repetitiveness than in diploids due to homeologs.\nWe tested the standard genome alignment approach using the widely used RNA-seq read mapping tool, STAR. Though this tool is often used in diploid read mapping, it is unclear whether it will be suitable for allopolyploids. Unfortunately, the mapping quality score from STAR is not suitable for thresholding ``uniqueness” due to how it assigns scores in an almost binary manner. Thus, we also tested LAST, a general sequence alignment tool that can estimate the probability that an alignment represents the genomic source of the read. For example, a given read aligns to a single location with no mismatches, but aligns to five other locations with one mismatch each. This read may be deemed a unique best hit, but there may be a reasonable probability, depending on read length, that it came from any of the other alignments with a single base-calling error. LAST allows us to set a degree of uniqueness (i.e. 0.05 mismap probability) as a cut-off threshold that is convenient for handling this type of uncertainty.\nWe examined tetraploid A. kamchatica first because it may be less complicated than hexaploid wheat, which we describe later in this study. Here, we evaluated the accuracy of each tool and each approach by how well they assigned reads to the correct subgenome. For A. kamchatica, the ground truth is known by testing with pure A. halleri and A. lyrata RNA-seq data. Table 1 shows the classification performance of each approach for A. kamchatica. One point to note is that the mapping rate is affected by alignments to non-homeologs, thus we also showed the number of homeologs with detected expression in our evaluations.\nWe calculated the precision by using classification error rate as the criterion. It is clear that subgenome-classification approaches (HomeoRoq and EAGLE-RC) performed better than the standard alignment-based approaches (STAR, LAST and Kallisto) using a concatenated genome or pseudo-alignment on a concatenated transcriptome. In the standard alignment-based approach using a concatenated reference genome, LAST’s mismap probability model was seen to be beneficial to read classification showing a much lower classification error rate than STAR, with only a slightly lower number of expressed homeologs detected.\nKallisto showed the lowest precision among all methods, though it showed the highest number of expressed homeologs detected. That Kallisto’s performance was the most affected by the presence of homeologs is perhaps due to a reduction in the number of unique kmers, relative to diploid analysis, which is essential for the method to find unique read-to-gene associations. To quantify the reduction in the number of unique kmers in a tetraploid reference relative to diploid reference, we performed a simulation with 1000 trials of 100 randomly selected A. halleri genes with randomly assigned SNPs at varying degrees of sequence divergence (Supplementary data, Table S1). This analysis shows that there is a large reduction in the number of unique kmers given one extra gene copy depending on the pair-wise sequence divergence. For a point of reference, A. kamchatica is estimated to have ~2–3% divergence between homeologs [7].\nEAGLE-RC showed the highest precision in read classification though it had a fewer expressed homeologs detected. HomeoRoq was less precise than EAGLE-RC while having a similar number of expressed homeologs detected. The main difference in methodology between EAGLE-RC and HomeoRoq is that EAGLE-RC utilizes genotype information explicitly while HomeoRoq relies on comparing the number of mismatches, implicitly comparing genotype differences. However, simply comparing the number of mismatches is susceptible to spurious mappings, because a read alignment with 9 versus 10 mismatches favors the one with 9 mismatches to the same degree as it would favor an alignment with 0 versus 1 mismatch.\nAnother point to consider is that the quality of the reference may differ between subgenomes and there may be uncertainty from missing regions or erroneous annotations. HomeoRoq requires reads to be mappable to both references, which constrains read counting to genome regions that exist in both subgenomes. However, this comes with the disadvantage that more divergent regions may be excluded due to reads being mappable to only one homeolog. EAGLE-RC requires reads to be mapped to regions that are different between homeologs and thus can be classified based on those differences. This constrains analysis to regions in the homeologs’ gene models that can be pair-wise aligned to compute the genotype differences between homeologs. We performed a simulation analysis where reads were simulated using ART [52] from annotated gene models with no divergence from the reference and calculated the classification performance (Supplementary data, Table S2). The results show that even in this ideal scenario, there were reads which could not be classified with certainty by LAST, HomeoRoq and EAGLE-RC due to reads mapping equally well to both subgenomes. It also shows that in ideal conditions, STAR and Kallisto, despite higher error rates, were excellent in their true positive rates. However, the ideal condition of data with no divergence from the reference, no sequences outside of annotated genes, and perfectly reflective of gene models is not realistic in practice.\n\nA. kamchatica homeolog expression quantification\nRead mapping is the 1st step in many types of sequence analysis and is the foundation for many methods that rely on processing read alignments. However, read-counting methods and differential expression analysis may potentially be able to correct for artifacts or ambiguity in read alignments. We revisited classification error using the quantified read counts for homeologs in A. kamchatica (Table 2), although we suggest some caution in the interpretation of the number of reads quantified and thus the classification error rate in this result as HomeoRoq and EAGLE-RC have additional expression quantification processes to count subgenome-common reads. For Kallisto, we used the estimated read count from its output while for all other methods, we obtained read counts using featureCounts.\nTable 2 Error rate for A. kamchatica using quantified read counts, averaged across three samples each for A. halleri (H) and A. lyrata (L) Here, error is represented as a proportion of quantified reads rather than the proportion of mappable reads in the earlier analysis. Our results show that error decreased in STAR, LAST and Kallisto as the quantification process accounted for ambiguously mapped reads in some fashion (dropped by featureCounts by default and distributed by Kallisto). HomeoRoq’s and EAGLE-RC’s error rates increased slightly due to these methods’ discarding reads that are deemed unclassifiable. However, a large number of unclassifiable reads are due to equivalent alignments to both subgenomes, deemed subgenome common, which can be used to estimate expression levels such as RPKM by distributing proportionally to each subgenome. This has been demonstrated previously for HomeoRoq [48] and is also applicable to EAGLE-RC. Thus, the number of reads quantified for subgenome-classification approaches in Table 2 may not be directly comparable to standard genome alignment approaches. It turns out that for tetraploid A. kamchatica, all methods had comparable error rates, where EAGLE-RC was the most precise.\nNext, we examined the ratio of homeologous pairs. We are interested in any shifts in expression between homeologs across conditions, so we analyzed RNA-seq data from A. kamchatica and quantified the A. halleri read count proportion \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$\\widehat{p}$\\end{document} as follows: \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$$\\widehat{p}=\\frac{\\mathit{\\mathsf{A}}.\\mathit{\\mathsf{halleri}}\\ \\mathrm{reads}}{\\mathrm{A}.\\mathrm{halleri}\\ \\mathrm{reads}+\\mathit{\\mathsf{A}}.\\mathit{\\mathsf{lyrata}}\\ \\mathrm{reads}}$$\\end{document}\nWhile we do not know the ground truth expression levels, we can evaluate the concordance between the different methods to describe how results might differ depending on which approach was used. To compare the results between different methods, we calculated the pair-wise root–mean–squared distance (RMSD) and coefficient of determination (r2) between the results for all methods. The RMSD is a measure of the average distance between two sets X and Y: \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$$\\mathrm{RMSD}=\\sqrt{\\frac{\\sum_{i=1}^n{\\left({x}_i-{y}_i\\right)}^2}{n}}$$\\end{document}\nThe r2 describes how well one variable X can be used to predict another variable Y by calculating the proportion of variability that can be explained: \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$${r}^2=\\frac{\\Big({\\sum}_{i=1}^n\\Big(\\big({x}_i-\\overline{x}\\big)\\left({y}_i-\\overline{y}\\right)\\Big){}^2}{\\sum_{i=1}^n{\\left({x}_i-\\overline{x}\\right)}^2{\\sum}_{i=1}^n{\\left({y}_i-\\overline{y}\\right)}^2}$$\\end{document}\nThough both measures describe the similarity between the results of different methods, the RMSD quantifies the magnitude of difference with units while the r2 quantifies the proportion of similar elements.\nWe examined the ratio of homeologous pairs using RNA-seq data from A. kamchatica samples before and after cold stress. The r2 (Table 3) and RMSD (Supplementary data, Table S3) for \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$\\widehat{p}$\\end{document} show that in general, the results from the standard genome-alignment-based approach were concordant (average r2 = 0.9505, RMSD of 6.96%). HomeoRoq was also concordant (average r2 = 0.9084) with STAR and LAST. This is consistent with the error rates in the classification results.\nTable 3 r 2 of the proportion of reads derived from the A. halleri subgenome between different quantification approaches for homeologs in A. kamchatica The r2 values between methods showed that though read mapping precision can vary greatly, downstream read counting methods were somewhat able to correct for artifacts or ambiguity in read alignment to arrive at a similar quantified expression. Kallisto was discordant from all other methods with an average RMSD of 18.88%. This may be due to its built-in read-counting method using expectation–maximization (EM) whereas all other methods used featureCounts, which does not consider multi-mapped reads. There were also issues with low-expression genes where 420 and 447 A. halleri and A. lyrata homeologs, respectively, were reported to have near-zero expression (\u003c10 reads) by all other methods in all samples but Kallisto reported more than 100 reads. The homeolog expression ratio scatter plots reflect this potential artifact (Supplementary data, Figure S1) where a slight clustering at the edges of the plots can be seen for Kallisto. Though the cause is unclear, Kallisto tends to overestimate some low-expression genes compared to other methods.\nEAGLE-RC also showed less concordance in r2 and RMSD than HomeoRoq to standard genome-alignment approaches, where it reported some high-expression genes as low expression compared to other methods. More detailed examination revealed that because EAGLE-RC required reads to cross genotype differences between homeologs, it excluded reads in exons which were not pair-wise aligned using LAST during the homeolog identification process. There may exist uncertainty in the annotation, as there are 552 homeologs that had at least a 40% difference in the proportion of the gene model aligned between A. halleri and A. lyrata. The majority of these cases were A. halleri homeologs having a smaller proportion of alignment than A. lyrata homeologs due to being much longer. The exclusion of regions which were not aligned between homeologs accounted for a large portion of the difference in EAGLE-RC. For example, the homeolog annotated as AT4G25110 in A. kamchatica with 12 exons in A. halleri and 5 exons in A. lyrata and an aligned region between homeologs of 46 and 61% in A. halleri and A. lyrata, respectively, had almost all (99.89%) of the reads assigned to A. halleri mapped to regions that were not aligned between homeologs (Supplementary data, Figure S2), leading to a large difference between EAGLE-RC and other methods (Supplementary data, Table S4).\nThe subgenome-classification approach with HomeoRoq or EAGLE-RC tries to account for potentially missing reference genome regions in one subgenome, through consideration of read mappability or explicitly utilizing genotype variation, respectively. For example, the homeolog annotated as AT5G45850 in A. kamchatica had reads that mapped to the A. lyrata reference that did not map at all to the A. halleri reference and was thus discarded by HomeoRoq. Similarly, missing reference genome regions in homeologs cannot be pair-wise aligned, thus EAGLE-RC discarded reads that mapped to these regions. In other methods, a naive counting of reads could be reference biased (Supplementary data, Table S5). In this case, the average \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$\\widehat{p}$\\end{document} for a sample with the subgenome-classification approach was 0.90 compared to 0.44 with STAR, LAST and Kallisto.\nNext, we examined DEH using RNA-seq data from A. kamchatica samples before and after cold stress. The results of different methods (Figure 3) showed that, in general, there was a high overlap between the different methods tested where Kallisto was the least concordant among the methods tested. This is consistent with the results when we examined the ratio of homeologous pairs.\nFigure 3 Overlap of differentially expressed homeologs from different methods in A. kamchatica. (a) The overlap between traditional methods. (b-d) The overlap between subgenome-classification methods HomeoRoq and EAGLE-RC with each traditional method. Intuitively, genes with lower expression are expected to be more affected by uncertainties due to read count differences having more of an effect on the proportion (Table 4). Indeed, our results show that the concordance between methods was lower for genes with lower expression with a significant increase (paired t-test P-value, 6.767 × 10−06) in RMSD. This may be due to a higher proportion of reads being ambiguous and each method’s difference in handling these ambiguously mapped reads. This may have large implications if the goal is to determine homeolog expression bias.\nTable 4 RMSD of the proportion of reads derived from the A. halleri subgenome for homeologs with read counts of fewer than or equal to 100 (lower triangular matrix) versus more than 200 (upper triangular matrix) in A. kamchatica\nTable 5 Classification performance for T. aestivum. Results are averaged across three samples for each of diploid A. tauschii (D) and tetraploid Chinese Spring (AB). The percent mapped refers to the number of reads that was mappable and, in the case of subgenome classification, the number of reads mappable to any subgenome. The classification error refers to the proportion of classified reads which were assigned to the wrong subgenome. The number of expressed homeologs with more than one read in any sample is also shown for each subgenome as a percentage of the set of 21 196 homeologs\nFigure 4 Overall classification error rate in tetraploid A. kamchatica and hexaploid wheat T. aestivum for all methods.\n\nHexaploid wheat read classification\nFor hexaploid wheat T. aestivum, the three homeologous copies further complicate read mapping and also requires a more complex workflow for subgenome classification. To test classification performance, we used RNA-seq data of a tetraploid wheat line (tetra-Chinese Spring) with subgenomes AB and a diploid A. tauschii line (accession KU-2076) with subgenome D as the ground truth. Table 5 shows the classification performance of each approach.\nOur results show that the misclassification rates of all methods were strikingly similar for allotetraploid A. kamchatica and allohexaploid wheat (Figure 4). This suggests that it was mainly the difference in methods rather than taxon-specific features that affected the error rate where subgenome-classification methods showed a higher precision than standard methods using the concatenated genome. The highest precision was obtained by EAGLE-RC followed by HomeoRoq, LAST, STAR and Kallisto in that order. Kallisto detected the most expressed homeologs by count, which may be spurious because Kallisto tended to overestimate low-expression genesrelative to other methods as discussed above. Among the other four methods, the subgenome-classification methods showed a slightly lower number of expressed homeologs detected than the standard genome alignment approach, though we do not know which one is closer to the ground truth. EAGLE-RC still maintained an error rate under 1%, the best among all methods, though it deemed an average of \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$\\sim$\\end{document}20% of mapped reads as unclassified due to low confidence resulting from ambiguity between homeolog pairs. There was also a higher cost in computation time for the higher precision (Supplementary data, Table S6).\nAnother point of interest is that the more divergent D diploid line had much higher error rates than the AB line directly extracted from Chinese Spring, which is the reference genome line (Table 5). In this case, STAR and Kallisto performed quite poorly compared to other methods. Divergence from the reference genome appears to be a multiplier for error rate, thus less precise methods will be more affected.\n\nHexaploid wheat homeolog expression quantification\nWe evaluated the classification error using the quantified read counts for homeologs in T. aestivum (Table 6). As described in the previous section, note that the number of reads quantified is not directly comparable among different methods. Again, for Kallisto we used the estimated read count from its output while for all other methods we obtained read counts using featureCounts.\nTable 6 Classification performance for T. aestivum using quantified read counts. Results are averaged across three samples for each of diploid A. tauschii (D) and tetraploid Chinese Spring (AB)\nTable 7 r 2 of the proportion of reads derived from subgenome A between different quantification approaches for homeologs in T. aestivum\nFigure 5 (a–d) Homeolog expression scatter plots for Kallisto versus other methods in hexaploid wheat, quantified as the expression proportion of subgenome A over the total (A + B + D) per homeolog. (e) Homeolog expression scatter plots for STAR versus LAST. (f) Homeolog expression scatter plots for EAGLE-RC versus HomeoRoq. As expected, the error rate was higher in hexaploid wheat than in A. kamchatica though with similar trends. In AB classification error, STAR had over \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$\\sim$\\end{document}200% higher rate of error, LAST and Kallisto had \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$\\sim$\\end{document}60% higher error and HomeoRoq and EAGLE-RC were the most robust with \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$\\sim$\\end{document}20% higher error. Similar to the alignment error analysis, the more distant A. tauschii (D) reads had much higher error rates, propagated from errors in the alignment. Again, STAR and Kallisto were the most affected by the increased distance in these samples.\nNext, we examined homeolog expression shifts across conditions and quantified the homeolog expression ratio in T. aestivum by calculating the proportion of subgenome A reads \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$\\widehat{p}$\\end{document} as follows: \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$$\\widehat{p}=\\frac{\\mathrm{chrA}\\ \\mathrm{reads}}{\\mathrm{chrA}\\ \\mathrm{reads}+\\mathrm{chrB}\\ \\mathrm{reads}+\\mathrm{chrD}\\ \\mathrm{reads}}$$\\end{document}\nWe examined the homeolog expression ratio using RNA-seq data from T. aestivum samples 24 h after fungal inoculation and after mock inoculation. The r2 (Table 7) and RMSD (Table S7) show that, relative to other methods, Kallisto exhibited a large drop in concordance in hexaploid wheat compared to A. kamchatica. The homeolog expression ratios (Figure 5a–d) show that the discordance was largely due to systematic differences in low-expression homeologs, which was also a trend for other methods (Figure 5e and f and Supplementary data, Figures S3–S5) though not to the degree of Kallisto. Also similar to A. kamchatica, there was some discordance between EAGLE-RC and other methods. In hexaploid wheat, there are 1689 AB, 1549 BD and 1661 AD pair-wise homeologs that have at least a 40% difference in the proportion of the gene model aligned, which may account for much of the discordance between EAGLE-RC and other methods.\nThen, we examined differentially expressed homeologs using RNA-seq data from T. aestivum samples 24 h after fungal inoculation and after mock inoculation. The DEH results for hexaploid wheat between different methods (Figure 6) show that there was less concordance between methods compared to tetraploid A. kamchatica. Notably, only 51% of 304 DEHs detected by Kallisto were supported by any of the other four methods (47% by STAR, 45% by LAST, 38% by HomeoRoq and 41% by EAGLE-RC). The results here reiterate that discordance was systematic, where Kallisto’s tendency to overestimate low-expression genes led to significant differences in the homeolog expression ratio.\nFigure 6 Overlap of differentially expressed homeologs from different methods in T. aestivum. (a) The overlap between traditional methods. (b-d) The overlap between subgenome-classification methods HomeoRoq and EAGLE-RC with each traditional method.\n\nDiscussion\nRecent improvements in sequencing technology have reduced the difficulty in constructing allopolyploid reference genomes [12–14] and there should be a corresponding increase in genome-wide studies for allopolyploids [3–8]. As such, there is a need to evaluate expression quantification methods given the presence of homeologs in allopolyploids. Unlike expression quantification in diploids, homeolog expression quantification evaluates multiple highly similar gene copies concurrently and it would be ideal if all copies are at a similar level of completeness. This includes the genome annotation, which may be a non-trivial source of uncertainty [53, 54].\nIt is well known that the presence of repetitive sequences in diploids (paralogs) present technical challenges for read alignment [55, 56] and can bias RNA-seq expression quantification [57]. For polyploids, with the presence of homeologs, there are even more repetitive sequences due to an increase in the number of gene copies. In this study, we saw that applying a standard diploid RNA-seq workflow to allopolyploids may have issues in terms of assigning reads to the wrong subgenome, particularly in hexaploid wheat. In general, low-expression genes accounted for most of the discordance between methods. Kallisto, especially, often overestimated the number of reads in low-expression genes which has been observed in previous studies [58]. If the goal is to determine homeolog expression bias, then accurate quantification of low-expression genes is important because small errors can result in large shifts in the expression ratio between homeologs. In addition, the accuracy of expression levels in such genes is especially important if we wish to identify ‘on’ and ‘off’ states of gene expression in response to stimuli in time course analyses.\nThere has been a long discussion and dispute about the bias of expression between homeologs and subgenomes because different species show different patterns [59]. In this study we found that different methods showed different biases, depending on the types of uncertainty they consider. In particular, gene annotation can affect the detection of homeolog expression bias because the exon regions of homeologs are typically annotated using RNA-seq reads on each copy separately and thus may be annotated differently, especially when the expression level of one of the copies is low. To obtain general conclusions on the biased expression in polyploids, we would suggest that analysis be performed with comparable and accurate methods in corresponding exonic regions. EAGLE-RC attempts to do this by considering only corresponding exonic regions between homeologs.\nIn our evaluations, we presented results both at the read mapping stage and after expression quantification. It is difficult to be completely fair when evaluating the accuracy of read mapping due to differences in each of the four approaches. Kallisto does not attempt to identify primary versus secondary alignments, thus a primary alignment only evaluation is not suitable. The standard genome alignment approach may have secondary alignments due to the presence of homeologs, which can be indistinguishable from the primary alignment in terms of alignment score, in which case a primary alignment is picked randomly. The subgenome-classification approach of mapping reads to each subgenome separately has inherently fewer secondary alignments, which is an advantage of this approach. Thus, we evaluated the classification accuracy with all alignments, as secondary alignment errors are a factor in all approaches though to different degrees. The results after read counting then present the performance of each approach after potentially accounting for ambiguous alignments. In this way we show the performance at both major steps in the RNA-seq expression quantification process, though there are many other read counting methods that we were not able test. Read counting methods may also benefit from the higher precision of the subgenome-classification approach by utilizing classified reads in lieu of uniquely mapped reads and distributing ambiguous reads through EM.\n\nConclusion\nIn this study, we evaluated methods for homeolog expression quantification in tetraploid A. kamchatica and hexaploid wheat T. aestivum using RNA-seq. We examined the standard genome alignment based approach with STAR and LAST, the subgenome-classification approach with HomeoRoq and EAGLE-RC and a pseudo-alignment approach with Kallisto.\nThe presence of homeologs had the largest effect on STAR and Kallisto, resulting in higher read classification error. We observed that discordance was systematic, occurring mostly in low-expression genes and can result in large shifts in the homeolog expression ratio. The explicit use of genotype differences between homeologs in EAGLE-RC seems to be a factor in reducing uncertainties in the reference genome and annotation. Our results show that EAGLE-RC was the most precise method in both tetraploid A. kamchatica and hexaploid wheat.\n\nSupplementary Material\nSupp_bby121 Click here for additional data file. "}