> top > docs > PMC:5123383 > spans > 0-86

PMC:5123383 / 0-86 JSONTXT

A fuzzy method for RNA-Seq differential expression analysis in presence of multireads Abstract Background When the reads obtained from high-throughput RNA sequencing are mapped against a reference database, a significant proportion of them - known as multireads - can map to more than one reference sequence. These multireads originate from gene duplications, repetitive regions or overlapping genes. Removing the multireads from the mapping results, in RNA-Seq analyses, causes an underestimation of the read counts, while estimating the real read count can lead to false positives during the detection of differentially expressed sequences. Results We present an innovative approach to deal with multireads and evaluate differential expression events, entirely based on fuzzy set theory. Since multireads cause uncertainty in the estimation of read counts during gene expression computation, they can also influence the reliability of differential expression analysis results, by producing false positives. Our method manages the uncertainty in gene expression estimation by defining the fuzzy read counts and evaluates the possibility of a gene to be differentially expressed with three fuzzy concepts: over-expression, same-expression and under-expression. The output of the method is a list of differentially expressed genes enriched with information about the uncertainty of the results due to the multiread presence. We have tested the method on RNA-Seq data designed for case-control studies and we have compared the obtained results with other existing tools for read count estimation and differential expression analysis. Conclusions The management of multireads with the use of fuzzy sets allows to obtain a list of differential expression events which takes in account the uncertainty in the results caused by the presence of multireads. Such additional information can be used by the biologists when they have to select the most relevant differential expression events to validate with laboratory assays. Our method can be used to compute reliable differential expression events and to highlight possible false positives in the lists of differentially expressed genes computed with other tools. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1195-2) contains supplementary material, which is available to authorized users. Background The advent of High-performance Next-Generation Sequencing (NGS) technologies has improved the analysis of differential expression of genes, by investigating both the nature and the quantity of expressed mRNAs and increasing the spectrum of applications of sequencing. Increasingly faster mapping algorithms and more complex statistical analyses have been developed to address all the issues introduced by RNA-Seq data, but the discussion about the best workflows is still open. A typical differential expression (DE) analysis workflow is composed by three main steps: (1) read mapping, (2) gene expression computation and (3) identification of noticeable differences between the samples. The results are generally output as a list of genes showing a statistically significant variation in expression between different experimental conditions. The confidence of the result is usually defined through fold change values and p-values obtained with statistical hypothesis testing. Despite the continuous improvement of the sequencing techniques, the results of DE analyses are not yet fully reliable. The variations in gene expression discovered from RNA-Seq data must be confirmed with laboratory assays, such as qPCR and these validations often reveal the presence of false positives in the bioinformatics analysis results [1]. In order to obtain more accurate results, the second and third steps of DE analysis workflow have been deeply examined in the last years, also taking advantages from the experience gained with the use of microarrays. Several normalization techniques [2–4] and DE analysis models [5–7] have been designed for sequence count data. More recently some attention has been focused on an issue that arises in the first step of DE analysis workflow, i.e. the problem of multireads, which are those reads that map to more than one transcript/genomic location in the reference sequences and cause uncertainty in gene counts. The main source of mapping ambiguity is the presence of genes with similar sequences (i.e., paralogous gene families) and, since the sequenced reads do not span entire transcripts, alignment algorithms are sometimes unable to uniquely determine the gene from which they are derived. In addition, polymorphisms, inaccurate reference sequences and sequencing errors require mismatches and indels to be allowed in read alignments, lowering out the quality of the mapping process. When the number of multireads is small, many researchers simply choose to exclude such reads from the analysis, counting only uniquely mapping reads [8], but this option always leads to an underestimation of the read counts and it is not possible to a priori know if the removed reads were relevant for the DE results. Another approach is to estimate the real number of read counts: the simplest way is to assign multireads to genes proportionally to the expression of uniquely mapping reads (named Rescue Method) [9]; some more complex techniques compute an estimation of the read counts using probabilistic models, such as IsoEM [10], RSEM [11], Rcount [12], TEtranscript [13], MMR [14]. These methods, starting from some assumptions on the distribution of the data, model the generation of multireads and estimate the final read counts. From each method we obtain different estimated counts, and their confidence value, when provided, is not considered by the actual DE analysis tools and it cannot be used to evaluate the results. In a recent work, Robert and Watson [15] propose a two stage analysis: in stage 1, reads are assigned uniquely to genes; in stage 2, reads that map to multiple genes are assigned uniquely to multi-map groups, but 30 % of such groups contains two or more genes. The expression variation of a single gene could either be hidden in the group, or it could require a laboratory validation of all the genes in the group. In this paper, starting from an approach for dealing with multireads that preserves uncertainty information, we present a novel DE analysis workflow. The main aim of our work is to extract the results with high possibility of being true positives, and hence to highlight those results that risk to be false positives, since their count values are influenced by the presence of multireads. The whole workflow we describe is based on fuzzy set theory. Fuzzy sets were introduced by Zadeh as a mean to model imprecisely defined concepts in 1965 [16], by handling partial truth with the concept of gradual membership to a set. Fuzzy logic is a generalization of classical logic based on fuzzy set theory. A fuzzy set could be used to restrict the possible values a variable can assume, thus providing a possibility distribution for the variable [17]. Possibility theory is the theory of handling possibility distributions; it was introduced by Zadeh and further developed by Dubois and Prade [18]. Possibility theory and probability theory are only loosely related, and the former seems more suitable to model uncertainty deriving from imprecision or lack of information. The approach used in this paper is compliant with the work of Zadeh [17], who proposed possibility distributions as suitable interpretations of fuzzy sets. The possibility measure used in this paper follows the notation introduced by Pedrycz [19]. The method has been tested on public RNA-Seq data and discussed by comparing the obtained results with other bioinformatics tools. The read count estimations have been computed with TopHat and Bowtie mappings and with RSEM. The DE analyses were performed on the counts with cuffdiff, DESeq2, edgeR and with Fisher’s Exact test adjusted by false discovery rate (FDR). Methods Our workflow starts with the quantification of gene expressions in presence of multireads through the definition of fuzzy read count. It applies fuzzy sets to build a granular representation of read counts for each gene. A whole new workflow is then introduced to extend DE analysis to this expression measure. For this purpose, we define the fuzzy fold change, the possibility of over or under-expression and the possibility of same-expression, that is useful to evaluate the presence of false positives in the DE events. The output of the workflow is a list of DE events with the additional information about uncertainty correlated to multiread presence. Fuzzy description of count data When the NGS reads are compared to a database of reference sequences such as genes or transcripts, we obtain a list of results as summarized by Table 1. The mappings among reads and genes are provided one by row and the multiread events are represented by multiple rows identified by the same read. If mismatches and indels are allowed in the mapping results we can obtain, for the same read, matches that differ in mapping accuracy. Table 1 Example of mapping results with multireadsFor gene-1: one uniquely mapping read (read-1); two reads having only gene-1 as best match (read-1, read-2); three reads having gene-1 as best match (read-1, read-2, read-3, the latter having also gene-2 as best match); five reads mapping on gene-1, even if not as best match (from read-1 to read-5) In our approach, we start from the assumption that the higher is the accuracy of the mapping, the higher is the possibility that read actually originates from such gene. Therefore, if a read maps to two genes with the same accuracy, we consider – without any statistical assumption – that it is equally possible for the read to originate from any of such genes. On the other hand, if a read is more similar to a gene than to another, then it is more possible that the read originates from the former than from the latter, yet without excluding this eventuality (because a low similarity can be due to errors in sequencing or variations in the genome of the sample). We easily introduce these concepts in our method by exploiting the possibility distribution emerging from the representation of read count through fuzzy sets. If we can define a numerical score to evaluate the accuracy of the mapping results and if we can scale it between 0 and 1, we can exploit it as description of the possibility of reconstructing the right mapping. For example, in NCBI Blast results [20], the numerical score could be defined by the identity of the mapping or by the product of identity and coverage (bit score); in a SAM file the 5th column, which represents the mapping quality between 0 and 255, can be scaled and used as a possibility value. After this computation of possibilities, we obtain a table like Table 2, in which a possibility value is associated to each mapping. Table 2 Possibility values of mappingsIn this example the possibility values are computed by scaling the product of Identity and Coverage obtained in Table 1 in a value from 0 to 1 For each gene, we can now compute the possibility of having a given read count. For example, the possibility of having a read count = 1 for gene-1 is the best possibility of having one read as true match and all the others as false match. In this way, we compute for gene-1 the possibility distribution shown in Fig. 1. This figure is a fuzzy set, and its interpretation is that the real read count for the gene can possibly be 2 or 3, but it has also an intermediate possibility of being 1, 4 or 5. This is just a representation of what we observe from mapping results, without assumptions on the shape of the distribution and without probability evaluation. The computation of such a fuzzy set is cumbersome; however some general considerations can be drawn to obtain an effective approximation. Fig. 1 Discrete possibility distribution of read counts. For each gene (a) can be represented with a trapezoidal fuzzy set (b) In fact, we can observe that the lowest possible value – denoted with an A in Fig. 1(a) – is the number of reads uniquely mapping to gene-1 and the highest value (D) is the total number of reads that map to that gene (counts outside this interval have null possibility). The plateau with highest possibility is delimited by two other values: the first (B) is the number of reads having the gene as unique best match, the second (C) is the number of reads having that gene as non unique best match (for example read-3 in Table 1 maps both to gene-1 and gene-2 with highest similarity). We also observe that the possibility value always increases between A and B and decreases between C and D. We can use these four parameters to obtain a simplified representation of the granular read count through a trapezoidal fuzzy set, exemplified in Fig. 1(b) and formally defined as:1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \mathrm{T}\mathrm{r}\left[\mathrm{A}\hbox{'},\mathrm{B},\mathrm{C},\mathrm{D}\hbox{'}\right](x)=\left\{\begin{array}{c}\hfill \begin{array}{cc}\hfill 0,\ \hfill & \hfill \mathrm{x}\le \mathrm{A}\hbox{'}\vee \mathrm{x}\ge \mathrm{D}\hbox{'}\hfill \\ {}\hfill \frac{\mathrm{x}\hbox{-} \mathrm{A}\hbox{'}}{\mathrm{B}\hbox{-} \mathrm{A}\hbox{'}},\hfill & \hfill \kern1.9em \mathrm{A}\hbox{'}<\mathrm{x}\le \mathrm{B}\hfill \end{array}\hfill \\ {}\hfill \begin{array}{cc}\hfill 1,\hfill & \hfill \kern1.9em \mathrm{B}<\mathrm{x}<\mathrm{C}\hfill \\ {}\hfill \frac{\mathrm{x}\hbox{-} \mathrm{D}\hbox{'}}{\mathrm{C}\hbox{-} \mathrm{D}\hbox{'}},\hfill & \hfill \kern1.9em \mathrm{C}\le \mathrm{x}<\mathrm{D}\hbox{'}\hfill \end{array}\hfill \end{array}\right. $$\end{document}TrA',B,C,D'x=0,x≤A'∨x≥D'x‐A'B‐A',A' 0.5 and evaluated the uncertainty of the results with same-expression fuzzy set. The results are summarized by Fig. 8 and Table 3. The figure plots the possibility of same-expression for the genes evaluated as differentially expressed by each analysis. The higher the possibility of same-expression, the more uncertain is the result of the gene, because of the presence of multireads. Fig. 8 Same-expression possibility in the results of the First Trial. In this trial the results obtained with TopHat show a rapidly increasing same-expression possibility. They have an high possibility of representing false positive results. The plot shows a wide presence of uncertainty in all the results and the grey line of same-expression possibility, by lying below all the other lines, reveals the presence of other differentially expressed genes less influenced by multireads, which have not been discovered by the other tools Table 3 Comparison of the results obtained in First TrialEach cell represents the number of DE genes in common between two methods specified in the corresponding row and column. Despite the Fisher’s test with FDR extracts a very large list of DE genes, all the results do not agree on at least one gene. The values on the diagonal (in bold) are the results obtained from each method The uncertainty highlighted in the differentially expressed genes is confirmed by an overall lack of concordance among the DE analyses results, summarized by Table 3. In this table, each cell represents the number of DE genes selected by the two methods specified in the corresponding row and column. Table 4 lists some examples of certain and uncertain results. For each gene in the table we list all the read counts estimations (RPKM for TopHat) and the results of the applied methods. Table 4 Examples of results of the First TrialThe first gene, HLA-DRB5, despite the abundant presence of multireads in its read counts, shows a certain result, with an maximum under-expression possibility and very low same-expression possibility. This results is not confirmed by cuffdiff, evenif it computes a high FC. The second gene, TTTY15 result is not reliable because it has low read counts, but it is considered as a DE result by all the tools. Our method higlights the result as false positive, with a high same-expression value, because of the low mean expression values obtained. The third example, the IGJ gene, is a possible false positive. Its counts are quite low and its fuzzy read counts are mostly overlapping. The pvalues obtained by the other tools are confirmed by a high under-expression possibility, but the risk of having a false positive is pointed out by a high same-expression possibility Second trial The same test described for a case-control study with just one sample per condition has been performed introducing biological replicates, downloaded from the same SRA study. We have used four replicates for case and four replicates of healthy controls. In this case the application of DESeq2 to the data gave no warnings and significant DE results, so we have used it for the p-value computation of RSEM and rescue-like estimations. Also edgeR has been applied, thanks to the presence of replicates. The mapping identified 27,328 genes, 35 % of reads were multireads and 88 % of gene counts were influenced by multireads. The results of the analyses are summarized by Fig. 9 and Table 5. In this case, the presence of biological replicates increases the same-expression possibility of the selected results, and despite the presence of biological replicates, there is still a very low concordance among the various DE analysis results. Multiple Sclerosis is a complex and multi-factorial disease and the introduction of biological variation in this experimentation makes it difficult to extract significant differential expression events. The list of genes selected by at least two workflows, their possibility values and their estimated read counts are showed in Supplementary information (Additional file 1). Fig. 9 Same-expression possibility in the results of the Second Trial. In this case, there is less difference in the same-expression evaluation of the results obtained with all the tools. The lower value of same-expression obtained on the best result is 0.53 Table 5 Comparisons of the results obtained in Second TrialEven in this case, the comparison of DE analyses show a significant disagreement on the selection of DE genes. The values on the diagonal (in bold) are the results obtained from each method Simulation study Since it is impossible to find experiments with information about the real varied and unchanged reads, we have set a simple simulation study to compare the proposed method with other existing tools. We have used wgsim by Samtools [25] to simulate 2 experimental conditions with 2 biological replicates, by introducing 10 over-expressed (with log2FC = +1) and 10 under-expressed genes (with log2FC = −1). The simulated FASTQ files have been processed with TopHat, Cuffdiff, RSEM, DeSEQ2, edgeR and with our fuzzy method. Some examples of the read counts obtained with RSEM and our method are showed in the Supplementary information (Additional file 1). The results obtained are summarized by Table 6. Cuffdiff with TopHat and DESeq2 applied on the centroids of the fuzzy trapezoids output no significant results (the adjusted p-values are > 0.05). DESeq2 applied on RSEM estimation allows to obtain the best sensitivity, while the best recall is obtained by edgeR applied on the centroids. Since our method only ranks the genes, we cannot compare its results in terms of sensitivity and specificity (or precision and recall), and then we will compare the different ranking obtained by all the methods. Anyway, it is interesting to see that, while only 10 genes have a same-expression possibility < 0.5 and all the listed genes have an over- or under-expression possibility very close to 1, the fold change obtained on the centroid values is useful to further evaluate the results. In fact, the expected log2 fold changes of about +1 (for over expressed genes) and about −1 (for under-expressed genes) is correctly computed only by our method and by edgeR. DESeq2 always underestimates the fold change. Also Cuffdiff computes a correct fold change for our varied genes, even if it does not consider them significant. The last evaluation is made by comparing the gene ordering given by the same-expression possibility with the sorting of results obtained by the p-values of the other workflows. The aim of this evaluation is to see which workflows are able to put the most reliable results in the top of their gene ranking (see Table 7). Table 6 Comparisons of the results obtained on simulated dataThis table lists the results obtained on 10 over-expressed and 10 under-expressed genes. Five more genes are involved as false positives. Some workflows select the genes as significant DE events only by the p-value (*), while others compute also the correct FC (**). The first two workflows are not able to select significant p-values. Ten genes have a low uncertainty (same-expression possibility) and ten genes have a high uncertainty due to their multireads. The five false positive events have a same-expression possibility close to 1 and a non significant FC. Even in this case, there is an overall disagreement in the selection of DE genes. The last 7 rows of the table show some evaluation metrics about the results (Bold) Table 7 Rankings of the genes for the simulated dataIn order to highlight the true positives, the other genes are hidden by a “false positive” label The worst result is obtained by Cuffdiff on TopHat’s mappings, because it puts a false positive gene at the first position in the ranking. The best result is surprisingly obtained by DESeq2 applied on the centroids of the fuzzy read counts. This workflow did not output significant results but it properly orders the genes, and it puts the first false positive in the 20th position. The second best result is obtained while ranking by same-expression possibility. Here the first false positive result is in the 16th position. Anyway, the good performance obtained by the centroids with DESeq2 can be exploited also to enhance the ranking of the method proposed in this paper. In fact, if we filter the results by fold changes (of centroids) > 0.8, we cleanse the list and we obtain the first false positive in the 20th position. Discussion The tests performed on the selected NGS experiments revealed a high percentage of multireads – 36 % of reads were multireads – that influenced over 83 % of expressed genes. The different approaches used to estimate the read counts have influenced the DE analysis, and all the obtained lists of differentially expressed genes showed some significant differences in the results. In this situation, it is very difficult to perform the choice of a subset of DE events to be confirmed with laboratory assays, such as qPCR. Moreover, since it is impossible to establish which is the right mapping for a multiread, it is also difficult to evaluate the better choice for the estimation of read counts. Our method is not designed to propose a further estimation of read counts, but it provides a solution for managing the uncertainty in read counts and in DE analysis results due to the presence of multireads. Furthermore, the quantification of under/same/over-expression possibility can be useful in the selection of the more reliable DE events, in order to improve the following laboratory work. The three possibility values can be used just to evaluate the uncertainty of the results of a DE analysis tool, or they can be used in place of it. In fact, a gene with a relevant change in expression will show a high possibility of over or under-expression, and when this result is accompanied by a null possibility of same-expression, we are sure that the multireads are not influencing the result. The possibility values can be used to properly sort the genes, and we do not propose to apply some thresholds on them, but for an easier interpretation of the results, one could consider an appropriate cut for the discretization of the possibility values and the selection of significant DE events. For instance, by applying a cut of possibility = 0.75 in the results of the first trial of Multiple Sclerosis dataset, we could select 37 DE genes, with the worst same-expression possibility < 0.34. In the analysis of biological replicates of the same datasest, the same-expression possibility is quite high for all the genes. As highlighted by Fig. 9, the lowest same-expression possibility is 0.53. This result can be due to the large amount of multireads found in the experiment, but it might be also caused by the rule that we have chosen to merge the fuzzy read counts of the same gene coming from biological replicates – described in the Methods section – that will be matter of subsequent studies. Conclusions In this paper we have presented a method for dealing with the problem of multireads without statistical assumptions and probability estimations. The method exploits fuzzy sets to describe the presence of uncertainty in read counts estimation and in DE analysis results, whenever the evaluation of the expression of a gene is influenced by multireads. We have described the trapezoidal fuzzy read counts and fuzzy fold change, and we have modeled the DE events with three fuzzy concepts: over-expression, same-expression and under-expression. The result of the computation is a list of DE events that can be sorted by the possibility of having a change in expression. The DE events are enriched with an evaluation of the possibility of same-expression, that describes the uncertainty of the result due to the presence of multireads mapping to each gene. In fact, the main aim of this work was to highlight the most reliable DE events and to warn about the more uncertain others, in order to exclude possible false positives from the laboratory assays that follow the bioinformatics analyses. We have tested the method on some public datasets, and the results revealed that 36 % of the mapped reads were multireads and over 83 % of expressed genes were influenced in their read count by the mapping uncertainty. We have not proved how many DE events that showed high uncertainty are really false positives, because it is very difficult to obtain public dataset in which all DE events have been validated via qPCR, especially because false positive results are not published, usually. Some studies have compared RNA-Seq data with microarray data, but none of the two technologies have been showed to be more reliable than the other on all the genes [26, 27]. In order to give an example of how the multireads influence the discovery of true positive DE events, we have tested the method on a simulated dataset with 10 over-expressed and 10 under-expressed genes. The results showed an overall disagreement in the selection of DE genes, but the proposed method is able to properly rank the most reliable results. The problem of multireads is frequent in high throughput data analysis and even with longer reads it does not decrease as expected, because there is a high overlapping across genetic portions. We have noticed that in a gene reference database, 50 % of the genes have at most one read that covers a portion of 45 % of their length. The same problem is present in other studies involving read mapping against a reference database. In this paper we have applied our method to gene expression evaluation, but the same concepts can be applied to other studies, such as isoform expression evaluation, ncRNA identification and genomic or metagenomic classification of DNA-Seq reads. As a future development, we plan to improve the merging of biological replicates step and we will compare our method to other emerging tools. For example, several alignment-free transcript quantification pipelines (e.g., kallisto [28], Salmon [29]) have recently been proposed. These incorporate multi-mapping reads and provide uncertainty estimates for gene counts using bootstrap sampling. Moreover, the same-expression possibility could be directly compared to the output of the DE analysis tool sleuth [30], which evaluates the DE events by considering also the confidence values computed by kallisto.

Document structure show

projects that have annotations to this span

There is no project