PMC:5563921 / 15299-26558
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"28414515-21724842-58539382","span":{"begin":105,"end":109},"obj":"21724842"},{"id":"28414515-23618408-58539383","span":{"begin":537,"end":541},"obj":"23618408"},{"id":"28414515-21572440-58539384","span":{"begin":4635,"end":4639},"obj":"21572440"},{"id":"28414515-9254694-58539385","span":{"begin":4979,"end":4983},"obj":"9254694"},{"id":"28414515-21571633-58539386","span":{"begin":6054,"end":6058},"obj":"21571633"},{"id":"28414515-23618408-58539387","span":{"begin":6368,"end":6372},"obj":"23618408"},{"id":"28414515-22383036-58539388","span":{"begin":6416,"end":6420},"obj":"22383036"},{"id":"28414515-9360842-58539389","span":{"begin":8514,"end":8518},"obj":"9360842"},{"id":"28414515-12672026-58539390","span":{"begin":8537,"end":8541},"obj":"12672026"}],"text":"3. Results\nIn our experiments, we used the prostate cancer data set utilized in the study by Kim et al. (2011). The data set is publicly available in NCBI Gene Expression Omnibus (GEO) under Accession No. GSE29155. It contains 11 samples in total, where 7 of them belong to tumor tissues and the remaining 4 samples are benign. We measured the GC content and the number of ambiguous bases of the outcomes of each method, and then aligned the results of both methods to the human genome using Tophat2 as the alignment method (Kim et al., 2013).\nDUST takes a value that ranges from 0 and 100 as the complexity threshold, while Zseq takes a z-score value as a complexity threshold, which shows how many standard deviations the normalized uniqueness score of the read is away from the mean. For the DUST method, we chose the value 5 as the threshold, which means that the value of the complexity of the read has to be greater than or equal to 5 to be selected; otherwise, DUST will ignore the read. For Zseq, we have chosen −1.5 as the value of the threshold, which makes the read good to be selected if the z-score of that read is greater than or equal to −1.5. The reason behind selecting these two thresholds is that both methods filter almost the same number of reads in each sample. The filtered reads using Zseq have less GC content than the filtered reads using DUST. It also has smaller standard deviation, which makes the reads centered more around the mean than DUST. Figures 4 and 5 show the GC-content distributions for both methods applied on the same sample set (SRR202058).\nFIG. 4. Percentage of GC content for all filtered reads using the Zseq histogram with \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\mu = 52.63 \\%$$ \\end{document} and \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\sigma = 12.08 \\%$$ \\end{document}.\nFIG. 5. Percentage of GC content for all filtered reads using the DUST histogram with \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\mu = 53.09 \\%$$ \\end{document} and \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\sigma = 12.36 \\%$$ \\end{document}. Zseq shows a slight improvement in reducing the GC content, mapping rate, and mapping time, while dropping the number of ambiguous bases drastically in comparison with DUST. Table 1 shows that the number of ambiguous bases, N, in the filtered reads using Zseq has drastically decreased compared with the ambiguous bases that have been filtered out using DUST in all samples. For example, the number of occurrences of N in sample SRR202054 for filtered reads by DUST is 19,177, while there are only 11,135 filtered reads using Zseq for the same sample. The results indicate that Zseq slightly shrunk the GC-content percentage distribution and reduced the mean of the GC-content percentage. For sample SRR202055, the mean of the GC content is 52.48% ± 12.10% using Zseq, which is less than the 52.91% ± 12.38% obtained using the DUST method. Zseq also shows better mapping alignment for the filtered reads than DUST for most of the samples. For example, in sample SRR202061, the reads filtered by Zseq have 79.20% mapping rate, which is greater than 77.90% mapping rate for reads filtered by DUST, the only exception is sample SRR202062, which shows a similar mapping rate of 71.30% for both DUST and Zseq.\nTable 1. Comparison of the Results of Applying Zseq on Samples from the Prostate Cancer Data Set as a Result of Applying DUST on the Same Samples\n\n3.1. De novo sequence validation\nUsing Trinity de novo assembler (Grabherr et al., 2011), transcripts have been reconstructed for the original reads of sample SRR202058, reads that have been filtered by DUST and reads that have been filtered by Zseq. In the next step, all three sets of constructed transcripts were evaluated by searching the assembled transcripts with the human genome sequences using BLAST (Altschul et al., 1997). The set of the reconstructed transcript using the filtered reads by Zseq contains a higher number of long sequences in comparison with the other two sets. Figures 6, 7, and 8 show the meaningful sequences for each set. Some of the sequences, which were built using the reads filtered by Zseq, have a length of 1000 bp or more along with a high alignment score, while the sequence length is slightly more than 300 bp using the reads filtered by DUST and 200 bp for the original reads without filtering.\nFIG. 6. Biologically meaningful human genomic sequences found using BLAST. De novo assembled transcripts using original reads.\nFIG. 7. Biologically meaningful human genomic sequences found using BLAST. De novo assembled transcripts using reads filtered by DUST.\nFIG. 8. Biologically meaningful human genomic sequences found using BLAST. De novo assembled transcripts using reads filtered by Zseq.\n\n3.2. Machine learning validation\nIn another experiment, we used an independent data set containing 12 samples (six tumors and six matched normal) (Kannan et al., 2011). Using these samples, three data sets were generated, one from the original reads, one by applying DUST on the reads, and the third one by applying Zseq on the reads for all samples. In the next step, all reads corresponding to each data set have been aligned to human genome hg19 using Tophat2 (Kim et al., 2013) and Cufflinks assembler (Trapnell et al., 2012) with default parameters to assemble the transcripts to the human genome and estimate their abundance, which is measured by FPKM value (fragments per kilo bases of exons for per million mapped reads). Table 2 shows the average mapping rate of reads filtered by each method.\nTable 2. Average Mapping Rate of Transcripts Using the Data Set Generated by the Original Reads, Reads Filtered by DUST, and Reads Filtered by Zseq Each generated data set using filtered reads has 43,497 features (transcripts) with FPKM values. Also, each of the 12 samples was labeled as cancer or matched benign. The FPKM value equals 0 if the transcript has not been presented in that sample. We measured the number of transcripts that can individually separate all cancer samples from normal samples perfectly, with 100% accuracy. In other words, we want to compute the number of transcripts generated using filtered reads by each method, in such a way that the FPKM values corresponding to cancer samples can be separated from those of FPKM of normal samples. Figure 9 depicts two transcripts; transcript a has clearly separable FPKM values, while in transcript b, the FPKM values cannot be separated accurately.\nFIG. 9. An example of two transcripts, one with separable FPKM values (a), and other transcript with inseparable FPKM values (b). Table 3 shows the number of transcripts that contain separable FPKM values. These results indicate that applying Zseq influences the alignment tool and assembler to quantify more meaningful transcripts that can discriminate cancer and normal samples in comparison with the DUST method and original reads.\nTable 3. The Number of Discriminative Transcripts For Each of the Three Data Sets Moreover, using chi2 (Liu and Setiono, 1995) statistical test on the 231 discriminative transcripts from Zseq data set, the NM_001145410 transcript corresponding to NONO gene was the most significant transcript among all other transcripts in all three data sets. NONO is known to regulate in different types of cancers such as breast and prostate cancer (Traish et al., 1997; Ishiguro et al., 2003). Next, a support vector machine (SVM) with linear kernel was applied on the three data sets using this transcript as feature. SVM is a supervised learning machine that tries to find an optimal separating hyperplane between classes (Cortes and Vapnik, 1995). Using a leave-two-out cross-validation scheme, the classification returns 100% accuracy for the Zseq data set, 91.66% for the DUST data set, while it was down to 83.33% in the original read data set.\n\n3.3. Result of estimated cutoff point Zseq\nResult of estimated cutoff point Zseq as shown in Tables 4 and 5 suggested that the method does not find the optimal point. The result of Zseq on the prostate cancer data set using the threshold \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\theta$$ \\end{document} = −1.5 in the previous section outperformed the result of the EC-Zseq. Despite having a better mapping rate, EC-Zseq falls short in mean GC content to Zseq with \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\theta$$ \\end{document}, in a number of ambiguous nucleotide measurements comparing to DUST and Zseq with \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\theta$$ \\end{document}, and in a number of decisive transcripts comparing to Zseq with \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\theta$$ \\end{document}. However, EC-Zseq still shows a better result than the original data set or preprocessing the data set using the DUST method.\nTable 4. Some Artifact Measurements of Prostate Cancer Samples That Were Preprocessed By Ec-Zseq\nTable 5. The Number of Decisive Transcripts for the Data Set That Was Preprocessed By Ec-Zseq\n\n4. C"}