PMC:5563921 / 5259-15298 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"28414515-12707371-58539379","span":{"begin":110,"end":114},"obj":"12707371"},{"id":"28414515-16056220-58539380","span":{"begin":134,"end":138},"obj":"16056220"},{"id":"28414515-21724842-58539381","span":{"begin":2615,"end":2619},"obj":"21724842"}],"text":"2. Methods\nThe z-score measurement has been used in different applications in bioinformatics (Cheadle et al., 2003; Margulies et al., 2005). Chopping sequence into k-mers is an essential technique in read assembly. We present the Zseq algorithm that uses the z-score measurement based on uniqueness scores of all reads. The uniqueness score is the normalized number of unique k-mers in each read that takes low-complex regions into account. Figure 1 depicts the process of finding reads with improved quality. Each module is explained in detail in the next few paragraphs.\nFIG. 1. Schematic representation of the process for filtering reads using the Zseq method. In the first step, Zseq scans all the reads and calculates the uniqueness score for all reads. The uniqueness score corresponding to each read is equal to the number of unique k-mers in that read. Zseq considers the default k-mer size, w, as 4-mers, which makes the vocabulary of four nucleotides (A,T,C,G) to be \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $${4^4} = 256$$ \\end{document} words. As the long reads may contain thousands of nucleotides, the 3-mer size is not sufficient to measure the complexity of the reads. This is because a 3-mer word can exist many times in the same read without being considered as unique, even when it is associated with different nucleotides each time. Zseq excludes the 5-mers of the low-complex/biased artifacts, such as ambiguous bases (N), PolyA/T, and GC content, from being unique by decreasing the unique score of the reads by one for each \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$2w$$ \\end{document} to reduce the chances of selecting this sequence later. The uniqueness score of each read is then normalized by dividing it by the length of the read. The normalized uniqueness scores of all reads are stored in a vector with the same order of the read in the input file.\nFigure 2 shows the distribution of the normalized uniqueness scores for all reads for sample SRR202054 from the prostate cancer data set used in the study of Kim et al. (2011). The x-axis shows the normalized uniqueness scores, while the y-axis shows the number of reads. As shown in the figure, the penalized sequences have a very small score down to −30. These are sequences that have been generated using reads that contain long PolyA/T sequences, very high GC content, or very high number of ambiguous nucleotides (N).\nFIG. 2. Distribution of the normalized uniqueness scores for all reads in sample (SRR202054) (\\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\mu = 25.8169 , \\sigma = 7.1681$$ \\end{document}). In the next step, Zseq calculates the mean and standard deviation for the normalized uniqueness scores. The mean of the normalized uniqueness scores of all reads is calculated in the first loop. The variance is also calculated linearly using a naive algorithm to reduce the cost of this step. The standard deviation is calculated from the variance of the vector of the normalized uniqueness scores.\nNext, for each normalized uniqueness score, we calculate the z-score using the mean, \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\mu$$ \\end{document}, and the standard deviation, \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\sigma$$ \\end{document}, as follows: \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} \\begin{align*} z = ( s - \\mu ) / \\sigma . \\tag{1} \\end{align*} \\end{document}\nThe z-score represents how many standard deviations the normalized uniqueness score of the read is away from the mean \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\mu$$ \\end{document} for all normalized uniqueness scores. In other words, if a read has a z-score of 0, it means that the read has the normalized uniqueness score of \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\mu$$ \\end{document}, while a z-score of value 1 means that the normalized uniqueness score is away exactly one standard deviation from the \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\mu$$ \\end{document}. Figure 3 shows the z-scores for all reads in the sample (SRR202054), where the x-axis is the z-score of the normalized uniqueness scores, while the y-axis indicates how many reads a particular z-score has in the sample.\nFIG. 3. Distribution of the z-scores of the normalized uniqueness scores corresponding to each read for sample (SRR202054). Finally, the user-adjustable threshold \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\theta$$ \\end{document} is used to determine whether or not to select the reads, if the z-score of the normalized uniqueness score of the reads is greater than or equal to \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\theta$$ \\end{document}, the read will be selected; otherwise, it will be filtered out.\n\n2.1. Estimating the cutoff point\nA data-driven method based on the labeling rules is used to filter out the reads with low uniqueness score. The method automatically determines the cutoff point c to compensate \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\theta$$ \\end{document} in the histogram of reads uniqueness scores and removes those reads whose uniqueness score is less than c. The labeling rules model calculates the rst quartile q1 and third quartile q3 using mean and standard deviation, both of which are in the rst loop through the reads. The cutoff point is calculated as follows: \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} \\begin{align*} c = q1 - g ( q3 - q1 ) , \\tag{2} \\end{align*} \\end{document}\nwhere g is the g-factor that can be calculated as follows: \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} \\begin{align*} g = ( h - q1 ) / h , \\tag{3} \\end{align*} \\end{document}\nwith h being the highest value in the histogram of reads' uniqueness scores. After calculating the cutoff point c, the method sweeps again throughout the reads and selects those that have \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\usepackage{upgreek}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$uniquenessscore \u003e = c$$ \\end{document}.\n"}