|
Copy Number Studies in Noisy Samples
Abstract
System noise was analyzed in 77 Affymetrix 6.0 samples from a previous clinical study of copy number variation (CNV). Twenty-three samples were classified as eligible for CNV detection, 29 samples as ineligible and 25 were classified as being of intermediate quality. New software (“noise-free-cnv”) was developed to visualize the data and reduce system noise. Fresh DNA preparations were more likely to yield eligible samples (p < 0.001). Eligible samples had higher rates of successfully genotyped SNPs (p < 0.001) and lower variance of signal intensities (p < 0.001), yielded fewer CNV findings after Birdview analysis (p < 0.001), and showed a tendency to yield fewer PennCNV calls (p = 0.053). The noise-free-cnv software visualized trend patterns of noise in the signal intensities across the ordered SNPs, including a wave pattern of noise, being co-linear with the banding pattern of metaphase chromosomes, as well as system deviations of individual probe sets (per-SNP noise). Wave noise and per-SNP noise occurred independently and could be separately removed from the samples. We recommend a two-step procedure of CNV validation, including noise reduction and visual inspection of all CNV calls, prior to molecular validation of a selected number of putative CNVs.
1. Introduction
Genomic copy number variation (CNV) was associated with a variety of clinical phenotypes [1,2,3,4,5,6]. Hence, the study of CNV is of diagnostic importance. CNV identification from high-density SNP-microarrays may be unreliable, particularly in noisy data [7,8,9]. Therefore, extensive validation of CNV findings is needed. Since CNV detection software may identify hundreds of putative CNVs in each sample and since validation of CNV findings by qPCR, or by other molecular methods, is laborious, we searched for simple strategies to evaluate large numbers of CNV findings.
Rigorous studies revealed that several components of system error occur in copy number data [10,11,12,13]. Here we focus on two major types of noise and present the noise-free-cnv software package for the visualization of copy number data and for the reduction of noise. This software enables large-scale inspection of CNV findings (produced by PennCNV [14], Birdview [15,16], or other specialized software packages). For illustration, we used 77 microarrays from a previous study of patients with cervical artery dissection from Switzerland and Southern Germany (age: 42.5 ± 9.8 years; 31 (40.3%) women) [17]. DNA was isolated from peripheral blood samples (no DNA from lymphoblastoid cell lines was used). DNA extraction, array hybridization, and array scanning were performed according to the manufacturer’s instructions [17]. The LRR and BAF values were obtained from the CEL files with the Affymetrix Power Tools software (APT). The quantile normalization was done in APT. The LRR and BAF can be then imported to PennCNV, to other CNV detections software packages (QuantiSNP, MAD), or to noise-free-cnv.
The Affymetrix 6.0 microarrays used for CNV detection contain a total of 906,600 single nucleotide polymorphisms (SNPs) and 946,000 non-polymorphic copy number probes (CNPs) covering all human chromosomes. In the present article, the notion of SNP is used for all analyzed probe sets (SNPs as well as CNPs).
2. Noise Components
Figure 1 shows two samples (visualized by noise-free-cnv), displaying signal intensity (LRR—upper panel) and B-allele frequency (BAF—lower panel) of all SNPs ordered along the chromosomes. The Log R Ratio (LRR) is a normalized measure of the total signal intensity for two alleles of the SNP. The B-Allele Frequency (BAF) is a normalized measure of the allelic intensity ratio of two alleles [18]. Signal intensities in sample ID 2355 show larger variance than in ID 1022. Moreover, a prominent pattern of waves is apparent in sample ID 2355. In many samples, we observed similar wave patterns. The noise-free-cnv software identified waves using a Gaussian filter with a large standard deviation, for instance comprising 1,000 SNPs. This filter “blurs” the values as shown in Figure 2(G,H). We called the resulting wave data the wave component of the LRR values. The variance of the blurred LRR values is a measure for the prominence of waves, the wave variance.
Figure 1 Signal strength (LRR) and B-allele frequency (BAF) of samples from two male patients (ID 2355 and ID 1022). SNPs were visualized in increasing position along the chromosomes. LRR values of patient ID 2355 have larger variance and show pronounced wave noise.
Figure 2 Wave noise. Ideograms of pro-metaphase (A) and metaphase (B) chromosome 7 were compared with signal intensities of SNPs of chromosome 7 of two patients (C,D) and with a human prometaphase (E) and metaphase (F) chromosome 7. Signal intensities shown in C and D were smoothed (noise-free-cnv software, function “blur” across 1,000 probe sets) to visualize genomic waves (G,H). This wave pattern was compared with the banding pattern of metaphase chromosomes (Figure 2). Human metaphase chromosomes were stained with the Giemsa-trypsine procedure, which induces a banding pattern. AT-rich regions are more frequent in Giemsa-dark bands than in Giemsa-light bands [19,20]. In our study samples, Giemsa-dark bands corresponded to genomic regions with reduced probe set signals. This pattern of noise was described by others as “genomic waves” or “CG-waves” [10,11,12,13]. The co-linearity of genomic waves with Giemsa bands illustrates that genomic waves follow a similar pattern in all samples.
After subtraction of the wave component, the resulting LRR values follow an approximately normal distribution around zero. We called the resulting values per-SNP component and their variance the per-SNP variance. The decomposition of system noise in wave component and per-SNP component is shown for one sample in Figure 3. Wave variance and per-SNP variance components were calculated for all samples in Table A1.
Figure 3 Noise components. LRR values of a noisy sample (A), split up in wave component (B) and per-SNP component (C). All SNPs of chromosomes 1–3 were shown (chromosomes indicated on top of panel A). The system deviations of individual SNP signal intensities are strongly correlated across samples (Figure 4). To quantify the correlation of the noise (variance) components between different samples, we computed two additional data series: for each SNP the median through all 77 per-SNP components was computed and saved as the per-SNP profile. For the wave profile the same procedure was applied to the wave components. We then computed, for each sample, the correlation between the wave profile and the (individual) wave component as well as the correlation between the per-SNP profile and the (individual) per-SNP component. Details of the algorithm are described in Appendix. The high correlations found in our 77 samples confirmed that wave noise and per-SNP noise are system noise, i.e., follow highly non-random patterns. On average, the correlation was 0.843 for the wave component and 0.568 for the per-SNP component.
3. Factors Associated with Quality of Copy Number Data
The resolution of a classical chromosome study depends on the quality of the chromosomes and is expressed as the total number of visible cytogenetic bands (400 bands: low to moderate quality; 850 bands: excellent quality). According to our knowledge, no comparable quality metric for molecular karyotyping exists. Quality control in most copy number studies consists of rejecting samples with outlier numbers of CNV findings. A quality metric for the resolution of a CNV study (relating the size of a CNV and the likelihood of its detection) has not yet been defined.
Figure 4 per-SNP system noise. Signal intensities in genomic region 2: 189766706–189891527 shown for four patients (ID 1020; ID 1022; ID1026; ID 1028). The lower panel shows the per-SNP median profile (median signal intensities) of all samples (n = 77). Arrows and arrowheads indicate SNPs with LRR values far above and below the mean. In the current study we propose a preliminary quality metric based on the median number of SNPs per chromosome with copy number state (CN) ≠ 2 (numbers/chromosome for all cases are shown in Table A1). Copy Number state of each SNP was determined by the Affymetrix Power Tools software package (APT). SNPs located in common CNVs were excluded from this analysis. To identify SNPs located in common CNVs, we analyzed 403 control samples without visible waves and with highest genotype call rates selected from a large German population (PopGen [21]), as described before [17]. The median number of SNPs with CN ≠ 2 per chromosome was considered as a preliminary quality metric. The quality of a sample was related to the chromosomal background of SNPs with abnormal copy number (Figure 5). We defined deliberate quality categories: samples were classified as eligible, if the median number of SNPs per chromosome with CN ≠ 2 was zero, those with >100 SNPs with CN ≠ 2 were classified as ineligible.
Figure 5 Quality of copy number samples. Number of SNPs with CN ≠ 2 per chromosome were scored. Sample ID 715 is eligible for CNV studies (most chromosomes without SNPs with CN ≠ 2). Accumulation of aberrant SNPs in chromosome 7 and 18 indicates presence of rare CNVs. Sample ID 50 is of intermediate quality. Sample ID 062 was classified as ineligible for CNV studies (>100 SNPs with CN ≠ 2 in most chromosomes). Samples were classified according to the defined quality categories in Table 1. The use of freshly prepared DNA (compared to DNA samples that were used since years and had been thawed and frozen repeatedly) was a significant determinant of eligible samples (p < 0.001). Samples with high call rate (rate of successfully genotyped SNPs) were more likely to be suitable for copy number studies than those with lower call rates (p < 0.001). Low levels of wave variance as well as per-SNP variance were associated with eligibility for CNV analysis (p < 0.001). Eligibility for CNV studies was not significantly associated with the median number of calls by PennCNV (p = 0.053). However, eligible samples had between 63 and 165 calls, while the range of calls was much broader in ineligible samples. Birdview yielded significantly more calls in ineligible samples (p < 0.001). The proportion of putative false positive Birdview calls increased with decreasing confidence rates: The number of CNV findings with confidence below 2.5 was most strongly elevated.
microarrays-02-00284-t001_Table 1 Table 1 Characteristics of 77 analyzed samples, classified according to eligibility for copy number variation (CNV) analysis. Numbers indicate mean values and range (lowest–highest value). Mean values were compared between groups with the Chi-2 test or the Kruskal-Wallis test. Figure 6 summarizes salient aspects of system noise in SNP microarrays. Figure 6(A) plots for each sample the variances of wave component and per-SNP component. Wave variance and per-SNP variance seem to occur independently from each other: the observed correlation between both noise components (r = 0.124) was not significant (p = 0.401). Figure 6(B) illustrates the relation between sample eligibility and noise components in the eligible (n = 23) and ineligible (n = 29) cases. Eligible samples (i.e., those that are supposed to be excellent for copy number studies) have low levels of per-SNP variance. Samples with high wave variance are inappropriate for copy number studies.
Figure 6 Wave variance and per-SNP variance. (A) Noise components in all 77 samples and (B) in samples of low (O) and high (●) quality (samples of intermediate quality were not included in (B)).
4. Noise Reduction in Copy Number Samples
The noise-free-cnv software package permits the visualization of samples, the isolation of noise components and the subtraction of isolated noise components. The next two examples (Figure 7 and Figure 8) illustrate noise reduction by comparing a test sample with a reference sample. We finally demonstrate the use of the noise-free-cnv-filter algorithm for the evaluation of CNVs.
Figure 7 shows a deletion in chromosome 20 of patient ID 1091, which was detected by PennCNV and Birdview analysis. Due to strong waves, reduced signal intensities in the region of the putative deletion are not easily seen. Visual inspection of the LRR values of chromosome 20 after subtraction of a reference sample (A–B) suggested the presence of a true deletion in this patient.
Figure 7 Signal intensities (y-axis: LRR values) of all SNPs from chromosome 18q up to chromosome 22. (A) Patient ID 1091; (B) reference sample ID 2355. After subtraction of the samples, a deletion in chromosome 20 became apparent (arrow). Figure 8 illustrates the analysis of a mosaic deletion. Although sample ID D62 was classified as ineligible for CNV studies, analysis of SNPs with CN ≠ 2 per chromosome revealed significant clustering on chromosome 5 (Table A1; Figure 5). Neither PennCNV nor Birdsuite identified a large CNV on chromosome 5. After noise reduction, LRR and BAF values were suggestive for the presence of a mosaic deletion [22,23,24] (Figure 8(B,D)). To confirm the diagnosis of a mosaic deletion, a conventional chromosome analysis was performed: Some rare 5q chromosomes were observed amongst a majority of normal chromosome sets. Interestingly, it was recently demonstrated that the identification of mosaic abnormalities by microarray analysis is unreliable [25].
We developed the noise-free-cnv-filter algorithm for optimized noise reduction (Appendix). In the samples of our study population, noise-free-cnv-filter analysis resulted in an average reduction of the wave variance by 74.2%, of per-SNP variance by 35.3% and of the overall variance by 38.1%. Noise-reduction according to this algorithm supports the evaluation of CNV findings, in particular when the putative CNVs are small (Figure 9).
In patient ID 715, both Birdview and PennCNV identified a deletion on chromosome 18 (green bar in Figure 9). Noise-free-cnv-filter analysis of the sample (ID 715 nf) suggested that the deletion was true. Subsequent molecular analysis confirmed the finding: the joining segment of the deletion was identified by a case-specific PCR and the breakpoints of the deletion were identified by DNA sequencing following standard procedures [17,26]. Two putative duplications in patients ID 412 were evaluated after noise-free-cnv-filter analysis. We considered the duplication in chromosome 1 (region 222 Mb) as spurious (red bar), but the duplication in chromosome 9 as probably true. As a consequence, this putative duplication is a candidate for further validation by molecular methods.
Figure 8 Sample with mosaic large deletion in chromosome 5q. (A,B) LRR- and BAF-values of SNPs of chromosomes 5 and 6 of patient. (C) LRR values of reference sample. (D) Signal intensities after subtraction of reference sample. Arrows indicate region with reduced LRR values. (E) LRR values after application of noise-free-cnv blur over 2,000 SNPs. (Bottom panel) Chromosome analysis of cultured peripheral blood lymphocytes from patient (courtesy of Johannes W.G. Janssen, Department of Human Genetics, University of Heidelberg). Arrow points to 5q-minus chromosome.
Figure 9 Validation of CNV findings. Left panels show crude LRR values, left panels show LRR values after noise-free-cnv-filter analysis. Samples were renamed with suffix “nf” after noise-free-cnv-filter analysis. Bars indicate putative CNV findings.
5. Conclusions—Proposal of a Two-Step Procedure for the Validation of CNV Findings
Our analysis had the following key findings: (1) Copy number samples may be noisy, which interferes—above a certain level of noise—with reliable identification of CNVs; (2) Eligible copy number samples were more likely when fresh DNA was used for microarray hybridization; (3) wave component and per-SNP component of noise are independent; (4) noise-free-cnv software enables noise reduction by subtracting wave and per-SNP noise components from samples; and (5) noise-free-cnv software supports the quality control of copy number data and the validation of copy number findings.
The current noise-free-cnv version was developed for the analysis of SNP microarray samples and was not designed for noise reduction in array based comparative genomic hybridization samples. The present study highlighted the value of noise reduction for large scale CNV validation (after software-assisted CNV detection). However, the value of noise reduction before software-assisted CNV detection is to be analyzed in future studies.
Based on our analysis of noise in real-life copy number samples we suggested a two-step procedure of CNV validation. As a first step of preliminary CNV validation we proposed large-scale inspection of CNV findings after noise reduction, to select putative candidate CNVs and reject false positive findings. In a second stage, this selection of putative CNV calls is analyzed further by independent molecular methods for final validation [17,26].
|