=============Title==========
Copy Number Variations in the Human Genome: Potential Source for Individual Diversity and Disease Association Studies.
=============Cor Author==========
*Corresponding author: E-mail yejun@catholic.ac.krTel +82-2-590-1214, Fax +82-2-596-8969 Accepted 11 March 2008
===========Author==========
Tae-Min Kim1, Seon-Hee Yim2 and Yeun-Jun Chung1,2*1Department of Microbiology, 2Integrated Research Center for Genome Polymorphism, The Catholic University of Korea, Seoul 137-701, Korea
===========Keywords==========
Keywords: array-CGH, Copy number variation (CNV), Genome-wide association study (GWAS)
===========Sub Heading==========
Abstract Introduction The definition of CNV The identification of CNVs using differ-ent platforms Clinical implications of CNVs and dis-ease association study Conclusion
==========Minor Heading===========
A
===========Main Text==========
Abstract.
The widespread presence of large-scale genomic variations, termed copy number variation (CNVs), has been recently recognized in phenotypically normal individuals.
Judging by the growing number of reports on CNVs, it is now evident that these variants contribute significantly to genetic diversity in the human genome.
Like single nucleotide polymorphisms (SNPs), CNVs are expected to serve as potential biomarkers for disease susceptibility or drug responses.
However, the technical and practical concerns still remain to be tackled.
In this review, we examine the current status of CNV DBs and research, including the ongoing efforts of CNV screening in the human genome.
We also discuss the characteristics of platforms that are available at the moment and suggest the potential of CNVs in clinical research and application.
IIntroduction.
Traditionally, large-scale genomic variants that are visible in conventional karyotyping have been thought to be associated with early-onset, highly penetrant genetic disorders, while they are incompatible in normal, disease-free individuals (Lupski, 1998; Stankiewicz and Lupski, 2002).
The construction of the 'reference genome' by the human genome sequencing project is based on the belief that human genome sequences are virtually identical, even in different individuals, except for well-known single nucleotide polymorphisms (SNP) or size-variants of tandem repeats such as mini- or microsatellites (variable number of tandem repeats or VNTR) (Przeworski et al., 2000).
This traditional concept has been recently challenged by the discovery that large structural variations are more prevalent than previously presumed (Check, 2005).
Using high-resolution whole- genome scanning technologies such as array-based comparative genomic hybridization (array-CGH), two groups of pioneering scientists have identified widespread copy number variations (CNVs) in apparently healthy, normal individuals (Iafrate et al., 2004; Sebat et al., 2004).
It proposes that our genome is more diverse than has ever been recognized, and subsequent studies have identified up to 11,000 CNVs across the whole genome (Tuzun et al., 2005; Hinds et al., 2006; Mills et al., 2006; McCarroll et al., 2006; Conrad et al., 2006; Sharp et al., 2005; Wong et al., 2007; de Smith et al., 2007).
Although the current understanding of CNVs is still limited for practical use and technical challenges still remain to be tackled, recent studies already have demonstrated the potential association of CNVs with various diseases, suggesting plausible functional significances and highlighting the promising utility of CNVs.
The current coverage of CNVs in the human genome already has exceeded that of SNPs (approximately 600 Mb comprising 12% of human genome) and is still increasing (Cooper et al., 2007).
These large-scale structural variants, in addition to SNPs, will serve as powerful sources to help our understanding of human genetic variation and of differences in disease susceptibility for various diseases.
This paper reviews the current knowledge and future perspectives of CNVs.
The definition of CNV.
Structural variations that involve large DNA segments can take various forms, such as duplication, deletion, insertion, inversion, and translocation.
Among them, DNA copy number variations larger than 1 kb are collectively termed CNVs.
Fig.
1 illustrates the concept of CNV.
Although the CNV can include large, microscopically visible genomic variations, it generally indicates a submicroscopic structural variation that is hardly detectable by conventional karyotyping (35 Mb) (Freeman et al., 2006).
Smaller variations such as small insertional- deletion (indel) polymorphisms are not included in CNVs, while they comprise another large collection of over 400,000 variants in the human genome (Mills et al., 2006), and neither is the insertional polymorphism of mobile elements such as Alus or L1 elements considered a CNV.
At the beginning stages of CNV discovery, a number of terms were proposed to define them e.g., large-scale copy number variants (LCV) (Iafrate et al., 2004), copy number polymorphism (CNP) (Sebat et al., 2004), and intermediate-sized variants (ISV) (Tuzun et al., 2005).
The current definition of CNV is also operational and can be modified with the advance of scanning resolution and coverage, and availability of allele frequency in a determined population.The identification of CNVs using differ-ent platforms.
Various scanning platforms and quality control methods have been used to identify CNV calls.
Because the choice of platforms has a great effect on the results, it is worth reviewing the characteristics of platforms to improve the understanding of CNVs.
The presence of CNVs in normal individuals was reported for the first time in 2004 independently by two groups led by Lee C. and Wigler M. (Iafrate et al., 2004; Sebat et al., 2004).
Both studies used two-dye array-CGH techniques that used clones of bacterial artificial chromosomes (BAC) or oligonucleotides (representational oligonucleotide microarray analysis, or ROMA).
Theyindependently reported about 250 and 80 loci as changes in copy number from 39 and 20 normal individuals, respectively.
Fig.
2 illustrates the general concept of CNV detection based on two-dye array-CGH.
Although the average numbers of CNVs per individual genome were similar in two studies (about 12 CNVs per genome), it should be noted that there was little overlap between the results.
This discrepancy between studies was possibly due to the use of different platforms and experimental conditions in different populations.
However, it is also probable that there are still large numbers of structural variants that have yet to be discovered (Buckley et al., 2005; Eichler, 2006).
One following study that provided evidence on the widespread presence of large-scale structural variations in the human genome was based solely on in silico analysis (Tuzun et al., 2005).
The sequence-level comparison of two independent genome sequences, i.e., one derived from a human genome reference assembly and the other from fosmid clones of a genomic library, revealed about 300 structural variations, including inversions.
This method can detect various types of structural variants, including inversion, which is not detectable by conventional array-CGH platforms.
Indeed, the results by Tuzun et al.
(2005) can be used as validated control for primary verification or for parameter tuning for the development of CNV-detection platforms or algorithms.
Although the use of this method is currently limited by the unavailability of sequence data, ongoing efforts to sequence the individual human genome and to develop cost-effective sequencing platforms (Bennett et al., 2005) will be able to facilitate sequence-level genome comparisons and the identification of highly qualified structural variants in the near future.
Two studies by McCarroll et al.
and Conrad et al., which focused on the identification of deletion variants (McCarroll et al., 2006; Conrad et al., 2006), used 1.2 million SNP genotyping data from The International HapMap Consortium (International HapMap Consortium.
2005).
They assumed that allelic deletion causes the discard of probes in SNP genotyping.
For example, the runs of consecutive probes with null genotype calls or runs of SNP genotypes whose allelic frequencies deviate from expected Hardy-Weinberg equilibrium ratios or expected Mendelian inheritance patterns might represent the presence of deleted loci.
They independently reported about 600 potential deletions as small as less than 100 bp.
The relatively small size of the identified variants, compared with the array-CGH method, is due to the high resolution of the platforms.
The use of an SNP-centric array platform can be used to identify linkage disequilibrium (LD) of structural variants with nearby SNPs in a given population.
But, the discrepancy in deletions that were identified in the two studies was also noted in spite of using similar HapMap populations and identification methods (Eichler 2006).
Recently, a comprehensive CNV analysis was reported based on high-resolution array platforms, Whole Genome TilePath (WGTP), which used 26,000 large insert clones, and Affymetric GeneChip Human Mapping 500K early access, which used 500,000 SNP oligonucleotides.
They identified about 1500 genomic segments as copy number variations or CNVRs (copy number variable regions) consisting of overlapping CNVs from 269 HapMap individuals (Redon et al., 2006).
The results from the two platforms are worth comparing becasuse they provide the highest currently achievable resolution and are often selected as primary platforms in many other studies.
Firstly, the CNVs that are identified from BAC-based array-CGH are generally larger than those from oligonucleotide-based arrays (230 kb and 80 kb of median size, respectively).
This overestimation of CNVs by BAC-based array-CGH is due to the large insert clones that are used, which has been frequently reported (Iafrate et al., 2006).
Secondly, the actual boundaries of structural variants can not be determined through BAC-based array-CGH.
On the other hand, a more accurate determination of variant boundaries can be achieved through SNP-centric oligonucleotide-based arrays that have an extensive number of oligonucleotides.
The SNP-centric platform has additional advantages of accompanying SNP genotype information as a potential variant source, combined with large structural variants and its ability to detect the presence of loss of heterozygosity (LOH) or segmental uniparental disomy (Bruce et al., 2005; Mei et al., 2000).
But, the SNP-centric platform also has its disadvantages.
In spite of the advanced resolution, the relatively low signal-to-noise ratio of oligonucleotide-based hybridization intensity, compared with large insert clone array, might result in higher false-positive rates.
Because most CNVs are subtle changes, this makes the results prone to misclassification of signal intensities and, consequently, to statistical errors.
Sometimes, it is pointed out that the SNP-centric array was originally designed for allelic discrimination and is not appropriate for CNV detection because of biased genomic distribution and sequence composition of spotted probes (McCarroll and Altshuler 2007d).
Recently proposed oligonucleotide-based array platforms have been designed for CNV detection specifically without sacrificing the advantage of high resolution, which can be a promising solution for CNV detection in the near future (Barrett et al., 2004).
In identifying CNVs in normal populations, one of the fundamental problems is the lack of a reference genome from which diploid states of sample DNA can be inferred.
Unlike the array-CGH-based tumor study in which the normal DNA of the same individual can be used as a reference genome, no single DNA source can present the standardized and universal genome in variant analysis.
Often, the pooled genome of several individuals has been used to represent the average genome, while the heterogeneity of the used population might affect the copy number inference step, as shown for examples of X chromosomes.
Redon et al.
and Komura et al.
adopted the pairwise comparison for ac-curate inference of copy number states in individual loci, which is noteworthy (Redon et al., 2006; Komura et al., 2006).
In pairwise comparison, the hybridization intensities of one sample is compared with those of all other remaining samples as one large reference, and the diploid states of loci can be more accurately inferred from the multiple comparison results.Clinical implications of CNVs and dis-ease association study.
In spite of recent technological developments of genetic polymorphism-oriented disease association studies, still little is known about the effects of genetic polymorphisms on common complex diseases.
One of the ultimate goals in exploring CNVs is to systematically assess the association between such variants and the disease.
Although it is unlikely that all CNVs in the human genome are associated with diseases, evidence of the association of CNVs and a wide spectrum of human diseases has rapidly accumulated.
Table 1 summarizes the CNVs that have been reported to be associated with diseases.
CNVs can affect disease susceptibility or individual differences in responses to drugs through alteration of gene expression.
Stranger et al.
's and Heidenblad et al's reports coherently showed positive correlations between DNA copy number dosage and gene expression level (Stranger et al., 2007; Heidenblad et al., 2005).
If a CNV region contains transcriptional regulatory elements rather than protein coding genes, it still can affect gene expression levels by changing transcriptional regulation or heterochromatin spread (Reymond et al., 2007).Conclusion.
The genomic fraction that is occupied by CNVs is now estimated to be about 600 Mb, already exceeding that of single base-level variants.
It is likely that the number of CNVs and the genomic fraction that is affected by structural variants will continue to expand, and many of them will be used for more practical purposes, including disease association or population studies.
However, it should be remembered that the current CNV entries are plagued by substantial amounts of false-positive and false-negative results.
Only a small portion of them have been validated by independent methods.
To overcome this, it is necessary to improve scanning platforms, including optimizing experimental conditions and developing more reliable CNV calling algorithms.
In the meantime, it is required for individual researchers to know the characteristics of the available platforms and analytical techniques to use them or to interpret the published results properly.
|