@ewha-bio:239 / 33054-33057
=============Title==========
Copy Number Variations in the Human Genome: Potential Source for Individual Diversity and Disease Association Studies.
=============Cor Author==========
*Corresponding author: E-mail yejun@catholic.ac.krTel +82-2-590-1214, Fax +82-2-596-8969 Accepted 11 March 2008
===========Author==========
Tae-Min Kim1, Seon-Hee Yim2 and Yeun-Jun Chung1,2*1Department of Microbiology, 2Integrated Research Center for Genome Polymorphism, The Catholic University of Korea, Seoul 137-701, Korea
===========Keywords==========
Keywords: array-CGH, Copy number variation (CNV), Genome-wide association study (GWAS)Keywords: chromosome, genome-wide linkage search, heritability, HDL cholesterolKeywords: inbreeding coefficient, Mengolian population, STR, HWE, PICKeywords: haplotype, HapMap, Korean, LD, populations, SNP
===========Sub Heading==========
Abstract Introduction The definition of CNV The identification of CNVs using differ-ent platforms Clinical implications of CNVs and dis-ease association study Conclusion Introduction Methods Results and Discussion Introduction Methods Results Discussion Introduction Methods Methods
==========Minor Heading===========
ASubjects, medical histories, genotyping, and measurement of HDL cholesterol Statistical analyses, heritability estimation, and variance component linkage analysis Participants Genotyping Estimating Hardy-Weinberg Equilibrium (HWE), Information Contents and Inbreeding Coefficients ASNP Selection DNA Samples Genotyping Statistical Analysis A
===========Main Text==========
Abstract.
The widespread presence of large-scale genomic variations, termed copy number variation (CNVs), has been recently recognized in phenotypically normal individuals.
Judging by the growing number of reports on CNVs, it is now evident that these variants contribute significantly to genetic diversity in the human genome.
Like single nucleotide polymorphisms (SNPs), CNVs are expected to serve as potential biomarkers for disease susceptibility or drug responses.
However, the technical and practical concerns still remain to be tackled.
In this review, we examine the current status of CNV DBs and research, including the ongoing efforts of CNV screening in the human genome.
We also discuss the characteristics of platforms that are available at the moment and suggest the potential of CNVs in clinical research and application.
IIntroduction.
Traditionally, large-scale genomic variants that are visible in conventional karyotyping have been thought to be associated with early-onset, highly penetrant genetic disorders, while they are incompatible in normal, disease-free individuals (Lupski, 1998; Stankiewicz and Lupski, 2002).
The construction of the 'reference genome' by the human genome sequencing project is based on the belief that human genome sequences are virtually identical, even in different individuals, except for well-known single nucleotide polymorphisms (SNP) or size-variants of tandem repeats such as mini- or microsatellites (variable number of tandem repeats or VNTR) (Przeworski et al., 2000).
This traditional concept has been recently challenged by the discovery that large structural variations are more prevalent than previously presumed (Check, 2005).
Using high-resolution whole- genome scanning technologies such as array-based comparative genomic hybridization (array-CGH), two groups of pioneering scientists have identified widespread copy number variations (CNVs) in apparently healthy, normal individuals (Iafrate et al., 2004; Sebat et al., 2004).
It proposes that our genome is more diverse than has ever been recognized, and subsequent studies have identified up to 11,000 CNVs across the whole genome (Tuzun et al., 2005; Hinds et al., 2006; Mills et al., 2006; McCarroll et al., 2006; Conrad et al., 2006; Sharp et al., 2005; Wong et al., 2007; de Smith et al., 2007).
Although the current understanding of CNVs is still limited for practical use and technical challenges still remain to be tackled, recent studies already have demonstrated the potential association of CNVs with various diseases, suggesting plausible functional significances and highlighting the promising utility of CNVs.
The current coverage of CNVs in the human genome already has exceeded that of SNPs (approximately 600 Mb comprising 12% of human genome) and is still increasing (Cooper et al., 2007).
These large-scale structural variants, in addition to SNPs, will serve as powerful sources to help our understanding of human genetic variation and of differences in disease susceptibility for various diseases.
This paper reviews the current knowledge and future perspectives of CNVs.
The definition of CNV.
Structural variations that involve large DNA segments can take various forms, such as duplication, deletion, insertion, inversion, and translocation.
Among them, DNA copy number variations larger than 1 kb are collectively termed CNVs.
Fig.
1 illustrates the concept of CNV.
Although the CNV can include large, microscopically visible genomic variations, it generally indicates a submicroscopic structural variation that is hardly detectable by conventional karyotyping (35 Mb) (Freeman et al., 2006).
Smaller variations such as small insertional- deletion (indel) polymorphisms are not included in CNVs, while they comprise another large collection of over 400,000 variants in the human genome (Mills et al., 2006), and neither is the insertional polymorphism of mobile elements such as Alus or L1 elements considered a CNV.
At the beginning stages of CNV discovery, a number of terms were proposed to define them e.g., large-scale copy number variants (LCV) (Iafrate et al., 2004), copy number polymorphism (CNP) (Sebat et al., 2004), and intermediate-sized variants (ISV) (Tuzun et al., 2005).
The current definition of CNV is also operational and can be modified with the advance of scanning resolution and coverage, and availability of allele frequency in a determined population.The identification of CNVs using differ-ent platforms.
Various scanning platforms and quality control methods have been used to identify CNV calls.
Because the choice of platforms has a great effect on the results, it is worth reviewing the characteristics of platforms to improve the understanding of CNVs.
The presence of CNVs in normal individuals was reported for the first time in 2004 independently by two groups led by Lee C. and Wigler M. (Iafrate et al., 2004; Sebat et al., 2004).
Both studies used two-dye array-CGH techniques that used clones of bacterial artificial chromosomes (BAC) or oligonucleotides (representational oligonucleotide microarray analysis, or ROMA).
Theyindependently reported about 250 and 80 loci as changes in copy number from 39 and 20 normal individuals, respectively.
Fig.
2 illustrates the general concept of CNV detection based on two-dye array-CGH.
Although the average numbers of CNVs per individual genome were similar in two studies (about 12 CNVs per genome), it should be noted that there was little overlap between the results.
This discrepancy between studies was possibly due to the use of different platforms and experimental conditions in different populations.
However, it is also probable that there are still large numbers of structural variants that have yet to be discovered (Buckley et al., 2005; Eichler, 2006).
One following study that provided evidence on the widespread presence of large-scale structural variations in the human genome was based solely on in silico analysis (Tuzun et al., 2005).
The sequence-level comparison of two independent genome sequences, i.e., one derived from a human genome reference assembly and the other from fosmid clones of a genomic library, revealed about 300 structural variations, including inversions.
This method can detect various types of structural variants, including inversion, which is not detectable by conventional array-CGH platforms.
Indeed, the results by Tuzun et al.
(2005) can be used as validated control for primary verification or for parameter tuning for the development of CNV-detection platforms or algorithms.
Although the use of this method is currently limited by the unavailability of sequence data, ongoing efforts to sequence the individual human genome and to develop cost-effective sequencing platforms (Bennett et al., 2005) will be able to facilitate sequence-level genome comparisons and the identification of highly qualified structural variants in the near future.
Two studies by McCarroll et al.
and Conrad et al., which focused on the identification of deletion variants (McCarroll et al., 2006; Conrad et al., 2006), used 1.2 million SNP genotyping data from The International HapMap Consortium (International HapMap Consortium.
2005).
They assumed that allelic deletion causes the discard of probes in SNP genotyping.
For example, the runs of consecutive probes with null genotype calls or runs of SNP genotypes whose allelic frequencies deviate from expected Hardy-Weinberg equilibrium ratios or expected Mendelian inheritance patterns might represent the presence of deleted loci.
They independently reported about 600 potential deletions as small as less than 100 bp.
The relatively small size of the identified variants, compared with the array-CGH method, is due to the high resolution of the platforms.
The use of an SNP-centric array platform can be used to identify linkage disequilibrium (LD) of structural variants with nearby SNPs in a given population.
But, the discrepancy in deletions that were identified in the two studies was also noted in spite of using similar HapMap populations and identification methods (Eichler 2006).
Recently, a comprehensive CNV analysis was reported based on high-resolution array platforms, Whole Genome TilePath (WGTP), which used 26,000 large insert clones, and Affymetric GeneChip Human Mapping 500K early access, which used 500,000 SNP oligonucleotides.
They identified about 1500 genomic segments as copy number variations or CNVRs (copy number variable regions) consisting of overlapping CNVs from 269 HapMap individuals (Redon et al., 2006).
The results from the two platforms are worth comparing becasuse they provide the highest currently achievable resolution and are often selected as primary platforms in many other studies.
Firstly, the CNVs that are identified from BAC-based array-CGH are generally larger than those from oligonucleotide-based arrays (230 kb and 80 kb of median size, respectively).
This overestimation of CNVs by BAC-based array-CGH is due to the large insert clones that are used, which has been frequently reported (Iafrate et al., 2006).
Secondly, the actual boundaries of structural variants can not be determined through BAC-based array-CGH.
On the other hand, a more accurate determination of variant boundaries can be achieved through SNP-centric oligonucleotide-based arrays that have an extensive number of oligonucleotides.
The SNP-centric platform has additional advantages of accompanying SNP genotype information as a potential variant source, combined with large structural variants and its ability to detect the presence of loss of heterozygosity (LOH) or segmental uniparental disomy (Bruce et al., 2005; Mei et al., 2000).
But, the SNP-centric platform also has its disadvantages.
In spite of the advanced resolution, the relatively low signal-to-noise ratio of oligonucleotide-based hybridization intensity, compared with large insert clone array, might result in higher false-positive rates.
Because most CNVs are subtle changes, this makes the results prone to misclassification of signal intensities and, consequently, to statistical errors.
Sometimes, it is pointed out that the SNP-centric array was originally designed for allelic discrimination and is not appropriate for CNV detection because of biased genomic distribution and sequence composition of spotted probes (McCarroll and Altshuler 2007d).
Recently proposed oligonucleotide-based array platforms have been designed for CNV detection specifically without sacrificing the advantage of high resolution, which can be a promising solution for CNV detection in the near future (Barrett et al., 2004).
In identifying CNVs in normal populations, one of the fundamental problems is the lack of a reference genome from which diploid states of sample DNA can be inferred.
Unlike the array-CGH-based tumor study in which the normal DNA of the same individual can be used as a reference genome, no single DNA source can present the standardized and universal genome in variant analysis.
Often, the pooled genome of several individuals has been used to represent the average genome, while the heterogeneity of the used population might affect the copy number inference step, as shown for examples of X chromosomes.
Redon et al.
and Komura et al.
adopted the pairwise comparison for ac-curate inference of copy number states in individual loci, which is noteworthy (Redon et al., 2006; Komura et al., 2006).
In pairwise comparison, the hybridization intensities of one sample is compared with those of all other remaining samples as one large reference, and the diploid states of loci can be more accurately inferred from the multiple comparison results.Clinical implications of CNVs and dis-ease association study.
In spite of recent technological developments of genetic polymorphism-oriented disease association studies, still little is known about the effects of genetic polymorphisms on common complex diseases.
One of the ultimate goals in exploring CNVs is to systematically assess the association between such variants and the disease.
Although it is unlikely that all CNVs in the human genome are associated with diseases, evidence of the association of CNVs and a wide spectrum of human diseases has rapidly accumulated.
Table 1 summarizes the CNVs that have been reported to be associated with diseases.
CNVs can affect disease susceptibility or individual differences in responses to drugs through alteration of gene expression.
Stranger et al.
's and Heidenblad et al's reports coherently showed positive correlations between DNA copy number dosage and gene expression level (Stranger et al., 2007; Heidenblad et al., 2005).
If a CNV region contains transcriptional regulatory elements rather than protein coding genes, it still can affect gene expression levels by changing transcriptional regulation or heterochromatin spread (Reymond et al., 2007).Conclusion.
The genomic fraction that is occupied by CNVs is now estimated to be about 600 Mb, already exceeding that of single base-level variants.
It is likely that the number of CNVs and the genomic fraction that is affected by structural variants will continue to expand, and many of them will be used for more practical purposes, including disease association or population studies.
However, it should be remembered that the current CNV entries are plagued by substantial amounts of false-positive and false-negative results.
Only a small portion of them have been validated by independent methods.
To overcome this, it is necessary to improve scanning platforms, including optimizing experimental conditions and developing more reliable CNV calling algorithms.
In the meantime, it is required for individual researchers to know the characteristics of the available platforms and analytical techniques to use them or to interpret the published results properly.e found peak evidence of linkage (LOD score=1.88) for HDL cholesterol level on chromosome 6 (nearest marker D6S1660) and potential evidences for linkage on chromosomes 1, 12 and 19 with the LOD scores of 1.32, 1.44 and 1.14, respectively.
These results should pave the way for the discovery of the relevant genes by fine mapping and association analysis.IIntroduction.
Cholesterol is a major part of cell membranes.
Cholesterol is carried in the blood by chylomicrons, very low density lipoproteins (VLDL), high density lipoproteins (HDL) and low density lipoproteins (LDL) (Dastani et al.
2006).
HDL cholesterol is reversely associated with cardiovascular disease, and is more tightly controlled by genetic factors than the other lipoproteins such as LDL, VLDL and chylomicrons.
Environmental factors including chronic alcoholism, estrogen replacement therapy, and exercise influence the levels of HDL cholesterol.
Several families with strikingly elevated HDL cholesterol levels have been identified.
HDL cholesterol levels are higher in blacks compared with whites and HDL cholesterol levels of females are higher than those of males (Barcat et al.
2006; Brousseau et al.
2004; Yamashita et al.
2000; Imperatore et al.
2000).
Candidate gene analysis using population-based case-control studies has been used to test the association between SNPs and HDL cholesterol levels.
Among the candidate genes selected mainly from lipid metabolism pathways, ApoA-I gene is the one most intensively studied (Inazu et al.
1994; Kuivenhoven et al.
1997).
By genome-wide linkage analysis, susceptibility genes can be identified although the genes are not candidates based on lipid metabolism.
Genome-wide linkage scans are conducted by use of microsatellite markers to identify genetic determinants affecting the traits (Wang and Paigen 2005).
Using HDL cholesterol levels as either discrete or quantitative trait, several linkage studies on genetic determinants of HDL cholesterol have been reported (Yancey et al.
2003).
Genetic effects on the variations in HDL cholesterol were studied mainly in Caucasians and Africans thus far, and little attention has been focused in this regard on Asian populations.
We found suggestive evidence for linkage for HDL cholesterol on chromosome 6, 1, 12 and 19, in studies conducted as part of GENDISCAN study, a large epidemiological study of Complex traits in geographically, culturally and genetically isolated large Mongolian families l in Dornod, Mongolia report.
Methods.
We analyzed data from 1002 Mongolian individuals from 95 large extended families.
Informed consent was obtained from all subjects prior to participation and the protocol was approved by the Institutional Review Board at Seoul National University.
Potentially confounding variables were assessed for each participant along with overall medical history.
Information on age, gender and anthropometry (height, weight, waist circumference, hip circumference and body fat content) were obtained for each individual.
Height in centimeter (cm) and weight in kilograms (kg) were measured using an automatic measuring instrument (IMI 1000, Immanuel Elec., Korea).
Body mass index (BMI) was calculated in kg/m.
Waist circumference was measured to the nearest centimeter at the level of the umbilicus, and hip circumference was measured at the level of the maximal circumference of the gluteus.
All other variables were collected through interviews performed by trained interviewers.
Information about amount of alcohol and smoking was also obtained from all the participants.
All the subjects were asked to fast for 12 hours before their visit.
Blood samples were collected from an antecubital vein into vacutainer tubes containing EDTA.
Blood samples were centrifuged at 3000rpm for 10 minutes and then stored at 70C.
DNA was isolated from lymphocytes for polymerase chain reaction (PCR) and automated genotyping.
A 10 ml blood sample was collected from each participating individual for genomic DNA extraction.
DNA was extracted from peripheral lymphocytes using the PUREGENE DNA Purification Kit for whole blood (Gentra Systems Inc, USA).
For genotyping, a set of 1000 microsatellite markers deCODE mapping sets (deCODE genetics, USA) was used covering the genome at an average density of 3 centimorgans (cM).
HDL cholesterol was measured by the enzymatic method using Cholestest-N-HDL kit (DAICHI, JAPAN) and HITACHI 7600-210 & HITACHI 7180 instruments.
Extensive quality control procedures ensured the validity and reproducibility of the measurements.
Multiple linear regression analysis was used by PC SAS version 8.2 and PC SPSS version 12 to account for effect of confounding variables.
Pedigree data was managed by PedSys (Southwest Foundation for Biomedical Research, San Antonio, Texas, USA).
Nonpaternity was examined using PEDCHECK (Mcpeek and Sun 2000) and relationships other than paternity were checked using average IBD-based method by PREST.
After correcting pedigree error and Mendelian errors, non-mendelian errors were examined and corrected using SimWalk.
Identity by descent (IBD) matrix between every relationship pairs in family was calculated and IBD matrix for single marker was calculated by SOLAR (Sequential Oligogenic Linkage Analysis Routines software version 2.1.4).
Multipoint IBD matrices were computed on every 1 cM distance using Markov chain Monte Carlo method by LOKI (Heath 1997).
Genetic components of selected phenotypes were estimated in terms of heritability.
Narrow sense heritability, defined as the proportion of total phenotypic variation due to additive genetic effects, was calculated.
Heritability of HDL cholesterol adjusted for age, gender, age- square, product of age and gender, product of age- square and gender, systolic BP, smoking and alcohol was estimated and a variance component linkage analysis was carried out by SOLAR which uses maximum likelihood methods to estimate variance components for the polygenic genetic effect and random individual environmental effects.
Results and Discussion.
The mean age of the 1002 individuals was 31 years and 54.5% of them were female.
Demographic and pedigree characteristics of the study sample are shown in Table 1.
The family size had a mean of 16.
Table 2 included information on 2546 pairs of first degree relatives (1812 parent-offspring pairs and 734 full-sib pairs), 2485 pairs of their second degree relatives (395 half-sibling pairs, 1202 grandparent-grandchild pairs, and 888 avuncular pairs), and 598 first-cousin pairs.
Means of their total cholesterol, HDL cholesterol, LDL cholesterol, and triglyceride were 159.82 mg/dl, 55.19 mg/dl, 90.51 mg/dl, and 63.30 mg/dl, respectively.
Table 3 shows correlation between HDL cholesterol and covariates such as age, gender, systolic blood pressure, alcohol consumption status, and smoking status.
These parameters were used as covariates in the variance component analysis which provided multivariable adjusted heritability estimates for HDL cholesterol of 0.45 (Table 4).
The peak multipoint LOD score was 1.88 on 6p21 (nearest marker D6S1660) and a secondary peak (LOD score of 1.44) was found on 12q23 (nearest marker D12S354).
We identified other potential evidence for linkage in the LOD score of 1.32 on 1q24 (nearest marker D1S412) and a LOD score of 1.14 at 19p13 (nearest marker D19S884) (Fig.
1, 2).
Table 5 presents all LOD scores 1.0 for HDL cholesterol.
We identified potential evidence of linkage on several chromosomes.
In other genome scan, a weak linkage signal for HDL cholesterol was observed for regions that overlapped slightly with the regions identified herein.
Klos et al.
reported the appearance of peak position in the chromosome 12q in European American population (Klos et al.
2001) (Table 6).
We found evidence of link- the population isolates used in GENDISCAN study would not present significant inflation of type I errors from inbreeding effects in its gene discovery analysis.
IIntroduction.
The GENDISCAN (Gene Discovery for Complex traits in Asian population of Northeast area) study was launched in 2002 in order to elucidate genetic causes of complex diseases.
This study attempted to incorporate designs that detect genetic signals with increased efficiency.
These included using genetically homogeneous population, recruiting large families, and considering quantitative phenotypes as well as disease outcome (Peltonen et al., 2001; Merikangas et al., 2003).
Large extended families still remaining in the Northeast Asia, enabled the project to adopt these designs.
Although there is no doubt that gene discovery of common complex diseases is one of the research priorities, the successful results have been very limited (Grant et al., 2006).
The difficulty of replication across studies, mandates the use of internally valid study designs and proper methodologies.
Using population isolates generally confers the advantage of increasing genetic homogeneity.
However population isolates might have inbreeding structures, which deviates the basic assumptions of HWE.
The presence of significant inbreeding necessitates modifications in genetic estimations using the population.
Therefore, we attempted to estimate the status of HWE, and inbreeding coefficients in two ethnic groups of Mongolia using genome-wide short tandem repeat (STR) genetic markers.
Compatibility with basic assumptions of population genetics can support the methodological validity of the overall GENDISCAN study,Methods.
The GENDISCAN study included non-selected families in Mongolia.
The People's Republic of Mongolia (not including the Chinese territory) has 2.6 million people which comprise of more than 20 ethnic groups.
The Orkhontuul are in Selenge Imag (Imag is an administrative district unit in Mongolia corresponding to a state in the United States) and the Dashbalbar area in Dornod Imag were selected.
The Orkhontuul area has a population of 3,760 people, mainly consisting of Khalkha tribe, and maintains semi-urban life style.
The Dashbalbar area is mainly habituated by about 4,000 people of Buryat ethnicity and has more traditional nomadic life style.
Many large extended families, which fit the study purposes of the GENDISCAN study still remain in both areas.Genomic DNA was extracted from peripheral leukocytes.
The Orkhontuul samples (2004, n=1,080) were genotyped using the Applied Biosystems Inc. platform (ABI Prism Linkage Mapping Set version 2.5 medium density, 400 markers) with average 10 cM resolution, and Dashbalbar samples (2006, n=1,020) were genotyped using the deCODE 1,000 STR marker platform with average of 3 cM resolution.
For the Orkhontuul participants markers on the chromosome 14 were analyzed.
For Orkhontuul data, markers with low call-rate (49 markers), and with more than 1% of genotype error rates (16 markers) and markers on X chromosome (18 markers) were excluded.
For Dashbalbar genotype data, the 1,000 STR marker platform provided 1097 markers originally, however we excluded markers on X chromosome (49 markers) and markers with low call-rate and more than 1% of genotype error rates (4 markers).
All participants provided informed consent.HWE and degree of inbreeding were assessed using the founders of each pedigree.
Non-founders were excluded because their genotypes are dependent on those of the founders.
HWE was estimated by comparing the expected and observed genotype frequencies.
Expected genotype frequency was calculated from allele frequency.
Chi-square goodness of fit test was used to determine whether HWE assumption was met.
The Chi-square statistics () of multi-allelic loci is defined as equation as Equation 1, with k (k-1) degree of freedom, where k is the total number of alleles.
(Equation 1)where, nuu and nuv denote homozygotic and heterozygotic genotypes, while pu and pv denote allele frequency of each allele.
Information contents of the genetic markers were estimated as polymorphism information content (PIC), heterozygosity and allelic diversity.
PIC is an index of the amount of information, which modifies the simple heterozygosity index by adjusting for the chance of mating between the same heterozygotic genotypes.
PIC was calculated from Equation 2.
(Equation 2)where p and p denote allele frequency of each allele (Czika, 2005).
Inbreeding was estimated by the deviation from the assumption that each founder shares no Identity by descent (IBD).
Generally genotype frequency of bi-allelic locus having p and q allele frequencies are predicted as p, 2pq, q respectively under HWE.
However, if there are IBD sharing of FI between founders, above prediction can be re-written respectively as Equation 3.
(Equation 3)where, Fdenotes inbreeding coefficient (Gillespie et al., 2004).
In brief, inbreeding is characterized by the excess of homozygote over expected level.
The inbreeding coefficient can be estimated as Equation 4 by solving Equation 3 (Equation 4)where, H denotes observed heterozygotic, and 2pq denotes estimated heterozygotic proportions from allele frequency (Hart et al., 2000).
HWE and estimations of expected and observed heterozygosity frequencies were obtained using SAS/Genetics program.Results.
The demographic characteristics of the subjects geno-typed are shown in Table 1.
There were 280 (99 men and 181 women) and 142 (90 men and 52 women) founders in Orkhontuul and Dashbalbar populations.
Non-founders' genotype.
were excluded, since theirs do not independently contribute to a gene pool.
The information contents in terms of PIC for single marker, range between 0.2 and 0.9, as shown in Fig.
1.
Average PIC was 0.72 and 0.71 for Orkhontuul and Dashbalbar populations, respectively which are relatively high for single marker information contents.
There was no significant difference in PIC across the chromosomes or populations.
The high PIC level enabled accurate estimation of other population genetic parameters.
HWE was satisfied among 88.6 % and 94.2%, respectively, of all markers in Orkhontuul and Dashbalbar populations (p-value 0.05).
If we apply the criteria of p-value 0.01, 90.5% and 95.3% of all markers were in HWE status All the markers including those which were not in HWE, were used for estimating the inbreeding coefficients,.
Inbreeding coefficient was estimated to be 0.0023 and 0.0021 in Orkhontuul and Dashbalbar populations.
Discussion.
Population isolates are generally considered to be one of the most ideal populations for genetic study (Pajukanta et al., 2003; rcos-Burgos et al., 2002; Escamilla et al., 2001).
However, possible inbreeding can cause deviation from general assumptions on which most analyses depend.
Presence of inbreeding can be problematic, because, if exits, l the genetic relationships between unrelated as well as related persons could be underestimated.
This underestimation of IBD can result in inflation of type I errors for linkage analysis (Hossjer et al., 2006 Nomura et al., 2005), linkage disequilibrium estimations and haplotype reconstructions (Zhang et al., 2004).
The inbreeding coefficient found in this study (about 0.2% in each population), does not necessitate any adjustment for genetic analyses such as IBD calculation, classic or non-parametric linkage analysis, and variance component-based linkage analysis.
By estimating the last common ancestor, 0.2% of inbreeding coefficient corresponds to 10 or 11 generations (Jensen- Seaman et al., 2001; Santos-Lopes et al., 2007).
In this study, both ABI and deCODE STR markers were genotyped with standardized procedure and any markers with more than 1% of genotype errors were discarded.
The genotype errors were confirmed within the pedigree structure.
Any Mendelian inconsistency was deleted and markers with possible double-recombination were also deleted.
Generally, genotyping in family-based study is more accurate than in studies using individuals only.
Thus, It is not likely that any genotype error could have been biased our findings.
In conclusion, we have estimated inbreeding coefficients in two population isolates in Mongolia.,.
We found that they fall in negligible range, allowing related genetic studies to be performed without any modification or adjustment for possible inbreeding effects.
This finding validates the ability of The GENDISCAN study to add to the growing body of evidence which associates specific genetic variations with complex disorders.% (6.4 of 34.5 Mb) of chromosome 22 with 757 tagSNPs and 815 haplotypes (frequency 5.0%).
Of 3430 common SNPs genotyped in all five populations, 514 were monomorphic in Koreans.
The CHB + JPT samples have more than a 72% overlap with the monomorphic SNPs in Koreans, while the CEU + YRI samples have less than a 38% overlap.
The patterns of hot spots and LD blocks were dispersed throughout chromosome 22, with some common blocks among populations, highly concordant between the three Asian samples.
Analysis of the distribution of chimpanzee-derived allele frequency (DAF), a measure of genetic differentiation, Fst levels, and allele frequency difference (AFD) among Koreans and the HapMap samples showed a strong correlation between the Asians, while the CEU and YRI samples showed a very weak correlation with Korean samples.
Relative distance as a quantitative measurement based upon DAF, Fst, and AFD indicated that all three Asian samples are very proximate, while CEU and YRI are significantly remote from the Asian samples.
Comparative genome-wide LD studies provide useful information on the association studies of complex diseases.
IIntroduction.
Vast amounts of information on single nucleotide polymorphisms (SNPs) and progress in high-throughput genotyping technology have generated a great deal of interest in establishing genome-wide linkage disequilibrium (LD) maps for genetic studies of complex traits (Chakravarti 2001; The International HapMap Consortium 2003; Myers and Bottolo 2005).
LD is known to occur in a block-like structure across the genome, with conserved haplotype blocks of tens to hundreds of kilobases punctuated by "hot spots" of recombination (Daly et al.
2001).
Since the concept of whole genome association studies using SNPs was introduced (Risch and Merikangas 1996), an optimal number of SNPs required for association studies has been center of extensive debate (Kruglyak 1999).
Initial studies have focused on average LD levels and the variability in processes that generate LD (Cardon and Abecasis 2003).
Although a single chromosome could carry many haplotypes in LD blocks, recent studies suggest that haplotypic variation may be much lower than previously imagined (Jeffreys et al.
2001; Patil et al.
2001; Gabriel et al.
2002).
Patil's group identified haplotype blocks on chromosome 21 for which over 80% of chromosomes were represented by a few common haplotypes (Patil et al.
2001).
In the analysis of human chromosome 22 with a marker density of one SNP per 15 kb, Dawson's group reported a highly variable pattern of LD along the chromosome, in which extensive regions of complete LD of up to 804 kb in length were interspersed with regions of no detectable LD (Dawson et al.
2002).
Although differences of LD patterns between populations have been reported (Abecasis et al.
2002; Reich et al.
2001, Zavattari et al.
2002), little information is available on the haplotype structure in different populations other than the recent study by S.B.
Gabriel, et al.
(Gabriel et al.
2002).
On the other hand, haplotype analysis has been widely employed in linkage studies for narrowing down the location of disease susceptibility genes (Zhang et al.
2004; Park 2007).
The International HapMap Project was launched to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation among four population samples: 30 trios from Yoruba in Ibadan, Nigeria (YRI), 45 unrelated Japanese in Tokyo, Japan (JPT), 45 unrelated Han Chinese in Beijing, China (CHB), and 30 trios in a Utah, US population with Northern and Western European ancestry (CEU) from the CEPH collection (The International HapMap Consortium 2003; 2004; 2007).
As the International HapMap Project releases a validated SNP map of 1 marker per kb for the HapMap samples, the general applicability of the HapMap data needs to be confirmed in samples from related populations.
Recent comparative studies of LD patterns have shown a high degree of concordance among various populations (Gabriel et al.
2002; Shifman et al.
2003; Stenzel et al.
2004; Mueller et al.
2005).
As the HapMap samples include Japanese and Chinese, it was our interest to test whether significant differences in LD exist between Koreans and the two other Asian samples.
In this paper, we measured the LD pattern along chromosome 22 in Korean samples and compared the Korean data with those of the four HapMap samples.
We were interested in exploring how the HapMap data could be used to estimate the genomic structure of Koreans.
We expect that this study will contribute to the development of proper strategies for association studies of common complex diseases in Koreans using the HapMap data.
Methods.
A total of 111,448 reference SNPs from chromosome 22 in the dbSNP (http://www.ncbi.nlm.nih.gov/SNP, build 116) were collected.
To maximize cost effectiveness of genotyping, SNPs were selected based on the following criteria: 1) markers with even spacing, 2) verified SNPs, 3) coding SNPs.
The SNPs were scored for the selection of the study using the following strategies.
First, it was most important in mapping chromosomal LD blocks to have relatively equal spaces between SNP markers.
Second, verified SNP markers (validation status was scored as 0 to 4 in the dbSNP) that had higher scores were chosen to prevent or reduce genotyping failure.
Also, repeated sequence regions were excluded by repeat masking with Primer3 software (Rozen and Skaletsky 2000).
Third, to be useful for a further study, protein coding SNPs had higher scores.
A total of 12,674 genotyping experiments were conducted by four Genotyping Centers, and a final set of 4681 markers passed the stringent quality control procedure (The International HapMap Consortium 2003).
Genomic DNA from 90 unrelated Korean individuals without family histories of major diseases was obtained from the Genomic Research Center in the Korean National Institute of Health (KNIH).
The KNIH samples were collected as part of an epidemiological project and represent urban and rural regions in the south of Seoul.
The sex ratio was 0.5 and the mean age was 50.
Informed consent from all participating subjects was obtained through KNIH, and research approval came from the relevant ethical committees.
DNA was isolated from peripheral blood leukocytes according to standard procedures with proteinase K-RNase digestion, followed by phenol-chloroform extraction.For each SNP, we chose a set of three primers: two PCR primers to amplify a product of 100-200 bps under standard conditions and an optimized extension primer to be complementary to the sequence immediately to a SNP site.
For genotyping, we employed three platforms-6063 SNP genotypings were done using the Orchid Bioscience SNP-IT assay (Princeton, NJ), 984 SNP genotypings using the PerkinElmer Life Sciences FP-TDI assay (Boston, MA), and 5627 SNP genotypings using the Sequenom MassARRAY (San Diego, CA).
A genotype frequency for each SNP was checked for consistency between the observed values and those expected from the Hardy-Weinberg equilibrium test in each assay.
Haploview version 3.2 (Barrett et al.
2004), based on the expectation-maximization (EM) method (Excoffier and Slatkin 1995), was used to infer haplotype phase and population frequency and to estimate the Lewontin's coefficients D' (Lewontin 1998), LOD, and correlation coefficient r (Hill and Robertson 1968).
PHASE v2.1 was used to estimate the recombination parameters (Li and Stephens 2003; Crawford et al.
2004) and assess the statistical significance of haplotype profile differences and individual haplotype fre-2006).
Because it has been suggested that the functional significance of IL-1B-3737 might depend on a broader haplotype, we used the three SNPs for haplotype analysis.
Haplotypes were reconstructed by PHASE version 2.1, using previously produced genotype data (Lee et al., 2004).
Of the possible eight haplotypes, three common ones accounted for 98% of the estimated haplotypes in the Korean population.
Table 1 shows the haplotype frequency estimation in each population.
The potentially more inflammatory IL-1B-511T/-31C haplotype represented 53.5% of the Korean haplotypes, compared with 33.7% of the Caucasian haplotypes.
So far, in many previous association studies, the individual SNP approach, most frequently using IL-1B-511 and IL-1B-31, has been adopted.
To our knowledge, we reported first that the IL-1B-1464 polymorphism has allele-specific differences in nuclear protein binding and is associated with a clinical disease (Lee et al., 2004).
The biological implication of this polymorphism was supported by in vivo studies by Chen et al.
that showed that the IL-1B-1464 polymorphism has substantial allele-specific effects when both IL-1B-511 and IL-1B-31 were alleles T and C, respectively (Chen et al., 2006).
The more informative haplotype 1 (GTC), containing the IL-1B-1464 polymorphism, which shows the highest transcriptional activity, represents 9.3% and 6.0% of Korean and Caucasian haplotypes, respectively, whereas haplotype 3 (GCT), with the lowest activity, had a higher frequency in Caucasians (64.8%) when compared with Koreans (44.2%) (Table 1).
The difference in IL-1B promoter haplotype frequency between the Korean and Caucasian populations was statistically significant (=20.6, p=0.000), and the allele frequencies of the IL-1B-1464 polymorphism (rs#1143623) were also significantly different between the two populations (IL-1B-1464 G allele frequencies for Korean and Hapmap European=0.548 and 0.672, respectively) (=6.38, p=0.01).
It has been suggested that genes that are involved in immune function may be under selective pressure in direct interaction with the environment (Sawyer et al., 2004; Kim et al., 2005).
The genes that influence a phenotypic variation between populations are expected to show high Fst values.
Compared with the Fst value for the Caucasian-vs-Asian comparison, the Fst values for the African-vs-Asian or -Caucasian comparisons were remarkably high (Fig.
1).
Previously, we reported that the IL-1B-1464 polymorphism contributes to the development of intestinal-type gastric cancer among Koreans (Lee et al., 2004).
As a curious finding in our report, the editor pointed out that carriers of IL-1B-1464 G tend to have a decreased risk of diffuse-type of gastric cancer, which is the opposite of intestinal-type gastric cancer, although both intestinal and diffuse types of gastric cancer are related to Helicobacter pylori-induced gastritis (Furuta et al., 2004).
Our results showed that most IL-1B-1464 C alleles are linked to the IL-1B-511T/-31C haplotype (Table 1).
Considering the level of promoter activity of haplotype 2 (CTC), we cannot exclude the possible association between this haplotype and the risk of diffuse-type gastric cancer, especially depending on interactions with other regulatory factors (Lee et al., 2007).
Association studies that use individual SNPs appear to be insufficient, and the understanding of functional haplotype structure of populations could provide potential explanations for IL-1B-related controversies and ethnic-specific associations.
Therefore, we believe that these Korean haplotype data will be useful for future association studies between IL-1B SNPs and disease risk.nted domains, including the human imprinted gene cluster that contains IGF2, H19, KCNQ1, ASCL2, and CDKN1C (Rapkins et al., 2006).
If, as has been suggested, imprinted genes are intimately connected with the acquisition of parental resources, we would not anticipate the existence of such genes in chicken, which leave their offspring to their own heritance after conception.
Phylogenetic analyses expose that the relationship between human and mouse is closer than that between human, mouse, and chicken.
Similarly, the relationship between zebrafish and chicken is quite distant (Shah et al., 2004).
Nonetheless, we assumed that chicken have imprinted genes due to the existence of common ancestral genomic regions that have evolved on a similar basis in each of the aforementioned species.
The purpose of this study was to identify candidate imprinted genes in chicken based on an analysis of orthologous genes in human, mouse, zebrafish, and chicken using the HomoloGene database.ols for the clinical oncology to determine the prognosis of patients (Lossos et al., 2004; Pomeroy et al., 2002), the molecular diagnosis (Golub et al., 1999) as well as the responsiveness to therapeutics (Snyder and Morgan, 2004).
There have been many reports on the molecular pattern analysis using microarray to understand the chemo- and radio-resistance in cervical cancer (Achary et al., 2000; Tewari et al., 2005; Wong et al., 2006), rectal cancer (Kim et al., 2007) and esophageal cancer (Fukuda et al., 2004).
Most of the studies are to identify differentially expressed genes in patients with different clinical outcomes, which can be applied to the evaluation of prognosis more accurately.
Although the conventional parameters like tumor stage and grade can be used to decide optimal cancer therapy, molecular markers would provide valuable information to make clinical decisions (Klopp and Eifel, 2006).
Genome-wide analysis on gene expression can predict the clinical consequences more accurately.
In addition, the information from gene expression profiling can facilitate the development of biological target for therapeutics by identifying pathways and determining steps contributing to the phenotype.
In this study, we examined the expression profiles of two lung cancer cell lines, which showed differential re- 1995).
In the inactive form, the pseudosubstrate domain is bound to the catalytic domain of PKC (Orr et al, 1994).
Upon stimulation, PKC translocates to the plasma membrane where the C1 and C2 domains interact with DAG and phosphatidylserine, respectively.
This interaction causes the pseudosubstrate domain to dissociate from the catalytic domain, which results in activation of PKC.
Inactive PKC is not freely distributed throughout the cytoplasm but appears to be localized to specific sites within the cell.
Association of PKC with scaffolding proteins such as AKAP79 (A Kinase-Anchoring Protein 79) (Klauck et al, 1996) and Gravin (Nauert et al, 1996) facilitates localization.
Streptomycetes are ubiquitous soil bacteria, and they play a key role in the global carbon cycle by degrading the insoluble remains of other organisms.
More clues to the development of the PKC super family come from the study of the bacterium Streptomyces coelicolor.
S. coelicolor has a large collection of enzymes and can metabolize many diverse nutrients.
This extremely simple organism contains approximately 8,667,507bp, yet has complex life cycle exhibiting mycelial growth and spore formation (Bentley, 2002) and notable for production of pharmaceutically useful anti-tumor compounds.
Of the predicted genes, an unprecedented proportion carries out regulatory functions in the cell (Winstead, 2002).
More than twelve percent of the genome is involved in facilitating biological processes, such as the bacterium's s reduce implementation time and increase the likelihood of eliminating bugs and localizing code modifications when a change in implementation is required.
In the initial version of the interface, all of the classes got tangled with each other and corrupted the concept of object-oriented programming.
However, they have been completely redesigned, as shown in Table 1.
This table summarizes the recent modifications of our system, and the interfaces for each class are documented, similar to Fig.
2.
The refactored version is now composed of 3851 lines, compared with the initial version, which was composed of 2765 lines of code.
By importing the five packages, an exemplary software system called J3dPSV 1.0, shown in Fig.
3, has been developed for viewing 3D structures of proteins from the Protein Data Bank for demonstrational purposes.
J3dPSV supports visualization of proteins for educational purposes by simulating simple molecular graphics.
In addition, J3dPSV interactively displays a molecule on the screen in a variety of color schemes, molecular representations, and animation features.
The molecular model can be changed by selecting the list (cartoon tubes, backbone, protein, cylinder, or line) in useful suggestions for genotype information.
Compared to the current genotyping tools, GTVseq has several unique and useful features in the following aspects: * GTVseq uses two different scoring schemes and the results are reported separately.
One of the scoring schemes is similar to that of NCBI, while the other is particularly useful for viral sequences with new or complicated genotypes (vide infra).
* GTVseq offers an easy and interactive web-based user interface, with intuitive reports for genotyping results.
* GTVseq can be used for genotyping many important viruses such as HIV-1, HIV-2, HBV, HCV, HTLV-1, HTLV-2, poliovirus, enterovirus, flavivirus, Hantavirus, and rotavirus, thus permitting the most comprehensive genotyping of viral genomes to date.Methods.
For genotyping of viral genome sequences, we need to establish 'reference sequences' for each genotype.
We have downloaded the reference sequence database collections from NCBI (http://www.ncbi.nlm.nih.gov/projects/ genotyping), for HIV-1, HIV-2, HBV, HCV, HTLV-1, HTLV-2, and poliovirus.
For HIV-1 reference sequences, GTVseq also provides several different collections of reference databases such as HIV-1 (2004) & CRF, HIV-1 (2005), HIV-1 (2005) & CRF.
For enterovirus, flavivirus, Hantavirus, and rotavirus, the reference sequences were combination of databases and interactive web pages for manipulating and displaying annotations on genomes.
In other words, GBrowse is a web-based application tool that is developed for navigating and visualizing the genomic features and annotations interactively for users.
Through it, users can view a certain region of the desired genomes and search for genetic biomarkers.
They may conduct a full-text search for most features of the genomes.
They also can download SNP assay, genotype, and allele frequency information and generate customized sets of tag-SNPs for their association studies (Thorisson et al., 2005).
GBrowse utilizes a web-based display that can be used to show arbitrary features of a nucleotide or protein sequence and can accommodate genome-scale sequences that are megabases in length.
The GBrowse system consists of various kinds of software modules and systems, such as web servers, database systems, and Perl libraries.
At present, many biological websites that provide genomic variants or portal services have been developed using GBrowse, including the following: the UCSC Genome Browser (Kuhn et al., 2007), the International HapMap Project (Thorisson et al., 2005), PlasmoDB (The
|
Annnotations
- Denotations: 1
- Blocks: 0
- Relations: 0