@ewha-bio:260 / 41791-41814
=============Title==========
Copy Number Variations in the Human Genome: Potential Source for Individual Diversity and Disease Association Studies.
=============Cor Author==========
*Corresponding author: E-mail yejun@catholic.ac.krTel +82-2-590-1214, Fax +82-2-596-8969 Accepted 11 March 2008
===========Author==========
Tae-Min Kim1, Seon-Hee Yim2 and Yeun-Jun Chung1,2*1Department of Microbiology, 2Integrated Research Center for Genome Polymorphism, The Catholic University of Korea, Seoul 137-701, Korea
===========Keywords==========
Keywords: array-CGH, Copy number variation (CNV), Genome-wide association study (GWAS)Keywords: chromosome, genome-wide linkage search, heritability, HDL cholesterolKeywords: inbreeding coefficient, Mengolian population, STR, HWE, PICKeywords: haplotype, HapMap, Korean, LD, populations, SNP
===========Sub Heading==========
Abstract Introduction The definition of CNV The identification of CNVs using differ-ent platforms Clinical implications of CNVs and dis-ease association study Conclusion Introduction Methods Results and Discussion Introduction Methods Results Discussion Introduction Methods Methods Construction of Adenoviral Vector for hTERT-specific Group I Intron Major Functionalities of CGHscape Methods Features and Results Personal Genomics Polymorphism and Mutation Databases Methods
==========Minor Heading===========
ASubjects, medical histories, genotyping, and measurement of HDL cholesterol Statistical analyses, heritability estimation, and variance component linkage analysis Participants Genotyping Estimating Hardy-Weinberg Equilibrium (HWE), Information Contents and Inbreeding Coefficients ASNP Selection DNA Samples Genotyping Statistical Analysis ADatasets The dataset
===========Main Text==========
Abstract.
The widespread presence of large-scale genomic variations, termed copy number variation (CNVs), has been recently recognized in phenotypically normal individuals.
Judging by the growing number of reports on CNVs, it is now evident that these variants contribute significantly to genetic diversity in the human genome.
Like single nucleotide polymorphisms (SNPs), CNVs are expected to serve as potential biomarkers for disease susceptibility or drug responses.
However, the technical and practical concerns still remain to be tackled.
In this review, we examine the current status of CNV DBs and research, including the ongoing efforts of CNV screening in the human genome.
We also discuss the characteristics of platforms that are available at the moment and suggest the potential of CNVs in clinical research and application.
IIntroduction.
Traditionally, large-scale genomic variants that are visible in conventional karyotyping have been thought to be associated with early-onset, highly penetrant genetic disorders, while they are incompatible in normal, disease-free individuals (Lupski, 1998; Stankiewicz and Lupski, 2002).
The construction of the 'reference genome' by the human genome sequencing project is based on the belief that human genome sequences are virtually identical, even in different individuals, except for well-known single nucleotide polymorphisms (SNP) or size-variants of tandem repeats such as mini- or microsatellites (variable number of tandem repeats or VNTR) (Przeworski et al., 2000).
This traditional concept has been recently challenged by the discovery that large structural variations are more prevalent than previously presumed (Check, 2005).
Using high-resolution whole- genome scanning technologies such as array-based comparative genomic hybridization (array-CGH), two groups of pioneering scientists have identified widespread copy number variations (CNVs) in apparently healthy, normal individuals (Iafrate et al., 2004; Sebat et al., 2004).
It proposes that our genome is more diverse than has ever been recognized, and subsequent studies have identified up to 11,000 CNVs across the whole genome (Tuzun et al., 2005; Hinds et al., 2006; Mills et al., 2006; McCarroll et al., 2006; Conrad et al., 2006; Sharp et al., 2005; Wong et al., 2007; de Smith et al., 2007).
Although the current understanding of CNVs is still limited for practical use and technical challenges still remain to be tackled, recent studies already have demonstrated the potential association of CNVs with various diseases, suggesting plausible functional significances and highlighting the promising utility of CNVs.
The current coverage of CNVs in the human genome already has exceeded that of SNPs (approximately 600 Mb comprising 12% of human genome) and is still increasing (Cooper et al., 2007).
These large-scale structural variants, in addition to SNPs, will serve as powerful sources to help our understanding of human genetic variation and of differences in disease susceptibility for various diseases.
This paper reviews the current knowledge and future perspectives of CNVs.
The definition of CNV.
Structural variations that involve large DNA segments can take various forms, such as duplication, deletion, insertion, inversion, and translocation.
Among them, DNA copy number variations larger than 1 kb are collectively termed CNVs.
Fig.
1 illustrates the concept of CNV.
Although the CNV can include large, microscopically visible genomic variations, it generally indicates a submicroscopic structural variation that is hardly detectable by conventional karyotyping (35 Mb) (Freeman et al., 2006).
Smaller variations such as small insertional- deletion (indel) polymorphisms are not included in CNVs, while they comprise another large collection of over 400,000 variants in the human genome (Mills et al., 2006), and neither is the insertional polymorphism of mobile elements such as Alus or L1 elements considered a CNV.
At the beginning stages of CNV discovery, a number of terms were proposed to define them e.g., large-scale copy number variants (LCV) (Iafrate et al., 2004), copy number polymorphism (CNP) (Sebat et al., 2004), and intermediate-sized variants (ISV) (Tuzun et al., 2005).
The current definition of CNV is also operational and can be modified with the advance of scanning resolution and coverage, and availability of allele frequency in a determined population.The identification of CNVs using differ-ent platforms.
Various scanning platforms and quality control methods have been used to identify CNV calls.
Because the choice of platforms has a great effect on the results, it is worth reviewing the characteristics of platforms to improve the understanding of CNVs.
The presence of CNVs in normal individuals was reported for the first time in 2004 independently by two groups led by Lee C. and Wigler M. (Iafrate et al., 2004; Sebat et al., 2004).
Both studies used two-dye array-CGH techniques that used clones of bacterial artificial chromosomes (BAC) or oligonucleotides (representational oligonucleotide microarray analysis, or ROMA).
Theyindependently reported about 250 and 80 loci as changes in copy number from 39 and 20 normal individuals, respectively.
Fig.
2 illustrates the general concept of CNV detection based on two-dye array-CGH.
Although the average numbers of CNVs per individual genome were similar in two studies (about 12 CNVs per genome), it should be noted that there was little overlap between the results.
This discrepancy between studies was possibly due to the use of different platforms and experimental conditions in different populations.
However, it is also probable that there are still large numbers of structural variants that have yet to be discovered (Buckley et al., 2005; Eichler, 2006).
One following study that provided evidence on the widespread presence of large-scale structural variations in the human genome was based solely on in silico analysis (Tuzun et al., 2005).
The sequence-level comparison of two independent genome sequences, i.e., one derived from a human genome reference assembly and the other from fosmid clones of a genomic library, revealed about 300 structural variations, including inversions.
This method can detect various types of structural variants, including inversion, which is not detectable by conventional array-CGH platforms.
Indeed, the results by Tuzun et al.
(2005) can be used as validated control for primary verification or for parameter tuning for the development of CNV-detection platforms or algorithms.
Although the use of this method is currently limited by the unavailability of sequence data, ongoing efforts to sequence the individual human genome and to develop cost-effective sequencing platforms (Bennett et al., 2005) will be able to facilitate sequence-level genome comparisons and the identification of highly qualified structural variants in the near future.
Two studies by McCarroll et al.
and Conrad et al., which focused on the identification of deletion variants (McCarroll et al., 2006; Conrad et al., 2006), used 1.2 million SNP genotyping data from The International HapMap Consortium (International HapMap Consortium.
2005).
They assumed that allelic deletion causes the discard of probes in SNP genotyping.
For example, the runs of consecutive probes with null genotype calls or runs of SNP genotypes whose allelic frequencies deviate from expected Hardy-Weinberg equilibrium ratios or expected Mendelian inheritance patterns might represent the presence of deleted loci.
They independently reported about 600 potential deletions as small as less than 100 bp.
The relatively small size of the identified variants, compared with the array-CGH method, is due to the high resolution of the platforms.
The use of an SNP-centric array platform can be used to identify linkage disequilibrium (LD) of structural variants with nearby SNPs in a given population.
But, the discrepancy in deletions that were identified in the two studies was also noted in spite of using similar HapMap populations and identification methods (Eichler 2006).
Recently, a comprehensive CNV analysis was reported based on high-resolution array platforms, Whole Genome TilePath (WGTP), which used 26,000 large insert clones, and Affymetric GeneChip Human Mapping 500K early access, which used 500,000 SNP oligonucleotides.
They identified about 1500 genomic segments as copy number variations or CNVRs (copy number variable regions) consisting of overlapping CNVs from 269 HapMap individuals (Redon et al., 2006).
The results from the two platforms are worth comparing becasuse they provide the highest currently achievable resolution and are often selected as primary platforms in many other studies.
Firstly, the CNVs that are identified from BAC-based array-CGH are generally larger than those from oligonucleotide-based arrays (230 kb and 80 kb of median size, respectively).
This overestimation of CNVs by BAC-based array-CGH is due to the large insert clones that are used, which has been frequently reported (Iafrate et al., 2006).
Secondly, the actual boundaries of structural variants can not be determined through BAC-based array-CGH.
On the other hand, a more accurate determination of variant boundaries can be achieved through SNP-centric oligonucleotide-based arrays that have an extensive number of oligonucleotides.
The SNP-centric platform has additional advantages of accompanying SNP genotype information as a potential variant source, combined with large structural variants and its ability to detect the presence of loss of heterozygosity (LOH) or segmental uniparental disomy (Bruce et al., 2005; Mei et al., 2000).
But, the SNP-centric platform also has its disadvantages.
In spite of the advanced resolution, the relatively low signal-to-noise ratio of oligonucleotide-based hybridization intensity, compared with large insert clone array, might result in higher false-positive rates.
Because most CNVs are subtle changes, this makes the results prone to misclassification of signal intensities and, consequently, to statistical errors.
Sometimes, it is pointed out that the SNP-centric array was originally designed for allelic discrimination and is not appropriate for CNV detection because of biased genomic distribution and sequence composition of spotted probes (McCarroll and Altshuler 2007d).
Recently proposed oligonucleotide-based array platforms have been designed for CNV detection specifically without sacrificing the advantage of high resolution, which can be a promising solution for CNV detection in the near future (Barrett et al., 2004).
In identifying CNVs in normal populations, one of the fundamental problems is the lack of a reference genome from which diploid states of sample DNA can be inferred.
Unlike the array-CGH-based tumor study in which the normal DNA of the same individual can be used as a reference genome, no single DNA source can present the standardized and universal genome in variant analysis.
Often, the pooled genome of several individuals has been used to represent the average genome, while the heterogeneity of the used population might affect the copy number inference step, as shown for examples of X chromosomes.
Redon et al.
and Komura et al.
adopted the pairwise comparison for ac-curate inference of copy number states in individual loci, which is noteworthy (Redon et al., 2006; Komura et al., 2006).
In pairwise comparison, the hybridization intensities of one sample is compared with those of all other remaining samples as one large reference, and the diploid states of loci can be more accurately inferred from the multiple comparison results.Clinical implications of CNVs and dis-ease association study.
In spite of recent technological developments of genetic polymorphism-oriented disease association studies, still little is known about the effects of genetic polymorphisms on common complex diseases.
One of the ultimate goals in exploring CNVs is to systematically assess the association between such variants and the disease.
Although it is unlikely that all CNVs in the human genome are associated with diseases, evidence of the association of CNVs and a wide spectrum of human diseases has rapidly accumulated.
Table 1 summarizes the CNVs that have been reported to be associated with diseases.
CNVs can affect disease susceptibility or individual differences in responses to drugs through alteration of gene expression.
Stranger et al.
's and Heidenblad et al's reports coherently showed positive correlations between DNA copy number dosage and gene expression level (Stranger et al., 2007; Heidenblad et al., 2005).
If a CNV region contains transcriptional regulatory elements rather than protein coding genes, it still can affect gene expression levels by changing transcriptional regulation or heterochromatin spread (Reymond et al., 2007).Conclusion.
The genomic fraction that is occupied by CNVs is now estimated to be about 600 Mb, already exceeding that of single base-level variants.
It is likely that the number of CNVs and the genomic fraction that is affected by structural variants will continue to expand, and many of them will be used for more practical purposes, including disease association or population studies.
However, it should be remembered that the current CNV entries are plagued by substantial amounts of false-positive and false-negative results.
Only a small portion of them have been validated by independent methods.
To overcome this, it is necessary to improve scanning platforms, including optimizing experimental conditions and developing more reliable CNV calling algorithms.
In the meantime, it is required for individual researchers to know the characteristics of the available platforms and analytical techniques to use them or to interpret the published results properly.e found peak evidence of linkage (LOD score=1.88) for HDL cholesterol level on chromosome 6 (nearest marker D6S1660) and potential evidences for linkage on chromosomes 1, 12 and 19 with the LOD scores of 1.32, 1.44 and 1.14, respectively.
These results should pave the way for the discovery of the relevant genes by fine mapping and association analysis.IIntroduction.
Cholesterol is a major part of cell membranes.
Cholesterol is carried in the blood by chylomicrons, very low density lipoproteins (VLDL), high density lipoproteins (HDL) and low density lipoproteins (LDL) (Dastani et al.
2006).
HDL cholesterol is reversely associated with cardiovascular disease, and is more tightly controlled by genetic factors than the other lipoproteins such as LDL, VLDL and chylomicrons.
Environmental factors including chronic alcoholism, estrogen replacement therapy, and exercise influence the levels of HDL cholesterol.
Several families with strikingly elevated HDL cholesterol levels have been identified.
HDL cholesterol levels are higher in blacks compared with whites and HDL cholesterol levels of females are higher than those of males (Barcat et al.
2006; Brousseau et al.
2004; Yamashita et al.
2000; Imperatore et al.
2000).
Candidate gene analysis using population-based case-control studies has been used to test the association between SNPs and HDL cholesterol levels.
Among the candidate genes selected mainly from lipid metabolism pathways, ApoA-I gene is the one most intensively studied (Inazu et al.
1994; Kuivenhoven et al.
1997).
By genome-wide linkage analysis, susceptibility genes can be identified although the genes are not candidates based on lipid metabolism.
Genome-wide linkage scans are conducted by use of microsatellite markers to identify genetic determinants affecting the traits (Wang and Paigen 2005).
Using HDL cholesterol levels as either discrete or quantitative trait, several linkage studies on genetic determinants of HDL cholesterol have been reported (Yancey et al.
2003).
Genetic effects on the variations in HDL cholesterol were studied mainly in Caucasians and Africans thus far, and little attention has been focused in this regard on Asian populations.
We found suggestive evidence for linkage for HDL cholesterol on chromosome 6, 1, 12 and 19, in studies conducted as part of GENDISCAN study, a large epidemiological study of Complex traits in geographically, culturally and genetically isolated large Mongolian families l in Dornod, Mongolia report.
Methods.
We analyzed data from 1002 Mongolian individuals from 95 large extended families.
Informed consent was obtained from all subjects prior to participation and the protocol was approved by the Institutional Review Board at Seoul National University.
Potentially confounding variables were assessed for each participant along with overall medical history.
Information on age, gender and anthropometry (height, weight, waist circumference, hip circumference and body fat content) were obtained for each individual.
Height in centimeter (cm) and weight in kilograms (kg) were measured using an automatic measuring instrument (IMI 1000, Immanuel Elec., Korea).
Body mass index (BMI) was calculated in kg/m.
Waist circumference was measured to the nearest centimeter at the level of the umbilicus, and hip circumference was measured at the level of the maximal circumference of the gluteus.
All other variables were collected through interviews performed by trained interviewers.
Information about amount of alcohol and smoking was also obtained from all the participants.
All the subjects were asked to fast for 12 hours before their visit.
Blood samples were collected from an antecubital vein into vacutainer tubes containing EDTA.
Blood samples were centrifuged at 3000rpm for 10 minutes and then stored at 70C.
DNA was isolated from lymphocytes for polymerase chain reaction (PCR) and automated genotyping.
A 10 ml blood sample was collected from each participating individual for genomic DNA extraction.
DNA was extracted from peripheral lymphocytes using the PUREGENE DNA Purification Kit for whole blood (Gentra Systems Inc, USA).
For genotyping, a set of 1000 microsatellite markers deCODE mapping sets (deCODE genetics, USA) was used covering the genome at an average density of 3 centimorgans (cM).
HDL cholesterol was measured by the enzymatic method using Cholestest-N-HDL kit (DAICHI, JAPAN) and HITACHI 7600-210 & HITACHI 7180 instruments.
Extensive quality control procedures ensured the validity and reproducibility of the measurements.
Multiple linear regression analysis was used by PC SAS version 8.2 and PC SPSS version 12 to account for effect of confounding variables.
Pedigree data was managed by PedSys (Southwest Foundation for Biomedical Research, San Antonio, Texas, USA).
Nonpaternity was examined using PEDCHECK (Mcpeek and Sun 2000) and relationships other than paternity were checked using average IBD-based method by PREST.
After correcting pedigree error and Mendelian errors, non-mendelian errors were examined and corrected using SimWalk.
Identity by descent (IBD) matrix between every relationship pairs in family was calculated and IBD matrix for single marker was calculated by SOLAR (Sequential Oligogenic Linkage Analysis Routines software version 2.1.4).
Multipoint IBD matrices were computed on every 1 cM distance using Markov chain Monte Carlo method by LOKI (Heath 1997).
Genetic components of selected phenotypes were estimated in terms of heritability.
Narrow sense heritability, defined as the proportion of total phenotypic variation due to additive genetic effects, was calculated.
Heritability of HDL cholesterol adjusted for age, gender, age- square, product of age and gender, product of age- square and gender, systolic BP, smoking and alcohol was estimated and a variance component linkage analysis was carried out by SOLAR which uses maximum likelihood methods to estimate variance components for the polygenic genetic effect and random individual environmental effects.
Results and Discussion.
The mean age of the 1002 individuals was 31 years and 54.5% of them were female.
Demographic and pedigree characteristics of the study sample are shown in Table 1.
The family size had a mean of 16.
Table 2 included information on 2546 pairs of first degree relatives (1812 parent-offspring pairs and 734 full-sib pairs), 2485 pairs of their second degree relatives (395 half-sibling pairs, 1202 grandparent-grandchild pairs, and 888 avuncular pairs), and 598 first-cousin pairs.
Means of their total cholesterol, HDL cholesterol, LDL cholesterol, and triglyceride were 159.82 mg/dl, 55.19 mg/dl, 90.51 mg/dl, and 63.30 mg/dl, respectively.
Table 3 shows correlation between HDL cholesterol and covariates such as age, gender, systolic blood pressure, alcohol consumption status, and smoking status.
These parameters were used as covariates in the variance component analysis which provided multivariable adjusted heritability estimates for HDL cholesterol of 0.45 (Table 4).
The peak multipoint LOD score was 1.88 on 6p21 (nearest marker D6S1660) and a secondary peak (LOD score of 1.44) was found on 12q23 (nearest marker D12S354).
We identified other potential evidence for linkage in the LOD score of 1.32 on 1q24 (nearest marker D1S412) and a LOD score of 1.14 at 19p13 (nearest marker D19S884) (Fig.
1, 2).
Table 5 presents all LOD scores 1.0 for HDL cholesterol.
We identified potential evidence of linkage on several chromosomes.
In other genome scan, a weak linkage signal for HDL cholesterol was observed for regions that overlapped slightly with the regions identified herein.
Klos et al.
reported the appearance of peak position in the chromosome 12q in European American population (Klos et al.
2001) (Table 6).
We found evidence of link- the population isolates used in GENDISCAN study would not present significant inflation of type I errors from inbreeding effects in its gene discovery analysis.
IIntroduction.
The GENDISCAN (Gene Discovery for Complex traits in Asian population of Northeast area) study was launched in 2002 in order to elucidate genetic causes of complex diseases.
This study attempted to incorporate designs that detect genetic signals with increased efficiency.
These included using genetically homogeneous population, recruiting large families, and considering quantitative phenotypes as well as disease outcome (Peltonen et al., 2001; Merikangas et al., 2003).
Large extended families still remaining in the Northeast Asia, enabled the project to adopt these designs.
Although there is no doubt that gene discovery of common complex diseases is one of the research priorities, the successful results have been very limited (Grant et al., 2006).
The difficulty of replication across studies, mandates the use of internally valid study designs and proper methodologies.
Using population isolates generally confers the advantage of increasing genetic homogeneity.
However population isolates might have inbreeding structures, which deviates the basic assumptions of HWE.
The presence of significant inbreeding necessitates modifications in genetic estimations using the population.
Therefore, we attempted to estimate the status of HWE, and inbreeding coefficients in two ethnic groups of Mongolia using genome-wide short tandem repeat (STR) genetic markers.
Compatibility with basic assumptions of population genetics can support the methodological validity of the overall GENDISCAN study,Methods.
The GENDISCAN study included non-selected families in Mongolia.
The People's Republic of Mongolia (not including the Chinese territory) has 2.6 million people which comprise of more than 20 ethnic groups.
The Orkhontuul are in Selenge Imag (Imag is an administrative district unit in Mongolia corresponding to a state in the United States) and the Dashbalbar area in Dornod Imag were selected.
The Orkhontuul area has a population of 3,760 people, mainly consisting of Khalkha tribe, and maintains semi-urban life style.
The Dashbalbar area is mainly habituated by about 4,000 people of Buryat ethnicity and has more traditional nomadic life style.
Many large extended families, which fit the study purposes of the GENDISCAN study still remain in both areas.Genomic DNA was extracted from peripheral leukocytes.
The Orkhontuul samples (2004, n=1,080) were genotyped using the Applied Biosystems Inc. platform (ABI Prism Linkage Mapping Set version 2.5 medium density, 400 markers) with average 10 cM resolution, and Dashbalbar samples (2006, n=1,020) were genotyped using the deCODE 1,000 STR marker platform with average of 3 cM resolution.
For the Orkhontuul participants markers on the chromosome 14 were analyzed.
For Orkhontuul data, markers with low call-rate (49 markers), and with more than 1% of genotype error rates (16 markers) and markers on X chromosome (18 markers) were excluded.
For Dashbalbar genotype data, the 1,000 STR marker platform provided 1097 markers originally, however we excluded markers on X chromosome (49 markers) and markers with low call-rate and more than 1% of genotype error rates (4 markers).
All participants provided informed consent.HWE and degree of inbreeding were assessed using the founders of each pedigree.
Non-founders were excluded because their genotypes are dependent on those of the founders.
HWE was estimated by comparing the expected and observed genotype frequencies.
Expected genotype frequency was calculated from allele frequency.
Chi-square goodness of fit test was used to determine whether HWE assumption was met.
The Chi-square statistics () of multi-allelic loci is defined as equation as Equation 1, with k (k-1) degree of freedom, where k is the total number of alleles.
(Equation 1)where, nuu and nuv denote homozygotic and heterozygotic genotypes, while pu and pv denote allele frequency of each allele.
Information contents of the genetic markers were estimated as polymorphism information content (PIC), heterozygosity and allelic diversity.
PIC is an index of the amount of information, which modifies the simple heterozygosity index by adjusting for the chance of mating between the same heterozygotic genotypes.
PIC was calculated from Equation 2.
(Equation 2)where p and p denote allele frequency of each allele (Czika, 2005).
Inbreeding was estimated by the deviation from the assumption that each founder shares no Identity by descent (IBD).
Generally genotype frequency of bi-allelic locus having p and q allele frequencies are predicted as p, 2pq, q respectively under HWE.
However, if there are IBD sharing of FI between founders, above prediction can be re-written respectively as Equation 3.
(Equation 3)where, Fdenotes inbreeding coefficient (Gillespie et al., 2004).
In brief, inbreeding is characterized by the excess of homozygote over expected level.
The inbreeding coefficient can be estimated as Equation 4 by solving Equation 3 (Equation 4)where, H denotes observed heterozygotic, and 2pq denotes estimated heterozygotic proportions from allele frequency (Hart et al., 2000).
HWE and estimations of expected and observed heterozygosity frequencies were obtained using SAS/Genetics program.Results.
The demographic characteristics of the subjects geno-typed are shown in Table 1.
There were 280 (99 men and 181 women) and 142 (90 men and 52 women) founders in Orkhontuul and Dashbalbar populations.
Non-founders' genotype.
were excluded, since theirs do not independently contribute to a gene pool.
The information contents in terms of PIC for single marker, range between 0.2 and 0.9, as shown in Fig.
1.
Average PIC was 0.72 and 0.71 for Orkhontuul and Dashbalbar populations, respectively which are relatively high for single marker information contents.
There was no significant difference in PIC across the chromosomes or populations.
The high PIC level enabled accurate estimation of other population genetic parameters.
HWE was satisfied among 88.6 % and 94.2%, respectively, of all markers in Orkhontuul and Dashbalbar populations (p-value 0.05).
If we apply the criteria of p-value 0.01, 90.5% and 95.3% of all markers were in HWE status All the markers including those which were not in HWE, were used for estimating the inbreeding coefficients,.
Inbreeding coefficient was estimated to be 0.0023 and 0.0021 in Orkhontuul and Dashbalbar populations.
Discussion.
Population isolates are generally considered to be one of the most ideal populations for genetic study (Pajukanta et al., 2003; rcos-Burgos et al., 2002; Escamilla et al., 2001).
However, possible inbreeding can cause deviation from general assumptions on which most analyses depend.
Presence of inbreeding can be problematic, because, if exits, l the genetic relationships between unrelated as well as related persons could be underestimated.
This underestimation of IBD can result in inflation of type I errors for linkage analysis (Hossjer et al., 2006 Nomura et al., 2005), linkage disequilibrium estimations and haplotype reconstructions (Zhang et al., 2004).
The inbreeding coefficient found in this study (about 0.2% in each population), does not necessitate any adjustment for genetic analyses such as IBD calculation, classic or non-parametric linkage analysis, and variance component-based linkage analysis.
By estimating the last common ancestor, 0.2% of inbreeding coefficient corresponds to 10 or 11 generations (Jensen- Seaman et al., 2001; Santos-Lopes et al., 2007).
In this study, both ABI and deCODE STR markers were genotyped with standardized procedure and any markers with more than 1% of genotype errors were discarded.
The genotype errors were confirmed within the pedigree structure.
Any Mendelian inconsistency was deleted and markers with possible double-recombination were also deleted.
Generally, genotyping in family-based study is more accurate than in studies using individuals only.
Thus, It is not likely that any genotype error could have been biased our findings.
In conclusion, we have estimated inbreeding coefficients in two population isolates in Mongolia.,.
We found that they fall in negligible range, allowing related genetic studies to be performed without any modification or adjustment for possible inbreeding effects.
This finding validates the ability of The GENDISCAN study to add to the growing body of evidence which associates specific genetic variations with complex disorders.% (6.4 of 34.5 Mb) of chromosome 22 with 757 tagSNPs and 815 haplotypes (frequency 5.0%).
Of 3430 common SNPs genotyped in all five populations, 514 were monomorphic in Koreans.
The CHB + JPT samples have more than a 72% overlap with the monomorphic SNPs in Koreans, while the CEU + YRI samples have less than a 38% overlap.
The patterns of hot spots and LD blocks were dispersed throughout chromosome 22, with some common blocks among populations, highly concordant between the three Asian samples.
Analysis of the distribution of chimpanzee-derived allele frequency (DAF), a measure of genetic differentiation, Fst levels, and allele frequency difference (AFD) among Koreans and the HapMap samples showed a strong correlation between the Asians, while the CEU and YRI samples showed a very weak correlation with Korean samples.
Relative distance as a quantitative measurement based upon DAF, Fst, and AFD indicated that all three Asian samples are very proximate, while CEU and YRI are significantly remote from the Asian samples.
Comparative genome-wide LD studies provide useful information on the association studies of complex diseases.
IIntroduction.
Vast amounts of information on single nucleotide polymorphisms (SNPs) and progress in high-throughput genotyping technology have generated a great deal of interest in establishing genome-wide linkage disequilibrium (LD) maps for genetic studies of complex traits (Chakravarti 2001; The International HapMap Consortium 2003; Myers and Bottolo 2005).
LD is known to occur in a block-like structure across the genome, with conserved haplotype blocks of tens to hundreds of kilobases punctuated by "hot spots" of recombination (Daly et al.
2001).
Since the concept of whole genome association studies using SNPs was introduced (Risch and Merikangas 1996), an optimal number of SNPs required for association studies has been center of extensive debate (Kruglyak 1999).
Initial studies have focused on average LD levels and the variability in processes that generate LD (Cardon and Abecasis 2003).
Although a single chromosome could carry many haplotypes in LD blocks, recent studies suggest that haplotypic variation may be much lower than previously imagined (Jeffreys et al.
2001; Patil et al.
2001; Gabriel et al.
2002).
Patil's group identified haplotype blocks on chromosome 21 for which over 80% of chromosomes were represented by a few common haplotypes (Patil et al.
2001).
In the analysis of human chromosome 22 with a marker density of one SNP per 15 kb, Dawson's group reported a highly variable pattern of LD along the chromosome, in which extensive regions of complete LD of up to 804 kb in length were interspersed with regions of no detectable LD (Dawson et al.
2002).
Although differences of LD patterns between populations have been reported (Abecasis et al.
2002; Reich et al.
2001, Zavattari et al.
2002), little information is available on the haplotype structure in different populations other than the recent study by S.B.
Gabriel, et al.
(Gabriel et al.
2002).
On the other hand, haplotype analysis has been widely employed in linkage studies for narrowing down the location of disease susceptibility genes (Zhang et al.
2004; Park 2007).
The International HapMap Project was launched to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation among four population samples: 30 trios from Yoruba in Ibadan, Nigeria (YRI), 45 unrelated Japanese in Tokyo, Japan (JPT), 45 unrelated Han Chinese in Beijing, China (CHB), and 30 trios in a Utah, US population with Northern and Western European ancestry (CEU) from the CEPH collection (The International HapMap Consortium 2003; 2004; 2007).
As the International HapMap Project releases a validated SNP map of 1 marker per kb for the HapMap samples, the general applicability of the HapMap data needs to be confirmed in samples from related populations.
Recent comparative studies of LD patterns have shown a high degree of concordance among various populations (Gabriel et al.
2002; Shifman et al.
2003; Stenzel et al.
2004; Mueller et al.
2005).
As the HapMap samples include Japanese and Chinese, it was our interest to test whether significant differences in LD exist between Koreans and the two other Asian samples.
In this paper, we measured the LD pattern along chromosome 22 in Korean samples and compared the Korean data with those of the four HapMap samples.
We were interested in exploring how the HapMap data could be used to estimate the genomic structure of Koreans.
We expect that this study will contribute to the development of proper strategies for association studies of common complex diseases in Koreans using the HapMap data.
Methods.
A total of 111,448 reference SNPs from chromosome 22 in the dbSNP (http://www.ncbi.nlm.nih.gov/SNP, build 116) were collected.
To maximize cost effectiveness of genotyping, SNPs were selected based on the following criteria: 1) markers with even spacing, 2) verified SNPs, 3) coding SNPs.
The SNPs were scored for the selection of the study using the following strategies.
First, it was most important in mapping chromosomal LD blocks to have relatively equal spaces between SNP markers.
Second, verified SNP markers (validation status was scored as 0 to 4 in the dbSNP) that had higher scores were chosen to prevent or reduce genotyping failure.
Also, repeated sequence regions were excluded by repeat masking with Primer3 software (Rozen and Skaletsky 2000).
Third, to be useful for a further study, protein coding SNPs had higher scores.
A total of 12,674 genotyping experiments were conducted by four Genotyping Centers, and a final set of 4681 markers passed the stringent quality control procedure (The International HapMap Consortium 2003).
Genomic DNA from 90 unrelated Korean individuals without family histories of major diseases was obtained from the Genomic Research Center in the Korean National Institute of Health (KNIH).
The KNIH samples were collected as part of an epidemiological project and represent urban and rural regions in the south of Seoul.
The sex ratio was 0.5 and the mean age was 50.
Informed consent from all participating subjects was obtained through KNIH, and research approval came from the relevant ethical committees.
DNA was isolated from peripheral blood leukocytes according to standard procedures with proteinase K-RNase digestion, followed by phenol-chloroform extraction.For each SNP, we chose a set of three primers: two PCR primers to amplify a product of 100-200 bps under standard conditions and an optimized extension primer to be complementary to the sequence immediately to a SNP site.
For genotyping, we employed three platforms-6063 SNP genotypings were done using the Orchid Bioscience SNP-IT assay (Princeton, NJ), 984 SNP genotypings using the PerkinElmer Life Sciences FP-TDI assay (Boston, MA), and 5627 SNP genotypings using the Sequenom MassARRAY (San Diego, CA).
A genotype frequency for each SNP was checked for consistency between the observed values and those expected from the Hardy-Weinberg equilibrium test in each assay.
Haploview version 3.2 (Barrett et al.
2004), based on the expectation-maximization (EM) method (Excoffier and Slatkin 1995), was used to infer haplotype phase and population frequency and to estimate the Lewontin's coefficients D' (Lewontin 1998), LOD, and correlation coefficient r (Hill and Robertson 1968).
PHASE v2.1 was used to estimate the recombination parameters (Li and Stephens 2003; Crawford et al.
2004) and assess the statistical significance of haplotype profile differences and individual haplotype fre-2006).
Because it has been suggested that the functional significance of IL-1B-3737 might depend on a broader haplotype, we used the three SNPs for haplotype analysis.
Haplotypes were reconstructed by PHASE version 2.1, using previously produced genotype data (Lee et al., 2004).
Of the possible eight haplotypes, three common ones accounted for 98% of the estimated haplotypes in the Korean population.
Table 1 shows the haplotype frequency estimation in each population.
The potentially more inflammatory IL-1B-511T/-31C haplotype represented 53.5% of the Korean haplotypes, compared with 33.7% of the Caucasian haplotypes.
So far, in many previous association studies, the individual SNP approach, most frequently using IL-1B-511 and IL-1B-31, has been adopted.
To our knowledge, we reported first that the IL-1B-1464 polymorphism has allele-specific differences in nuclear protein binding and is associated with a clinical disease (Lee et al., 2004).
The biological implication of this polymorphism was supported by in vivo studies by Chen et al.
that showed that the IL-1B-1464 polymorphism has substantial allele-specific effects when both IL-1B-511 and IL-1B-31 were alleles T and C, respectively (Chen et al., 2006).
The more informative haplotype 1 (GTC), containing the IL-1B-1464 polymorphism, which shows the highest transcriptional activity, represents 9.3% and 6.0% of Korean and Caucasian haplotypes, respectively, whereas haplotype 3 (GCT), with the lowest activity, had a higher frequency in Caucasians (64.8%) when compared with Koreans (44.2%) (Table 1).
The difference in IL-1B promoter haplotype frequency between the Korean and Caucasian populations was statistically significant (=20.6, p=0.000), and the allele frequencies of the IL-1B-1464 polymorphism (rs#1143623) were also significantly different between the two populations (IL-1B-1464 G allele frequencies for Korean and Hapmap European=0.548 and 0.672, respectively) (=6.38, p=0.01).
It has been suggested that genes that are involved in immune function may be under selective pressure in direct interaction with the environment (Sawyer et al., 2004; Kim et al., 2005).
The genes that influence a phenotypic variation between populations are expected to show high Fst values.
Compared with the Fst value for the Caucasian-vs-Asian comparison, the Fst values for the African-vs-Asian or -Caucasian comparisons were remarkably high (Fig.
1).
Previously, we reported that the IL-1B-1464 polymorphism contributes to the development of intestinal-type gastric cancer among Koreans (Lee et al., 2004).
As a curious finding in our report, the editor pointed out that carriers of IL-1B-1464 G tend to have a decreased risk of diffuse-type of gastric cancer, which is the opposite of intestinal-type gastric cancer, although both intestinal and diffuse types of gastric cancer are related to Helicobacter pylori-induced gastritis (Furuta et al., 2004).
Our results showed that most IL-1B-1464 C alleles are linked to the IL-1B-511T/-31C haplotype (Table 1).
Considering the level of promoter activity of haplotype 2 (CTC), we cannot exclude the possible association between this haplotype and the risk of diffuse-type gastric cancer, especially depending on interactions with other regulatory factors (Lee et al., 2007).
Association studies that use individual SNPs appear to be insufficient, and the understanding of functional haplotype structure of populations could provide potential explanations for IL-1B-related controversies and ethnic-specific associations.
Therefore, we believe that these Korean haplotype data will be useful for future association studies between IL-1B SNPs and disease risk.nted domains, including the human imprinted gene cluster that contains IGF2, H19, KCNQ1, ASCL2, and CDKN1C (Rapkins et al., 2006).
If, as has been suggested, imprinted genes are intimately connected with the acquisition of parental resources, we would not anticipate the existence of such genes in chicken, which leave their offspring to their own heritance after conception.
Phylogenetic analyses expose that the relationship between human and mouse is closer than that between human, mouse, and chicken.
Similarly, the relationship between zebrafish and chicken is quite distant (Shah et al., 2004).
Nonetheless, we assumed that chicken have imprinted genes due to the existence of common ancestral genomic regions that have evolved on a similar basis in each of the aforementioned species.
The purpose of this study was to identify candidate imprinted genes in chicken based on an analysis of orthologous genes in human, mouse, zebrafish, and chicken using the HomoloGene database.ols for the clinical oncology to determine the prognosis of patients (Lossos et al., 2004; Pomeroy et al., 2002), the molecular diagnosis (Golub et al., 1999) as well as the responsiveness to therapeutics (Snyder and Morgan, 2004).
There have been many reports on the molecular pattern analysis using microarray to understand the chemo- and radio-resistance in cervical cancer (Achary et al., 2000; Tewari et al., 2005; Wong et al., 2006), rectal cancer (Kim et al., 2007) and esophageal cancer (Fukuda et al., 2004).
Most of the studies are to identify differentially expressed genes in patients with different clinical outcomes, which can be applied to the evaluation of prognosis more accurately.
Although the conventional parameters like tumor stage and grade can be used to decide optimal cancer therapy, molecular markers would provide valuable information to make clinical decisions (Klopp and Eifel, 2006).
Genome-wide analysis on gene expression can predict the clinical consequences more accurately.
In addition, the information from gene expression profiling can facilitate the development of biological target for therapeutics by identifying pathways and determining steps contributing to the phenotype.
In this study, we examined the expression profiles of two lung cancer cell lines, which showed differential re- 1995).
In the inactive form, the pseudosubstrate domain is bound to the catalytic domain of PKC (Orr et al, 1994).
Upon stimulation, PKC translocates to the plasma membrane where the C1 and C2 domains interact with DAG and phosphatidylserine, respectively.
This interaction causes the pseudosubstrate domain to dissociate from the catalytic domain, which results in activation of PKC.
Inactive PKC is not freely distributed throughout the cytoplasm but appears to be localized to specific sites within the cell.
Association of PKC with scaffolding proteins such as AKAP79 (A Kinase-Anchoring Protein 79) (Klauck et al, 1996) and Gravin (Nauert et al, 1996) facilitates localization.
Streptomycetes are ubiquitous soil bacteria, and they play a key role in the global carbon cycle by degrading the insoluble remains of other organisms.
More clues to the development of the PKC super family come from the study of the bacterium Streptomyces coelicolor.
S. coelicolor has a large collection of enzymes and can metabolize many diverse nutrients.
This extremely simple organism contains approximately 8,667,507bp, yet has complex life cycle exhibiting mycelial growth and spore formation (Bentley, 2002) and notable for production of pharmaceutically useful anti-tumor compounds.
Of the predicted genes, an unprecedented proportion carries out regulatory functions in the cell (Winstead, 2002).
More than twelve percent of the genome is involved in facilitating biological processes, such as the bacterium's s reduce implementation time and increase the likelihood of eliminating bugs and localizing code modifications when a change in implementation is required.
In the initial version of the interface, all of the classes got tangled with each other and corrupted the concept of object-oriented programming.
However, they have been completely redesigned, as shown in Table 1.
This table summarizes the recent modifications of our system, and the interfaces for each class are documented, similar to Fig.
2.
The refactored version is now composed of 3851 lines, compared with the initial version, which was composed of 2765 lines of code.
By importing the five packages, an exemplary software system called J3dPSV 1.0, shown in Fig.
3, has been developed for viewing 3D structures of proteins from the Protein Data Bank for demonstrational purposes.
J3dPSV supports visualization of proteins for educational purposes by simulating simple molecular graphics.
In addition, J3dPSV interactively displays a molecule on the screen in a variety of color schemes, molecular representations, and animation features.
The molecular model can be changed by selecting the list (cartoon tubes, backbone, protein, cylinder, or line) in useful suggestions for genotype information.
Compared to the current genotyping tools, GTVseq has several unique and useful features in the following aspects: * GTVseq uses two different scoring schemes and the results are reported separately.
One of the scoring schemes is similar to that of NCBI, while the other is particularly useful for viral sequences with new or complicated genotypes (vide infra).
* GTVseq offers an easy and interactive web-based user interface, with intuitive reports for genotyping results.
* GTVseq can be used for genotyping many important viruses such as HIV-1, HIV-2, HBV, HCV, HTLV-1, HTLV-2, poliovirus, enterovirus, flavivirus, Hantavirus, and rotavirus, thus permitting the most comprehensive genotyping of viral genomes to date.Methods.
For genotyping of viral genome sequences, we need to establish 'reference sequences' for each genotype.
We have downloaded the reference sequence database collections from NCBI (http://www.ncbi.nlm.nih.gov/projects/ genotyping), for HIV-1, HIV-2, HBV, HCV, HTLV-1, HTLV-2, and poliovirus.
For HIV-1 reference sequences, GTVseq also provides several different collections of reference databases such as HIV-1 (2004) & CRF, HIV-1 (2005), HIV-1 (2005) & CRF.
For enterovirus, flavivirus, Hantavirus, and rotavirus, the reference sequences were combination of databases and interactive web pages for manipulating and displaying annotations on genomes.
In other words, GBrowse is a web-based application tool that is developed for navigating and visualizing the genomic features and annotations interactively for users.
Through it, users can view a certain region of the desired genomes and search for genetic biomarkers.
They may conduct a full-text search for most features of the genomes.
They also can download SNP assay, genotype, and allele frequency information and generate customized sets of tag-SNPs for their association studies (Thorisson et al., 2005).
GBrowse utilizes a web-based display that can be used to show arbitrary features of a nucleotide or protein sequence and can accommodate genome-scale sequences that are megabases in length.
The GBrowse system consists of various kinds of software modules and systems, such as web servers, database systems, and Perl libraries.
At present, many biological websites that provide genomic variants or portal services have been developed using GBrowse, including the following: the UCSC Genome Browser (Kuhn et al., 2007), the International HapMap Project (Thorisson et al., 2005), PlasmoDB (The is database is free for non-commercial purposes.
The KRDD is visualized using a web-based graphical view, and anonymous users can query and browse the data using the search function.
The KRDD homepage is shown in Fig.
1, and the stored data are visualized using a web-based graphical view.
It has four major menus of web pages: (i) a Blast Search of a mutant line; Blast from rice Ds-tagging mutant lines; (ii) a primer design tool to identify genotypes of Ds insertion lines; (iii) a Phenotype menu for Ds lines, searching by gene name and phenotype characteristics among specific Ds lines; and (iv) a Management menu for Ds lines.
The Blast Search is searchable by selecting specific databases, consisting of DS Sequence, Indica Core, Japonica Core, Indica EST, Japonica EST, Indica Genome, Japonica Genome, Indica GSS, and Japonica GSS in Oryza sativa.
The KRDD uses several reference databases to facilitate a comprehensive analysis of the genome sequence.
These include the Entrez nucleotide database of the National Center for Biotechnology entative biological pathway database, now provides the KEGG Metabolism Atlas (Okuda et al., 2008) by manually combining about 120 existing metabolic pathway maps, as shown in Fig.
1.
However, the static approach to representing metabolic pathway diagrams offers no flexibility.
On the other hand, our initial attempts to visualize all information automatically in a single atlas map resulted in a confusing diagram that was difficult to interpret, as shown in Fig.
2.
It should be noted that Fig.
2 differs in many aspects from Fig.
1 or conventional drawings in biochemistry textbooks.
For this reason, we designed a new metabolic atlas viewing tool called J2dpathway, which has node-abstracting features.
When J2dpathway is initially executed, a window frame appears, as shown in Fig.
3.
The screen consists of views and editors.
The tool-bar menu at the top lists various tool icons, including zoom-in, zoom-out, cliques, highly connected nodes, and obtaining cycles.
The Map Repository View on the left side lists a preinstalled data source that has many example pathways to explore, which are arranged in a tree view of the components of of the HIF1ODD domain with multiple partner proteins, such as ARD1 (Jeong et al., 2002), prolyl hydroxylase (PHD) (Schofield and Ratcliffe, 2004), and p53 (Fels and Koumenis, 2005; Sanchez-Puig et al., 2005), also have been reported.
However, the molecular basis for the multiple binding specificity of the HIF1ODD domain has not been understood yet.
The detailed characterization of the correlation between the binding sequence motifs in the ODD domain and its binding to multiple target proteins is necessary for understanding the versatile function of the HIF1ODD domain.
Two functionally independent sequence motifs, the N-terminal and C-terminal ODD (NODD and CODD), in the HIF1ODD domain were shown to bind to the DNA-binding domain (DBD) of p53 (Hansson et al., 2002).
The crystal structure of the CODD motif in complex with pVHL was determined to ncluding 1p36, 5q31, and 21q22, by whole-genome linkage analysis (genome-wide association studies), and many polymorphisms also have been identified at these loci (Suzuki et al., 2003; Tokuhiro et al., 2003).
Human leucine-rich alpha-2-glycoprotein 1 (LRG1) was first identified as a trace protein in human serum (Haupt & Baudner, 1977).
The LRG1 gene is located on chromosome 19p13.3, and the primary sequence of LRG includes repeated leucine residues and also has putative membrane-binding domains.
Serum LRG1 is the first extracellular ligand for cytochrome c (Cyt c).
Cyt c is a ubiquitous, heme-containing protein that normally resides in the space between the inner and outer mitochondrial membranes (Newmeyer et al., 2003).
Extracellular Cyt c may play a role in inflammation, as it has been reported to cause arthritis when it is injected into mice.
Its levels in RA patients' sera are significantly lower than those of healthy controls (Pullerits et al., 2005).
At least eight repeating 24-amino acid segments that have a notable consensus sequence were identified in a large family of LRG proteins.
The function of LRG has not been elucidated, although the functions of many of the other members of the LRR (leucine-rich repeat)-containing superfamily are known (Kobe & Deisenhofer, 1994; Buchanan & Gay, 1996).
Plasma LRG expression levels are lower in liver cancer patients who are treated with radiofrequency ablation (Kawakami eutic agent for cancer is the in vivo specificity of cancer cell regression.
For such a specificity, target RNA-independent and nonspecific transgene induction by the group I intron should be avoided.
In other words, mis-spliced products should not be generated by the group I intron.
In this study, in order to evaluate the therapeutic feasibility of the hTERT-specific group I intron, we assessed the target RNA specificity of the trans-splicing phenomenon by the intron in mice that have been intraperitoneally xenografted with human cancer cells.Construction of Adenoviral Vector for hTERT-specific Group I Intron .
The expression vector that encodes for the hTERT-specific trans-splicing group I intron was constructed as previously described (Kwon et al., 2005; Song et al., 2006).
In brief, the Rib21AS group I intron, which recognizes uridine at position 21 (U21) of hTERT RNA, was generated to harbor an extended internal guide sequence, which includes an internal guide sequence (IGS, 5'-GGCAGG-3'), an extension of the P1 helix, an additional 6-nt-long P10 helix, and a 325-nt-long antisense sequence that is complementary to the downstream region (30 to 354 residues) of the targeted U21 of hTERT RNA.
In addition, cDNA, as a 3' exon that encodes for the lacZ gene, was inserted downstream of the modified group I intron expression construct (Fig.
1A).
using PHRAP (http:// www.phrap.org/), it does not ensure correct assembly because the quality scores that are generated from 454 data are not compatible with those from Sanger reads.
Further, PHRAP has problems with handling massive reads (usually hundreds of thousands from an SFF file).
A recent report has demonstrated that GS assembler programs (gsAssembler for de novo assembly and gsMapping for reference-guided assembly; http://www.
454.com/enabling-technology/the-software.asp) that are supplied by Roche Applied Science are ideal for correct assembly of 454 data that are short and inherently error-rich (Chaisson and Pevzner, 2008).
Recent versions (1.1.02.15 and later) of GS assembler programs support mixed assembly with Sanger-type reads, but their performance is not well known at present.
Moreover, because pre-existing assembly software such as PHRAP and CelAsm (Huson et al., 2001) do not directly support data that are produced by 454 machines, 454-derived contigs (GS contigs) should be used as if they were individual reads or be shredded to generate many overlapping 'pseudoreads' (Goldberg et al., 2006).
Pseudoreads, made from GS contigs to emulate the read size of standard Sanger data (ca.
600 bp), are virtual reads whose stepping between consecutive dertaken as a collaboration between Korean funding agencies (Ministry of Education, Science and Technology and Korean National Institute of Health), experimental academia (Ulsan Medical Institute, SungKyunKwan Medical Institute, and Korea Advanced Institute of Science and Technology), and corporations (DNA Link, SNP-Genetics, and Samsung Advanced Institute of Technology) (Yoo et al., 2006; Lee et al., 2008).
Resulting from the project, a Korean SNP and haplotype database system was developed to help those researchers who study high-frequency, complex Korean diseases and changes in ethnic global migratory variants.
In the project, we tried to accomplish a number of goals.
First, the system should be able to provide essential information that is needed for gene discovery of complex Korean diseases.
Second, the system should contain basic and advanced tools that may apply to applications such as diagnostics, treatment, and prevention of diseases.
Third, the database system should provide Korean-specific SNPs and haplotype information that are common in the Korean population.
We have developed a series of software programs for association studies as well as for the comparison and analysis of Korean HapMap data with four other populations (Yorubans in Ibadan, Nigeria; Centre d'Etude du are involved in lipogenesis, such as SREBF1, suggesting that PTP1B may play a role in the enlargement of adipocyte energy storage (Rondinone et al., 2002).
The human PTPN1 gene maps to chromosome 20q13.13, a syntenic region of the distal arm of mouse chromosome 2 that harbors quantitative trait loci for body fat and body weight (Lembertas et al., 1997).
The PTPB1 gene consists of 10 exons, spanning 74 kb, and the first intron is longer than 50 kb.
In humans, several linkage signals with type 2 diabetes mellitus (T2DM) (Bowden et al., 1997), BMI (Hunt et al., 2001), fat mass, and energetic intake (Collaku et al., 2004; Dong et al., 2003; Lembertas et al., 1997) were reported at this locus in different populations, further supporting the candidacy of PTPN1 involvement in T2DM and obesity.
In Poland, a family-based linkage study of T2DM showed the highest logarithm of the odds score (Ji et al., 1997; Klupa et al., 2000).
This locus also showed evidence of linkage with early onset T2DM (onset=45 years) in a subset of 55 French families (Zouali et al., 1997).
prostaglandin and is associated with biologic events such as injury, inflammation, and proliferation (Hla and Neilson, 1992; Tazawa et al., 1994).
PTGS2-mediated prostanoids play an important role in maintaining blood pressure (Anderson et al., 1976; Daniels et al., 1967).
Specially the cortical PTGS2- derived prostaglandin I2 participates in the pathogenesis of renal vascular hypertension through stimulating renal rennin synthesis and release (Hao and Breyer, 2008).
Clinical studies as well as animal studies also demonstrate important roles for PTGS2 in maintaining cardiovascular homeostasis (Zewde and Mattson, 2004; Zhang et al., 2006).
PTGS2 is upregulated in animal models of cardiac failure (Abassi et al., 2001; Adderley and Fitzgerald, 1999), and its expression has been detected in heart failure in humans (Wong et al., 1998).
PTGS2 gene is located on chromosome 1q25.2-q25.3 (Hla and Neilson, 1992) and its cDNA encodes a 604 amino acid protein.
Recently a large-scale association study in Japanese population revealed the association of PTGS2 poly-c blood pressures and heart rate (Eric Colman, 2005).
Phendimetrazine has also been widely prescribed as an anorectic for the treatment of obesity, and has been reported to have properties similar to methamphetamine, which is known to suppress appetite by activating catecholaminergic neurotransmission (Seiden et al., 1993; Chen et al., 2001).
Methamphetamine is known to primarily block dopamine transporter, which inhibits dopamine reuptake, indicating that dopamine up-regulation has an anorectic effect (Mackler et al., 1993).
Because phendimetrazine and methamphetamine stimulate the central nervous system to produce euphoria, probably via the activation of dopaminergic systems in the brain (Nailles et al., 2003), these drugs are restricted to short-term use (a few weeks) and prominently labeled to warn against the risk of addiction.
However, although many anorectics are available, evidence is still lacking concerning their efficacies, safeties, and molecular mechanisms.
Recently, cDNA microarray studies on gene expression profile changes by amphetamine have been reported (Noailles et al., 2003; Yamamoto et al., 2005), but no such report has been issued on other anorectics.
In this study, we employed gned for the identification and visual representation of CNAs using genome-wide array-CGH profiles.
CNAs can be directly identified from log2 ratio profiles that can be obtained from array-CGH datasets with minimal modifications.
Data smoothing option is also provided to cope with the noise level of data for reliable detection of CNAs.
The identification of CNAs is based on SW- ARRAY algorithm that ensures fast and robust detection of chromosomal alterations.
The identified CNAs are exported into Excel-compatible outputs or graphically illustrated with graphic-user interface.
Relatively easy operability as well as the fast processing of overall procedures is the major advantage of our software over the conventional ones.
CGHscape software package is freely available and provides the comprehensive environments for investigation of tumor genome and genomic variants.Major Functionalities of CGHscape.
(1) CGHscape was designed as a standalone program compatible in Microsoft Windows environments.
Compiled codes of CGHscape can be easily installed.
The interpreter- or web-based methods have the advantage al., 2006).
Genealogical relationships among haplotypes in a chromosome 2 8.4 kb region without obligate recombination events were demonstrated using the CEU samples only (The International HapMap Consortium, 2005).
If other population samples such as YRI, CHB, and JPT had been included, the haplotype blocks would have been fragmented due to a number of historical recombination events and phylogenetic studies with such a small block would have not been informative.
In this study, instead of conventional tree-based phylogeny, principal coordinate analysis (PCoA) (Higgins, 1992) was employed using the haplotype data on a region encompassing multiple blocks.
As PCoA, albeit distance-based, is useful to grasp the major trend among the sequences, it would be worth to try how PCoA performs with such a dataset.
As an illustrative purpose, a region of 200 kb in chromosome Xq28, which is about 1 Mb away from the pseudoautosomal region (PAR2) at the tip of X chromosome long arm, was chosen and the haplotype structures of three ethnic groups that showed apparent recombination events were compared.
This region of the human genome harbors several important disease genes such as glucose-6-phosphate dehydrogenase (G6PD), cancer/testis antigens (CTAG1B, CTAG2), and Gab3 protein (GAB3).
oaches should help identify biomarkers to classify specific diseases based on high-throughput data.
However, when a patient’s sample is evaluated to determine his/her disease status using more than one experimental condition relative to a determined biomarker set, correct prediction becomes impossible.
Furthermore, methods to predict the disease status of a patient using biomarkers that initially are identified under different conditions than those that are used for the patient analysis have not been developed.
This study suggests a method that can accurately predict the disease status of a patient using a predetermined biomarker that is developed on a different platform.
Specifically, we performed a two-step discretization of gene expression values by their rank, which were processed in both the biomarker selection and prediction stages.Methods.
To evaluate our proposed method, we used two different datasets: the NCI dataset (Lee, et al., 2003) and the colon cancer dataset (Kim, et al., 2007; Notterman, et al., 2001).
Both of these datasets include gene expression information that was determined experimentally using two different microarray platforms (oligonucleotide-based and cDNA-based).
There are a large number a loss of function of the oxidase (Tosha et al., 2004).
Recently, the I47A/I54V protease mutant in complex with Lopinavir showed that mutation affects the strain of the bound inhibitor in the protease-binding cleft (Grantz Saskova et al., 2008).
In previous studies, the mutation of specific sites has been shown to have an effect on the function and structure of proteins that cause disease.
It is well known that there is a correlation between mutated proteins and disease.
Also, there are bioinformatic tools to predict the correlation between mutation and disease, such as SIFT (Steven Henikoff et al., 2003) and PolyPhen (Vasily Ramensky et al., 2002).
However, these tools are based only on sequence homology.
In this study, we conducted a large-scale structural and sequence mutational analysis of amino acids that could have a direct effect on protein function.
Because we collected the largest number of 3D structural changes in proteins, such as pockets, we named the dataset the structural mutatome.
The number of such structural mutations will increase continuously, and mapping the mutations to function and to disease will play a critical role in understanding the precise disease mechanisms that are caused by 3D mutations.
We classified mutated proteins by their structural properties (distance of pocket residue and mutation, pocket size, surface size, and stability) and physico-chemical properties (weight, instability, isoelectric point, and GRAVY cations that are related to the comparison and translation of various XML languages and parsers (Funahashi et al., 2004; Strmbck et al., 2005; Choi et al., 2008).
However, because they were mostly aimed at drawing only relatively small-scale drawings that unify?only several pathways, systematic analyses of shared and duplicated compounds between pathway maps were not necessary.
Thus, to the best of our knowledge, KGML analyses tools rarely have been addressed in the literature to draw a large-scale pathway, such as the KEGG Atlas from a graph-theoretical perspective.
As a preliminary step in providing automatic graph layout techniques to the genome-scale flow of metabolism, analyzing KEGG XML files is crucial for software developers.
Thus, in this paper, we provide shared and duplicate compound information, using our XML analyses tool, to provide valuable information for automatic layout research in the area of systems biology.
These kinds of analyses that are based on graph-theoretical perspectives can be extremely useful when drawing a global pathway map in which edge crossing arises as a crucial issue.
ulting in a vast amount of genetic and pathway information with regard to the etiology of cerebrovascular disease.
These genes were annotated to access information on transcription, translation, structural function, and relatedness to the disease.
In addition to in silico data mining, 320 250K Affymetrix SNP chips (GeneChip Human Mapping 250K Nsp Array, Affymetrix, Inc., CA) were utilized for a case/control association study to generate experimentally associated markers of cerebrovascular disease.
The associated genes from the SNP chips and the genes that were retrieved from in silico data mining systems were compared and analyzed.
A protein-protein network diagram that showed the integrated markers and their relationships was constructed in order to analyze the network characteristics and produce hub genes.
It was found that the PPI network that was associated with cerebrovascular disease follows a power-law degree distribution, as other biological networks do (Peri et al., 2003).
The PathwayStudio 5.0 program (Ariadne, Inc., MD, USA) was utilized to process the natural text mining of PubMed abstracts; the use of PathwayStudio resulted in a gene-disease association network.
The etiology of the disease and its related genes, which were extracted from in silico data mining and network analy-Transitional DTD standard and does not use technologies that are dependent on specific web browsers.
This is one way to make a web alignment tool more compatible with many web browsers.
We have developed a user-friendly a web based alignment tool based on ClustalW-MPI program.
It is standard and easy to maintain.
This web tool will help researchers to carry out multiple sequence alignment with a large number of input accompanies by a viewer and an editing function.
It also enables users to download the results and do basic analyses such as building trees and sequence clustering.Features and Results.
In order to use alignment tools, most advanced users use UNIX or Linux commands and options directly in a console window.
It is very inconvenient to use.
It also can cause frequent mistakes.
A web alignment tool can be executed through a GUI environment on the web page by selecting commands and options.
Our web alignment tool has the following features; input, downloadable output, and visualization.
Users input multiple sequences in the web alignment ty.
The speed of sequencing is advancing many folds per year, much faster than the cycle of semiconductor chips in computer industries.
Also, genome sequencing technology is becoming an everyday technology at the level as computer CPUs are universally used.
In five years time, experts predict that everyone in developed nations will be able to have his or her own genome information.
Due to its far reaching consequences in medicine, health, biology, nanotechnology, and information technology, DNA sequencing will become the most important industrial technology ever developed during the next decades.Personal Genomics.
In 2009, genome sequencing technologies will achieve one person's whole genome per day in terms of DNA fragments sequenced.
Personal genomics is a new term that utilizes such fast sequencers.
In 2008, the cost for one personal genome is less than $350,000 USD.
If the cost goes down below $1,000 USD, the impact of personal genomics is predicted to be the largest ever in biology in common people's lives.
Reflecting this technological advancement to society is the PGP (Personal Genome Project), a project to sequence as many people as possible with lowest possible cost (Church, 2005).
At ited human diseases.
In addition, many computational programs have been created to predict the functional effects of unknown CVs (Ng et al., 2006; Care et al., 2007).
Database searches and bioinformatic predictions can be useful in prioritizing novel CVs for further analysis.
In this review, we summarize the databases that are most helpful in interpreting the functional effects of CVs.
We perform an extensive survey of existing in silico prediction methods and compare their performance.
Finally, we introduce a combination method as a promising approach to improve prediction performance.Polymorphism and Mutation Databases.
Several databases that are helpful in assessing the functional effects of CVs or their relevance to disease phenotype are listed in Table 1.
Each of two broad-category mutation databases, general mutation databases (GMDBs) and locus-specific mutation databases (LSDBs), has unique strengths and weaknesses (Porter et al., 2000).
Because polymorphism and mutation databases have been developed for different uses, they complement each other.d to the successful identification of specific positively selected genes, including human olfactory genes and human leukocyte antigen (HLA) loci (Salamon et al., 1999; Gilad et al., 2000).
Therefore, the NS/S ratio test is a recognized tool for the effective detection of types of natural selection in protein-coding genes.
Under conditions of no selection, we would expect a NS/S ratio of 1.
In case of negative selection, NS/S is 1, and with positive selection, NS/S would be 1 (Biswas & Akey, 2006).
Furthermore, the availability of large SNP datasets allowed us to determine where natural selection (either negative or positive) has effected variations in humans (Nielsen et al., 2007).
In this study, we investigated natural selection on the human genes by comparing the simple ratios of nonsynonymous and synonymous coding SNPs (cSNPs) in individual protein-coding genes.
Methods.
We downloaded and analyzed all coding SNPs (cSNPs) with a validation code greater than 2 from the public dbSNP (build 125, http://www.ncbi.nlm.nih.gov/SNP/).
Where necessary, we additionally used genotype data generated from the International HapMap Project with
|
Annnotations
- Denotations: 1
- Blocks: 0
- Relations: 0