Methods Study population The initial study population included 200,595 individuals, 20-77 years of age, who visited 16 health promotion centers nationwide from April 2004 to December 2007 in the KCPS-II. Of these, there were 325 confirmed cases of CRC [26], but 132 cases ≥55 years of CRC onset age were excluded to obtain early-onset CRC cases. For controls, they were recruited from the Korean Metabolic Syndrome Research Initiative study, a part of KCPS-II, in Seoul, initiated in December 2005. A total of 9,128 individuals were recruited in 2006, and an additional 17,569 individuals were recruited in 2007. Therefore, the total Seoul cohort included 26,697 volunteers. Volunteers from the first round had routine health examinations at the Health Promotion Center in university hospitals between January 2006 and December 2007. From this total, 1,004 individuals were genotyped using Affymetrix Genomewide Human SNP Array 5.0 (Affymetrix, Santa Clara, CA, USA). However, 10 of 1,004 individuals were removed because of low genotyping call rates (<95%), and 4 individuals were shown to have biological relatives; so, one member of each pair was excluded. Eleven and 2 individuals were also excluded as a result of gender mismatches [27]. An additional 6 cases and 1 control were excluded due to missing anthropometric measurements (height, weight, BMI, waist circumstance [WC], and blood pressure [BP]) and self-reported questionnaire information (smoking status and alcohol consumption). A detailed description of the KCPS-II study design and methods of selection of controls in this study are published elsewhere [27]. Therefore, a total of 1,163 participants (men, 687; women, 476) were included in this study: 187 cases (men, 133; women, 54) and 976 controls (men, 554; women, 422). A written consent form was signed by all study participants, and the Institutional Review Board of Yonsei University approved the study protocol. Genotyping DNA samples were isolated from the peripheral blood of participants and genotyped using Affymetrix Genomewide Human SNP Array 5.0 (Affymetrix Inc.) at DNA Link Inc. (Seoul, Korea). Internal quality control (QC) measures were employed to ensure accuracy of the data. The QC call rate (dynamic model algorithm) was ≥95%, and heterozygosity of X chromosome markers identified the gender for each sample. Genotype calling was performed by Birdseed (v2) algorithm. Chromosome Y was not analyzed. A total of 1,163 individuals were genotyped via this platform in the analysis. PLINK (v1.07) was used to estimate identity by state (IBS) over all SNPs [28]. A default set of 426,019 SNPs was used for further analysis, as recommended by Affymetrix. In the quality assurance screening, we flagged SNPs with genotype call rates < 95%, minor allele frequencies < 0.01, and SNPs showing deviation from Hardy-Weinberg equilibrium (HWE) at p < 0.0000001. The final set of acceptable markers included 312,506 autosomal SNPs. Accuracy of the genotyping was calculated by Bayesian robust linear modeling using the Mahalanobis distance (BRLMN) algorithm [29]. Chemistry and anthropometric measurements Serum, separated from peripheral venous blood, was obtained from each participant after a 12-h fast and then stored at -70℃ until analyzed. For anthropometric measurements, WC was measured on exposed waists midway between the lower rib and the iliac crest using a measuring tape. For difficult cases, WC was measured at 3 cm above the navel. Weight and height were measured while participants were wearing light clothing. BMI was calculated as weight (kg) divided by height squared (m2). Both systolic and diastolic BP was measured after a 15-min rest. In addition, each participant was interviewed using a structured questionnaire to collect information on smoking and alcohol consumption as well as demographic characteristics, such as age, gender, and family and past history of clinical diseases. Cigarette smoking was classified into never smokers, ex-smokers, and current smokers. Alcohol consumption was divided into nondrinkers and current drinkers. Regular physical activity was tracked as either "yes" or "no". SNP selection and GRS calculation In the association of SNPs with CRC, the SNPs with p < 10-5 in Korean men were: rs17391002 (CXCL12), rs9549448 (SOX1), rs254833 (MYO10), rs2553614 (TMEM71), rs13153032 (NSUN2), rs2288073 (FLJ30851), rs9604214 (SOX1), rs9865670 (OPA1), rs17186320 (KIAA1009), rs1509497 (RFX8), rs235428 (PHF20L1), rs9845920 (OPA1), rs9846212 (OPA1), rs6763744 (OPA1), rs4128317 (ALK), rs7646304 (OPA1), rs17047306 (SPATA17), rs1490338 (SPATA17), rs902351 (SPATA17), and rs2543662 (ITSN2) (Supplementary Table 1). The SNPs with p < 10-5 in Korean women in the association between SNPs and CRC were: rs10083736 (GOT2), rs16987827 (DHX35), rs8046516 (GOT2), rs9926182 (GOT2), rs17523778 (FAM174B), rs4974411 (TPRA1), rs1834902 (H2AFY), rs16895308 (MAST4), rs8032832 (FAM174B), rs6901560 (PD6), rs11025480 (PRMT3), rs3814110 (BNC2), rs16895307 (MAST4), rs7089063 (MARCH8), rs16893688 (IBTK), rs6861487 (MAST4), rs9613463 (MN1), rs11242237 (H2AFY), rs11150094 (WWOX), and rs9625253 (MN1) (Supplementary Table 2). Each SNP in this study was assumed to be associated with risk according to an additive genetic model, which performs well, even when the true genetic model may not be known or may be incorrectly specified [30]. A GRS was calculated on the basis of reproducible tagging of SNP-associated loci reaching genomewide levels of significance. In this study, the GRS was calculated with the 3 SNPs in Korean men and 5 SNPs in Korean women showing the strongest association with CRC (p < 10-6). The GRS was created by two methods: a simple count method (count GRS) and a weighted method (weighted GRS) [31, 32]. Both methods anticipated each SNP to be independently associated with risk. We assumed an additive genetic model for each SNP, applying a linear weighting of 0, 1, or 2 to genotypes containing 0, 1, or 2 risk alleles, respectively. This model is known to perform well, even when the true genetic model is unknown or wrongly specified [30]. The count model assumes that each SNP in the panel contributes equally to the risk for CRC and was calculated by summing the values for each of the SNPs. The weighted GRS was calculated by multiplying each beta-coefficient by the number of corresponding risk alleles (0, 1, 2). Outcome classification The principle outcome variables were prevalence (n = 165) and incidence rates (n = 22), based on national cancer registry and hospitalization records. Although Korea has a national cancer registry, reporting was not complete during the time of follow-up, and consequently, hospital admission files were used to identify first admission events for CRC. An incident of CRC was coded as occurring, based on either a positive report from the national cancer registry or upon hospital admission for a cancer diagnosis [33]. According to the International Classification of Diseases, Tenth Revision (ICD-10), CRC was coded as C18-C20 [34]. Statistical analysis All analyses were conducted using PLINK version 1.06 (Free Software Foundation, Inc., Boston, MA, USA) and SAS statistical software version 9.0 (SAS Institute Inc., Cary, NC, USA). All statistical tests were two-sided, and statistical significance was determined as p < 0.05. To evaluate general characteristics of the study population, means and standard deviations (SD) were calculated, and frequency of cigarette smoking, alcohol consumption, and physical activity was determined. Paired t-tests were performed to indicate the differences between case participants and control participants for both men and women. A X2 goodness-of-fit test was used to assess whether SNPs were in HWE and to determine differences in genotype frequencies between CRC cases and controls. The GRS was categorized into quartiles. The CRC risk associated with genotype was estimated as s ORs and 95% confidence interval (CI), computed using logistic regression with an additive genetic model. We also used receiver operating characteristic (ROC) curve analysis and calculated the area under the curve (AUC; also known as the C statistic) to evaluate the discrimination power of the model. In addition, internal validity of each model was checked using bootstrap [35], while 10-fold crossvalidation was used for the external validity of each model (Supplementary Tables 3 and 4) [36].