Methods

Whole-exome sequencing data analysis
A total of 20 DNA samples were obtained from 10 pairs of bone marrow (tumor) and germline (normal) cells collected from 10 NK-AML patients. We illustrated the workflow for whole-exome sequencing data analysis in Fig. 1. The raw reads in FASTQ file format were mapped to the human reference, which was developed by the 1,000 genome project (human_g1k_v 37.fasta), by using the Burrows-Wheeler aligner (BWA v 0.6.1), which generates SAM format files [13]. The SAM files were converted into binary format files (BAM) by samtools v0.2.6, which reduces the file size and improves computing efficiency [14]. The read group information of the sequencing machine was added to the head of the BAM files. The aligned sequences were sorted in the order of chromosomes by Picard v1.79 (http://picard.sourceforge.net) and went through a PCR duplicate marking process, which enables the Genome Analysis Toolkit (GATK) to ignore duplicates in subsequent processing [15]. Finally, the BAM files were indexed by bamtools v2.2.0 [16]. We performed a local realignment prior to recalibration, which gives the most accurate quality scores for each sample. Local realignment with known indel sites (Mills_and_1000G_gold_standard.indels.b37.vcf, 1000G_phase1.indels.b37.vcf) for each individual does not require multiple sample realignments, which demand extreme computational power. However, we carried out a local realignment with the matched tumor and normal samples together to prevent misalignment due to the differences between these two tissue types. Recalibration was performed with multiple known sites (dbSNV_137.b37, Mills_and_1000G_gold_standard.indels.b37.vcf, and 1000G_phase1.indels.b37.vcf), which may increase recalibration accuracy. We reduced the BAM file size to about 1/100 of the original file size by using the GATK tool, which saved variant calling time without losing any essential information. We used the UnifiedGenotyper of GATK for variant calling, followed by variant recalibration with known sites (hapmap_3.3.b37.vcf, 1000G_omni2.5.b37.vcf, dbsnp_137.b37.vcf, and Mills_and_1000G_gold_standard.indels.b37.vcf), and annotated by them using snpEff v2.0.57 [17].

Statistical analysis
We performed logistic regression analyses between the somatic mutations and NK-AML using PLINK/SEQ v0.08 (http://atgu.atgu.mgh.harvard.edu/plinkseq), which provides powerful utilities in variant call format (vcf) for analyzing whole-exome and -genome data. Further, we verified the odd ratios and p-values estimated from PLINK/SEQ using Stata, v11.2 (Stata Corp., College Station, TX, USA).
We selected the somatic nsSNVs with complete call rates and evaluated the GRS models composed of the variants associated with NK-AML. The GRS was calculated for each individual by accumulating the number of risk alleles (0, 1, or 2) of the SNVs. We created stepwise GRS models, comprised of the selected SNVs, according to their significance level; if the significance level was equal between two or more SNVs, we selected the SNVs in the order of their chromosomal position. In addition, we evaluated a GRS model that consisted of gene variants reported in previous leukemia studies [9, 10, 18, 19]. We compared the area under the receiver operating characteristic curve (AUC) of each GRS model using the "roctab" and "roccomp" commands in Stata.