GSA-SNP and i-GSEA4GWAS
The GO database was downloaded from the Gene Ontology Consortium (12,902 terms) [11], and the KEGG pathway database was downloaded from the KEGG (311 terms) [12].
For both databases, only the gene sets having 10-200 member genes were used (3534 GO and 211 KEGG terms). For both GSA-SNP and i-GSEA4GWAS, 20-kb padding was added to both ends of each gene. Usually, these methods, like i-GSEA4GWAS, pick up the best p-value and assign it to the encompassing gene. On the contrary, GSA-SNP allows one to choose different schemes for assigning the SNP p-values to each gene: either the best or the second best p-value within the gene boundary. Choosing the second best p-value has been recommended, as it may reduce the false positive associations with little loss of sensitivity [4].
For GSA-SNP, we downloaded the standalone program (as of Jan. 2011) and executed it locally. For i-GSEA4GWAS, we used the web server version by uploading the SNP p-values. GSA-SNP allows several approaches of evaluating gene set significance. While other approaches require p-values from permutation tests, Z-standardization requires no permuted p-values. The score of a gene is defined as -log of the p-value assigned to the gene. For each gene set, the scores of its member genes are averaged, and the Z-statistics of these scores are used to estimate the significance under the assumption of a normal distribution. The effect of multiple testing is corrected by the false discovery rate (FDR) method [13]. On the other hand, i-GSEA4GWAS compares the distribution of the member gene scores of a gene set to all the genes using K-S statistics and corrects the multiple testing effect using FDR that is based on SNP permutation tests. Variation in the number of member genes among gene sets is taken care of by multiplying a ratio of 'highly significant' genes in a gene set relative to those among all genes. Here, the 'highly significant' genes are defined as the genes that map with at least one of the top 5% of all SNPs in the dataset.