Big Data Mining and Analytics


SNP interactions, SNP combinations, GWAS, case-control study, disease association analysis, cross-phenotype association studies


Every person differs from every other person regarding their physical appearance, susceptibility to disease, response to medications, and so on. However, 99.9 percent of human DNA is the same. As such, differences in human genomes are very worthy of study. Single-Nucleotide Polymorphisms (SNPs) are the simplest form and most common source of genetic polymorphism. SNPs have been used to successfully identify defective genes that cause Mendelian diseases. However, most common human diseases are complex and are caused by multiple SNPs. Each SNP explains only a small fraction of genetic causes. Experiments on individual SNPs may reveal their non-detectable effects on complex diseases. Pathogenesis is a complicated topic, and it is difficult to correctly predict multiple SNPs. As such, the analysis of SNP data is a critical task in the study of genetic diseases. In this paper, we divide the methods for genome-wide SNP data analysis into two categories: single-trait Genome-Wide Association Studies (GWAS) in which pathology is mined from data of a single phenotype, and multiple-trait GWAS which identifies cross-phenotype associations. For single-trait GWAS, we review methods ranging from the simple to the complex, including TEAM, BOOST, AntEpiSeeker, SNPRuler, EDCF, HiSeeker, ORF, MLR-tagging, MSCD, and MIC. For multiple-trait GWAS, we describe methods in terms of their employed regression models, dimension-reduction methods, and meta-analysis methods. We also list the advantages and disadvantages of these methods. Finally, we discuss the future directions of SNP data analysis for genome-wide association.


Tsinghua University Press