Developing a Unified Statistical Method for Analyzing Large-Scale Related and Unrelated Exome-Chip Data

Ethan Heinzen
Ethan Heinzen
University of Minnesota

Ethan Heinzen¹, Min He¹,²
¹Center for Human Genetics, and ²Biomedical Informatics Research Center

Research area: Genetics, Biomedical Informatics 

Background: When study populations in Genome-wide Association Studies (GWAS) contain both related and unrelated individuals, often the populations are split into two designs, family-based and case-control, which potentially reduces overall power. Data from exome-chip, an intermediate experiment between exome sequencing and genotyping arrays, was available on 10,016 individuals from the Personalized Medicine Research Project (PMRP) at the Marshfield Clinic Research Foundation. The population of the exome-chip data has a mixed structure, with over 40% of samples related. To analyze the full dataset appropriately, we developed an improved principal component analysis (iPCA) for related and unrelated exome-chip data that eliminates effects caused by population stratification (unrelated samples) and family structure (related samples).

Method: We excluded variants based on Hardy-Weinberg Equilibrium (<10^-6), minor allele frequency (<5%), high linkage disequilibrium (LD) regions, and high missingness (>10%). We identified related individuals using identity by descent (IBD), keeping as many unrelated individuals as possible. Next, we calculated eigenvectors of a covariance matrix based on unrelated samples. Finally, we calculated principal components for all samples by genetic difference from the mean multiplied by the calculated eigenvectors. We then tested associations with a chosen phenotype for each variant using a score-based test, incorporating covariates including BMI, smoking status, and the top 10 principal components to adjust for sample structure. Finally, we compared results with the Generalized Estimating Equation (GEE) model, which accounts for clusters in the same population.

Results: When implemented in R, the iPCA took 4.6 hours to calculate the top 10 principal components and 9.1 hours to calculate the top 100 principal components. The phenotype tested was asthma; the score-based test results are still pending.

Conclusion: The iPCA developed is promising, eliminating effects caused by population stratification and family structure, but is computationally intensive. Better and more efficient implementation of this iPCA is needed.