EIGENSOFT: Frequently Asked Questions

Question: Is a Mac or Windows version of EIGENSOFT available?
Answer: No. Due to our limited resources we are only able to support Linux at this time.

Question: Is documentation available for EIGENSOFT?
Answer: Yes. See main README file included in the software release, and additional documentation files referenced therein.

Question: What’s inside the 3 directories CONVERTF, POPGEN, and EIGENSTRAT?
Answer: The CONVERTF directory contains documentation and examples of our convertf program for converting file formats. The POPGEN directory contains documentation and examples of our smartpca program for running PCA. The EIGENSTRAT directory contains documentation and examples of correcting for population stratification in disease studies using the EIGENSTRAT approach, as well as a PERL wrapper smartpca.perl for running the smartpca program.

Question: I tried running EIGENSOFT but the code crashes. What should I do?
Answer: This is probably a systems issue. Try running the pcatoy program and if this trivial program crashes, contact your system administrator for help in tracking down this systems issue. See documentation for details.

Question: Can I run EIGENSOFT on very large data sets?
Answer: Yes. We currently support GWAS data sets up to 8 billion genotypes. For data sets between 2 billion and 8 billion genotypes, some care is required. See documentation for details.

Question: I am running with outlier removal on a large data set and the number of outliers removed seems too large, not only in the first iteration but in subsequent iterations as well. What should I do?
Answer: The outlier removal approach we have implemented is a heuristic that seems to work well on data sets up to 1000 samples. For larger data sets, we recommend either increasing the #sdev threshhold (-s flag in smartpca.perl, outliersigmathresh parameter in smartpca) above 6.0, or combining your data set with HapMap data and just removing samples with unusual continental ancestry along the top two axes. [If choosing the latter, you can reduce running time by computing PCs using HapMap populations only. See documentation for details on how to do this.]

Question: Can I use EIGENSTRAT in studies of quantitative traits?
Answer: Yes. See README file in EIGENSTRAT directory.

Question: Can I use EIGENSTRAT in studies involving imputed SNPs?
Answer: At the moment, our code does not support probabilistic genotypes that may be produced by imputation programs. This is algorithmically straightforward but due to our limited resources, it may be awhile before we can provide this upgrade. In the meantime, a possible solution is to first run PCA on non-imputed SNPs (this will indicate whether there are ancestry differences between cases and controls) and then run EIGENSTRAT to compute disease association statistics for all SNPs by sampling integer-valued genotypes in the case of imputed SNPs.

Question: How long does the code take to run?
Answer: See README file in EIGENSTRAT directory.

Question: The code takes a long time to run on my huge data set. Isn’t a fast eigenvector approximation possible?
Answer: Yes, in theory it is possible to greatly reduce computation time of top eigenvectors using a fast eigenvector approximation. Unfortunately, due to our limited resources, we have yet to implement this.

Question: I’m running on an extremely large number of samples and the software runs out of memory. Why?
Answer: The software uses memory proportional to the square of the number of samples. In the case of an extremely large number of samples (e.g. >10,000), the software may run out of memory. The fast eigenvector approximation described above would actually solve this problem, but is not yet implemented.

Question: When I run I get an error message about “idnames too long”. What should I do?
Answer: The software supports sample ID names up to a max of 39 characters. Longer sample ID names must be shortened. In addition, if your data is in PED format, the default is to concatenate the family ID and sample ID names so that their total length must meet this limit; however, you can set “familynames: NO” so that only the sample ID name will be used and must meet the 39 character limit.

Question: Is it possible to obtain the SNP weights for each SNP along each top eigenvector?
Answer: Yes. See snpweightoutname parameter documented in POPGEN/README.

Question: Can I run EIGENSOFT on microsatellite data?
Answer: Yes. The way to do this is to make a fake SNP for each microsatellite allele (this is redundant, but that is ok). For details, see p.2076 of Patterson et al. 2006 PLoS Genet.

Question: Why does the normalization subtract off SNP means? Wouldn’t it be theoretically correct to subtract off sample means instead?
Answer: The decision to subtract off SNP means instead of sample means was an empirical choice to ensure that results are invariant to flipping the SNP alleles (i.e. changing 0/1/2 to 2/1/0).

Question: How do I compute principal components using only a subset of populations and project other populations onto those principal components?
Answer: Use -w flag to smartpca.perl (see EIGENSTRAT/README), or use poplistname parameter to smartpca (see POPGEN/README).

Question: Should regions of long-range LD in the genome be removed prior to PCA?
Answer: Yes, to avoid principal components that are artifacts of long-range LD it is ideal to remove such regions. See Table 1 of Price et al. 2008 AJHG. However, EIGENSTRAT can subsequently be run to compute disease association statistics using the full set of SNPs.

Question: How do I compute principal components using only a subset of SNPs but then run EIGENSTRAT using the full set of SNPs?
Answer: If using smartpca.perl, then run smartpca.perl on reduced data set and then run EIGENSTRAT on full data set. If using smartpca, then run smartpca on reduced data set, then run evec2pca.perl to produce .pca file, then run EIGENSTRAT on full data set. Details in POPGEN/README and EIGENSTRAT/README.

Question: convertf decides to “ignore” all my samples. Why?
Answer: A likely reason is that you are using a “fam” or “ped” file with a funny value (0, 9 or -9) in column 6. Try setting column 6 to 1.