Question: Is a Mac or Windows version of EIGENSOFT available?
Answer: No. Due to our limited resources we are only able to support
Linux at this time.
Question: Is documentation available for EIGENSOFT?
Answer: Yes. See main README file included in the software release, and
additional documentation files referenced therein.
Question: What's inside the 3 directories CONVERTF, POPGEN, and EIGENSTRAT?
Answer: The CONVERTF directory contains documentation and examples of our
convertf program for converting file formats. The POPGEN directory contains
documentation and examples of our smartpca program for running PCA.
The EIGENSTRAT directory contains documentation and examples of correcting for
population stratification in disease studies using the EIGENSTRAT approach,
as well as a PERL wrapper smartpca.perl for running the smartpca program.
Question: I tried running EIGENSOFT but the code crashes. What should I do?
Answer: This is probably a systems issue. Try running the pcatoy program
and if this trivial program crashes, contact your system administrator for help
in tracking down this systems issue. See documentation for details.
Question: Can I run EIGENSOFT on very large data sets?
Answer: Yes. We currently support GWAS data sets up to 8 billion genotypes.
For data sets between 2 billion and 8 billion genotypes, some care is
required. See documentation for details.
Question: I am running with outlier removal on a large data set and the
number of outliers removed seems too large, not only in the first iteration but
in subsequent iterations as well. What should I do?
Answer: The outlier removal approach we have implemented is a heuristic that
seems to work well on data sets up to 1000 samples. For larger data sets,
we recommend either increasing the #sdev threshhold (-s flag in smartpca.perl,
outliersigmathresh parameter in smartpca) above 6.0, or combining your data
set with HapMap data and just removing samples with unusual continental
ancestry along the top two axes. [If choosing the latter, you can reduce
running time by computing PCs using HapMap populations only. See documentation
for details on how to do this.]
Question: Can I use EIGENSTRAT in studies of quantitative traits?
Answer: Yes. See README file in EIGENSTRAT directory.
Question: Can I use EIGENSTRAT in studies involving imputed SNPs?
Answer: At the moment, our code does not support probabilistic genotypes that
may be produced by imputation programs. This is algorithmically
straightforward but due to our limited resources, it may
be awhile before we can provide this upgrade. In the meantime, a possible
solution is to first run PCA on non-imputed SNPs (this will indicate whether
there are ancestry differences between cases and controls) and then run
EIGENSTRAT to compute disease association statistics for all SNPs by sampling
integer-valued genotypes in the case of imputed SNPs.
Question: How long does the code take to run?
Answer: See README file in EIGENSTRAT directory.
Question: The code takes a long time to run on my huge data set. Isn't a
fast eigenvector approximation possible?
Answer: Yes, in theory it is possible to greatly reduce computation time
of top eigenvectors using a fast eigenvector approximation. Unfortunately,
due to our limited resources, we have yet to implement this.
Question: I'm running on an extremely large number of samples and the software
runs out of memory. Why?
Answer: The software uses memory proportional to the square of the number of
samples. In the case of an extremely large number of samples (e.g. >10,000),
the software may run out of memory. The fast eigenvector approximation
described above would actually solve this problem, but is not yet implemented.
Question: When I run I get an error message about "idnames too long".
What should I do?
Answer: The software supports sample ID names up to a max of 39 characters.
Longer sample ID names must be shortened. In addition, if your data is in
PED format, the default is to concatenate the family ID and sample ID names
so that their total length must meet this limit; however, you can set
"familynames: NO" so that only the sample ID name will be used and
must meet the 39 character limit.
Question: Is it possible to obtain the SNP weights for each SNP
along each top eigenvector?
Answer: Yes. See snpweightoutname parameter documented in POPGEN/README.
Question: Can I run EIGENSOFT on microsatellite data?
Answer: Yes. The way to do this is to make a fake SNP for each microsatellite
allele (this is redundant, but that is ok). For details, see p.2076
of Patterson et al. 2006 PLoS Genet.
Question: Why does the normalization subtract off SNP means? Wouldn't it be
theoretically correct to subtract off sample means instead?
Answer: The decision to subtract off SNP means instead of sample means was an
empirical choice to ensure that results are invariant to flipping the
SNP alleles (i.e. changing 0/1/2 to 2/1/0).
Question: How do I compute principal components using only a subset of
populations and project other populations onto those principal components?
Answer: Use -w flag to smartpca.perl (see EIGENSTRAT/README), or use
poplistname parameter to smartpca (see POPGEN/README).
Question: Should regions of long-range LD in the genome be removed prior to
PCA?
Answer: Yes, to avoid principal components that are artifacts of long-range LD
it is ideal to remove such regions. See Table 1 of Price et al. 2008 AJHG.
However, EIGENSTRAT can subsequently be run to compute disease association
statistics using the full set of SNPs.
Question: How do I compute principal components using only a subset of SNPs but
then run EIGENSTRAT using the full set of SNPs?
Answer: If using smartpca.perl, then run smartpca.perl on reduced data set and
then run EIGENSTRAT on full data set. If using smartpca, then run smartpca on
reduced data set, then run evec2pca.perl to produce .pca file, then run
EIGENSTRAT on full data set. Details in POPGEN/README and EIGENSTRAT/README.
Question: convertf decides to "ignore" all my samples. Why?
Answer: A likely reason is that you are using a "fam" or "ped" file with a funny value (0, 9 or -9) in column 6. Try setting column 6 to 1.