dChip: Filter genes

 

Percentile fold change filtering

 

We are often interested in genes showing large variation across samples or present in most samples, and these genes could be used for unsupervised gene or sample clustering so that clustering results are not affected by the noise from absent or non-changed genes. Select menu “Analysis/Filter genes”:

 

The dialog provides several criteria for filtering genes. The current “Array list file” specifies the samples used in the filtering and “pooling replicate arrays” will be performed before the filtering if needed.

 

The criterion (1) requires that the ratio of the standard deviation and the mean of a gene’s expression values across all samples be greater than a certain threshold. This ratio is also known as Coefficient of Variation (CV). The more variable a gene is across samples, the larger the ratio is. But if a gene is mostly absent across samples, this ratio can be large due to small mean; in this case we could also use the criterion (2). The default upper limit 1000 is a reasonably large number that is usually satisfied, but it can be changed to obtain genes not variable across samples.

 

If a group of samples contain some outlier samples that drive gene filtering and clustering (e.g. sample 29 and 11 in the left figure below; data courtesy of Tao Lu and Bruce Yankner), we can log-transform expression values by checking "Open group/Options/Log x transform". Afterwards, gene filtering criterion (1) can be selected to use "Standard deviation (for logged data)" instead of CV due to the variance stabilization property of log transformation (right figure below).

            

 

The criterion (2) requires that a gene be called “Present” in more than a portion of the arrays in the array list file. Please see the “Handling replicate arrays” section for the criterion (3). The criterion 4 selects genes whose expression values are larger than a threshold in more than a percentage of samples. If the expression values have been log-transformed by checking "Open group/Options/Log x transform", the expression value threshold should also be specified in log scale. Also note that criterion (2) is at array level (replicate arrays are not pooled) and criterion (4) is at sample level (replicate arrays are pooled to compute mean expression for each sample).

 

The filtering can be restricted to an existing gene list or its complement set, if a “Gene list file” (a tab-delimited text file with the first column of each row being the probe set name) is specified in the “Filter on gene list” button. Click the button multiple times to switch between “using all genes”, “using gene list” and “excluding gene list”.

 

By default the “Analysis/Filter genes” and “Analysis/Compare samples” functions ignore the Affymetrix control genes (probe set names starting with “AFFX-”), since their changes are generally not interesting. To include Affymetrix control genes in the filtering or comparsion, click “More options/Analysis” or select from menu “Tools/Options/Analysis/”, and uncheck “Omit Affymetrix control probe set at filtering or comparison”.

 

The genes satisfying the filtering criteria will be exported to the “Gene list file” specified at the “Filtered gene list” button. This gene list file can be used in analysis functions such as hierarchical clustering, or “Analysis/Model-based expression/Export” to export expression data for these genes. Once the gene list file is saved, its directory and name is stored in the Windows clipboard automatically, so one can use Control+V to paste the file name into other places such as “Analysis/Hierarchical clustering”.

 

Other gene filtering functions

 

ANOVA filtering can use sample category information to select genes that vary among sample groups in a supervised way.

 

[Use version 10/1/05+] We can also use the "Tools/Percentile filtering" function to select genes by its fold change between a high and a low percentile across samples. The filtered gene list can be clustered and viewed as usual. This gene list file also contains the percentile-standardized data, with the two percentiles for each gene linearly scaled to the +/- displaying range (specified at “Options/Clustering”). To view the percentile-standardized data, first do 'Get external data' to read in this exported data file, and then do “Analysis/Clustering” with “Options/Standardize rows” unchecked.