Past Short Courses and Tutorials

 

2014-2015 Short Courses


Organizer: Han Chen

The Program in Quantitative Genomics (PQG) offers a monthly short course series on statistical and computational methods for analyzing genomic and 'omic data, genomic databases and resources, and data processing and analysis software, throughout the school year.

 

Tuesday, November 18, 2014
12:30-2:00 PM
Building 2, Room 426 - Biostatistics Conference Room
a pizza lunch will be provided

Allan Just
Research Fellow
Department of Environmental Health


Pre-processing and analyzing DNA methylation microarrays

DNA methylation, an epigenetic modification which may modulate gene expression, can be impacted by the environment and also may be associated with subsequent health risks. The Illumina 450K methylation microarray, covering over 480,000 methylation sites, is a popular platform for measuring these marks for epidemiologic studies. This short course will present challenges and opportunities using this platform with a focus on the role of technical variability and pre-processing for epigenome-wide association studies (EWAS). Examples will include a large dataset from the Normative Aging Study and the use of free software tools (R and Bioconductor packages) as well as online data repositories.

 


Tuesday, September 23, 2014
12:30-2:00 PM
Building 2, Room 426 - Biostatistics Conference Room

Chunyu Liu, Ph.D.
NHLBI’s Framingham Heart Study


Analysis of the Mitochondrial Genome in Relation to Complex Traits

Mitochondrion are important organelles in human cells. Mitochondria produce ATP which is used for all cellular activities through a process called oxidative phosphorylation (OXPHOS). In addition, mitochondrial regulate a number of other cellular activities. Although mitochondria are controlled by nuclear, mitochondria possess their own genome (mtDNA) which encodes genes used in OXPHOS process. The mtDNA is inherited through mothers only. Mitochondria have been implicated in several human diseases including rare disease and common phenotypes such as metabolic and immunological phenotypes and aging.

I will focus two topics in my talk. First, I will talk about association test of the mtDNA using family data. Due to maternal inheritance, both effects of the nuclear and mitochondrial genomes are needed to be considered in testing for association of mitochondrial variants with traits in pedigrees. Second, I will talk about annotation and association analysis of the 226 mtDNA variants on the Illumina Human Exome BeadChip.

 

2013-2014 Short Courses


Organizer: Han Chen

 

Tuesday, April 8, 2014
12:30-2:00, Building 2-Room 426


Alkes Price
Assistant Professor of Statistical Genetics, Depts. of Epidemiology & Biostatistics, HSPH

Using Integrative Networks as a framework for Identifying Disease Mechanisms

In this short course, I will first review the concepts of missing and hidden heritability of complex traits, focusing on recent work on estimating the heritability explained by genotyped SNPs (h2g) that distinguishes these two concepts. I will then describe ongoing efforts to provide a functional interpretation of hidden heritability, using annotations of coding and regulatory regions. Finally, I will describe recent and ongoing endeavors to resolve the heritability that is still missing.

 


Tuesday, March 25, 2014
12:30-2:00, Building 2-Room 426


Kimberly Glass
Research Fellow, Department of Biostatistics and Computational Biology, DFCI & HSPH

Using Integrative Networks as a framework for Identifying Disease Mechanisms

Rapidly evolving genomic technologies have allowed us to begin to develop a more unified understanding of how many different types of interactions at multiple and vastly different scales can influence patient health. In recent years we have come to appreciate that in many cases a single gene or pathway cannot fully characterize changes in cellular states. Rather, these states are better characterized by shifts in regulatory networks, with structures that are altered between different disease states. Here I will describe the application of PANDA (Passing Attributes between Networks for Data Assimilation) to integrate emerging multi-omic data through the reconstruction of comprehensive cellular regulatory networks, show how these network models can be used to make specific, testable hypotheses about the mechanisms mediating differences in disease response and phenotype, and explore how network modeling can help guide the development of more personalized drug therapies.

 



Tuesday, March 4, 2014

12:30-2:00, Building 2-Room 426


Eric Franzosa
Postdoc with Curtis Huttenhower

A Tour of the BioBakery: Computational Tools for Microbial Community Analysis

Microbial ecology is one of many fields that have benefitted greatly from technical advances in DNA sequencing. In particular, low-cost culture-independent sequencing has made metagenomic and metatranscriptomic ("meta'omic") surveys of microbial communities practical, including bacteria, archaea, viruses, and fungi associated with the human body, other hosts, and the environment. In this talk, we will introduce the tools our lab has developed to transform raw meta'omic sequencing data into biological insights. In the first part of the talk, we will discuss tools for taxonomic and functional profiling of microbial communities. These tools interpret mappings of DNA and RNA reads to microbial reference genomes in order to answer (i) "Which species are present in a community (and in what proportions)?" as well as (ii) "What are they doing?" The resulting profiles are compositional in nature, and are characterized by sparse features with high dynamic range. In the second part of the talk, we will discuss the statistical methods we have developed to explore these unique data, including tools to quantify (i) co-occurrence relationships between species and (ii) associations between meta'omic features and clinically important metadata. We will conclude with an overview of the lab's Galaxy web server and VM-based solutions for providing easy access to the tools introduced in the talk.



Tuesday, February 4, 2014
12:30-2:00, Building 2-Room 426

Guocheng Yuan
Associate Professor of Computational Biology and Bioinformatics
Department of Biostatistics
Dana-Farber Cancer Institute
Harvard School of Public Health

Mapping Cellular Hierarchy Through Single-cell Analysis

How do cells make decisions? Despite the large amount of microarray and RNAseq data publicly available, this question remains poorly understood. The major problem is that traditional genomic approaches can only measure the average behavior over a large population of cells, whereas cellular responses are highly heterogeneous even within the same cell-type. To overcome this limitation, new technologies are being developed at a rapid speed to profile gene expression at the single-cell resolution. Such technologies will have profound impact on how we understand the biology. In this talk I will present two recent studies from our group. In the first study, we used single-cell analysis to map the cellular hierarchy in the blood system and provided new insights into the differentiation pathway of hematopoietic stem cells. In the second study, we have developed a new computational approach to identify initiation events associated with cell differentiation based on bifurcation analysis. If time allows, I will present some ongoing work on dynamic evolution of cellular heterogeneity.

 


Tuesday, October 8, 2013
12:30-2:30, Building 2-Room 426


Chaolong Wang
Research Fellow
Department of Biostatistics

Statistical Methods to Estimate Individual Ancestry with Applications to Disease Gene Mapping


nowledge of individual ancestry is important to correct for population stratification in disease association studies to avoid spurious association signals. However, estimation of individual ancestry is challenging for data generated some modern cost-effective technologies. For example, in target sequencing experiments, targeted regions often include too few variants to accurately represent global ancestry and off-target regions are covered poorly, precluding accurate estimation of the genotypes. To address these challenges, I developed LASER, a new method that skips genotype calling and directly analyzes sequence reads from off-target regions to estimate individual ancestry in a principal components ancestry space. Using simulations and real data, we showed that the method could accurately infer worldwide continental ancestry and fine-scale ancestry within Europe with modest amounts of sequence data. Based on a similar pipeline, we also implemented a software program called TRACE, which can trace individual ancestry using small amounts of genotype data when sequence reads are not available. Combined with a genotype imputation approach, we showed that TRACE could accurately estimate fine-scale population structure for ExomeChip genotyped European samples. These methods enabled us to introduce additional ancestry-matched controls from public resources to augment the sample size of a target sequencing study of age-related macular degeneration, leading to discovery of a rare variant that is significantly associated with increased risk of the disease.

 

2012-2013 Short Courses


Organizer: Chaolong Wang

The Program in Quantitative Genomics (PQG) offers a monthly short course series on statistical and computational methods for analyzing genomic and 'omic data, genomic databases and resources, and data processing and analysis software, throughout the school year.


 

Tuesday, April 23, 2013
Building 2, Room 426
12:30-2:00


Peter Kraft
Professor of Epidemiology
Department of Epidemiology
Department of Biostatistics
Harvard School of Public Health

Challenges in the study of gene-environment in genetic epidemiology

Many chronic diseases are caused by a complex interplay of multiple genes and multiple clinical, social and environmental risk factors. Epidemiology can help identify the key players in this network and how they interact. However, untangling the causal web from observational data can be quite difficult.

This lecture will introduce the methodological, statistical, and practical challenges in the study of gene-environment interactions. It will also review what is known about the epidemiology of gene-environment interactions in several complex diseases, and summarize the recommendations from a recent “Gene-Environment Think Tank” sponsored by the National Cancer Institute.



Tuesday, March 26, 2013
Building 2, Room 426
12:30-2:00


Shamil Sunyaev
Associate Professor of Medicine and Health Sciences and Technology
Harvard Medical School and Brigham & Women's Hospital

Computational and Statistical Methods for Sequencing Studies in Mendelian and Complex Trait Genetics

Whole exome and whole genome sequencing studies are rapidly emerging as leading approaches in both Mendelian and complex trait genetics. Sequencing studies aim at identifying causal allelic variants directly, rather than finding statistical proxies served as genetic markers. Thus, these studies require new statistical methods that may benefit from evolutionary and functional considerations. I will discuss population genetics foundations and statistical aspects of sequencing studies. I will describe existing computational methods for predicting the functional effect of human allelic variants and current problems complicating improvement of these methods. I will also present two examples of recent sequencing studies together with the first results.


Tuesday, February 26, 2013
Building 2, Room 426
12:30-2:00


Xiaole Shirley Liu
Professor of Biostatistics and Computational Biology, Harvard School of Public Health
Director, Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute
Associate Member, Broad Institute

Chromatin Dynamics and Cancer Gene Regulation

The application of ChIP-seq and DNase-seq in recent years has greatly expedited the mechanistic understanding of transcriptional and epigenetic gene regulation in development and diseases. Although epigenetic profiles are often considered a reflection of the overall transcriptional activities, the dynamics of epigenetic profiles can be effectively mined to reveal novel transcriptional mechanisms. I will introduce computational methods to analyze ChIP-seq data, approaches using the dynamics of ChIP-seq and DNase-seq to infer in vivo transcription factor binding, and novel functions of chromatin factors. I will also discuss its application to liver metabolism, gut differentiation, prostate and breast cancer cell progression, to reveal distinct mode of transcription factor binding and chromatin dynamics.


Tuesday, November 20, 2012
12:30-2:00, Building 2, Room 426


Liming Liang
Assistant Professor of Statistical Genetics
Department of Epidemiology
Department of Biostatistics

Statistical analysis of methylation quantitative trait loci and epigenetic networks underlying complex traits

Epigenetic variation in the methylation of DNA at CpG islands (CGI) is related to the regulation of transcription. Abnormalities of DNA methylation are well recognized in single gene disorders and in cancer, and it is postulated that epigenetic changes in methylation may be of importance to common human diseases. Recently advanced
technology has enabled high-throughput DNA methylation to be collected in large scale epidemiological investigations. This short course is intended to give an overview on analyses of DNA methylation using high
throughput data, in particular the Illumina HumanMethylation27 and HumanMethylation450 BeadChips. This short course will use real data and hands on experience to discuss topics including preprocessing, normalization, batch effect adjustment, association analyses and network analyses (time permitting).

 

2011-2012 Short Courses


Organizers: Lin Li & Seunggeun Lee

The Program in Quantitative Genomics (PQG) offers a monthly short course series on statistical and computational methods for analyzing genomic and 'omic data, genomic databases and resources, and data processing and analysis software, throughout the school year.


Tuesday, April 10, 2012
12:30-1:30, FXB G13



Alkes Price, Ph.D.
Assistant Professor of Statistical Genetics
Department of Epidemiology
Department of Biostatistics


Disease mapping in admixed populations

Admixed populations such as African Americans and Hispanic Americans are often medically underserved and bear a disproportionately high burden of disease. Owing to the diversity of their genomes, these populations have both advantages and disadvantages for genetic studies of complex phenotypes. Advances in statistical methodologies that can infer chromosomal segments from ancestral populations may substantially enhance disease mapping efforts in these populations.

 


 

Tuesday, March 6, 2012
12:30-1:30, FXB G13

W. Evan Johnson, Ph.D.
Assistant Professor
Department of Statistics
Boston University

The Universal Probability Code (UPC): Platform-independent preprocessing of expression profiling data for personalized medicine workflows

The development of personalized treatment regimes is an active area of current research in genomics. During the last decade, several technological advances such as microarrays and RNA sequencing have allowed for the quantification of transcription levels in research settings. This has enabled researchers to use gene expression as a tool for developing personalized therapies. However, in a typical research study, biological specimens representing multiple conditions are often processed in batches, after which researchers must apply normalization techniques to correct for non-biological artifacts. In processing any particular sample, many normalization methods borrow information from other samples to estimate probe-level effects and to standardize variances across the arrays. However as the personalized-medicine era evolves, a greater emphasis must be placed on techniques that can process individual samples so diagnostic and prognostic models can be derived and then subsequent samples can be compared against the models serially. For this reason we have developed a series of complementary methods that require no ancillary samples and can be applied single-channel arrays, two-channel arrays, and next-generation RNA sequencing data. Our approach, the Universal Probability Code (UPC) combines sophisticated background modeling approaches with a two-component mixture model, which classifies genes as either inactive or active. Importantly, downstream analyses-including those that evaluate differential expression-can interpret UPC values consistently, irrespective of the technology used to measure gene expression. In addition, because samples are processed individually, output values can be incorporated into personalized-medicine workflows which call for genomic samples to be processed serially rather than in batches.

 


Tuesday, February 7, 2012
12:30-1:30, FXB G13



Soumya Raychaudhuri, M.D., Ph.D.
Divisions of Genetics & Rheumatology
Brigham & Women's Hospital
Harvard Medical School
The Broad Institute


Fine-mapping genetic loci : Application to the CFH gene in Age Related Macular Degeneration

In this talk we describe strategies to take SNP association hits from GWAS data sets and to localize the association signal. Ideally such efforts will lead to variants that are functional and act to cause disease. We will emphasize strategies to phase SNP data, and apply conditional haplotype data. As a primary example we will focus on the Complement Factor H (CFH) gene, known to have common variants associated with age related macular degeneration (AMD). This gene has been associated with AMD since 2005 as one of the earliest hits from genome-wide association studies. We will demonstrate how high density SNP genotyping can be used to localize the association signal to the gene, and to mitigate the possibility that the association is resulting from a neighboring gene in linkage. Then we will apply conditional haplotype analysis to identify a rare haplotype that confers a high degree of risk for AMD. Sequencing individuals with that haplotype identified a rare R1210C variant that confers a high degree of risk of AMD (OR~20) and accounts for roughly 1% of cases. The variant has been shown to be functional and intriguingly in separate studies has been shown to confer dominant risk of atypical hemolytic syndrome, a familial pediatric disease.

 


 

Tuesday, January 24, 2012
12:30-1:30, FXB G13



Tianxi Cai, Sc.D.
Associate Professor of Biostatistics
Department of Biostatistics
Harvard School of Public Health


Risk Prediction with High Dimensional Genomic Markers

The complexity of the genetic architecture of human health and disease makes it difficult to identify genomic markers associated with disease risk or to construct accurate genetic risk prediction models. Accurate risk assessment is further complicated by the availability of a large number of markers that may be predominately unrelated to the outcome or may explain a relatively small amount of genetic variation. The standard approach to identifying important markers often assesses the marginal effects of individual markers on a phenotype of interest. When multiple markers relate to the phenotype simultaneously via a complex structure, such a type of marginal analysis may not be effective. In this talk, I will discuss various testing and estimation procedures that can be used to efficiently construct parsimonious risk prediction models.


 

Tuesday, December 20, 2011
12:30-1:30, FXB G13



Seunggeun Lee, Ph.D.
Research Fellow
Department of Biostatistics
Harvard School of Public Health


Rare Variant Association Testing Using the Sequence Kernel Association Test (SKAT)

Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. In this talk, I will introduce the sequence kernel association test (SKAT) and its recent extensions. In addition, I will show how SKAT can be used for data analysis using the SKAT R package. SKAT is a kernel machine based regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait, while easily adjusting for covariates. As a score-based variance component test, SKAT can quickly calculate p-values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. We also provide analytic power and sample size calculations to help design candidate gene, whole exome, and whole genome sequence association studies.


Tuesday, November 8, 2011
12:30-2:00, FXB G13

a pizza lunch will be provided

W. Evan Johnson, Ph.D.
Assistant Professor
Department of Statistics
Boston University


Low level Processing and Visualization of Whole Exome Sequencing Data


The advent of massively parallel sequencing technologies have revolutionized our ability to identify of rare variants that are causative of familial disease risk. Many groups are actively engaged in whole exome sequencing projects to identify rare variants associated with disease susceptibility. There are many recently developed tools for the identification of disease-related SNPs in next-generation sequencing (NGS) data. We have recently developed our own NGS variant identification algorithm, which utilizes a probabilistic Pair-Hidden Markov Model (PHMM) for base calling and SNP detection that incorporates base uncertainty. It also maps uncertainty from multiple sub-optimal alignments of the read. We will be extending this algorithm to identify other genomic changes, including insertions, deletions, inversions, epigenetic modification, copy number variation, allele-specific expression, and splice variants. In addition we will also survey other tools that are currently used for this purpose. We will present results form our whole-exome studies which include breast cancer cohorts, a large ADHD pedigree, and from a small pedigree suffering from a severe, rare, X-linked disease.


 

Tuesday, October 11, 2011
12:30-2:00, FXB G13

a pizza lunch will be provided

Christoph Bock, Ph.D.
Broad Institute of MIT and Harvard, Cambridge, MA, USA; Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA; Max Planck Institute for Informatics, Saarbrücken, Germany.

 

Bioinformatic Methods and Software for Epigenome-wide Association Studies (EWAS) and Biomarker Development

 

Epigenome mapping has played an important role in establishing the prevalence of altered DNA methylation in cancer cells; and it is increasingly applied to investigate diseases other than cancer. Indeed, epigenetic events could provide a tractable link between the genome and the environment, with the epigenome emerging as a biochemical record of relevant life events.

Systematic investigation of these topics requires powerful, accurate and cost-efficient methods for epigenome profiling of human samples. We have recently benchmarked four methods for genome-wide DNA methylation mapping in terms of their accuracy and power to detect DNA methylation differences (Bock et al. 2010 Nat Biotechnol). This technology comparison was designed to mimic the setup of a typical EWAS, suggesting that our bioinformatic and experimental approach could provide a blueprint for designing and executing large-scale EWAS investigating the epigenetic basis of human diseases. Furthermore, we have developed bioinformatic tools for epigenetic biomarker development, which facilitate the adaptation of disease-specific epigenetic alternations for clinical diagnostics.
My talk will discuss this ongoing work and highlight practical implications for conducting epigenome association studies and performing epigenetic biomarker development.

The described work was in part funded by NIH grant U01ES017155 (Roadmap Epigenome Mapping Center) and the Broad Institute’s Epigenome Initiative.

 

 

2010-2011 Short Courses


Tuesday, April 26, 2011
12:30-2:00, Kresge G2

a pizza lunch will be provided

Eric Tchetgen Tchetgen, Ph.D.
Assistant Professor of Epidemiology
Department of Epidemiology
Department of Biostatistics
Harvard School of Public Health


An overview of some modern statistical methods for the study of Gene-environment interaction


This talk gives an overview of some modern statistical techniques for making inferences about gene-environment interaction. Some emphasis is given to issues of confounding adjustment, the impact of model mis-specification and strategies for improving estimation efficiency. Methodologies for prospective and retrospective study designs are discussed, including standard linear and logistic outcome regression approaches, semiparametric estimators, case-only methods, the profile likelihood approach, empirical bayes techniques, and a retrospective regression framework. Software implementation is provided and data examples are presented for illustration.


 

Tuesday, March 22, 2011
12:30-2:00, Kresge G2

a pizza lunch will be provided

Christoph Lange
Associate Professor of Biostatistics
Harvard School of Public Health


Large-scale Association Studies in Family-based Design


Large-scale genetic association studies in family-based designs provide a unique opportunity to gain better understanding of the pathways of complex diseases. The modeling of the phenotypes and the multiple comparisons problem pose serious statistical challenges. We will discuss approaches how these issues can be addressed in family-based design and implemented in the statistical analysis.

 


 

Tuesday, February 22, 2011
12:30-2:00, Kresge G2

a pizza lunch will be provided

**PLEASE BRING YOUR LAPTOP COMPUTER FOR THIS TUTORIAL**

PQG DATAVERSE - SOFTWARE DOWNLOADS

The short course will use both a web application https://compbio.dfci.harvard.edu/predictivenetworks/
and the use of the R programming language http://www.r-project.org/

Attendees should have a laptop with - R installed (http://cran.r-project.org/) and the packages pgfSweave, penalized, minet, catnet and network (see Packages)

The ones who want to take a look at the pieces of software I will use during the short course can use these two links:

https://compbio.dfci.harvard.edu/predictivenetworks
https://github.com/bhaibeka/predictionet

The attendees who are interested at binging their own gene expression dataset and list of genes of interest can do so and run the tutorial on their own data (even if it may be not trivial).

Benjamin Haibe-Kains
Department of Biostatistics and Computational Biology
Center for Cancer Computational Biology
Dana-Farber Cancer Institute and Harvard School of Public Health

Predictive Networks: A New Framework for Inferring Robust Networks from Gene Expression Data

DNA microarrays and other high-throughput omics technologies provide large datasets that often include hidden patterns of correlation between genes reflecting the complex processes that underlie cellular processes. The challenge in analyzing large-scale expression data has been to extract biologically meaningful inferences regarding these processes - often represented as networks - in an environment where the datasets are complex and noisy. Although many techniques have been developed in an attempt to address these issues, to date their ability to extract meaningful and predictive network relationships has been limited. In this PQG short course, I will introduce the problem of network inference and I will present a platform developed in John Quackenbush's lab, which enables inference of reliable gene interaction networks from prior biological knowledge, in the form of biomedical literature and structured databases, and gene expression profiling data. The preliminary version of our analytical pipeline is both accessible through the Predictive Networks web application and the predictionet R package. Using real data, I will show the benefit of using prior biological knowledge to infer networks and how to quantitatively assess the quality of such networks. An access to the Predictive Networks web application and the last version of the predictionet package will be provided. The material of this course will consist in a Sweave file including all the R commands you need for inferring networks and analyzing them.

ACKNOWLEDGEMENTS

This work is supervised by John Quackenbush and has been done in collaboration with Christopher Bouton, Erik Bakke, James Hardwick, Catharina Olsen, Gianluca Bontempi, Amira Djebbari, Niall Prendergast, and Renee Rubio.  

 


 

January 25, 2011
12:30-2:00, Kresge G2

a pizza lunch will be provided

Adnan Derti and Yair Benita
Merck and Co., Inc.

Practical Challenges of Next-generation Sequencing: Applications to Genome, Exome and Transcriptome Sequencing

Next-generation sequencing (NGS) has taken center stage in biomedical research. The technology is affordable and has immense potential. However, data management and analysis pose a tremendous challenge. In this talk we will discuss the challenges facing the pharmaceutical industry as it transitions to use NGS as its primary profiling tool for various projects. We will discuss in detail our approaches to building de-novo genome assemblies, profiling of tumors for oncology research, and transcriptome discovery and quantitation with known and novel sequencing methods.


Tuesday, October 12, 2010
12:30-2:00 PM, FXB G13


Liming Liang, Ph.D.
Assistant Professor of Statistical Genetics
Departments of Epidemiology & Biostatistics
Harvard School of Public Health

Current findings from the 1000 Genomes Project

The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. The pilot phase projects, designed to develop and compare different strategies for genome wide sequencing with high throughput sequencing platforms, have been completed. In this presentation, though not being ambitious to cover every aspect of this project, I will give a general introduction to the 1000 Genomes Project and focus on current findings. It will be useful to those who want to know more about the 1000 Genomes Project and learn about the discovered knowledge and available data that can be applied to their own research.


2009-2010 Tutorials


May 10, 2010
3:30-5:30 PM, Kresge LL6

Joshua Korn
The Broad Institute

Copy-Number Variation and Disease: Studying CNVs in Scans for Whole Genome Association

Heritable phenotypes have a wide variety of genetic causes, running the gamut from Mendelian diseases such as sickle-cell anemia and Down’s syndrome (caused exclusively by a single point mutation and a whole-chromosome duplication, respectively), to complex diseases such as Crohn’s disease and diabetes (both of which have at least dozens of associated polymorphisms spread throughout the genome, each with small effect). Recently, large segments--with a size range between 1kb and 1Mb--of the genome were discovered to vary in copy number between healthy samples as well. Integrating the analysis of SNPs and CNVs allows for a more complete picture of human genetic variation. Additionally, while rare point mutations can only be found via sequencing techniques, whole genome association studies are well-powered to detect rare copy-number variation affecting segments at least 20kb long, allowing an early look at the relative contribution of rare, de novo, and common copy number variants to various phenotypes. An example of each of these types of variants has recently been associated to disease, and we will explore the different methods required to do such analyses for future studies.

 


Monday, March 22, 2010
3:30-5:30 PM, Kresge LL6


Michael Reich
Director of Cancer Informatics Development
Broad Institute of MIT and Harvard

Tools for Integrative Genomics

Integrative genomics provides unprecedented power to increase our understanding of basic biological processes and determine the mechanisms of disease. This approach- the combining of evidence from multiple data modalities such as gene expression, copy number, epigenetic, and mutation data to find the genomic causes of a disease state- has resulted in the identification of novel mutations, the discovery of causal relationships between genomic aberrations and clinical pathologies, and other important insights in the short time it has been in practice. To take advantage of this wealth of data, new tools are needed that can span data modalities and support the very large datasets characteristic of integrative efforts. The Broad Institute has produced a number of software tools to facilitate integrative genomics investigations, including GenePattern, a suite of over 120 tools for the analysis of gene expression, copy number, proteomics, flow cytometry, and other data, along with extensive capabilities for combining these tools to create complex, reproducible methodologies; and the Integrative Genomics Viewer (IGV), a flexible, scalable, high-performance tool for the concurrent visualization of multiple large scale datasets. These freely available tools are used by tens of thousands of researchers worldwide to improve our understanding of cancer, immunology, microbial genomics, stem cell biology, and other fields.

Participants will learn the major features and benefits of GenePattern and IGV and will understand how these tools may be applied in their own research.

 


Monday, January 25. 2010
3:30-5:30 PM, Kresge LL6

Saumyadipta Pyne, Ph.D.
Department of Medical Oncology
Dana-Farber Cancer Institute

Marc-Danie Nazaire
Cancer Program
Broad Institute of MIT and Harvard

"FLAME: A Platform for Automated High-dimensional Flow Cytometric Data Analysis"

Flow cytometry is a popular platform in both clinical and research applications for rapid single cell interrogation of surface and intracellular markers. Multiparametric flow platforms, allowing increasing number of markers to be measured in parallel, have challenged the traditional technique of identifying cell populations by manual analysis. We present a new computational platform, FLAME, based on novel high-dimensional parametric modeling strategies to address the complexities of multiparameteric flow data with rigor and robustness, and without any need for projecting the data to lower (1 or 2) dimensions. We demonstrate FLAME's ability to detect rare cell populations, to model robustly in the presence of noisy and skewed subpopulations, and to perform the critical task of registering cell populations across samples which enables us to compare cohorts across different time points and phenotypes. FLAME has been incorporated with the GenePattern package of Broad Institute to facilitate the pipelining of flow data analysis with standard bioinformatic applications such as high-dimensional visualization, subject classification or outcome prediction. We demonstrate how FLAME can facilitate the application of flow cytometry to complex biological and clinical problems.


Monday, December 14, 2009
3:30-5:30 PM, Kresge LL6

Pierre R. Bushel, Ph.D.
Staff Scientist
Biostatistics Branch
NIEHS

"State-of-the-art Biological Processes Enrichment Using Gene Ontology"

The Gene Ontology (GO) is a biological resource that contains the annotation (in terms of controlled vocabulary) of the molecular characteristics of genes and gene products. The tool has been extremely useful for research investigators to glean insight into the molecular pathways that govern biological conditions. However, the topology of GO poses challenges for reliable enrichment of biological processes.

This tutorial will present an overview of GO, touch on the limitations of a typical method for performing gene set enrichment and then address key considerations for improved overrepresentation of GO terms in a data set. The tutorial will conclude with a short demonstration of GOEAST, a web-based tool that performs gene set enrichment analysis but with the inherent GO hierarchical structure considered.

PRESENTATION SLIDES

 


Monday, November 23, 2009
3:30-5:30 PM, Kresge LL6

Paul Bain
Topic: Genome Browsing with Ensembl

Ensembl provides unified access to genomic information and annotation for more than 50 eukaryotic species. Learn how to find gene and genomic-related information, from splice sites to SNPs and more. We'll also explore the data mining tool BioMart that provides access to Ensembl data in bulk, with hands-on exercises.

 


Monday, October 26, 2009
3:30-5:30 PM Kresge LL6

Ross Lazarus, MBBS MPH
Associate Professor, Harvard Medical School
Director of Bioinformatics, Channing Laboratory

"An Introduction to the Galaxy Genomics Workbench"

Galaxy is a popular and flexible open-source analysis platform, designed to support and encourage persistent, reproducible and shareable genomics research. Requiring only a standards-compliant web browser, the free public server at Penn State offers transparent integration with familiar model organism and public human data sources such as BioMart and the UCSC, together with a flexible set of tools for text file manipulation, combined with specialized tools such as intersection operations on interval data. While a skilled informatician can readily program these steps using a scripting language, Galaxy has the advantage of requiring minimal user training, and more importantly, of organizing and retaining every step, parameter setting and intermediate dataset from every analysis in a persistent, shareable and reproducible 'history'. Some of these basic and widely applicable features will be illustrated in this tutorial/demonstration, and a preview of tools currently under development for QC and analysis of commodity whole genome statistical genetics and microarray expression data will be provided.

By the end of the session, attendees will be familiar with some simple, but widely applicable basic Galaxy tools and features that may prove to be useful in their own research, and will be ready to begin to explore some of the more specialized Galaxy capabilities such as the creation and sharing of complex workflows, and the rapidly growing suite of next-gen sequencing tools.