Automated Phenotyping of Patient EMR Data: Feature Extraction and Selection

Faculty Mentor: Dr. Tianxi Cai
Graduate Student Mentor: Sheng Yu
Program Participants: Rolando Acosta, William Artman, Cassandra Burdziak

In this project, students got hands‐on experience for important steps of developing automated phenotyping algorithms for patient classification in the electronic medical records (EMR) system. Students used natural language processing (NLP) tools to extract medical concepts from articles about diseases from online sources such as the Wikipedia. Students assembled the findings from multiple websites, and prepared the list of relevant concepts for parsing free text clinical notes from the EMR. The NLP software produces the number of times each of the concept is mentioned in the notes for each patient. Students summarized the distribution of the NLP concepts and performed association testing to identify the final list of concepts relevant to the phenotype of interest using the NLP data.

Diﬀerential Outcomes by SES in Children Undergoing Treatment for Acute Lymphoblastic Leukemia

Faculty Mentor: Dr. Donna Neuberg
Other Mentors: Traci Blonquist, Joey Antonelli
Program Participants: Sergio Barrera, Randy Davila, Emily Roberts

The project reviewed efficacy and toxicity outcomes from a recent clinical trial to identify possible differences in treatment outcomes. Simple investigation of race and ethnicity do not identify such differences, but an initial look at SES suggest subtle differences. Participants learned about clinical trials and about acute lymphoblastic leukemia, which is the most common pediatric malignancy. They also learned about statistical data analysis and the process of preparing a report reflecting that analysis for a clinician.

The Eﬀects of Environmental Factors and the BRCA Genetic Mutation on Ovarian Cancer Risk

Faculty Mentor: Dr. Eric Tchetgen Tchetgen
Graduate Student Mentor: Kathy Evans
Program Participants: Andrea Lane, Jennifer Osei, Nathalie Quiroz

Students considered a population‐based case‐control study based on all ovarian cancer patients identified in Israel between 1 March 1994 and 30 June 1999 (Modan et al. 2001). The main objective of the study was to examine the interplay of the BRCA1/2 genes and known reproductive/gynecological risk factors of ovarian cancer. To test for interactions between reproductive risk factors and BRCA1/2 in their effects on the risk of ovarian cancer, students performed a number of analyses, ranging from standard logistic regression to more efficient methods such as case‐only analysis which exploits an assumption of gene‐environment independence in the population. Their primary aim was to test for genetic effects in the presence of interactions between the dichotomous variable representing a person’s BRCA1/2 mutation status and her use of oral contraceptives and parity.

The Eﬀects of Probe Sequencing On Microarray Gene Expression Measurements

Faculty Mentor: Dr. Rafa Irizarry
Graduate Mentor: Ryan Sun
Program Participants: Kevin Kupiec, Randy Williams, Pedro Agrinsoni Munoz

The number of fields for which freely available data can be found online has greatly increased during the last decade. Discoveries and prediction algorithms can no longer be built by downloading, parsing, and analyzing data. Examples include creating and using online movie reviews to build a recommendations systems or tweets to predict stock prizes. Students learned to download data from twitter, learned to process English sentences into numerical summaries, and used this to build a prediction algorithm using the R language.