Robust Profile Clustering

Paper | Code

Adapted from a local partitioning framework, this model breaks the global clustering assumption allowing subjects and variables to cluster at two levels: (1) globally, where subjects are assigned to an overall population-level cluster via an overfitted finite mixture model, and (2) locally, where deviations from population-level behaviors are accommodated via beta-Bernoulli process dependent on identifiable subpopulation differences. An overfitted finite mixture model allows for the number of clusters to remain unknown a priori.

Using dietary consumption data from the National Birth Defects Prevention Study (NBDPS), subpopulations by state of residence of mothers were defined and profiles from multivariate categorical data collected were derived from a food frequency questionnaire (FFQ) that measured dietary intake during the previous 12 months. The model developed was able to discriminate consumption patterns that were shared amongst all mothers in the United States and identify food items which had higher or lower consumption, based on the mother’s state of residence.

Addressing Generalizability for Survey-sampled Study Populations

Paper | Code

The Hispanic Community Health Study/Study of Latinos (HCHS/SOL) is a multi-center epidemiologic study in Hispanic/Latino populations in the United States examining a diverse sample of approximately 16,000 participants aged 18-74. Multi-stage area probability sampling was adopted for the study. To account for the complex survey sampling design, in analysis, sampling weights were applied to each of the participants to make inference to the full target population. Our goal is to extend Bayesian nonparametric dimension reduction techniques to incorporate survey sample weighting within the estimation sampling algorithm.

Generating Inference from Cluster-based Models


Supervised Robust Profile Clustering is a joint predictive clustering model intended to build upon the information derived from the RPC to integrate into a regression model and understand its association to an outcome of interest. Using the discriminating strength of the RPC model, researchers are now able to focus attention on behaviors attributed to an overall population versus a specific subpopulation.  With an NBDPS focus on national dietary behaviors associated with the risk of oral cleft defect, the model was applied to establish the effect of population-based maternal diet on the risk of an oral cleft defect for their offspring.

Addressing Disparities in Health Access and Quality of Care

We are using model-based clustering and classification techniques to examine how a patient’s demographic (e.g. race, socioeconomic status, geography) can impact patient care, treatment and survival in cancer patients, leading to health inequities in at-risk populations.

Addressing Reproducibility in Model-based Clustering

Dimension reduction methods are frequently applied towards high-dimensional data of varying types. However, these techniques tend to yield results specific to the study population. Under a Bayesian framework, clustering models are sensitive to hyperparameter selection and estimation technique employed, which can greatly affect the results in latent variable models.  My research continues to explore ways to improve the sampling and identifiability of these models towards the practical applications of population-based biostatistical problems.