A recent article in the Harvard Medicine Magazine describes the fruitful cross-pollination between computation, human language and biology in the face of a commonly-shared challenge; namely, the need to parse large amounts of multidimensional data in messy, multi-layered contexts. Tools developed from the field of natural language processing are increasingly being applied to derive meaning from repeated sequences within large data sets, be they word phrases in bio-medical literature or sequences of base pairs in DNA.
One example of the utility of the approach is a project overseen by Biostats alumni and Harvard Medical School Professor of Biomedical Informatics Peter Park and led by postdoctoral fellow Doga Gulhan, which uses a statistical model derived from natural language processing to identify the specific causes of mutations in the genomes of cancer patients. Building on studies that link certain causal factors to specific mutations, the team scans trillions of DNA base pairs found in tumor genome sequences to identify mutation signatures that indicate different factors such as smoking and UV exposure. This information enables them to look beyond apparent differences in tumor type to potentially similar underlying mechanisms in order to identify the right drugs for treatment.
On the other side of the spectrum are projects that use natural language processing to glean information from biomedical literature. One example is a software platform created by John Bachman and Benjamin Gori that uses machine language to parse scientific publications to look for phrases of interest. The platform named INDRA (Integrated Network and Dynamical Reasoning Assembler) cross references raw phrases against each other in a manner analogous to sequence alignment to identify the mechanisms that connect them, enabling researchers to carry out complex tasks such as the creation of biomechanical network models. A similar application involves the use of natural language processing to analyze electronic medical records (EMRs). Guergana Savova, an HMS associate professor is working to build systems that can read and analyze plain text within millions of EMRs, allowing for a ‘deep phenotyping’ of cancer that will move researchers closer to the goal of achieving precision medicine.
While not a panacea given the omissions, variations, and errors related to language data, the applications of natural language processing are promising enough that establishing data and curation standards is a crucial next step, allowing researchers to get to the point, according to Alexa McCray, Professor of Medicine at HMS, “where we can compare apples to apples across biomedicine”.