Olga Troyanskaya takes a deep dive into data

Olga Troyanskaya
Peter Murphy

But for the light clatter of keyboards and the low chatter of co-workers, it’s quiet among the rows of computer desks in Room 270 in the Carl Icahn Laboratory. Here, along the southern edge of Princeton’s campus, a cluster of graduate and postdoctoral students work with Professor Olga Troyanskaya at the Lewis-Sigler Institute for Integrative Genomics. Yet below the surface of this computational space shared by several laboratories, a subfloor bears the brunt of a digital maelstrom. There, unimaginably large streams of genetic data are transformed to binary code and shipped out. Data packets zip along fiber-optic lines buried beneath Washington Road, race under Route 1, and plop at the High-Performance Computing Research Center on the Forrestal Campus. Once processed, the data, now sorted by algorithms calculating probabilities of “right” answers — principally, previously unidentified genes or proteins associated with a disease or disorder — pings back.

Troyanskaya, a professor in the computer science department and the Lewis-Sigler Institute, has broad ambitions — nothing less than devising the digital tools needed to usher in the long-dreamed-of era of precision medicine, a term that broadly describes a process where physicians someday will routinely select treatments for patients based on an understanding of their genes. In pursuing that goal, Troyanskaya and her team are harnessing the power of technologies like artificial intelligence to solve problems in biology on an immense scale. And they are making significant progress in understanding the genetic basis of disorders like autism, as well as of many diseases, including cancer and Alzheimer’s disease.

“We are trying, in a practical, applied way, to figure out whether we can enable this promise of precision medicine,” Troyanskaya says. “Scientists may have sequenced the human genome. But we still don’t know the answers to most of our questions about what genes mean for human health. What we are looking to do is change genomic data to biological knowledge.”

Precision medicine stands in stark contrast to much of today’s medical treatment, still often a one-size-fits-all approach in which treatment, medication, and preventive practices are developed for the average person, with less consideration for the differences among individuals. Precision medicine, instead, is decidedly individualistic — aimed at developing cures and treatments tailored to the unique characteristics of each person.

Though the concept of personalized or precision medicine has been dreamed of for decades, the prospect became more realistic with the completion of the $3 billion Human Genome Project, a 13-year effort that in 2000 led to the first draft of the human genetic code. In a White House ceremony celebrating the achievement, President Bill Clinton predicted the knowledge contained within each person’s DNA would “revolutionize the diagnosis, prevention, and treatment of most, if not all, human diseases.”

Over the years, the scientific community has made tremendous strides in the field by plucking out useful information about individual genes, Troyanskaya says, noting the development of genetic tests and the resulting “targeted” drugs that take individual genetic differences into account. These can help a small subset of patients, as in some breast and lung cancers. However, such approaches target only uncommon forms of diseases, identifying rare variants in single disease-causing genes.

But common ailments like diabetes, cancer, and heart disease, scientists are finding, are more genetically complex than anticipated, involving hundreds of genes playing small roles — instead of a single, errant gene bearing the brunt of causation. What’s more, scientists have discovered that many diseases lack trademark patterns — the same disease can be caused by different combinations of genes in different people. Adding to the difficulty is the slew of genetic data available. With decades of research programs focused on genomic material, as well as the growing popularity of private DNA collections by companies like 23andMe and Ancestry.com, the sheer volume of information available to scientists is daunting.

With an extensive background in computer science and molecular biology, the 41-year-old Troyanskaya is unfazed. She believes there are deeper, broader ways to look at the vast amount of human genetic information and harness that knowledge to defeat disease — for example, by employing the most advanced techniques in machine learning to make “predictions” about constellations of genes. Where others may see a disconcertingly large mountain of data, Troyanskaya looks for the connections within it.

In an address before the World Economic Forum in Switzerland in 2017, she spelled out her vision. Standing before a giant TV screen emblazoned with images of gene sequences, Troyanskaya explained to a crowd of world business leaders that mutations are everything. “One might mean the difference between my curly hair and someone’s straight hair,” she said. “One might mean getting cancer or not.”

Then came her main point: “The key challenge here is that we don’t observe most disease-causing mutations,” she said. “And a disease might be caused by different mutations or even combinations of multiple mutations. We need a method that can tell us what a mutation will mean, whether we see it or not.”

In David Botstein’s lab at Stanford, Troyanskaya combined bioinformatics with traditional biology research.
Courtesy Calico Life Sciences, LLC

What Troyanskaya and her team represent is the dawn of a new approach: the development of a predictive science in which genetic-risk profiles may indicate a proclivity to a disease, the effectiveness of a given drug or treatment, and whether changes in lifestyle might be especially beneficial.

There are many steps that need to occur between the pinpointing of myriad interconnecting genes tied to a disorder and a cure, Troyanskaya notes, including developing and then testing a treatment. Such a process could take years, even decades. But, Troyanskaya says, at least now, medical researchers are being given solid clues to work with. This could be the start, possibly, of something big. “It is hard to be over the top when talking about Olga,” says Alex Lash, chief informatics officer at the Simons Foundation in New York, where Troyanskaya also serves as deputy director for genomics in its Flatiron Institute, which focuses on basic research. “Her mind moves at 100 miles an hour. She is doing seminal work, and investigators in the future will build on it.”

Even by Princeton standards, Troyanskaya, who is married with two young children, is seen by her colleagues as unusually busy and productive. She wears an Apple Watch and relies on a silent, ever-so-slight tap on her wrist to keep her appointments. As she sits in her second-floor office overlooking an airy atrium in the modernistic Icahn Lab, where the Lewis-Sigler Institute is housed, Troyanskaya gazes at her bookshelves, crammed with textbooks and her students’ theses, and reflects on the role of lucky breaks over time. “I am amazed at how many serendipitous things there are in my life,” says Troyanskaya. “It’s actually both humbling and terrifying.”

She was born in Moscow in 1977. Her parents were intellectuals — her father a civil engineer, her mother an electrical engineer. Troyanskaya was an only child. An avid reader, she would often get into trouble for reading under blankets at night, illuminating books with a flashlight. Her mother had given birth to her at age 39 after several failed pregnancies. As difficult as the miscarriages were for her mother, Troyanskaya says, the situation meant that their home was littered with biology books, specifically books that covered genetics and anatomy, that her mother had pored over in attempts to understand her condition. “That’s where my interest in biology started,” Troyanskaya says.

She attended a specialized language high school and graduated by age 16. She came to the United States through an exchange program that placed her with a family in Woodbridge, Va., a suburb of Washington, D.C., where she lived with Caryn Collier, a law-office manager; her husband, Andy Collier, a career employee of the U.S. Department of Commerce; and the youngest of the Colliers’ three children. The Collier family worked to make sure Troyanskaya would succeed at Woodbridge Senior High School. They took her shopping for American-style clothes to substitute for the running suits and traditional Russian dresses she had brought from home. At school, Troyanskaya challenged herself, acing her load of Advanced Placement classes, even though she was still perfecting her English. She became known as a student who liked to ask questions. “It drove the other students crazy,” Caryn Collier says. “She was always asking her teachers questions. ... Olga messed up the curve in every class.”

Troyanskaya wanted to stay in the United States and study math and science at the college level, but she knew her parents could not afford that. With few resources and no sense of the college landscape, she took the advice of her host parents, who encouraged her to apply to the University of Richmond. After much back and forth, with offers from members of the Colliers’ church to pay for her college room and board, she won a full scholarship there. “My host parents drove me to Richmond and took me to the admissions office and asked to speak with the head so I could tell my story,” Troyanskaya recalls. The Colliers “made sure that the scholarship came through,” she added.

This illustration shows a network of genes in the brain that are relevant to Alzheimer’s disease. Uncovering network relationships among genes can help scientists understand complex neurodegenerative diseases.
Aaron K. Wong *15

In college, she continued to excel. She double-majored in computer science and biology, with a minor in mathematics. As a freshman, she conducted a research project, working under one of her professors, and started to think about how a summer internship might aid her entry to graduate school. “I thought there’s got to be a field that connects computer science and biology,” Troyanskaya says. “It turned out it existed, but no one knew about it yet — bioinformatics.” She made a list of schools known for excellence in biology and another for schools renowned for strength in math and computer science. She worked her way down the list, looking at the websites of faculty members who may have been combining the fields. “I came up with a list of eight or nine people,” Troyanskaya says. “I sort of bulk-emailed them and asked if I could do research with them over the summer.”

One person responded. Steven Salzberg, then a computer science professor at Johns Hopkins University, offered her a research spot based on her letter, thinking she had already graduated. Once he realized she would only be a sophomore, he stood by his offer. “I’m still amazed by that,” Troyanskaya says of Salzberg, who would go on to be a leader in the Human Genome Project and other sequencing efforts. “We’ve been friends and colleagues ever since.”

She spent the next few summers in Salzberg’s lab, both at Hopkins and at the Institute for Genomic Research. Troyanskaya wrote computer programs predicting the effects of specialized genetic structures on the creation of proteins and drank in the culture of a modern research laboratory. As valedictorian at her graduation ceremony in 1999, Troyanskaya carried the University of Richmond’s mace, with her name inscribed on it, as the school’s most outstanding student.

She was off to Stanford University for her graduate degree to work with David Botstein, the chair of Stanford’s genetics department, and Russ Altman, a renowned biostatistician. Botstein would serve as a powerful mentor. In 2003 he founded the Lewis-Sigler Institute at Princeton (he’s now emeritus), and Troyanskaya would come to the University as an assistant professor shortly after receiving her Ph.D.

When Troyanskaya was joining his lab at Stanford, Botstein was known for employing DNA microarray technology in a novel way, identifying potentially lethal tumors by combing through genetic material for genes that were activated. Microarrays provide “snapshots” of a given moment in a cell, showing which genes are turned on and off. Such information can offer fundamental insights into the biochemistry of cell growth and guide clinical decisions based on patients’ responses to medication. “Most of bioinformatics up to then had been focused on the genome and its sequence,” Troyanskaya says. “But how do you put all those sequences together and predict where the genes are? That’s how David made a huge impact.” Botstein could predict the biological location where the genes could be found, she explains.

Research papers based on revelations from microarray techniques poured out of the Botstein laboratory — on genes involved in breast cancer, lung cancer, and blood-vessel tumors. “It was so exciting,” Troyanskaya says. “Everyone was coming together. Computer scientists, molecular biologists, and biostatisticians would argue it out and find the best way.”

She wrote a paper about developing technical tool kits for genome analysis. “I developed the first method for this idea of using heterogeneous data from microarrays — basically, the kitchen sink — everything that tells you what proteins do,” she says, “and how they interact and how they are regulated to be able to predict these functional networks of how proteins work in the cell.” Her research combined bioinformatics with the more traditional biology research in a laboratory setting housing chemical and biological specimens, efforts she calls “biology interactions.” She came to understand that, to be useful, everything she did had to be rooted in biology. “In addition to learning all the technical things, a lot of what I learned is how critical it is for people in my field — people developing tools — to be closely involved with biology and biomedical research,” she explains. “We’re not just developing tools to apply to data. It’s about solving real challenges in biology using computational methods.”

Jian Zhou *17
Ruth Dannenfelser

At Princeton, Troyanskaya and her collaborators and students work to find meaning in the vast array of databases worldwide. She doesn’t cull them. She keeps piling it on, petabyte (one quadrillion bytes of computer storage) by petabyte. For several years, Troyanskaya and her team have used deep learning — a machine-learning technique that teaches computers to learn by example — to understand which patterns in DNA are important.

Now a postdoc at Flatiron, Jian Zhou *17 pioneered the lab’s first deep-learning program, DeepSEA, while working as a graduate student with Troyanskaya at Princeton. The professor, he says, is “always looking for something that will have a big impact,” and his idea was to build a computer program that could scour genomes for mutations and assess whether a given mutation might have a serious biological effect. He attacked the problem by employing artificial neural networks — algorithms inspired by the human brain. The technique allows machines to solve complex problems, even when using data sets that are diverse, unstructured, and interconnected. Over months, Zhou taught DeepSEA to “learn” associations between different parts of a genome and how they are important for molecules that interact with DNA, and then to use those associations to pinpoint the effects of any mutation.

Like its designer, DeepSEA turned out to be an A student, nimbly sifting through a network of more than 100 million gene interactions to draw out information. In an August 2016 paper in Nature Neuroscience, Troyanskaya and colleagues described how they employed DeepSEA and another deep-learning program, Seqweaver, to analyze mutations across the entire human genome, a breakthrough raising the number of genes linked to autism-spectrum disorder from 65 to 2,500. The program “learned” characteristics that indicate a connection to autism as it went along, honing the quality of predictions as it proceeded. The tool could be used to explore the genetic basis of any complex disease: “There are a hundred mutations in any genome, and DeepSEA can predict which ones of those are actually disease-causing and predict specific molecular consequences on each mutation,” Troyanskaya says.

Next came ExPecto, another machine-learning framework developed by Troyanskaya and her team. This software program, named after the powerful, but notoriously difficult, defensive Patronus charm from the “Harry Potter” series of books (“Expecto Patronum!”), can predict, for any given mutation, whether that mutation disrupts the expression of a gene — its turning “on” or “off.”

“ExPecto is more specific” than DeepSea, explains Aaron Wong *15, who is now a data scientist and project leader in genomics at Flatiron, where he uses the software to study autism. ExPecto incorporates DeepSea but also contains “tissue-specific” information about how genes operate in cells. As a result, Wong says, scientists using ExPecto can look at a person’s genome and learn that a single mutation may prevent a specialized biomolecule from binding to an important brain protein and activating it.

The results of one investigation using ExPecto, published July 16 in Nature Genetics, were calculated by assessing the genetic ramifications of more than 140 million mutations in different tissues. The study points to those mutations potentially responsible for increasing the risk of several immune-related diseases, such as Crohn’s disease and chronic hepatitis B virus infection.

With millions of predictions about genes available, Troyanskaya and her team have tried to make it easier for other researchers to join in the work. Their interactive web server, HumanBase (http://hb.flatironinstitute.org/), provides a portal to ExPecto and provides predictions and information about human genes, pathways, and disorders when given certain genetic information as an input. The free service employs sophisticated computational analysis to produce information that predicts how genes are expressed and how they interact.

Troyanskaya is engaged in collaborations across many research fronts, including one with University of Michigan researchers to find ways to predict which genes play a role in kidney disease. Another investigation, with Nobel laureate Paul Greengard at Rockefeller University in New York, has implications for Alzheimer’s and Parkinson’s diseases. Troyanskaya also is doing potentially important work on cancer immunotherapy.

“To really enable the promise of precision medicine, we need to not only be able to predict which mutations are disease-causing but be able to understand the precise effects of each mutation, put it in a cellular context, then be able to integrate this across genes and more biomolecules,” Troyanskaya says. “Then we need to put this together into a picture that includes multiple cell types, multiple tissues, multiple organisms, and integrated, detailed models to truly be able to transfer the information about molecular biology to the whole organism.”

Ever the unrelenting student, she’s still asking questions, still pressing to understand. 

Kitta MacPherson is an award-winning science writer who has worked in daily newspapers and at Princeton University.