NoPrimaryTagMatch

Two brothers create powerful new tool to sift through big datasets

Brothers David (right) and Yakir (left) Reshef developed the new statistical tools under the guidance of professors from Harvard University and the Broad Institute. Photo by ChieYu Lin, courtesy of Pardis Sabeti

It’s an unusual starting point for a high-profile paper in a leading science journal: two brothers, students just a year apart at universities down the Charles River from one another, decide to work together on a summer project. The research unfolds through ideas scribbled on the walls of a laboratory, insights gained during downtime working as an EMT, and brainstorms shared at a fraternity house in Boston.

Advertisement:

Today, the fruits of that labor were published in the journal Science: a powerful tool to rapidly flag patterns and identify correlations in huge databases, from sports statistics, to online social networks, to the genomes being churned out by science laboratories.

While it’s unusual for two brothers in their mid-20s to share credit as the lead authors of a paper, the achievement demonstrates how creativity often arises from the back-and-forth of a team – in this case, David and Yakir Reshef have been a team since childhood.

“I think in some sense, David and I have been roping each other into things for our entire lives,’’ said Yakir, now 24, and on a Fulbright scholarship at the Weizmann Institute of Science in Israel.

Advertisement:

The summer after his senior year at the Massachusetts Institute of Technology, David began working with Pardis Sabeti, a biologist at the Broad Institute with an interest in global health. David was developing an approach to sift through large, international health data sets, highlighting potential relationships between demographic information and the incidence of infectious diseases, such as cholera or HIV.

“We just wanted a simple way to figure out what was in the datasets. At first we thought we would go find some methods that existed. It turned out to be a much more complicated question to answer,’’ said David, now 25 and earning a joint medical and doctoral degree at Harvard and MIT.

David began to get excited. He saw a potentially big opportunity to develop tools that could rapidly and effectively identify all sorts of complex patterns hidden in data, ranging from the rise and fall of flu cases depending on the season to the swooping curve of female obesity when graphed against average income. He turned to his longtime collaborator, Yakir, a Harvard undergraduate who was working that summer on an ambulance crew based in Arlington, a job that he hoped would help prepare him for medical school.

Advertisement:

The two had been extremely close intellectual partners throughout their grade school years, but even the short commute between Harvard and MIT had allowed them to grow apart a bit.

“In some ways, this was a return to the good old days,’’ Yakir said. “This is a pattern with us — I was a little bit skeptical at the beginning. I credit him with having the thinking, ‘Even though it’s crazy, we should try it.’ … He hatches hare-brained schemes.’’

So Yakir began printing out scientific articles and bringing them to his summer job. He would sit on the ambulance and think about the statistics and large datasets between calls — a unique vantage point that gave him both a birds-eye view of broad health data, mixed with medicine in action.

They found that to solve the problem, they had to draw from all corners, expanding beyond global health. They worked with computer scientist Michael Mitzenmacher at Harvard and one of his students, mathematician Hilary Finucane, who recently got engaged to Yakir.

Over the years, academic scholarships put the key team members temporarily on different continents. The collaboration intensified over Skype, eventually resulting in a new statistical tool presented today that rapidly and effectively mines data for relationships and ranks the strongest ones — without any preconceived notion of what that might be. The tool can’t answer the question of whether one thing caused another, but by finding the strongest correlations, it can help scientists generate new hypotheses and questions to explore.

Advertisement:

The brothers — both baseball fans, although Yakir says David is the better athlete — decided to try their tool out on statistics and salaries from Major League Baseball. They found that hits, total bases, and a statistical measure of offensive performance were most strongly correlated with salary.

They also tested their tool on data from the World Health Organization, yeast gene activity, and genomic data describing the bacteria present in the human gut, and found relationships that were not picked up with older methods.

Eli Upfal, a professor of computer science at Brown University who was not involved in the study, said the new tool had a solid mathematical foundation and worked well on real data, but its ultimate impact will be seen over time.

“It’s not like someone solved an open problem in mathematics and you can check the proof,’’ Upfal said. “This is more: here is a tool and here’s some very good mathematical justification for the tool, but the proof eventually is in it being adopted and shown to be practical. … This is the first step.’’

Sabeti said that the team plans to extend and build on the tool, for example finding ways to look for complex relationships between more than two pieces of data.

“Every field is ripe for a tool like this with the data deluge,’’ Sabeti said. “Everywhere we go and give talks about it, somebody says, ‘I have a datset for you to look at,’ — finance, sports, statistics.’’

The researchers are making the tool available on a website. The brothers said they hope to work together again soon.

To comment, please create a screen name in your profile

Conversation

This discussion has ended. Please join elsewhere on Boston.com