N3C, on the other hand, is auditable by, and accountable to, thousands of researchers at hundreds of participating institutions, with a strong focus on transparency and reproducibility. Everything users do within the interface, which uses Palantir’s GovCloud platform, is carefully preserved, so anyone with access can retrace their steps.
“This isn’t rocket science, and it isn’t really new. It’s just hard work. It’s tedious, it has to be done carefully, and we have to validate every step,” says Christopher Chute, a professor of medicine at Johns Hopkins who also co-leads N3C. “The worst thing we could do is methodically transform data into garbage that would give us wrong answers.”
Haendel points out that these efforts haven’t come easy. “The diversity in expertise that it took to make this happen, the perseverance, dedication, and, frankly, brute force, is just unprecedented,” she says.
That brute force has come from many different fields, many of them not traditionally part of medical research.
“Having everyone on board from all aspects of science really helped. During covid people were much more willing to collaborate,” says Mary Boland, a professor of informatics at the University of Pennsylvania. “You could have engineers, you could have computer scientists, physicists, all these people who might not normally participate in public health research.”
Boland is part of a group using the N3C data to look for whether covid increases irregular bleeding in women with polycystic ovarian syndrome. Outside of covid, most researchers have to use insurance claims data to get a large enough database for population-level analyses, she says.
Claims data can answer some questions about how well drugs work in the real world, for instance. But those databases are missing huge amounts of information, including lab results, what symptoms people are reporting, and even whether patients die.
Collecting and cleaning
Outside of insurance claims databases, most health data collaboratives in the US use a federated model. Participants in these studies all agree to format their own datasets in a common format, and then run queries from the collective, such as the proportion of serious covid cases by age group. Several international covid research collectives, including the Observational Health Data Sciences and Informatics (OHDSI, pronounced “Odyssey”), operate this way, avoiding legal and political problems with cross-border patient data.
OHDSI, which was founded in 2014, has researchers from 30 countries, who together hold records for 600 million patients.
“That allows each institution to keep their data behind their own firewalls, with their own data protections in place. It doesn’t require any patient data to move back and forth,” says Boland. “That’s comforting for a lot of places, especially with all the hacking that’s been going on lately.”