Monthly Archives: May 2017

Privacy in genomic databases

Genome-wide association studies, which try to find correlations between particular genetic variations and disease diagnoses, are a staple of modern medical research.

But because they depend on databases that contain people’s medical histories, they carry privacy risks. An attacker armed with genetic information about someone — from, say, a skin sample — could query a database for that person’s medical data. Even without the skin sample, an attacker who was permitted to make repeated queries, each informed by the results of the last, could, in principle, extract private data from the database.

In the latest issue of the journal Cell Systems, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory and Indiana University at Bloomington describe a new system that permits database queries for genome-wide association studies but reduces the chances of privacy compromises to almost zero.

It does that by adding a little bit of misinformation to the query results it returns. That means that researchers using the system could begin looking for drug targets with slightly inaccurate data. But in most cases, the answers returned by the system will be close enough to be useful.

And an instantly searchable online database of genetic data, even one that returned slightly inaccurate information, could make biomedical research much more efficient.

“Right now, what a lot of people do, including the NIH, for a long time, is take all their data — including, often, aggregate data, the statistics we’re interested in protecting — and put them into repositories,” says Sean Simmons, an MIT postdoc in mathematics and first author on the new paper. “And you have to go through a time-consuming process to get access to them.”

That process involves a raft of paperwork, including explanations of how the research enabled by the repositories will contribute to the public good, which requires careful review. “We’ve waited months to get access to various repositories,” says Bonnie Berger, the Simons Professor of Mathematics at MIT, who was Simmons’s thesis advisor and is the corresponding author on the paper. “Months.”

Bring the noise

Genome-wide association studies generally rely on genetic variations called single-nucleotide polymorphisms, or SNPs (pronounced “snips”). A SNP is a variation of one nucleotide, or DNA “letter,” at a specified location in the genome. Millions of SNPs have been identified in the human population, and certain combinations of SNPs can serve as proxies for larger stretches of DNA that tend to be conserved among individuals.

The new system, which Berger and Simmons developed together with Cenk Sahinalp, a professor of computer science at Indiana University, implements a technique called “differential privacy,” which has been a major area of cryptographic research in recent years. Differential-privacy techniques add a little bit of noise, or random variation, to the results of database searches, to confound algorithms that would seek to extract private information from the results of several, tailored, sequential searches.

Practical applications for non native English speakers

After thousands of hours of work, MIT researchers have released the first major database of fully annotated English sentences written by non-native speakers.

The researchers who led the project had already shown that the grammatical quirks of non-native speakers writing in English could be a source of linguistic insight. But they hope that their dataset could also lead to applications that would improve computers’ handling of spoken or written language of non-native English speakers.

“English is the most used language on the Internet, with over 1 billion speakers,” says Yevgeni Berzak, a graduate student in electrical engineering and computer science, who led the new project. “Most of the people who speak English in the world or produce English text are non-native speakers. This characteristic is often overlooked when we study English scientifically or when we do natural-language processing for English.”

Most natural-language-processing systems, which enable smartphone and other computer applications to process requests phrased in ordinary language, are based on machine learning, in which computer systems look for patterns in huge sets of training data. “If you want to handle noncanonical learner language, in terms of the training material that’s available to you, you can only train on standard English,” Berzak explains.

Systems trained on nonstandard English, on the other hand, could be better able to handle the idiosyncrasies of non-native English speakers, such as tendencies to drop or add prepositions, to substitute particular tenses for others, or to misuse particular auxiliary verbs. Indeed, the researchers hope that their work could lead to grammar-correction software targeted to native speakers of other languages.

Diagramming sentences

The researchers’ dataset consists of 5,124 sentences culled from exam essays written by students of English as a second language (ESL). The sentences were drawn, in approximately equal distribution, from native speakers of 10 languages that are the primary tongues of roughly 40 percent of the world’s population.

Every sentence in the dataset includes at least one grammatical error. The original source of the sentences was a collection made public by Cambridge University, which included annotation of the errors, but no other grammatical or syntactic information.