Over time, English has swirled into dialects so different that speakers from the same country cannot always understand each other. Similarly, linguists – as they have catalogued words, spellings, pronunciations, and meanings – have stylized their individual academic databases to suit the needs of their own research.

In an age of computational linguistics, that can be a problem. Computers offer vastly improved capabilities for finding patterns and connections. But while human brains are good at smoothing over minor inconsistencies, computers tend to be very literal.

And data that can’t be understood can’t be part of the conversation. “Because of the large quantities of data that can be brought to bear on a problem, for many studies occasional data quality issues are not fatal,” explains SFI Professor Tanmoy Bhattacharya, who leads SFI’s linguistics program. But, he says, “the next advance in linguistics will need to understand weak signals or complicated histories deep in the data, and in these situations data issues will be very important. We will need to understand how the data being used are selected, curated, and presented.”

Further, language databases will need to adopt coding conventions that allow them to talk to one another. “We need to develop a lingua franca for all linguistics databases to speak,” he says. “Whatever way databases organize their own data, or speak their own internal dialect, we should be able to translate them all into something universally understandable and answer queries using the same code all others use.”

Bhattacharya, SFI Distinguised Fellow Murray Gell-Mann, and longtime SFI collaborator George Starostin are hosting an invitation-only working group this week at SFI to address this challenge. Conventional and computational linguists will evaluate existing relevant online and offline databases, explore optimal data formats, and discuss– perhaps even establish – the most useful programmed analysis tools for historical linguistics research.

“What is going to come of this is the preparation to enable the next big advance in computational linguistics,” Bhattacharya says.

More about the working group here.