Murray Cox, Sean Downey, Brian Hallmark, J. Lansing, Peter Norquest

Paper #: 07-08-021

Historical relationships among languages are used as a proxy for social history in many non-linguistic settings, including the fields of cultural and molecular anthropology. Linguists have traditionally assembled this information using the standard comparative method. While providing extremely nuanced linguistic information, this approach is time consuming and labor intensive. Conversely, computational approaches are appreciably quicker, but can potentially introduce significant error. Furthermore, current methods often use cognate sets that were themselves coded by historical linguists, thus reducing the benefit of computational approaches. Here we develop a method, based on the ALINE distance, to extract feature-sensitive relationships from paired glosses, datasets that require minimal contribution from trained linguists beyond transcription from primary sources. We validate our results by comparison with data generated independently via the comparative method, and quantify error rates using consistency indices. To showcase our method’s utility and to demonstrate its robustness at local and regional scales, we apply it to two language datasets from eastern Indonesia. As linguistic datasets proliferate, scalable computational methods that mimic historical linguistic reconstruction will become increasingly necessary. Although at present we cannot disentangle all the processes driving linguistic change (e.g. lexical borrowing), our method provides a robust and accurate alternative to manual linguistic analysis. The feature-sensitive method adopted here accurately and automatically identifies emergent patterns hidden in traditional word-lists by analyzing critical phonetic information that is discarded (or required as prerequisite) by many current cognate-based computational methods. This approach is not intended to supplant manual linguistic analysis, but has an important role in quickly generating robust data for non-linguistic fields or interdisciplinary projects that require formal quantitative analysis of historical linguistic relationships. Our approach provides a workable approximate phylogeny in cases where a trained linguist is unavailable, or otherwise significantly reduces the time and effort required for manual classification.