Belikov, Alenander V.; Andrey Rzhetsky and James Evans

The explosive growth of scientists, scientific journals, articles and findings in recent years 1​ ,2 exponentially increases the difficulty scientists face in navigating prior knowledge and collectively reasoning over it to drive future advance 3​ ,4​. This challenge is exacerbated by uncertainty about the reproducibility of published findings 5​ –8​. The availability of massive digital archives, machine reading and extraction tools on the one hand, and automated high-throughput experiments on the other, allow us to evaluate these challenges at scale and identify novel opportunities for accelerating scientific advance 9​ ​. Here we demonstrate a Bayesian calculus that enables the positive prediction of robust, replicable scientific claims with findings automatically extracted from published literature on gene interactions. We matched these findings, filtered by science, with unfiltered gene interactions measured by the massive LINCS L1000 high-throughput experiment to identify and counteract sources of bias. Our calculus is built on easily extracted publication meta-data regarding the position of a scientific claim within the web of prior knowledge, and its breadth of support across institutions, authors and communities, revealing that scientifically focused but socially and institutionally independent research activity is most likely to replicate. This contrasts with the ineffectiveness of alternative strategies like “follow the leader”—trusting top journals and top scientists—which do not predict robust findings. These findings recommend policies that go against the common practice of channeling biomedical research funding into centralized research consortia and institutes rather than dispersing it more broadly. Our results demonstrate that robust scientific findings hinge upon a delicate balance of shared focus and independence, and that this complex pattern can be computationally exploited to decode bias and predict the replicability of published findings. These insights provide guidance for scientists navigating the research literature and for science funders seeking to improve it. Moreover, our project models an entirely machine-driven research pipeline, from machine reading to evaluation, that could be incorporated by intelligent algorithms to augment scientific search, recommending fruitful research hypotheses to accelerate cumulative scientific advance.