A novel, freely available online dataset enables users to identify overlooked individuals, institutions and connections in the history of genomic science.
A novel, freely available online dataset enables users to identify overlooked individuals, institutions and connections in the history of genomic science. It comprises more than 13 million records and has been compiled, after more than two years of work, by the ERC-funded project Medical Translation in the History of Modern Genomics (TRANSGENE). The dataset documents the institutions that submitted yeast, human and pig DNA sequences to the European Nucleotide Archive and other open access databases between 1980 and 2015, indicating for each institution the number of submitted nucleotides and the year of submission. It also lists the PubMed ID, authors and publication year of the articles that describe these sequences for the first time in the scientific literature. The source code of the software that was used to compile the data can also be downloaded without restrictions.
The data collection process involved 30 million automated searches in the European Nucleotide Archive, Europe PubMed Central and Scopus. The search results needed to be structured in a new fashion in order to interlink 13.4 million sequence submissions with 29,560 fully indexed publications – more than 75% of these records are related to human sequencing. A data note describing the search strategy and cleaning protocol, as well as the design and structure of the dataset has been published in the open access and open peer review life sciences platform F1000Research.
The TRANSGENE team, which comprises various researchers from STIS, is now analysing a number of co-authorship networks that were derived from the data. These analyses are being combined with historical knowledge that the project has drawn from oral histories and archival searches. The results of this mixed methods approach will be published in a history of science journal during 2020 or early 2021.