Transforming the Medical Subject Headings into Linked Data: a new article in the Journal of Library Metadata

This article reviews the pilot project to convert the Medical Subject Headings from XML to linked data/RDF. The article examines the collaborative process, the technical and organizational issues tackled, and the future of linked data at the National Library of Medicine.


In February 2014, the National Library of Medicine formed the Linked Data Infrastructure Working Group to investigate the potential for publishing linked data (LD), to determine best practices for publishing LD, and to prioritize LD projects, beginning with transforming the Medical Subject Headings (hierarchically-organized indexing terminology for indexing and cataloging of biomedical information for the MEDLINE /PubMeD database) as a LD pilot.

LD lies at the heart of what Semantic Web (Web of Data, opposed to a sheer collection of datasets) is all about: large scale integration of, and reasoning on, data/datasets on the Web. These datasets are available in a standard common format RDF (based on triple statements), reachable and manageable by Semantic Web (SW) technologies (RDF, OWL, SKOS, GRDDL, POWDER, RDFa, R2RML, RIFSPARQL, etc.) providing an environment where application can make either conversion or on-the-fly access to existing databases (relational, XML, HTML, etc.), query that data, draw inferences using vocabularies, etc.

The library community has embraced LD and other SW technologies as a means to better expose data about their collections and encourage the reuse of library data on the Web. In 2011, the World Wide Web Consortium (W3C) released a set of recommendations on linked data for libraries. In the same way as other national libraries - including the Library of Congress, the National Agricultural Library, the British Library -  also the National Library of Medicine (NLM ) has shown its increasing desire to participate in the arena of LD to ensure that users can link and reuse consistent, permanent, and authoritative NLM data.

In 2013, NLM conducted an environmental scan of LD at peer institutions and a survey of NLM’s “datascape” in order to generate recommendations on how NLM could participate in the SW. In early 2014, researchers at NLM’s Lister Hill National Center for Biomedical Communication submitted a paper to the 2014 American Medical Informatics Association (AMIA) Meeting analyzing the six different versions of MeSH in RDF. Subsequently a beta version of PubChem RDF was released by the National Center for Biotechnology Information at NLM.

Afterwards the NLM LD Infrastructure Working Group chose to transform MeSH as a pilot project for transforming, storing, and publishing NLM in RDF/LD. In particular, the Working Group produced a draft set of triple statements from NLM’s existing prototype by adapting an eXtensible Stylesheet Language Transformation (XSLT) to generate RDF data (an internal prototype for generating MeSH RDF) from the existing MeSH XML.

To coincide with the AMIA meeting, NLM released (on November 17, 2014) the initial beta version of MeSH RDF to facilitate data sharing and linking using SW standards and technologies.

The types and relations (i.e., semantics) in the MeSH RDF model are defined in the ontology as MeSH RDF classes (and subclasses) and predicates accompanied by their definitions. The Working Group discussed the advantages and disadvantages of utilizing existing ontologies, such as SKOS, to represent MeSH in RDF but eventually decided to develop a specific MeSH RDF vocabulary to define the types and relationships expressed in the XML files. The vocabulary was manually prepared using a widely used ontology editor Protègè

In addition to modeling MeSH RDF data (represented as a URI reference or encoded as a literal string), the Working Group established a Web presence for MeSH RDF that includes a read-only SPARQL endpoint (to query MeSH RDF directly using SPARQL query language), a SPARQL query editor, a browseable interface, RESTful interface for URIs, documentation, and GitHub repositories. The interface is powered by a stack that includes OpenLink’s open source Virtuoso RDF server and an open source front-end Java application deployed in Tomcat.

NLM utilized WebTrends software to conduct analytics on the MeSH RDF webpages and discovered a number of organizations utilizing the site that were not included in the known set of beta partners. As a result of feedback received from the beta partners, language tags (with multilingual capability) were added for all strings with literal values.

In addition to producing another release of MeSH RDF (an updated beta of MeSH RDF was released around November 20, 2015) in-synch with the XML, NLM will establish consistent policies and procedures to publish LD (so far the publication of MeSH RDF is not truly LD) and incorporate MeSH RDF into other LD activities.


Source: Barbara BushmanDavid Anderson & Gang Fu, Transforming the Medical Subject Headings into Linked Data: Creating the Authorized Version of MeSH in RDF, Journal of Library Metadata, Special Issue:  Controlled Vocabularies and the Semantic Web, Volume 15Issue 3-4, 2015, pp. 157-176, http://www.tandfonline.com/doi/abs/10.1080/19386389.2015.1099967?journalCode=wjlm20#.VrYxQfnhCM8

We would be pleased to hear from you about your ideas and experiences in Linked Data publishing and consumption  (bibliographic data, terminologies, controlled vocabularies and more)!