Building a Culture of Data Citation ... with Persistent Identifiers

DATA CITATION has been advocated across and within many research enterprises globally. Individual researchers have adopted data citation as part of their work and an increasing number of publishers and funders are now encouraging or requiring some level of data citation. The benefits of data citation are clear: besides increasing the visibility of data resources, improving the integrity of research and publications, there is a general trend of acknowledgment and accreditation being associated with data citation. Researchers are beginning to see the value in the citation of their data to be as important as citation of their other outputs”.

Session Title: Measuring the Impact of Data Citation Practices in Research  at SciDataCon  part of International Data Week 2018 (Gaborone, Botswana, 5-8 November 2018) 

(Image source: ANDS Data Citation)

In the context of Data Management practices, citing data by means of Persistent Identifiers (PIDs) has been increasingly recognized as a lynchpin in the evolving cyber-infrastructure to facilitate proliferation of quality data in distributed trusted digital environments and information/data-intensive collaborative research. 

This entry will spotlight some sources that focus on importance of use of PIDs / DOIs for data citation.


Persistent Identifiers & Digital Object Identifiers : Why do they matter ? 

Persistent Identifier (PID) is an association between a character string and an object (files, parts of files, persons, organizations, abstractions, etc.).  Once a resource has been registered with PID, the only location information relevant for this resource from now on should be that identifier. PIDs need to be managed and kept current over time (see: Persistent identifiers, in Digital Preservation Handbook).

The use of PIDs in citations for non-data set assets such as software and projects is an emerging area of focus, and builds on the success of data citation efforts (see: Citing & publishing software: Publishing research software, MIT).

In the field Sciences (e.g., geology, ecology, and archaeology) - - where each study is temporally (and often spatially) unique  - -  PIDs are core elements of digital metadata records, and are especially useful when space or resource requirements complicate storage of the samples described (see: Liberating field science samples and data).

The paper : 
- - 
“Identifiers for Earth Science Data Sets: Where We Have Been and Where We Need to Go”, by Goldstein, J.C., Mayernik, M.S. & Ramapriyan, H.K., (2017). Data Science Journal. 16, p.23. DOI: - -
explores the adoption of DOIs for Earth Science data sets, outlines successes, and identifies some remaining challenges.

The Digital Object Identifier (DOI) is by far the most widely used identifier system - with 130 million PIDs assigned to date (International DOI Foundation, IDF). 

DOIs comprise a portion of the Handle System, but exceed its capabilities by providing a resolution system for identifiers and for requiring semantic interoperability, among other reasons. One of the main purposes of assigning DOI names is to separate the location information from any other metadata about a resource.

This blog entry :
- - "Guidelines for the optimal use of Digital Object Identifiers for germplasm samples- -
will introduce you to 
International Treaty on Plant Genetic Resources for Food and Agriculture (FAO of the UN) that brings a number of legal issues that require better identification of and access to germplasm samples. 

Assignment of PIDs / DOIs to cite data/data patterns unquestionably contribute to the reputation of data citation while :

* Enabling improved tracking of data set use and re-use

* Providing credit for data producers

* Aiding reproducibility efforts through associating research with the exact data set(s) used

By linking publications with the resources/data underlying the scientific findings, PIDs / DOIs help :

* Uniquely identify research data collections and thus maintain provenance of the source of dataProvenance is important for understanding and using scientific data sets, and critical for independent confirmation of scientific results,

* Ensure scientific integrity

* Enhance the searchability, discovery of and access to data

*  Improve data management practices

* Strengthen scientific communication

* Facilitate the extraction of citation metrics for those from given organizations, for a given group of data sets or even a single data set, which would otherwise involve considerable amount of manual effort (see: How to Track the Impact of Research Data with Metrics).

Some good Data Citation practices

Recently, many communities of practice - committed to supporting data science and data management activities and projects - have selected PIDs and DOIs as their identifiers of choice to cite their data ... 

“Working with other organizations and groups focused on Data Citation, scholarly communication and enabling data infrastructures, we can make substantial progress”, - CODATA-ICSTI Data Citation Standards and Practices.

EUDAT Services for Data Preservation 

Each Digital Object and Digital Collection - - stored in the EUDAT B2SAFE - - has a registered PID that can then be used for citing the item in publications and/or to find it. 
B2SAFE is robust, safe and highly available service which allows community and departmental repositories to implement Data Management policies on their research data across multiple administrative domains in a trustworthy manner. 


DataCite is a leading global non-profit organisation that provides persistent identifiers DOIs for research data. DataCite's goal is to help the research community locate, identify, and cite research data with confidence.
DataCite Metadata Scheme for the Publication and Citation of Research Data describes elements that could potentially be included in a citation.

CODATA-ICSTI Task Group on Data Citation Standards and Practices

A data citation needs an identifier that is both unique and persistent. Persistent Identifiers are categorized as descriptive metadata. However, their role is so critical that they may be put in their own category … For scientific literature, persistent identifiers such as DOIs have often resolved to a landing page containing at least bibliographic metadata” The Current State of Practice, Policy, and Technology for the Citation of Data.

UK Digital Curation Centre (DCC)

Unique identifiers, and metadata describing the data, and its disposition, should persist – even beyond the lifespan of the data they describe”, - How to Cite Datasets and Link to Publications.

FORCE- 11 FAIR (findable, accessible, interoperable and re-usable) Data Citation Implementation Group

“A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community, -  The  Joint Declaration of Data Citation Principles.

To date, these FORCE-11 principles have been endorsed by about 400 entities worldwide, including journal publishers and professional science societies.

Research Data Alliance (RDA) Working Group (WG) on Data Citation

“The WG recommends solving this challenge by … identifying data sets by storing and assigning PIDs to timestamped queries that can be re-executed against the timestamped data store”, - Recommendations of the Working Group on Data Citation.

“By assigning PIDs to the query, the process is very lightweight and scales with increasing amounts of data. It preserves the subset creation process and thus contributes to the reproducibility of an experiment. Provenance details and metadata about the data set are collected”, - Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use.

Australian National Data Service (ANDS) - Citation and identifiers

ANDS provides or connects with several identifier services and also uses identifiers in ANDS systems. ANDS encourages use of identifiers for (1) research data, such as DOIs or handles; (2) people and organisations, such as an ORCID ID or ISNI ID; and (3) research projects, activities or grants, such as ANDS purl IDs for ARC and NHMRC grants.

ICPSR (Institute for social research, University of Michigan) – Data Citations

The ICPSR recommends straightforward citations that include the elements of title, author, data, version, and PID.

In addition to these elements, ICPSR also recommends the addition of fixity information, such as a checksum or Universal Numeric Fingerprint (UNF), which are definitive ways to establish provenance of the data.

ICSU World Data System (ICSU-WDS) accreditation for Trusted Data Services for Global Science

“All who produce, share, and use data and metadata are stewards of those data, and have responsibility for ensuring that the authenticity, quality, and integrity of the data are preserved, and respect for the data source is maintained by ensuring privacy where appropriate, and encouraging appropriate citation of the dataset and original work and acknowledgement of the data repository”, - Data Sharing Principles.

EZID - a service to create and manage long-term globally unique IDs for data and other sources

The California Digital Library and Purdue University are adopting a new strategic direction for their EZID DOI services to support DataCite’s long-term sustainability and to improve DOI services for the broader community… In addition to DOIs, EZID supports ARK identifiers (see: EZID DOI Service is Evolving).

European Biodiversity Observation Network (EU BON) Portal

Data Policy Recommendations for Biodiversity Data - - developed to be used in the EU BON portal - - emphasizes data citations as its important components.

Earth Science Information Partners (ESIP)

The Federation of ESIP has developed Data Citation Guidelines for Data Providers and Archives - - which meet all the purposes of Earth science data citation - - that have been adopted by over 180 participating member organizations.

The Global Change Information System (GCIS)

The GCIS implements findings PID links from the US National Climate Assessment with their underlying data and the tracing of indicators found within Integrated Ecosystem Assessments.

Related content: