Find the DATA You need ... more easily with Google Dataset Search!

 

Data sets and related information tend to be spread across multiple data repositories on the web. Governments, scientific publishers, researchers, data providers (both individual providers and data repositories) publish data for fields ranging from agriculture and climate sciencelife sciencesocial science to high-energy physics and more. 

In many cases, information about these data sets is neither linked nor has it been indexed by search engines, making data discovery often frustrating or, in some cases, impossible.

Easy access to data sets and to its provenance on the web is critical in order to facilitate reproducibility of research results (thus enabling scientists to build on others’ work), and to boost returns on investments traceable in different directions. 

In order to facilitate the universal accessibility to and increase discoverability of datasets through a single interface, in September 2018, Google launched a Beta version of a GOOGLE DATASET SEARCH Engine, - now available alongside other specialized Google’s search engines

A Google DATASET SEARCH engine aims to create a Data Sharing Ecosystem that will encourage data publishers and users to follow best practices for producing, storage, consuming, citing and discovering of datasets.  

SOME TECHNICAL ASPECTS...

To provide a confederated search point

… for the millions of web pages that host datasets, - the DATASET SEARCH function relies on structured data embedded in web sites.

To embed metadata within the coding of each web page 

… that offers data, - Google has adopted the open source standard for structured data schema.org that is based on an effort recently standardized at W3C (the Data Catalog Vocabulary), and which includes such dataset description as:  - who created the data -  when it was created -  terms of use, etc.

[Developers can contribute to expanding schema.org metadata for datasets, providing domain-specific vocabularies, as well as working on tools and applications that consume this rich metadata].

To help data providers describe their datasets 

… in a structured way, enabling Google and others to link this structured metadata with information describing locations, scientific publications, or even Knowledge Graph, facilitating data discovery for others, - Google Search has published new guidelines.

“… search engines improve most quickly when a critical mass of users is there to provide data on what they’re doing” (Google launches new search engine to help scientists find the datasets they need | The Verge).

Nevertheless, before search for data becomes as seamless as it should be – a number of technical challenges still remain, such as:

Defining and identifying more consistently what constitutes a dataset:

  • Is a single table a dataset?
  • What about a collection of related tables? What about a set of images and an API that provides access to data?
  • Is a URL for the metadata page a good identifier; can there be multiple identifiers?

 

Describing content of datasets:

Relating datasets to each other and propagating metadata among related datasets:

  • What if an aggregator provides more metadata about the same dataset or cleans the data in some useful way?
  • How much of the metadata (e.g., provenance information) can be propagated among related datasets?

Dive deeper: 

Keep up-to-date by signing up for AIMS News, follow @AIMS_Community on Twitter.

And, thanks again for your interest !