What and Why Technology is important to push the development of AGROVOC Thesaurus

04.09.2017

What and Why Technology is important to push the development of AGROVOC Thesaurus

Dear AIMS Readers!

AIMS is pleased to share with you an interview with Andrea Turbati, who is responsible for managing the maintenance of AGROVOC technical facilities.
The multilingual agricultural thesaurus AGROVOC is a system of terms, concepts and relations that presents knowledge in all areas of interest of the FAO, including food, nutrition, agriculture, fisheries, forestry, environment etc.
_________________________________________________________________________________________________

Andrea Turbati, Ph.D., is a Research Associate at the University of Rome “Tor Vergata”. His research interests span across Knowledge Representation and Knowledge Based Systems. He is author of ~20 publications in the Semantic Web area. During his Ph.D. he worked on ontology learning and population from unstructured content. His Ph.D. thesis was about the design and development of CODA (Computer-aided Ontology Development Architecture), an architecture and a framework that extends the unstructured information management framework UIMA to support the generation of RDF data. He is one of the developers of Semantic Turkey, a platform for Knowledge Acquisition and Management, and VocBench (mainly covering its interaction with Sematic Turkey) - a collaborative web-based, multilingual, editing and workflow tool that manages thesauri, authority lists and glossaries using SKOS-XL. He has also contributed to the EU-funded project SemaGrow.
_________________________________________________________________________________________________

AIMS : Your research team (the ART Group at the University of Rome Tor Vergata) is engaged in the development and maintenance of VocBench (VB) - a free and open source RDF modelling web environment for editing thesauri. While VB2 is being used for the creation of AGROVOC thesaurus, VB3 will be released soon. What are the expected revamps and improvements in VB3 and how end users would benefit from the new VB version?

Andrea Turbati (A.T.) : VB2 is used to edit and manage thesauri, it does not have all the functionalities which are provided by an Ontology Editor tool. For example, in VB2 it is not possible to add a new property directly (you need to import a vocabulary having the desired property) nor it is possible to edit a property. VB3 can be considered a complete Ontology Editor tool, therefore VB3 can be either used to edit an ontology or a thesaurus, providing all the desired functionalities (adding/editing concepts/schemes/classes/instances/property, managing skos collections, writing restriction for owl classes, etc.).

AIMS : Is AGROVOC moving to VB3 soon ?

A.T. : Not in the next months, but we will keep you informed.

AIMS : What is the technical capacity of VB to manage a large scale RDF dataset such as AGROVOC URIs in SKOS-XL, or more precisely, how many million AGROVOC RDF triples does VB manage currently? What says statistics?

A.T. : Both VB2 and VB3 adopt the Sesame2 (now called RDF4J) standard, this means that they are able to connect to dedicated and well known RDF triplestores (for example GraphDB), relying on their performance in dealing with large RDF repositories. At the moment, AGROVOC is managed by VB2 and consists of more 6.260.000 triples.

AIMS : VB is projected to support Linked Data (LD) approach. Could you please explain us what mechanism is staying behind? What standards, concept schemes and protocols VB is based on, to publish controlled vocabularies as LD?

A.T. : VB (and in particular VB3) offers the possibility to editors to link a specific concept to a concept present in a different ontology. At the moment, (VB2) this matching between a concept in a thesaurus managed by VB and a remote one - is done by specifying the exact URL of the second concept. In the next release, VB3, a dedicated GUI will guide the editor in finding the most appropriate concept to state the desired link. This links among concepts can then be exported, alongside the RDF data, since they follow the RDF standards regarding the matching between RDF resources.

AIMS : Could you tell us how many/what vocabularies is AGROVOC dataset currently aligned to, through LD mechanism?

A.T. : In the current release, there are 18 external dataset to which AGROVOC is linked to. The ones having more than 1200 links are: Chinese Agricultural Thesaurus mantained by CAAS, the National Agricultural Library's Agricultural Thesaurus (NALT), DBPEDIA, SWD of the German National Library (DNB), BNCF thesaurus of the the National Central Library of Florence, Aquatic Sciences and Fisheries Abstracts (ASFA) Thesaurus, EARTh thesaurus and EUROVOC thesaurus. More information can be found in the AGROVOC VoID file.

AIMS : Could you briefly tell us what the AGROVOC Void description stands for and what function does it serve ?

A.T. : Every time an AGROVOC release is ready, a VoID (Vocabulary of Interlinked Datasets) file is provided as well. A VoID file contains all the metadata about a specific ontology. In the AGROVOC case, the VoID file contains: the numbers of triples, concepts and terms for the current release, when the VoID file was generated, the list of languages having at least one term, for every language - the number of terms (both prefLabel and altLabel) and the datasets to which AGROVOC is linked to (with the number of links).

AIMS : Would you be so kind to explain us how you technically detect altLabel, prefLabel of a given AGROVOC concept in a given language?

A.T. : Since AGROVOC is serialized in SKOS-XL, there are two distinct properties (skosxl:prefLabel and skosxl:altLabel) to specify a label for a given concept (or a scheme). The language is contained in the label itself, so, using an ad-hoc SPARQL query, it is possible, given a specific concept, to obtain all its prefLabel and altLabel (with all the data associated to them) and filter them according to the desired language.

AIMS : It is said that "VocBench is not only a tool for editing of multilingual thesauri, but is multilingual itself". It would be interesting to know how VB is technically designed to couple with this "Babylon's task" and how many languages it could support in its architecture, considering also various “alignment experiences” among different languages inside VB?

A.T. : VB is multilingual itself since it is possible to change the display language. This can help users managing/browsing the current thesaurus. In VB2 it is possible to select one of the following languages: English, Spanish, Dutch and Thai. In VB3, since the GUI has been deeply changed, we are still deciding which languages will be supported for the User Interface.

AIMS : Could you explain us shortly how the AGROVOC SPARQL endpoint is "activated" and what is its main function?

A.T. : Inside VB there is a dedicated tab to execute arbitrary SPARQL queries. Since such queries can be extremely problematic for several reasons (they can be computational intensive to calculate; using a SPARQL UPDATE it could be possible to bypass the checks performed by VB, etc.), only selected users are allowed to execute SPARQL queries or update. The main function of providing a SPARQL endpoint inside VB (even for a small group of users) is to be able to obtain precise answers to complex RDF questions (such as "how many concepts have more than an altLabel in a given language").

AIMS : Could you explain us how you are going to transfer old backup AGROVOC data from its previous version to the new one (i.e. from VB2 to VB3, once it is released) ? It's seems to be a rather tricky task ...

A.T. : VB2 and VB3 have two different architectures. The main difference (regarding how the data is managed and stored) is that in VB2 a DB was needed to store all the information about users, their actions and how RDF data changed through time (the history and validation metadata), while in VB3 - all this metadata is stored in a dedicated ontology.
The main data (AGROVOC) is still saved in a triple store. This means that when passing from VB2 to VB3, AGROVOC data (all its concepts, scheme, property, etc.) is immediately available (it can be Export and then Load with the appropriate functions) and can be edited and displayed right away, but the information about users, the history and validation is lost.

AIMS : Thank you so much for taking time out of your busy schedule to answer AIMS questions and to share part of your knowledge and experience with AIMS Readers!

What and Why Technology is important to push the development of AGROVOC Thesaurus

Topics: