AgroTagger classifier. Leveraging the knowledge encoded in AGROVOC for AGRIS items


Did you hear about AgroTagger? It is a  JAVA Command Line application to assign semantic terms to textual content. At a high level of abstraction, AgroTagger can be considered as a keyword extractor that uses the AGROVOC thesaurus, to extract keywords from a set of web URLs. The AgroTagger is also used to enrich data in different digital information environments (AGRIS is among them).


Given the dynamic influence of numerous social media (web 2.0 tools and other media) on the world of scientific publications, information systems should:

  • not only facilitate access to information about research outputs, but also
  • learn (be regularly assessed and updated) to make research outcomes enhanced by both topical and contextual semantic data collected by means of web crawling/discovering and indexing, -

to "come across" what is actually largely meaningful to the topic of interest.

To bring support to the aforementioned scenario, this entry introduces you to AgroTagger, which adoption is described in a F1000Research article : - 

* Celli F, Keizer J, Jaques Y et al. Discovering, Indexing and Interlinking Information Resources. F1000Research 2015, 4:432 (doi: 10.12688/f1000research.6848.2)

Supportive quotes from the article: 

“In this context, we believe that it is important for AGRIS users – especially for researchers – to have access to those valuable pieces of information that are neither exposed in a database nor accessible via web service”

“In fact, it is not only important to discover web links, but also to process them in a way that allows reuse in multiple scenarios

“... it is possible to apply semantic enrichment to crawled web resources and to use this semantic knowledge to enhance the AGRIS web portal...our work leverages Semantic Web technologies and the knowledge encoded in the AGROVOC thesaurus in order to recommend web resources that are relevant to a given AGRIS bibliographic item”.

“… we discuss crawling and analysing web resources to populate our “Crawler Database”; a SPARQL endpoint with AGROVOC annotations of web resources identified by the URL from which they were crawled”. 

The entire process we discuss in this paper has already been implemented and integrated in the AGRIS website”. 

AgroTagger classifier

In the aforementioned paper, the AgroTagger tool is presented in the incremental process (developed within SemaGrow project) to discover web resources in the domain of agricultural science and technology. Basically, the paper describes this process in three steps: 

Step 1

After web crawling was done, starting from a set of URLs from “trusted” and valuable websites ... 

Step 2

AgroTagger was applied to give a meaning to discovered resources. AGROVOC URIs/keywords were assigned to any web page, PDF, and word documents discovered during the crawling phase...

Step 3

A recommender system was run, to integrate discovered resources with AGRIS, in order to enrich the user experience of the AGRIS website. 
This was possible thanks to the use of Linked Data methodologies, with help of which a wide array of custom-crawled resources were interlinked with the AGRIS bibliographic database.

The paper also discusses the SemaGrow Stack open-source software, a query federation and data integration infrastructure used to estimate the semantic distance between crawled web resources and AGRIS.  The SemaGrow Stack is used as part of the recommender system, - a JAVA component that computes meaningful combinations between the Crawler Database and the AGRIS database, and generates a new triplestore: the “Recommender Database”. 

NOTE: The workflow and the components described in this paper can be used in any domain, so they are not restricted to agriculture; one can simply use another thesaurus to annotate web resources and populate the Crawler Database composed of triples generated by AgroTagger.

P.S. "Workflow tools can provide a mechanism for moving into earlier stages of the research lifecycle, adding value and ensuring an ongoing pipeline of outstanding scholarship", - Strategy & Integration Among Workflow Providers (The ScholarlyKitchen, 2017)

Enjoy the article !

More about AgroTagger

As a FAO initiative, the AgroTagger JAVA software was developed in the context of a couple of EU projects, such as agINFRA and SemaGrow. The application is language independant, but the model created is based on AGROVOC in English. If one is able to build a new MAUI model, it can support all the languages.

AgroTagger is open source: this means that anyone can build a website or a web service on top of it. On Github: there is the source code, which means that anyone in the world can take the code and add new features.

If you want to use AgroTagger to index your documents with AGROVOC, it is recommended to rely on a programmer or someone who knows how to run JAVA applications from the command line. Learn more: AgroTagger : How to use an updated AGROVOC thesaurus

Text mining in agriculture: The AgroTagger keyword extractor (AgroKnow)

KARWOWSKI, W., WRZECIONO, P., Methods of Automatic Topic Mining in Publications in Agriculture Domain, Information Systems in Management (2017) Vol. 6 (3) 192−202.

AGROTAGS: a contribution towards improved digital information management in agricultural research, 2010,  by T.V. Prabhakar, Lavanya Kiran Nelam, V.Balaji, in Annals of Library and Information Studies

More about AGROVOC and AGRIS

AGROVOC multilingual Thesaurus is “the first FAO resource to embrace the Semantic Web" (The AGROVOC Linked Open Data). In particular, it is an SKOS-XL concept scheme and a Linked Open Data (LOD) set aligned with 18 other multilingual knowledge organization systems related to agriculture. Beginning in April 2017, AGROVOC updates its content on a monthly basis.

AGRIS (the International System for Agricultural Science and Technology), is a collection of more than 9 million multilingual bibliographic resources. The system’s goal is to make agricultural research globally discoverable.