OpenMinTed: new EU project for text and data mining. Behind the scenes.
OpenMinTeD aspires to enable the creation of an infrastructure that fosters and facilitates the use of text mining technologies in the scientific publications world, builds on existing text mining tools and platforms, and renders them discoverable and interoperable through appropriate registries and a standards-based interoperability layer.
Text and data mining in the OpenMinTeD project
Text and data mining (TDM) refers to the process or practice of examining large collections of written resources in order to generate new information. It is about applying specialized software/algorithms/techniques on existing textual information (at multiple levels and in several dimensions) so that it can be read and analyzed by machines in order to extract meaningful (hidden and new) information/data/knowledge for humans. TDM is a complex process, involving techniques from areas such as information retrieval, natural language processing, information extraction and data mining into a single workflow. However, text mining solutions are not easy to discover and use, nor are they easily combinable by end users.
Aiming to tackle a number of text mining challenges, the Horizon 2020 OpenMinTed: Open Mining INfrastructure for TExt and Data (2015-2018) project:
aims at rendering text mining tools and platforms discoverable and interoperable through appropriate registries and a standards-based interoperability layer (interoperable framework), towards achieving goals of open science;
aspires to enable the creation of a European infrastructure that fosters and facilitates the discovery and use of text mining technologies in the scientific publications world,
OpenMinTeD text mining tools, services and associated resources will run on the cloud, requiring an in-depth optimization of service deployment and execution via scalable VM-based service distribution and use of distributed storage.
Through its infrastructural foresight activities, OpenMinTeD’s vision is to make operational a virtuous cycle in which:
To achieve these goals, OpenMinTeD brings together different stakeholders, content providers and scientific communities, text mining and infrastructure builders, legal experts, data and computing centers, industrial players, and SMEs.
|
| |
The scholars and experts from the aforementioned communities are involved in the following activities:
- gathering requirements and charting the respective fields as to TDM usage and practices as well as tools, resources and standards used,
- defining prototype applications serving the corresponding scientific communities via the OpenMinTeD infrastructure,
- evaluatding these applications in relation to the infrastructure.
All activities are related to nine Work Packages (WPs) of the project, such as:
- WP1 - PROJECT MANAGEMENT
- WP2 - COMMUNITY ENGAGEMENT AND SUSTAINABILITY
- WP3 - SUPPORT AND TRAINING
- WP4 - COMMUNITY DRIVEN REQUIREMENTS AND EVALUATION
- WP5 - INTEROPERABILITY FRAMEWORK (see: Outcomes Of The OpenMinteD Interoperability Workshop)
- WP6 - PLATFORM DESIGN AND IMPLEMENTATION
- WP7 - PLATFORM INTEGRATION, TESTING AND DEPLOYMENT
- WP8 - OPERATION AND MAINTENANCE
- WP9 - COMMUNITY DRIVEN APPLICATIONS IMPLEMENTATION
OpenMinteD supports training of text mining users and developers alike and demonstrates the merits of the approach through several use cases identified by scholars and experts from different scientific areas:
Agriculture & Biodiversity (two institutions related to Agriculture - Agroknow and INRA – are involved in the activities of this cluster):
- Enrich agricultural databases to assist food- and water-borne disease outbreak alerts and product recalls
- Image, figure and dataset discovery in the AGRIS FAO online service
Generic Scholarly Communication:
- Semantic search and discovery of open scientific outcomes
- Map of academia – scholarly communication network
- Research monitoring and analytics
Life Sciences:
- Text mining assisted curation of the EMBL-EBI chemical databases
- Curation of the neurosciences resources KnowledgeBase and Neurolex
- Develop and evaluate methods for the automatic detection and linking of named entities, citation traces and intentions in social science scientific publications.
- The role of Agroknow in the project
It is the task of Agroknow to carry out the important job of gathering TDM requirements from other stakeholders (OpenMinTeD’s future platform users and contributors), so that OpenMinTeD will build a TDM platform that meets the requirements of all stakeholders as good as possible.
In particular, Agroknow is the leader of the WP4 responsible for collecting research communities’ requirements relevant to TDM. Through an iterative design process and a close interaction to a number of research communities, Agroknow’s activities in this WP are: (1) identify typical use cases and applications; (2) record and chart the profiles of end users/researchers who are involved in or use TDM.
The Agroknow team needs to ensure that the methodology will be able to extract requirements from different stakeholders’ groups with different needs and applications, including but not limited to:
- data providers (such as institutional repository managers and private publishers),
- e-infra & aggregator operators (such as AGRIS, OpenAIRE and META-SHARE),
- text mining researchers and
- researcher application developers.
Different needs of all ‘Personas’ (stakeholders/users with common characteristics) participating in OpenMinTeD will have to be carefully collected , organized (in Persona Graph),
(Source: Figure 1: A persona graph for the agriculture/wheat use case)
as well as analyzed, validated and then visualized into envisaged user interfaces and meaningful workflows before to be transformed into technical/functional requirements that the technical partners of the project will be able to use for actually working on the corresponding solutions.
All these processes involve defining different methods for extracting requirements from different user types, the organization of events and interviews, identifying the most appropriate methods for validating the requirements and passing meaningful specifications to the technical partners.
Agroknow is also responsible for the agri-food community involvement, in particular through the AGRIS use case. The provided requirements will enrich the existing AGRIS portal with text mining functionalities that will facilitate access to research outcomes currently available through AGRIS.
The role of INRA in the project
Ranked the number one agricultural institute in Europe and number two in the world, INRA carries out mission-oriented research for high-quality and healthy foods, competitive and sustainable agriculture and a preserved and valorized environment.
The first mission of INRA is to produce and enable access to knowledge to the international community of researchers and practitioners in agriculture but also towards policy makers and society. Another INRA mission is to develop innovations and know-how of service to society. These practical applications contribute to developing agricultural, industrial or service companies.
To serve both these missions, several INRA entities conduct research and develop services in text and data mining on agriculture and biology related material. Two of INRA’s entities are involved in the OpenMinteD project: the Scientific and Technical Information Department (DIST), and the Bibliome team from the research laboratory Applied Mathematics and Computer Science from Genomes to the Environment (MaIAGE).
The DIST works transversely to support INRA scientific strategy and provide innovative services within the area of information access, management and diffusion. The Bibliome team develops new methods in NLP and Machine Learning for the acquisition and the annotation of semantic knowledge expressed in natural language in scientific and technical domains such as Biology. Its technologies are used in several on-line applications. It has organized many shared tasks under the umbrella of LLL’05 and BioNLP Shared task 2011, 2013 and 2016.
The DIST and the MaIAGE-Bibliome team lead the agriculture uses cases from requirements to implementation in WP4 and WP9, participate in all working groups for the interoperability framework in WP5 and are actively involved in community engagement and training activities, in WP2 and WP3.
OpenMinTeD project and Scholarly Communication
Text mining can be widely applied in a number of research contexts. Particularly, computers can parse and analyze huge amounts of text and retrieve the text parts of interest to researchers, saving their enormous amounts of time and effort, as the quantity of published research data is exponentially increased. Text mining can be used for species disambiguation, extraction of domain-specific concepts from texts, text normalization, annotation with tags etc.
Text mining applications in various disciplines aim at making the existing information more meaningful and accessible to everyone as well as more discovered through links that would be impossible to notice with manual searches. In this view, the OpenMinTeD Horizon 2020 project aims to “enable the creation of an infrastructure that fosters and facilitates the use of text and data mining technologies in the scientific publications world and beyond, by both application domain users and text-mining experts.”
The OpenMinted platform will consist of the following services: REGISTRY SERVICE, WORKFLOW SERVICE and ANNOTATION SERVICE.
In this frame OpenMinTeD will provide:
Innovative services for topic modeling
OpenMinTeD will publish innovative services for topic modeling, that could potentially be used by a wider range of research communities. These innovative services will be enriched with language detection and NLP mechanisms.
Innovative indexing approach
To find publications relevant to a search (also across individual publications) implies that these publication are well indexed and classified so that relevance to a search query can be ascertained at a rough granular level. Moreover, text mining technology typically focuses on finding sets of individual publications, leaving it up to the user to somehow integrate and synthesize the knowledge (content of documents) which is largely lost in conventional indexing approaches. OpenMinTeD will enable semantic search and discovery of open scientific outcomes in (subject and institutional) repositories.
Semantic metadata extraction mechanisms and search
OpenMinTeD will be used to allow content (publishers and repositories) and service providers (OpenAIRE, Europe PMC and others) in using text-mining services to incorporate semantic metadata extraction mechanisms and provide semantic search mechanisms.
Research Analytics
Research analysis is an increasingly important area for funders and institutions alike. Text mining of scientific topics is able to address research topics shifts over time, to discover and compare hidden research trends and to model time-evolving networks, groups and communities. OpenMinTeD TDM mechanisms could be applied to research outcomes from different European funders and institutions. In this way research analytics will conceivably create comparative analyses that will cover the whole European Research Area.
For any question related to the OpenMinTeD project contact OpenMinTeD team.
Source: OpenMinTed Project
Might also be of your interest:
OpenMinTeD’s sister project FutureTDM
National Center for Text Mining NaCTeM as an excellent source of text mining tools
Text and Data Mining: challenges and solutions from the publishers’ perspective
Towards efficient sharing and discovery of foodborne diseases information
OpenMinteD: making sense of large volumes of data
Save the date!
23 May 2016
The 10th Language Resources and Evaluation Conference (LREC 2016)
Grand Hotel Bernardin Conference Center
Portorož (Slovenia)