Cross-Lingual Analysis of Textual and Extra-Linguistic Resources

FP7 ICT-2011.4.2b STREP Research Proposal (18 Jan 2011)
Project Duration: 3 Years; EU Funding: about 3 to 3.5 million Euros


University of Sheffield, Department of Computer Science (Administrative Coordinator)
Named Investigators: Hamish Cunningham, Diana Maynard, Danica Damljanovic

Core Competencies: Natural Language Processing, Machine Learning, Named Entity Recognition

MODUL University Vienna, Department of New Media Technology (Scientific Coordinator)
Named Investigators: Arno Scharl, Marta Sabou, Karl Wöber
Core Competencies: Semantic Web, Sentiment Detection, Crowdsourcing, Information Visualisation

Vienna University of Economics and Business, Research Institute for Computational Methods
Named Investigators: Kurt Hornik, Albert Weichselbraun, Gerhard Wohlgenannt
Core Competencies: Text Mining, Ontology Learning, Pattern Detection, Distributed Computing, Statistics

ECTRL Solutions SRL (to be confirmed)
Named Investigators: Adriano Venturini, Marisa Nones, Elena Not

Core Competencies: Recommender Systems, CMS, User Profiling, Tourism Portals, Mobile Applications

ECM – European Cities Marketing
Named Investigators: John Heeley, Flavie Baudot
Core Competencies: Tourism Marketing, Destination Management, Knowledge Transfer

Food and Agriculture Organization of the United Nations
Named Investigators: Johannes Keizer
Core Competencies: Fisheries and Aquaculture Information, Resource Consumption and Conservation

1. Rationale and Problem Area

As consumers are being exposed to increasing amounts of multilingual, multimodal and multisource information, delivered to them via a range of devices, they are finding it increasingly difficult to identify what is personally most relevant. Similarly, for companies the ability to capture, annotate and retrieve this heterogeneous information has become of vital importance in many sectors of the economy ranging from health care to tourism, manufacturing, publishing, and the media.

This proposal is about developing the necessary innovative intelligent technologies which will provide individual users and companies with their own personalised and contextualised lens through which to explore web content that is relevant, up-to-the-minute, and important. In order to achieve this, the project will bring about new developments across several research areas: personalised, multilingual information extraction; harnessing crowd-sourcing for creation and evaluation of language resources; interactive visualisation of large document spaces; and personalised information delivery on mobile and web platforms.

From a scientific perspective, this project is motivated by the need for personalised and multilingual information access, and for adaptive text mining technology. Information Extraction (IE) is a key enabling language technology, but IE research so far has mostly taken a one size fits all approach. Tailoring IE applications across domains or across languages is generally time consuming and requires significant linguistic skills. Our goal here will be to take a new approach where we use automatically derived models of users' interests, goals, and current work context in order to guide/calibrate/determine what information needs to be extracted from diverse information sources, e.g., web pages, blogs, tweets. This is what we call “personalised information extraction”.

In order to further lower adaptation costs for information extraction methods, this project will harness the wisdom of the crowds for resource collection (e.g., gather sentiment dictionaries in multiple languages), for continuously refining the underlying user models, and for evaluating the accuracy and coverage of the extracted information (in the tradition of games-with-a-purpose, to be delivered via social networking platforms such as Facebook).

The project will advance the state of the art in multilingual information extraction by going beyond simple named entities to a domain-specific and context-aware handling of complex events and sentiment terms. The incorporation of this information with temporal and geo-location information, combined with advanced information visualisation techniques, can be used to present targeted information to a user based on individual requirements, including mobile platform delivery. A particularly innovative element will be the use of relevant linguistic and semantic knowledge from Linked Open Data resources such as OpenCyc, GeoNames, WordNet, Freebase, and DBpedia. We also expect major advances in the area of context-aware and cross-lingual sentiment detection.

Propagating the identified features (annotations) between documents via latent constructs – e.g., associated concepts from multilingual ontologies or embedded images – will provide a contextualised information space spanning multiple feature dimensions. The contextual features themselves will serve as training data for identifying relations between multilingual information objects. CATER will analyse the relations between the extracted concepts by merging the various feature dimensions into an integrated model, and use a combination of statistical and semantic methods for combining redundant or partially incorrect information in a consistent manner. The processed information will be made available as Linked Open Data, thus facilitating its reuse outside CATER, by third parties.

2. Use Cases and Exploitation

CATER will develop generic, cross-lingual and adaptive language technologies suitable for a wide range of applications. The use cases will demonstrate and evaluate CATER’s capabilities to capture and structure large multilingual knowledge repositories in a scalable manner, and to uncover and manage flows of relevant information. More specifically, CATER will investigate and optimize the information flows between stakeholders in two distinct use cases with different research challenges, functionality and target groups:

(i)    Domain-Specific Personalised Tourism Search. The first use case targets individual tourists and, as such, provides intuitive access mechanisms that hide the complexity of the necessary pre-processing and the underlying information space. It requires cross-lingual, context-sensitive and adaptive methods totrack and present the most relevant tourism information acquired from a variety of sources in a user-friendly manner, according to individual needs and interests (e.g., disabled visitors, families, newlyweds). We will derive the required user models automatically by extracting information from the user’s emails, social media activities, and other content on their computer. The system will respect user privacy and require appropriate authorisation. From a macro-level perspective, the identified patterns in the multilingual information space in conjunction with aggregated records of user behaviour from the system’s log files will provide important insights for destination management organisations and business analysts, particularly when these indicators are embedded into real-time decision making tools.

(ii) Science and Innovation Tracker for Sustainable Agriculture.The second use case will address the problem of alerting scientists and policy makers to new developments in key scientific areas, especially breaking news of early stage research reported in press releases, tweets, and scientific blogs. The extracted information needs to be tailored to the specific user in terms of which sources are included, which geographical areas, topics, and languages are covered, and which presentation format and level of detail is chosen. In contrast to the first use case, here we target domain experts such as scientists and policy makers who must keep track of a complex, multilingual and highly dynamic body of knowledge, which often needs to be accessed via mobile devices during field trips or from disaster zones.

Building upon the methods and technology of CATER, both use cases will be pursued in close collaboration with the project’s industry and associate partners, and will incorporate advanced visualisations synchronised via multiple coordinated view technology.

2. Objectives and Impacts

2.1. Key Scientific and Technological Innovations

·        Personalised Information Extraction

o   Personalised and targeted information extraction guided by automatically derived and continuously updated user models.

o   Cross-lingual and context-aware sentiment detection to identify relevant information sources and measure the perceptions of different stakeholders.

o   Crowdsourcing in the tradition of games-with-a-purpose to derive multilingual language resources (e.g., sentiment dictionaries) and for evaluating the accuracy and coverage of the extracted information.

·        Multidimensional Annotation, Data Fusion and Relation Detection

o   Using latent constructs (e.g. embedded image data, annotated ontology references) to propagate linguistic annotations and other forms of metadata across languages.

o   Develop hybrid techniques that use both semantic and statistical approaches to combine the extracted data while addressing data redundancy and incompleteness.

·        Context-Aware (Visual) Access Mechanisms

o   Visualise relations and patterns in multilingual digital content in real time, following a multiple coordinated views approach.

o   Provide personalised access to relevant content via Web clients and mobile devices.

2.2. Scientific, Technical and Socio-Economic Impact[Marta Sab2] 

·        Scientific. The project will develop a new integrated approach to information access and mining, combining methods for personalised IE, crowd-sourcing and extra-linguistic knowledge from Linked Open Data repositories. It will deliver novel algorithms for advanced personalised cross-lingual information extraction, focusing on identifying events, time and sentiment. A key focus will be Portuguese, which is spoken not only in the EU, but also in Brazil and by other key EU trading partners.

·        Technical. Technical infrastructure for large-scale, cross-lingual and personalised information extraction and presentation, as well as knowledge transfer from academic partners to the participating SMEs and content providers. Provision of advanced and scalable information services supporting four European languages (English, French, German, Portuguese).

·        Socio-economic. CATER will encourage communication and collaboration between stakeholders often divided by differing goals and agendas. The project will harness and amplify the collective resources of these stakeholders. It will improve European competitiveness in the areas of multilingual content extraction and presentation via web-based and mobile platforms. The SMEs involved in the project (ECTRL, ECM) will be in a position to deliver innovative personalised multilingual services to consumers and companies in diverse domains.

2.3. Public Results[admin3] 

·         Open source tools and application development frameworks for

o   English, French, German and Portuguese information extraction;

o   Crowd-sourcing multilingual language resources.

·         Continuously updated and refined Linked Open Data repository

·         Suite of media monitoring and Web intelligence tools (for partners and clients)

·         Two domain-specific personalised search engines for the use case domains of tourism and sustainable agriculture

3. Tentative Work Package Structure

·        Project Management (USFD)

·        Data Collection, Crowdsourcing and Evaluation (MOD)

o   Unstructured Evidence Sources – Crawled Digital Content

o   Structured Evidence Sources – Linked Open Data

o   Social Evidence Sources – Crowdsourcing

§ Social Application Framework (Facebook, OpenSocial, Android, iOS, etc.)
§ Game with a Purpose (Mechanics, Tasks, Design, Incentive Structure)
§ Crowdsourcing Tasks

·        Data Collection

·        Collection of multilingual language resources

·        Quantitative Evaluation

·        Personalised, Multilingual Information Extraction (USFD)

o   IE for building personal, contextual user requirement profiles - extracting info from personal resources like email, tweets, Facebook posts, etc. to build a private, contextual model of each user’s information needs, which is then used to guide the event recognition from the public sources

o   Robust and adaptable Events Recognition, including temporal and geo-location information

o   Context-Aware, Multilingual Sentiment Detection (MOD) applying the crowd-sourced multilingual dictionaries

·        Linked Open Data Repository (VEB)

o   Contextual Feature Propagation

§ Cross-Lingual (English, German, French, Portuguese) – USFD
Rationale for Language Selection: key EU and international trading languages; Use Cases e.g. North Atlantic Stakeholders, Impact for European Tourism; Available Resources (Corpora, Ontologies, Critical Mass of Users); Balance of Germanic and Romance Languages; Core Expertise of the Consortium

§ Cross-Modal (Image, Text) -- MOD
§ Cross-Source (News Media, Blogs, Deep Web Resources, etc.) -- VEB

o   Data Fusion and Cross-Validation

o   LOD Services

·        Adaptive User Interaction and Visualization (ECTRL)

o   Recommender Systems

o   User Profiling, Personalisation of the user interaction

o   Mobile Applications

·        System Integration (MOD) -- mention webLyzard spinoff

o   Service Architecture (Acquisition, Extraction, Propagation, Reasoning)

o   User Interface

§ Web Platform
§ Mobile Applications

o   Distributed Computing and STS Heuristics to Handle the Load

o   Linked Open Data Repository

·        Use Case: Domain-Specific Personalised Tourism Search (ECM)

o   Target Group: Tourists, Individuals

o   Associate Partners: UNWTO, Cities, Ricci/Trento

·        Use Case: Science and Innovation Tracker for Sustainable Agriculture (FAO)

o   Target Group: Researchers, policy makers, government organisations, NGOs working in disaster areas

o   Associate Partner: NOAA

·        Dissemination and Exploitation (VEB)

4. Advisory Board[admin4] [kalina5] 

National Oceanic and Atmospheric Administration (NOAA)
Contact: David Herring
Core Competencies: Marine Resource Management, Policy Development, Stakeholder Communication

A1 Telekom Austria
Contact: Hannes Ametsreiter, CEO
Core Competencies: Mobile Applications and Network Services

Contact: Richard Benjamins, Director of User Modeling
Core Competencies: Data Mining, User Profiling, Mobile Applications and Services

Free University of Bozen-Bolzano
Contact: Francesco Ricci

United Nations World Tourism Organization (UNWTO)
Contact: […]
Core Competencies: Tourism Policy

Brazilian Agricultural Research Corporation (EMBRAPA)
Contact: Ivo Pierozzi Jr., Laboratory of Organization and Management of Electronic Information
Core Competencies: Agriculture Research

+ City Representatives of our Target Languages

For later inclusion into the actual proposal


CATER will advance technology in the area of cross-lingual information search and retrieval. The project will rely on cross-lingual text processing to identify core concepts, facts, events, relations as well as sentiment in digital content of two use case domains: tourism (e.g., documents published by destination management organisations, reviews and recommendations, user-provided tags and ratings) and sustainable agriculture (e.g. press releases, research papers, scientific blogs, tweets). Connections and similarities between the extracted information objects will be identified and coupled with third-party knowledge contained in external semantic sources and linked open datasets. Access to the information archive will be facilitated through context-aware personalisation services as well as innovative and device-specific visualisation methods. [MS6]


Tourism is of core importance to Europe, but is hampered by poor information flow between tourism service providers and consumers. On the one hand, tourists rely on generic search engines (typically controlled by organisations with their headquarters outside Europe) and therefore struggle to get the latest tourism offerings in a personalised manner. On the other hand, tourism marketers have difficulties in both disseminating tourism information effectively and in collecting data about tourists and their activities for the purpose of further improving their marketing and planning.

A major challenge lies in the inherently multilingual nature of the tourism domain. Tourism offers and user-generated content in the form of reviews and recommendations are typically available in a variety of languages, connecting tourists and destination managers from various ethnic backgrounds using their native language for accessing and managing online information (it is estimated that about 75% Internet users worldwide browse non-English Web documents, and that more than 98% of them use their native language when searching online).

At the same time[Marta Sab7] , the need for tourist information often depends on the particular context of an interaction such as the stage of the trip (planning phase vs. the actual trip), the available or preferred access device (desktop vs. mobile), and the location (home vs. point-of-interest). The use of generic search engines does not allow customising the search results to the current needs of tourists. Finally, tourists are not only interested in information about a given point-of-interest, but more often about semantically related events that match their preferences, take place during a planned trip and are located at a reasonable geographic distance. Generic search engines are unable to retrieve such cross-links between events and places unless these are explicitly stated in a Web document, and even then the co-referenced events might not fit a tourist’s particular context and preference structure.

--------- Impact Snippet ---------

The most immediate target is tourism: 1) providing citizens with personalised access to tourism information in multiple European languages, via mobile and other web-enabled devices; 2) allowing destination managers to monitor opinions expressed in various languages reflecting the image of their location[Diana May8] ; 3) opening up tourism data as Linked Data, thus facilitating its reuse outside CATER, by third parties.[Diana May9] 


Example 1: An institute in Africa is working on Integrated Pest Management on sugar cane and wants to screen new developments in Brazil. The Tracker is constantly monitoring in collaboration with EMBRAPA ( press releases from Brazilian Research Institutions, Conference proceedings, Tweeds from Researchers, Science Blogs, and maps this to the Knowledge map of the tracker, from which the scientist in Africa can find the new developments. This will have further links from Brazil to collaborating institutions in Africa. It will exploit the US VIVO systems to track expertise in US American Institutions."

More ideas for specific examples: sugar cane, viticulture, cash crops, cotton (India)

Example 2: North American Fisheries Management, Stakeholders e.g. Portugal, UK, France, US; collaboration with NOAA

--------- Impact Snippet ---------


a)   The automatization of the AGRIS Infrastructure for text mining would make FAOs workflow and the work of the associated AGRIS centres much more efficient. At the moment the system is based on human monitoring and human produced metadata. That forces the network to work on quantity to have a measurable coverage. This diminishes the possibility of creating high quality metadata on scientific and technological material. With the automated crawling and annotation process, the human input can be limited to smaller, but high quality input which will feed the automated process.

b)   The impact on the global community will be considerable

a.    At the moment there is no predominant single access point to scientific and technological knowledge in agriculture and related areas as it exists for physics and life science. Scientists, advisers, decision makers have to jump from one database and website to the next one to make a cumbersome agglomeration of resources. With the bloom of blogs and other “Web2” means of communication this situation has still become more complicated. Obviously any tentative to resolve this problem by centralization through human intervention was doomed to fail and failed. the project will address the problem of accessing and using distributed resources. The combination of targeted crawling and semantic annotation based on a human made knowledge structure is the sustainable solution to the problem.

b.   The creation of a huge and dynamic linked open data repository will also influence other players, research institutes, publishers, ministries and so on to publish their data in linked data format to be easily reachable through the AGRIS platform. So the snowball effect through targeted crawling and semantic annotating on a growing repository will induce the formation of other easily accessible data pools in the world

c. With Agriadne and AgrisenZ being open source products and available for everyone, similar infrastructures as the AGRIS infrastructure can be build for specific topics or by specific players.

   Effectively xxxx will deliver a key contribution to realize a semantic web in the area of agriculture at large.


 [kalina1]MYNE (both personalisation and mining);

Personalised Mining and Presentation of Multilingual Content


Trying to emphasize the new bits. Only crowd-sourcing is missing

 [Marta Sab2]Rewrite in line with the impacts stated in the call:


Improved European competitive position in a multilingual digital market through the provision of better services to citizens and businesses.

•       Novel forms of partnership between new programme entrants and established players, reduced development costs and shorter time-to-market, thus stimulating innovation and expanding markets.

• Result-driven knowledge transfer between research centres (and their spin-offs) and progressive technology providers (especially SMEs), data brokers/aggregators and content providers.


 [admin3][kalina] Another area to think about and list in the proposal are standards. GATE is ISO compliant (the Lang Tech standards, got the numbers for later), as well as a member of the OASIS UIMA group (again, standard text we’ll put in the proposal, when we write it). Are there other ones to consider? Perhaps somebody like J. Hendler or another big Linked Data name from across the pond and also in Europe? Any tourism-related *ML language?

 [admin4][kalina] Perhaps we can also bring in AB members from MONNET and META.NET and 1-2 other key ongoing projects relevant here. We don’t necessarily need to approach them now (as many are busy writing proposals for the same call and we don’t want to share ideas), just list that we’re planning to invite them and thus coordinate the efforts.


 [kalina5]I would also add some people from the MONNET, META.NET, and 1-2 other key related projects, as I think this will strengthen our credibility.

 [MS6]This description of the project differs from what has been said so far, by putting the focus on the semantic processing rather than on the personalised IE, as has been done before. Might this confuse the reviewers?

 [Marta Sab7]This part does not make sense if recommendation will not become a core pillar of the project.

 [Diana May8]it just sounded a bit weird the other way round

 [Diana May9]Do we also need to say something about the other use case here?


Related articles


Enhanced by Zemanta
cater-abstract-dec06-arno.doc70 KB FP7_Participant Information-FAO.doc268.5 KB mu-partner-description.doc54.5 KB Perspective_Budget_Sheet_FAO.xls76.5 KB