Implementing the FAO Open Archive based on Fedora Commons and FRBR

Summary

The Food and Agriculture Organization of the United Nations (FAO) has more than 50 yeas of experience in the collection, production and the diffusion of information on agriculture and related sciences. To facilitate access to FAO publications the Organization has implemented two document repositories: the FAO Online Catalogue (FAODOC) and the FAO Corporate Document Repository (CDR).

FAODOC produces since 1945 high quality metadata for both its electronic and printed documents. The CDR contains full-text publications and uses a workflow system based on the Electronic Information Management System (EIMS) to collect (minimal) metadata through the course of the document/publication production process.

Both systems take care of the same group of documents: FAO publications. This harms the proper dissemination of FAO publications. In addition it means efforts are duplicated in cataloguing and maintaining different databases. Therefore it was decided to merge the content of the CDR-EIMS and the FAODOC in one repository: the FAO Open Archive (FAO OA), a digital, open repository to collect, manage, maintain and disseminate all material published by FAO.

This page describes the process of merging the two systems, each with a different structure and workflow procedure, into the FAO OA. The first step towards creating the FAO OA was to analyze the CDR-EIMS and FAODOC in order to find similarities and differences. Then a workflow which integrated electronic publishing and cataloguing was established and an overall architecture in which all the features that were previously managed through CDR-EIMS and FAODOC was designed.

Another substantial part of the process consisted of identifying an open source software that would meet the requirements at as well the FAO OA as the FAO organizational level. An evaluation of open source software packages was carried out and Fedora Commons emerged as the best candidate. Subsequently the FAO Content Model could be defined, generally based on the Fedora Digital Object and the Functional Requirements for Bibliographic Records (FRBR).

The FAO OA project provided the possibility to clearly identify FAO organizational requirements for the storage, dissemination and preservation of documents and bibliographic metadata. At the same time it provided the occasion to evaluate state-of-the-art tools for the management of digital repositories and to identify the most appropriate vocabulary and metadata standards.

1. Introduction

The Food and Agriculture Organization of the United Nations (FAO) has more than 50 years of experience in the production and the dissemination of information, both through its headquarters-based regular programme and through its field projects. The collection, analysis, interpretation and dissemination of information relating to nutrition, food and agriculture is one of FAO’s functions. The World Wide Web has proven to be a powerful mean for FAO to disseminate multilingual information . FAO currently maintains a number of different document and document metadata repositories:

The FAO Online Catalogue (FAODOC) is the online catalogue for documents and publications produced by FAO since 1945.FAODOC catalogues and indexes both electronic and printed documents. The FAODOC records have been created by professional cataloguers and contain high quality descriptive metadata.
The FAO Corporate Document Repository (CDR) contains full-text publications produced by FAO technical departments. The CDR disseminates full text documents and a minimal set of metadata. The CDR uses a workflow system based on the Electronic Information Management System (EIMS) to collect metadata through the course of the document/publication production process. The objective of EIMS is to have authors or producers of documents provide necessary administrative and descriptive metadata.

There is a lack of integration within the different bibliographical metadata repositories. Overlapping content creates inconsistencies that may affect the proper dissemination of the FAO publications. In addition, FAO duplicates efforts in cataloguing and maintaining different databases.

The FAO OA project aims to merge the content of the EIMS-CDR and the FAODOC to create one coherentdigital repository offering a solid foundation for the collection, management, maintenance and timely dissemination of material published by FAO.

FAODOC is a multilingual, on-line catalogue of documents and publications produced by FAO since 1945. The system uses UNESCO's CDS/ISIS software. The Web search interface was developed by the Institute for Computer and Information Engineering, University of Warsaw, and the AGRIS/CARIS and Documentation Group of FAO. Since its inception FAODOC invested resources on the production of high quality bibliographic records.

The FAO Web site was released in 1995. The first electronic publishing workflow began in 1998 with the EIMS. Currently more than 58 000 resources (full text documents and multimedia) -as of June 2010, are managed by the EIMS (See Table 1). Photos, videos and audio files are made available in a variety of ways on the FAO Web site. The CDR was created as the online digital library for the dissemination of FAO documents and publications, as well as selected non-FAO material, in electronic format. At present more than 40 100 full text documents are available through the CDR.

Resource type	Number of records
full text documents	40 100
photos	9 500
videos	7 600
audio files	1 200

Table 1. Resources at FAO

The EIMS-CDR was developed to manage FAO electronic publishing workflow. Both the CDR and the EIMS run on MicrosoftWindows Platformon top of an Oracle 9 database. The software is written in the ASP programming language, with some ad hoc modules and functionalities developed in ASP.Net. The EIMS architecture results from the interaction of several modules managing different aspects of the overall workflow, all modules connect to a single database, which stores records’ descriptive metadata as well as detailed workflow information.

The purpose of EIMS-CDR and FAODOC are different. The main focus of FAODOC is the management of bibliographic data of FAO documents; EIMS’s focus is managing the electronic publication of full text documents on the Web. CDR is focused on the dissemination of FAO documents archived through EIMS. In 2004, the two systems were linked, connecting FAODOC records to the full text documents archived in EIMS-CDR.

The process of merging the two existing databases, each with a different structure and workflow procedures, is a challenging task. The first step towards creating the FAO OA was to analyze the EIMS-CDR and FAODOC in order to find similarities and differences.

A comparative analysis of FAODOC and EIMS-CDR data was made in 2007 the analysis was carried to assess how much the two databases overlapped. EIMS-CDR and FAODOC have not always applied the same document selection criteria, particularly regarding meeting documents that are mostly stored in CDR while FAODOC only has the final reports. Moreover EIMS-CDR and FAODOC have different cataloguing policies. In the CDR each record corresponds to one publication (for example a book). In FAODOC a publication and its chapters are often catalogued separately. Thus, in FAODOC a book can correspond to more than just one record.

The analysis showed that percentage of records that are duplicated in EIMS-CDR and FAODOC is high, and it has increased over the years, a total of 72% of all records created in 2006 in EIMS-CDR are also present in FAODOC.

2. FAO OA architecture

overall architecture of the FAO OA will integrate all the features that were previously managed through EIMS-CDR and FAODOC. The electronic publishing and cataloguing is managed through the same system and is shared in the same digital repository, e.g. from document creation to its cataloguing, indexing and conversion to a suitable electronic format up to its dissemination on the Web. Figure 1 illustrates all FAO OA components and the way they will interact. The new system results from the assembling of different components, briefly described in the following sections.

Figure 1. FAO OA architecture and workflow.

Data input for FAO units

Currently FAO units have different customizations of the EIMS input module. Each customization is for a basic internal workflow that can vary from a one-step to a multiple-step approval process. FAOunits are responsible for the data entry and the minimal description of documents.In the FAO OA the units continue to provide data through EIMS, describing the document with a minimal set of metadata. With the new system electronic publishing and cataloguing share a common data entry point.

Fedora Commons is going to be used only at the end of the workflow. The current EIMS code base is reliable because it has been stable for several years. EIMS software remains much as it is, with a separate webservice publishing only 'public' data to the repository, triggered by parts in the EIMS code.

Electronic publishing

The publication of FAO documents in electronic format will continue to be managed through the existing EIMS-CDR modules:

Core module for electronic publishing; this module is used to review the information from FAO units and to manage the conversion of full text documents into the required electronic format (HTML, PDF, etc.).
Scanning requests managing module; this module is directly connected with the core module for electronic publishing and it is used to track the work assigned to internal resources, or track the work orders sent for document scanning and/or conversion to external companies.

At present the synchronization with the Fedora Commons repository is carried at Core module level at the final step of the electronic publishing workflow.

In the near future a new EIMS-CDR system will be released. The new system will be based on Fedora Commons and the central repository will always be updated in real time.

Cataloguing

Cataloguing is managed through a new module that is used by cataloguers to manage the information that must be released in the FAO OA. The main characteristics are:

Detailed cataloguing procedure
Use of authority control system webservices for the indexing of subjects and data entry for journal, series, corporate body, project and conference fields. This information is currently maintained on the Workbench Concept Server (WCS) (See Section 5)

The implementation of the cataloguing interface has been offshored to an external company. It will be installed and maintained in the FAO Java programming environment.

Output interfaces

The FAO OA search engine should be simple and intuitive for as many different types of users as possible, yet powerful. Specialized users, such as librarians and information management specialists, will be able to access intermediate and advanced search functions. Theintermediate search will have additional metadata fields, whereas the advanced search will search all metadata combinations, all the different Boolean operators and it will provide the option to integrate Stemming and Synonyms search as well as full text search. Another type of search will be the search by topic, optimized to provide a significant list of results based on AGROVOC and Multidisciplinary Areas (when appropriate).

3. Workflow procedures

The workflow of the FAO OA will integrate two main activities; electronic publishing and cataloguing. Below is a short description of the most important workflow steps (See Figure 6):

FAO units initiate a record by inserting metadata. Only minimal information is required to initiate a record: author, title, year and/or FAO job number. The system verifies whether the record already exists in the database. A simple validation workflow within the input system assures that the records inserted are eligible to be published in the open archive.
The electronic publishing administrators and the cataloguing administrators are notified of the addition of the new record. They can take action simultaneously on the full text and the metadata of the records.
If the document received is already in electronic format it requires validation and possible conversion to the most suitable format. This task can be carried out in-house or by an external company. If the document is not in electronic format, it requires digitalization (scanning by an external company may be required).
Using the initial minimal set of metadata and a link to the full text, the documents are catalogued and indexed by FAO cataloguers.
Validated records are disseminated through FAO Web sites. Search engines, services providers and digital libraries will harvest the record’s metadata, enhancing access to FAO documents.

4. Open Source Software

An evaluation of open source software packages was carried out, to identify the software that would meet the requirements at FAO OA and FAO organizational level. The products to be evaluated were selected with the merging of the FAODOC and EIMS-CDR in mind. The objective was to identify a tool that would better support the storage, dissemination and preservation of documents and bibliographic metadata.

Fedora Commons emerged through the evaluation as the best candidate with respect to the project and organizational requirements for document publishing and cataloguing.

Fedora Commons

Fedora Commons is a system that was designed to serve as a digital repository for a variety of uses, including institutional repositories. The system offers a series of services and tools supporting the long term preservation of digital objects, the content versioning and the management of distributed repositories. An important emphasis is on the ease of integration with other systems, through the use of an Application Programming Interface (API). The main features of Fedora Commons are:

A suite of webservices for creating, managing, publishing, sharing and preserving digital content.
Service architecture divided into four areas: repository services, preservation services, semantic services and enterprise services
Highly scalable and configurable
Fedora Commons support very advanced content representation mechanisms through its digital document model (data objects).

Fedora Commons is a complex system that requires technical skill for its installation and management. Fedora Commons does not have a user interface but, thanks to its versatile web services, it can easily interoperate with other systems and user interfaces.

The modular, service-oriented architecture of Fedora Commons can support incremental merging of FAODOC and EIMS-CDR, as well as incremental deployment of advanced features to the FAO user communities. A staged implementation plan mitigates many of the risks involved in developing, deploying and maintaining the FAO OA, because it offers a high degree of flexibility throughout the project lifecycle. Fedora Commons allows the possibility to develop – and release – different modules at different times according to the strategic decisions of the FAO OA.

The Next-Generation Architecture for Format-Aware Characterization (JHOVE2)

JHOVE2 is an open source tool providing services for the identification, feature extraction, validation and assessment of digital objects. JHOVE2 has a well-defined API that makes it easy to integrate it in the proper module of the FAO OA. The integration of JHOVE2 is foreseen in a second phase of the project, the system will be used for file format validation procedures and for the extraction of metadata that will be user for a long term preservation strategy and it will allow FAO OA policy-based assessment.

5. FAO OA Content Model

Once the software was selected, the FAO Content Model was defined. The two main requirements were to avoid any loss of granularity in the existing legacy data from FAODOC and CDR metadata sets, and to provide a way for managing multilingual collections of digital objects by the use of relationships such as language translations, different versions or partitive relationships, among others. The Fedora Digital Object and the Functional Requirements for Bibliographic Records (FRBR) have played a crucial role in it.

Without going into deep technical details with regard to Fedora Commons, we will provide a short description of what is a Fedora digital object, how the FAO OA defined has defined its digital object structure and what metadata standards are used. In addition, the FAO OA content model implements an adaptation of FRBR which has been possible due to the flexibility of Fedora Commons establishing complex and rich relationships between different digital objects.

Fedora Digital Object

The strength of Fedora Commons is the generic digital object model used to offer the essential information required for the management of digital content such as documents, images, e-learning objects, metadata and many others.

In a Fedora repository, all content is managed as data objects, each of which is composed of components ("datastreams") that contain either the content or metadata about it. Each datastream can be either managed directly by the repository or left in an external, web-accessible location to be delivered through the repository as needed. A data object can have any number of data and metadata components, mixing the managed and external datastreams in any pattern desired (http://www.fedora-commons.org/confluence/display/FCR30/Getting+Started+with+Fedora)

The basic components of a Fedora digital object are: the PID (a persistent, unique identifier for the digital object); object properties (a set of system-defined descriptive properties that are necessary to manage and track the object in the repository); and datastream(s) (element in a Fedora digital object that represents a content item). A Fedora Commons digital object can have one or more datastreams. The following four datastreams are reserved by Fedora:

DC. Dublin Core unqualifiedused to contain metadata about the object;
AUDIT. Audit trail of all changes made to the object, controlled by the system only;
RELS-EXT. Relationships to other digital objects;
RELS-INT. Internal relationships from digital object datastreams.

Figure 2. The basic components of a Fedora digital object

http://www.fedora-commons.org/documentation/3.0b1/userdocs/digitalobjects/objectModel.html

The Fedora digital object may contain more custom datastreams to represent user-defined content.

The next five datastreams have been added to the FAO OA digital object :

Bibliographic description: this datastream accomodates any bibliographic data recording and identifying a digital object. Due to the granularity of the existing FAODOC legacy data, MODS - Metadata Object Description Schema - was chosen as the most suitable standard to represent the bibliographic metadata. MODS is an XML-based bibliographic description schema developed by the Library of Congress designed as a compromise between simplicity of Dublin Core metadata and the complexity of the MARC format used by libraries.
Agricultural metadata: this datastream is dedicated to any other public metadata not included in standarized descriptive procedures for digital objects. An example of this, it is the compliancy to AGRIS AP, application profile used in the context of agricultural information management standards and the AGRIS Search Engine. AGRIS is a global public domain database with nearly 3 million structured bibliographical records on agricultural science and technology. The content is provided by more than 150 participating institutions from 65 countries. If a data provider wants to be harvested by AGRIS, the data must be exposed in AGRIS AP. Most of the fields required for this export are accommodated in the MODS datastream, however two of them - ARN and AGRIS Categories – are not included due to their particular format. Therefore the creation of a new datastream to accomodate this type of not-standarized bibliographic information was created.
Internal cataloguing metadata: this datastream that only stores internal data used for the cataloguing work, such as the id of the cataloguer or internal notes.
Electronic publishing metadata: this datastream contains the data related to the E-publishing activity. Most of the fields are used internally.
Multilingual controlled vocabulary metadata: this datastream is dedicated to metadata related to subject and geographic information storing URIs and labels in the six FAO official languages.
Preservation Metadata: this datastream accomodated the Preservation Metadata: Implementation Strategies (PREMIS), an international standard developed by the Library of Congress. It is applicable to a wide range of digital preservation activities. PREMIS data model defines a number of properties for the preservation of digital objects, as events, agents, rights and permissions, and the relationships between these entities. Most of PREMIS metadata can be introduced automatically when a digital resource is inserted into the Fedora Commons. This standard was selected to define preservation metadata for the digital resources (html, PDF, tiff, etc.) that will be stored in the FAO OA.

FAO OA Content Model based on FRBR

What is FRBR?

The Functional Requirements of Bibliographic Records (FRBR) is a conceptual entity-relationship model where an entity is an object that can be identified unequivocally, the attributes are used for the description of an entity and the relationships are inherent associations between one or more entities. FRBR has been developed by the International Federation of Library Associations and Institutions (IFLA)

The FRBR model is structured in three entity groups:

Group 1. Productentities are Work, Expression, Manifestation and Item. They represent the products of intellectual or artistic activities;
Group 2. Responsibility entities are persons and corporate bodies related to Group 1 entities through specific relationships;
Group 3. Subjectentities like concepts, objects, events, places of any of the Group 1 or Group 2 entities.

Each of the entities is described through a set of characteristics or attributes. Examples of attributes for each entity are the following:

Work: title, form or date among others
Expression:title, form, date or language among others
Manifestation:title, statement of responsibility, edition, place of publication, publisher or date of publication among others
Item: identifier, access restrictions or condition among others

Relationships serve as links between one entity and another. There are different types of relationships:

Relationships between Work, Expression, Manifestation and Item where a work is expressed as an Expression, an Expression is manifestated as a Manifestation, a manifestation is available as an Item.
Other relationships between Group 1 entities like a work is Supplement Of another work, an expression is Part Of another expression, an expression is Translation Of another expression or an expression is Revision Of another expression.
Relationships to Persons and Corporate Bodies linked to the first group by four relationship types:
the created by relationship to work;
the realized by relationship to expression;
the produced by relationship to manifestation;
and the owned by relationship to item
The entities in all three groups are connected to the work entity by a subject relationship. The has as subject relationship indicates that any of the entities in the model, including work itself, may be the subject of a work.

Figure 3. FRBR model adapted to the requirements of the FAO OA data collection.

Why FRBR?

Given the special nature of the FAO collection, which often includes each of its publications in all six official languages, the FRBR entity Expression streamlines the creation of efficient relationships among all of the language variants. This helps end-users with enhanced searches, easier access to the documents, and better delivery of the proper content in the proper language. FRBR helps to reduce the time consumed during cataloguing (e.g. adding subjects only to Work) and create richer relationships among documents.

What is the use of FRBR in the FAO OA?

The FRBR entity-relationship model has been translated to the Fedora Commons Digital Object in the following way:

A FRBR entity is represented by a Fedora digital object;
The Attributes are the PID, the object properties and the metadata in the specific datastreams; and
The Relationships are available in the RELS-EXT and RELS-INT datastreams

Figure 4. Set of relationships between entities

Figure 3 represents the FRBR model adapted to the requirements of the FAO OA data collection. All the entities comprised in Group 1 such as work, expression, manifestation and item are used and each of them is represented by a different digital object. A specific set of relationships between entities was also defined: is Expressed As, is Manifested As, is Available As, is Successor Of, is Supplement Of, is Part Of, is Translation Of, is Revision Of and is Alternate Of (Figure 4). Group 2 is represented by corporate bodies and conferences . They are only related to the entity Work with the created by relationship. From Group 3 only two entities were selected - concept and place – and linked to the entity Work. Additional entities were aggregated such as projects, series and journals. While projects is linked to the entity work, series and journals are related to the manifestation (Figure 5).

The FAO Open Archive Authority Control

The FAO Authority Description Concept Scheme focuses on the implementation of authority control for corporate bodies, conferences, projects, journals and Series. Authority control can be defined as a technique or process that provides the assignment of a unique form and the use of cross-references from obsolete and related forms. Selecting a single form brings together in one place all the works of a concept. For effective system searching authority control is essential. These objectives are achieved through the use of rules for the creation of authorized forms based on international standards and the adoption of URIs. The FAO Authority Control Systemdoes not aim to be an all inclusive and comprehensive system for describing the entities. It limits itself to describe the information available on the current legacy data, taking into consideration the requirements of the FAO OA.

The objectives of the FAO Open Archive Authority Control are to:

provide consistency, reliability, standardization and simplification of the FAO OA authority control management;
convert of the flat authority lists into a concept-based system (RDF/XML – SKOS/AOS);
represent each concept by all the forms, preferred and non-preferred, in all languages, associated with it;
provide URIs to each concept;
create concept-to-concept relationships; and
add datatype properties such as geographical information, other identifiers, acronyms, etc...

The FAO Authority Description Concept Scheme for corporate bodies, conferences, projects, journals and series aims to provide a more efficient management of the many form of names - like multilingual-, use of official FAOTerm forms, and above all the use of hierarchical and historical relationships – like links between out dated and new authorized forms. This is achieved through the use of rules for the creation of authorized forms based on international standards and adoption of URIs. It does not aim to be an all inclusive and comprehensive system for describing the entities. It limits itself to describe the information available in the current legacy data taking note of the requirements of FAODOC and CDR.

Figure 5. FAO OA Entities

The content model of the authority control system for bibliographic data is based on a concept-based system. A concept is represented by all the forms, preferred and non-preferred, in all available languages. Therefore, the entire representation of a concept often includes many forms. A form is a word (simple term) or a multiword expression (complex term) that designates a particular concept. Many forms, in a number of languages, can represent a single concept.

The Authority Control System currently contains the following information:

Types: Corporate Bodies, Conference Series, Conferences, Projects, Journals and Series
Names: in more than 15 languages
International codes: ISSN, AGROVOC
Library Internal information: Call numbers, holding libraries
Geographic information: City, State, Country
And Relations: Bibliographic, Instrumental, Partitive, Temporal

Conclusions

The merging of CDR and FAODOC into a sustainable digital repository is an important goal for FAO and will strengthen its role as a knowledge Organization and at the same time will place FAO in a community of open source users who can provide support and feedback.

The FAO OA project provided the possibility to clearly identify FAO organizational requirements for the storage, dissemination and preservation of documents and bibliographic metadata. At the same time it provided the occasion to evaluate state-of-the-art tools for the management of digital repositories and to identify the most appropriate international standards.

More information

Nicolai, Claudia; Subirats, Imma; Katz, Steve (2007) The FAO Open Archive: Enhancing the Access to FAO Publications Using International Standards and Exchange Protocolsat ELPUB 2007 (Vienna, Austria)

Subirats, Imma; Nicolai, Claudia; Katz, Steve. (2008) An open archive for the Food and Agriculture Organization of the United Nations: disseminating enriched metadata and full text documentsat OAI6 (Geneva, Switzerland)

Subirats, Imma; Bagdanov, Andy; Katz, Steve; Nicolai, Claudia (2009). Fedora Commons 3.0 versus DSpace 1.5: Selecting an Enterprise-Grade Repository System for FAO of the United NationsPoster at Open Repositories 2009 (Atlanta, US)

AIMS