Proposal for a Drupal Workbench Module

[To be revised. The Agrovoc or generic SKOS browser functionality can be separated from the scope of this module and is covered by the SKOS browser module that will be developed for Agrovoc and for the BioTech Glossary]

 
After a first experiment with an Agrovoc search/index Drupal module based on the Agrovoc web services, we are now thinking of a more advanced Workbench module that implements similar search/index functionalities but:
  1. calls the new Workbench web services and stores RDF records;
  2. supports searches of both Agrovoc concepts and Authority Files records (for the moment, just journals, but it should be extensible);
  3. stores URIs and labels (and some additional values) in referenced nodes (see attached file) instead of taxonomy fields: this is more compatible with the RDF approach that for instance DERI uses.
This would basically add support for the integration of external authority files both for subject indexing and for journals (for the moment).

 
Below is a detailed proposal of two possible ways of implementing this new module. It's a long read, only for those really interested, but feedback on which way to go would be appreciated: if you like, you can just look at the advantages and disadvantages paragraphs.
 
If there is no strong opinion on which way to go, I would opt for the first one. One question would be whether we want to invest in such a module now, considering that in a year or so we should be able to implement these functionalities using Drupal RDF support and a foreseeable SPARQL engine for the Concept Server.
 
However, implementing the module in the way(s) suggested below is completely compatible with the Drupal RDF approach, with the big advantage that a module can be easily distributed to any Drupal user even if not an RDF/SPARQL expert, while the implementation of all functionalities using only Drupal RDF modules requires writing the SPARQL queries, defining the mapping to CCK fields, setting up a dynamic interface for the SPARQL query and finding a way to run the query and store the records at the moment of indexing a node...
So, do we all agree that we should go ahead with this module?

Proposal for a Drupal Workbench Module

This module would incorporate the following functionalities:
  • Workbench Agrovoc search/index
  • Workbench Authority Files search/index
  • Workbench Agrovoc navigator (search/browse/hierarchy) (especially for AIMS)
The Workbench Agrovoc search/index functionality will have similar features and a similar interface to the basic Agrovoc search/index module already implemented but will retrieve the data from the Agrovoc Concept Scheme, it will be RDF-based and it will support suggesting new terms.
 
The Workbench Authority Files search/index functionality will work as the one above, but will only look up records from a specific authority file.  
 
In addition, both functionalities above require a specific workflow for storing new values locally while proposing them to the Workbench, getting temporary URIs and periodically checking for the final URIs of approved values (see below, point 1.3).
 
The Workbench Agrovoc search/browse/viewer will implement functionalities similar to the ones now implemented in AIMS: it will allow to create pages with specific Agrovoc views (e.g. search, browse, hierarchy)
 
The functionalities to implement are many, so I would distinguish between a version 1 that includes all essential functionalities and a version 2 with additional features. The Workbench Agrovoc search/browse/viewer could be implemented in version 2.
The proposals below focus on the first two functionalities.
 
Two proposals for the implementation (although the first one is probably to be preferred at the moment):
 
1. Implementation based on the Workbench web services
 
This implementation has a workflow that is very similar to the basic Agrovoc search/index module already developed, only the underlying technology would change, queries on the Authority Files triples would be added and a specific workflow for storing local/suggested values would be added. 

 
When the module is enabled: The module automatically creates a specific content type for each entity that it expects to retrieve from the Workbench (first version: Agrovoc concept and Journal): these content types will have fields that store the URI and the essential properties of the concept: the title will be the URI and other essential fields will be the labels in all languages enabled in the Drupal installation, the description, a “temporary” field to indicate if the concept is an approved one (final URI) or a suggested one (temporary URI), and the relations with other concepts (it can be decided if all relations will be stored as generic relations or if also the types of relations will be stored: version 1: only generic relations).

 
Relations between concepts will be implemented as node references and the only mandatory field for a concept will be its URI, so that new nodes can be added as referenced nodes on the fly, by just giving the URI, and the remaining information will be added when that concept is actually retrieved by a user to index something. The URI being the title of the node, node references among concepts will reproduce the original relations between concepts.
(This storing of RDF results into a content type structure is in line with the Drupal RDF modules developed by DERI, one of which, the RDF SPARQL Proxy module, stores results into CCK nodes mapping the fields returned by a SPARQL query to the CCK structure: see a description of this module and this approach in: http://openspring.net/sites/openspring.net/files/corl-etal-2009iswc.pdfIn this way, concepts and journals can be referenced by any node and can be managed with Views, so that only concepts and journals that actually have labels will be displayed.

 
The module should also automatically create two Views (one for each type of entity, concept and journal) that show URI, description and labels of ONLY the records that have labels (records that only have URIs are there only as related concept, but are not to be considered until a user actually uses them for indexing and therefore retrieves them). These Views will be used by the node reference field to first check if a concept is already in the system before looking up Agrovoc.

 
When a user creates a content type where he wants to include a field for Agrovoc concepts and/or a field for journal:The user should select a field of type “node reference” with the only option to reference (not “create and reference”) and with cardinality configured as “unlimited” and select the appropriate content type among one of those created by the module (Agrovoc concept or journal): the module should automatically intercept the selection of such types of node reference and include a “Search Concept Scheme” link above the node reference field.


When a user creates a new node of the above content types:

The user can either first search among local concepts (those already stored retrieved from the Concept Scheme) through the standard node reference field, or just click on the “Search Concept Scheme” link and open the search popup.

The popup window allows to search existing Concept Server concepts by label, but also allows to locally create and suggest new concepts by calling the corresponding web service and getting a temporary URI: concepts created in this way are listed together with the selected ones in the popup for final confirmation or removal before clicking on the “Import and reference” button that closes the popup.



(Version 2: the search interface will also exploit RDF to allow navigation among concepts and / or suggestion of related concepts). Selected and suggested concepts are then stored in nodes of the corresponding type (Agrovoc concept or journal) and automatically referenced in the multiple node reference field (which can only reference existing nodes, not create new ones on the fly, otherwise users could try to type new values in it without checking for valid URIs).



As per the normal functioning of node reference fields, selecting a concept that is already in the system just creates the reference to that node, so the retrieval of a concept that is already in the system does not create duplicates.



The new suggested concepts (temporary URI + proposed label(s) + optional proposed description) will be stored like the others but the “temporary” field will be checked. The module also has to implement a procedure to periodically synchronize temporary URIs of suggested concepts with the final URIs assigned if and when a concept is approved (Imma can give details regarding the appropriate web services and related workflow).

 
Advantages: it can be implemented immediately; it allows to call the web services for suggesting new terms in the Authority Files, which is not possible with solution 2.
 
Disadvantages: it uses the Workbench web services, so it is tied to them: only triples available through these web services can be queried and stored.
 
2. Implementation based on Drupal RDF SPARQL modules
 
The module would do the same as the implementation above but instead of calling the Workbench web services it would query the Concept Server through SPARQL queries. This implementation would make full use of the DERI RDF SPARQL Proxy module (http://drupal.org/project/rdfproxy and http://openspring.net/sites/openspring.net/files/corl-etal-2009iswc.pdf): it would define different mappings between the Concept Server schema (and potentially any other schema) and the local content types (concept, journal, but potentially many others which may store values from other RDF stores).



This requires the availability of the Concept Server behind a SPARQL engine, which I think is foreseen anyway. 
When the module is enabled:

All as in implementation 1, but this module would also automatically create the SPARQL Proxy mappings between the Concept Scheme schema and the created content types (version 2: it would also allow users to create new content types and corresponding mappings that can be used against other RDF stores).


When a user creates a content type where he wants to include a field for Agrovoc concepts and/or a field for journal (or link to any other node defined in the step above):

All as in implementation 1 (addition of the “Search Concept Scheme” link).



When a user creates a new node of the above content types:

All as in implementation 1, but the popup window should load the appropriate mapping (“proxy”) according to the content type selected for the node reference, first run the correspondent SPARQL query without storing the resulting triples, and then only store the triples that the user has selected.

The workflow for suggesting terms and synchronizing temporary / approved URIs should still be implemented through web services, unless the Concept Scheme SPARQL engine allows for insert queries. 


Advantages: it is a more general approach: since through the RDF SPARQL Proxy module different mappings to different RDF sources (schemas) can be configured, the module could dynamically load different mappings and search and store triples from different sources (e.g., using the same module one could have a geographic indexing field in a content type that queries the geo-political ontology and stores triples form there in a referenced node)
 
Disadvantages: it cannot be implemented until the Concept Server is available behind a SPARQL engine; implementing the workflow for suggesting terms and synchronizing temporary/approved URIs can be more difficult or impossible through SPARQL queries, while there are dedicated Workbench web services for this (Imma can give more details).