Data Policy - Harvest of the metadata from DFID




1.     1. BEN: dc:creator – we cannot currently split the authors accurately into different single tags or distinguish people from institutions.  We are working towards splitting out Author/Editor/Publisher but at the moment, they are all in 1 field in the database with some organisations mixed in as well.

Stefano: this is probably the most critical problem, since if they cannot distinguish people from isntitutes, we probably end up in mixing people and Org. in one tag (which one?...) – additionally authors are in the same field (this “split” can be also done in house anyway..)


Yves: multiple authors: looking at the records, easy to split on the comma. This would be a task of the import/export tool (BTW, we should seriously look at Google Refine for this). personal/insitutional: this should be a best effort named entity recognition task, again Google Refine has excellent tools for this. It is anyway semantically correct to assign the dc:creator property to either a person or an institution. Our internal model should support this.

Fabrizio: I like the idea of using Google refine. Anyway, we could have compatibility problems if we will try to generalize this issue. In fact, while probably it will be easy to split an author from a corporate, it will be really difficult to split authors in the same field. In this situation (DFID) it looks very easy to split authors on the comma. In AGRIS we have a similar issue, but somethimes the delimiter is a comma, other times a semicolon, or a dot, or a mixture of commas and semicolons.... Not easy to generalize



Stefano

: agreed, splitting is realatively easy, as long as the separator, as Fab says, is the same.
Other thing is how to separate authors from institution - dcterms:creator is semantically correct for dc,

yes

, we simply need to find a way to proceed now, with this case, since this is the first time we face this issue. How do we?


 


2.     BEN:   dc:coverage is stored in our database using UN Country and Region codes (http://unstats.un.org/unsd/methods/m49/m49regin.htm) so yes this is a controlled vocabulary.
Stefano: OK

Yves: import/export component should replace these with AGROVOC URIs. Also, would be better to have the property as dcterms:spatial

Fabrizio: very easy to do. I am actually doing it with freekeyword during the process of RDF-ization

Stefano: I would tend to leave the term as provided by them - if they use a published encoding scheme, we should use that one. Anyway, this is yours Yves, and yes, it can be made more granular, as you say dcterms:spatial, which property refines dc:coverage
 


3.     BEN:   dc:subject is based on a controlled vocabulary in our database, applied by our editors and agreed with DFID – based on a Research Topic Parent and Child relationship (the details of which can be seen here in the attached spreadsheet).

In the past this was also tied to DFID Programme but last year we decided to disconnect this relationship so that we could apply multiple Research Topics  to 1 Output (as you can see the data in the Gateway still just assigns 1 Topic but this will be amended in future).

The spreadsheet should also help you to decide which bits of the R4D database are appropriate for inclusion in AGRIS.

Stefano: OK

Fabrizio: we could also try to migrate to Agrovoc. For each keyword, we can check if it is in Agrovoc and use the Agrovoc URI. If a keyword is not in Agrovoc, we could decide to analyze it, probably we could add new keywords to our ontology....

Stefano: as 3. I would prefer not to convert the original source and leave the task to the solr index, in the past I asked on several occasions Gudrun if she wanted to analyze all the non-agrovoc terms existing in the AGRIS collection (and we have thousands of them), but she always said that this activity is very time consuming.
 


4.     BEN:   dc:description – we cannot remove the HTML at present – the html tags should be character encoded in the source anyway.

Stefano: … if we harvest and publish their abstracts (even if the contain the cdata sections) we should not get  the proper characters.. for example, open their data with Firefox and see the descriptions (abstracts)..

Yves: Another task for the import/export component. Looking at the data it's just a bit of escaped HTML, mostly line breaks. We should experiment a bit, but I would tend to strip it completely -- an easy task with a regular expression.

Fabrizio: or probably we could simply add a CDATA section avoiding parsers can crash while analyzing the XML. Sometimes there could be entities not correctly defined...

Stefano: This is very recurring and known XML problem, especially when the cataloger or data input person copies and pastes from pdf or word document, he/she drags unwanted text. I am in favor to add all the possible occurrences/cases and give this in pasto al import parser to replace them with the relevant replacement char.. we have been doing this manually for such a long time..
The CDATA section also is a solution (implemented, btw, by WebAGRIS system export scripts), but sometimes, as is in this case, it does not work.. Look at the attachment, as saved from the URL: 
http://www.dfid.gov.uk/r4d/Gateway/?verb=ListRecords&metadataPrefix=oai_dc
 


5. BEN:   dc:date – this denotes the year (defaulted to 01/01 for day/month) of publication in most cases (28244 out of 29545 currently).  For the remaining 1301, we have used the date when the publication was added to the R4D database.

Stefano: OK

Fabrizio: I don't like this idea, since in AGRIS the submission date is intrinsecally contained in the URL of the resource, anyway, we can accept also this situation.

Stefano: Fab, maybe you did not get exactly what Ben wrote: dc:date is here exactly what is in AGRIS dateIssued, the year (date) of publication. Also here, anyway, we need to normalize it (at the source or in the solr index, as Fab does), since they appear as <dc:date>2007-11-20T00:00:00</dc:date>
 


6.     BEN:   dc:identifier – the first identifier record is the Output/document abstract record and the next is the full text.

Stefano: these URLs are a mixtures of PDF and other URLs, unfortunately sometimes there are three, sometimes two and sometimes one and this latter can be either the pdf or other pages…All the links that I accessed are broken.. maybe you are luckier than me..:-) – anyway, quite straight forward to pick only up the .pdf file.. still another thing that should be done here.. 

Yves: Yes this would be a great feature for the import/export component -- do a basic check on all URIs to make sure they exist. Also, would be great to start keeping a local full-text copy for preservation while publicly pointing to the URL provided.

Fabrizio: I accessed 5/5 broken links. We need a preliminary check - as Yves suggested - to understand if URLs are working or not.

Stefano: on multiple occasions  (Scielo is one example), in AGRIS,  the URLs links were wrong because the scripts they implmente to write the export to a file, is adding commas, dots or other things that are breaking the real URL. Other times, they simply changed DNS, server whatever and they did not update the metadata.. we need to ask more to data providers.
 


7.     BEN:   dc:relation – the Project abstract record is referenced here – a Project may have several Outputs which we are making available in the Gateway and it is useful for this relationship to be flagged (the Project gives the wider framework and context for the publication/output).

Stefano: OK

Yves: not sure what we do with this, maybe just toss it away?

Fabrizio: I agree, we don' t have this information in AGRIS and probably we don't need it....

Stefano: Actually we have dc:relation in several AGRIS resources (especially those from the Library, but not only), especially when the repository is accurately describnig if a resource is part of another resource. Ben also states that this is useful information...But, also here, we need to make sure with them that the URLs are correct, since they are broken, too..

==========


8. Content

An additional issue for this dataset, but that is naturally recurring with all the "new data" coming from

heterogeneous
sources, such as RD4, is the type of data that they are disseminating for our harvest. We normally give for granted that the data owner that wants to publish the data in AGRIS knows the main domain areas that our DB handles. Accordingly, they should filter only the data that we should index. With OAI-PMH repositories, this is relatively easy, simply usingh the setspecs verb, but sometimes this is not implemented in all repositories. In this specific case, analyzing briefly the dc:subject they used (very very poor indexing.. only one veeery broad term for each resource...), these are the three most used (or only) terms:
Urbanisation, Water, Transport

If we are to publish the data as is now, we basically also need to include all the resources that have Water as unique term.. In a few words for this and other sources, also an analysis of the content is required.


 


Add comment

Log in or register to post comments