WUR harvesting (April 2012)
Data from WUR is missing in AGRIS from 2004 (last year input). The WUR harvest was achieved using the following parameters:
- Base URL: http://library.wur.nl/oai
- Date Range: Jan. 2004 – Dec.2011
- Set: non specified
- Metadata format: agris_ap
- Number of records: 123,409
- Full-text links: 123,409
Crosscheck of two agris ap records
The first check was done to identify the potential aggregation of duplicated data, that is data harvested from WUR that are already present in the AGRIS index.
The last set of data as was submitted by WUR in 2004 and the harvest was done starting from the end of 2004.
Since the ARN differs in its compilation from what AGRIS has and what we harvested, we should see how we can dedup data using a string match with titles, but then, we are facing the problem of the quality of the metadata. The data as is indexed in AGRIS is richer, but the new WUR OAI-PMH contains the link to the full text. Maybe we could merge, the thing should be studied.
- What the already AGRIS-indexed data contains that is not in the newly harvested metadata: AGROVOC descriptors, right subject categories (ASC), language of the resource, complete set of authors, publisher name and place, pagination, ISBN, other notes,etc.
- What the newly harvested metadata contains that is not present in the AGRIS-indexed data: URL full text
- dc:medium should be dcterms:medium -- to replace all occurrences with the valid nms
- xml:lang AND dc:language= “und” -- to remove all such entries
- dc:source -- when present, it should respect the DTD sequence below. It is now stored in the middle of the metadata record, after dc:type and before dc:language -- it should be changed at the source
<!ELEMENT ags:resource (dc:title+, dc:creator*, dc:publisher*, dc:date+, dc:subject*, dc:description*, dc:identifier*, dc:type*, dc:format*, dc:language*, dc:relation*, agls:availability*, dc:source*, dc:coverage*, dc:rights*, ags:citation*)>
1 and 2 can be done in house, 3 from the data provider
Ideally we should update the WUR records already indexed in AGRIS to add the URL full text metadata info. But if this is possible, it can be done at a later stage. Given the importance of the harvest, we should index the data that is not duplicated immediately.
Indexed in AGRIS
Thanks! We shall wait that Frank returns, that you open a new case and that the same case will be cleared. No hurry..
For dc:source, I had a closer look at the data and I noticed that this is practically repeating the information already exposed inside the ags:citation container (the serials information on the resource), so we’ll probably end up in removing this redundancy.
For the time being we can go ahead and index the data that is not indexed in AGRIS, as I mention below, and implement ourselves all the changes to the WUR metadata.
Before doing so, anyway, I need your confirmation regarding the set that AGRIS will need to harvest.
I noticed that currently your repository defines four Sets: Publicly available objects, Publicly available objects delivered from our repository, Distributed Africana Repositories Community, Publicly available dissertations Wageningen UR
So, basically, you have defined four groups of metadata and you are saying that there may be some relevance. Last week I did a full harvest, and have not specified any of these sets, following your advice that you gave some time ago.
Based on great knowledge on the AGRIS and the WUR collections, would you recommend to filter out some of the data or just index the entire data from the WUR repository of the last eight years?
Thanks and regards,
From: Besemer, Hugo [mailto:Hugo.Besemer@wur.nl]
Sent: Tuesday, April 10, 2012 9:27 AM
To: Anibaldi, Stefano (OEKC)
Cc: Celli, Fabrizio (OEKC); Keizer, Johannes (OEKC); Jaques, Yves (OEKC); AGRIS-Input
Subject: RE: can you have a quick look?
Hope you had a good Easter.
With regard to the requested fixes such as you mention them: they are new issues, have to be made a new case and it may take some time (Frank is getting married, surprisingly in Nicaragua)
With regard to sets: the sets as we are using them now may have some relevance, especially set=public (only Openly accessible objects. The previous sets that we had in WaY for subjects did not work.
Regarding the overlap; this is probably the cleanest solution aalthough you may lose some unique material this way
The way forward
The harvest was rerun with Metadata Format = agris_ap and date range = 2000-01-01 to 2012-01-01 --> Number of records: 114,229
These are the primary requirements for the new indexing of WUR data:
- Dedup the data and filter out all the ags:resources whose title is already indexed in AGRIS. The new metadata is by far poorer in contents but contains the link to the full text..
- dc:source is not in the right sequence and is preventing validation of all the set. As far as I could check in the data this element is containing information that is duplicating, so it can be completely removed
- We need to remove “und” from dc:language and from all the xml:lang att.
- dc:medium should be changed to dcterms:medium