WUR harvesting (April 2012)

Background

Data from WUR is missing in AGRIS from 2004 (last year input). The WUR harvest was achieved using the following parameters:

  1. Base URL: http://library.wur.nl/oai
  2. Date Range: Jan. 2004 – Dec.2011
  3. Set: non specified
  4. Metadata format: agris_ap
  5. Number of records: 123,409
  6. Full-text links: 123,409

Crosscheck of two agris ap records

The first check was done to identify the potential aggregation of duplicated data, that is data harvested from WUR that are already present in the AGRIS index.

The last set of data as was submitted by WUR in 2004 and the harvest was done starting from the end of 2004.

Since the ARN differs in its compilation from what AGRIS has and what we harvested, we should see how we can dedup data using a string match with titles, but then, we are facing the problem of the quality of the metadata. The data as is indexed in AGRIS is richer, but the new WUR OAI-PMH contains the link to the full text. Maybe we could merge, the thing should be studied.

  • What the already AGRIS-indexed data contains that is not in the newly harvested metadata:  AGROVOC descriptors, right subject categories (ASC), language of the resource, complete set of authors, publisher name and place, pagination, ISBN, other notes,etc.
  • What the newly harvested metadata contains that is not present in the AGRIS-indexed data:  URL full text

OTHER ISSUES

  1. dc:medium should be dcterms:medium -- to replace all occurrences with the valid nms
  2. xml:lang AND dc:language= “und” -- to remove all such entries
  3. dc:source -- when present, it should respect the DTD sequence below. It is now stored in the middle of the metadata record, after dc:type and before dc:language -- it should be changed at the source

<!ELEMENT ags:resource (dc:title+, dc:creator*, dc:publisher*, dc:date+, dc:subject*, dc:description*, dc:identifier*, dc:type*, dc:format*, dc:language*, dc:relation*, agls:availability*, dc:source*, dc:coverage*, dc:rights*, ags:citation*)>

1 and 2 can be done in house, 3 from the data provider

OUTCOME

Ideally we should update the WUR records already indexed in AGRIS to add the URL full text metadata info. But if this is possible, it can be done at a later stage. Given the importance of the harvest, we should index the data that is not duplicated immediately.

Record #1

WUR OAI-PMH
 
<ags:resource ags:ARN="NL200610115">

          <dc:title xml:lang="und">Voluntary automatic milking in combination with grazing; visits to the automatic milking systems and behaviour</dc:title>

          <dc:creator><ags:creatorPersonal>Ketelaar-de Lauwere, C.C.</ags:creatorPersonal>dc:creator>

          <dc:creator><ags:creatorPersonal>Ipema, A.H.</ags:creatorPersonal><dc:creator>

          <dc:creator> <ags:creatorPersonal>Metz, J.H.M.</ags:creatorPersonal><dc:creator>

          <dc:date><dcterms:dateIssued>1999</dcterms:dateIssued> </dc:date>

          <dc:subject>

                  <ags:subjectClassification scheme="ags:ASC">A00</ags:subjectClassification>

          </dc:subject>

          <dc:identifier scheme="dcterms:URI">http://library.wur.nl/WebQuery/wurpubs/310115</dc:identifier&gt;

          <dc:type>Article in monograph or in proceedings</dc:type>

          <dc:language scheme="dcterms:ISO639-2">und</dc:language>

          <agls:availability>

            <ags:availabilityLocation>Library Wageningen University and Research Centre, Postbus …</ags:availabilityLocation>

            <ags:availabilityNumber>310115</ags:availabilityNumber>

          </agls:availability>

</ags:resource>

 

Record #1

indexed in AGRIS
<ags:resource ags:ARN="NL2001002706">

<dc:title xml:lang="en">Voluntary automatic milking in combination with grazing: visits to the automatic milking system and behaviour</dc:title>

                <dc:creator>

                                <ags:creatorPersonal>Ketelaar-de Lauwere, C.C.</ags:creatorPersonal>

                                <ags:creatorPersonal>Ipema, A.H.</ags:creatorPersonal>

                                <ags:creatorPersonal>Metz, J.H.M.</ags:creatorPersonal>

                                <ags:creatorCorporate>IMAG-DLO, Wageningen (Netherlands)</ags:creatorCorporate>

                </dc:creator>

                <dc:publisher>

                                <ags:publisherName>IMAG</ags:publisherName>

                                <ags:publisherPlace>Wageningen (Netherlands)</ags:publisherPlace>

                </dc:publisher>

                <dc:date><dcterms:dateIssued>1999</dcterms:dateIssued></dc:date>

                <dc:subject>

                                <ags:subjectClassification scheme="ags:ASC">N20</ags:subjectClassification>

                                <ags:subjectThesaurus scheme="ags:AGROVOC" xml:lang="en">machine milking</ags:subjectThesaurus>

                                <ags:subjectThesaurus scheme="ags:AGROVOC" xml:lang="en">robots</ags:subjectThesaurus>

                                <ags:subjectThesaurus scheme="ags:AGROVOC" xml:lang="en">grazing</ags:subjectThesaurus>

                                <ags:subjectThesaurus scheme="ags:AGROVOC" xml:lang="en">behaviour</ags:subjectThesaurus>

                </dc:subject>

                <dc:type> Non-Conventional</dc:type>

                <dc:format>

                                <dcterms:extent>p. 27-30</dcterms:extent>

                </dc:format>

                <dc:language scheme="ags:ISO639-1">En</dc:language>

                <dc:relation>

                                <dcterms:isPartOf scheme="ags:ISBN">9054061731</dcterms:isPartOf>

                </dc:relation>

                <dc:source>Dutch-Japanese workshop on precision dairy farming, Klooster, C.E. van 't.- Wageningen (Netherlands): IMAG, 1999.- ISBN 9054061731. 157 p.</dc:source>

</ags:resource>


Record #2

from WUR OAI-PM
<ags:resource ags:ARN="NL200537577">

                <dc:title xml:lang="eng">Performance levels in food traceability and the impact on chain design: results of an international benchmark study</dc:title>

                <dc:creator>

                                <ags:creatorPersonal>Vorst, J.G.A.J. van der</ags:creatorPersonal>

                </dc:creator>

                <dc:date>

                                <dcterms:dateIssued>2004</dcterms:dateIssued>

                </dc:date>

                <dc:subject>

                                <ags:subjectClassification scheme="ags:ASC">E15</ags:subjectClassification>

                </dc:subject>

                <dc:identifier scheme="dcterms:URI">http://library.wur.nl/WebQuery/wurpubs/337577</dc:identifier&gt;

                <dc:type>Article in monograph or in proceedings</dc:type>

                <dc:format>

                                <dcterms:extent>175 - 183</dcterms:extent>

                </dc:format>

                <dc:language scheme="dcterms:ISO639-2">eng</dc:language>

                <agls:availability>

                                <ags:availabilityLocation>Library Wageningen University and Research <ags:availabilityLocation>

                                <ags:availabilityNumber>337577</ags:availabilityNumber>

                </agls:availability>

</ags:resource>

Record #2

Indexed in AGRIS

<ags:resource ags:ARN="NL2004731571">

                <dc:title xml:lang="en">Performance levels in food traceability and the impact on chain design: results of an international benchmark study</dc:title>

                <dc:creator>

                                <ags:creatorPersonal>Vorst, J.G.A.J. van der</ags:creatorPersonal>

                                <ags:creatorPersonal>Bremmers, H.J.</ags:creatorPersonal>

                </dc:creator>

                <dc:date>

                                <dcterms:dateIssued>2004</dcterms:dateIssued>

                </dc:date>

                <dc:subject>

                                <ags:subjectClassification scheme="ags:ASC">E20</ags:subjectClassification>

                                <ags:subjectThesaurus xml:lang="en" scheme="ags:AGROVOC">foods</ags:subjectThesaurus>

                                <ags:subjectThesaurus xml:lang="en" scheme="ags:AGROVOC">tracer</ags:subjectThesaurus>

                                <ags:subjectThesaurus xml:lang="en" scheme="ags:AGROVOC">performance</ags:subjectThesaurus>

                </dc:subject>

                <dc:description>

                                <ags:descriptionNotes>Bremmers H.J. (ed.) Dynamics in chains and networks : proceedings of the sixth international conference on chain and network management in agribusiness and the food industry (Ede, 27-28 May 2004). Wageningen Academic Press : Wageningen, 2004. - 630 p. ISBN 907699840X; paper</ags:descriptionNotes>

                </dc:description>

                <dc:format>

                                <dcterms:extent>p. 175-183</dcterms:extent>

                </dc:format>

                <dc:language scheme="ags:ISO639-1">En</dc:language>

                <agls:availability>

                                <ags:availabilityLocation>*Library Wageningen University and Research Centre</ags:availabilityLocation>

                                <ags:availabilityNumber>2004731571</ags:availabilityNumber>

                </agls:availability>

</ags:resource>

Correspondence


Hi Hugo,

 

Thanks! We shall wait that Frank returns, that you open a new case and that the same case will be cleared. No hurry..

For dc:source, I had a closer look at the data and I noticed that this is practically repeating the information already exposed inside the ags:citation container (the serials information on the resource), so we’ll probably end up in removing this redundancy.

For the time being we can go ahead and index the data that is not indexed in AGRIS, as I mention below, and implement ourselves all the changes to the WUR metadata.

 

Before doing so, anyway, I need your confirmation regarding the set that AGRIS will need to harvest.

I noticed that currently your repository defines four Sets: Publicly available objects, Publicly available objects delivered from our repository, Distributed Africana Repositories Community, Publicly available dissertations Wageningen UR

So, basically, you have defined four groups of metadata and you are saying that there may be some relevance. Last week I did a full harvest, and have not specified any of these sets, following your advice that you gave some time ago.

Based on great knowledge on the AGRIS and the WUR collections, would you recommend to filter out some of the data or just index the entire data from the WUR repository of the last eight years?

 

Thanks and regards,

 

Stefano

 

 

 

From: Besemer, Hugo [mailto:[email protected]]

Sent: Tuesday, April 10, 2012 9:27 AM

To: Anibaldi, Stefano (OEKC)

Cc: Celli, Fabrizio (OEKC); Keizer, Johannes (OEKC); Jaques, Yves (OEKC); AGRIS-Input

Subject: RE: can you have a quick look?

 

Hi Stefano

 

Hope you had a good Easter.

With regard to the requested fixes such as you mention them: they are new issues, have to be made a new case and it may take some time (Frank is getting married, surprisingly in Nicaragua)

With regard to sets: the sets as we are using them now may have some relevance, especially set=public  (only Openly accessible objects. The previous sets that we had in WaY for subjects did not work.

Regarding the overlap; this is probably the cleanest solution aalthough you may lose some unique material this way

 

 

Best

 

Hugo







========================================================

The way forward


The harvest was rerun with Metadata Format = agris_ap and date range = 2000-01-01 to 2012-01-01 --> Number of records: 114,229



These are the primary requirements for the new indexing of WUR data:

  1. Dedup the data and filter out all the ags:resources whose title is already indexed in AGRIS. The new metadata is by far poorer in contents but contains the link to the full text..
  2. dc:source is not in the right sequence and is preventing validation of all the set. As far as I could check in the data this element is containing information that is duplicating, so it can be completely removed
  3. We need to remove “und” from dc:language and from all the xml:lang att. 
  4. dc:medium should be changed to dcterms:medium