AGRIS Minutes (wrap-up from meetings 2012/01/17 - 2012/01/27)

DOAJ
issues
-- duplicates because they also harvest Scielo
-- their schema is custom (but we already have an conversion XSL)

recommendation
-- harvest, convert and dedupe (approx. 100k records) as a one-off (do not build a configurable multi-use de-duping component)

-- Use Sweden (DOAJ is there) for provenance in building the ARN

-- in future stop harvesting from Scielo since they are anyway in DOAJ (true for other providers as well)

AGRIS CD
Will we publish one in 2012? (Johannes decision)

OpenAGRIS
-- need to get other kinds of data in there before May
-- Stefano will research on good APIs whether SPARQL or other

Data processing
--
Stefano is documenting the AGRIS business process so we can better evaluate possible  ARIADNE cooperation
-- Fabrizio is investigating the state of the code and documentation in Ariadne

Import/export
-- Team is working on a document that defines a set of principles and contains numerous real user scenarios with recommended courses of action derived from those principles
issues
-- Does LODE-BD recommend following DC encoding guidelines? example: DFID data has multiple creators in the same element, this is not recommended DC practice, therefore can we reject such data and not put it in AGRIS, or do we accept it because it's not specified in LODE-BD as a bad practice (note that by this rule we should also not accept AGRIS AP as it nests elements, another DC bad practice)?

-- If we accept such data should we then try to fix it by separating the values. We think not.
 
-- If we have a dc:creator element that may have an undifferentiated personal or corporate creator in it, can we loosen the AGRIS-AP DTD so that mixed content is permitted, i.e. either a text string or ags:creatorCorporate/Personal elements? We think yes, Johannes can you confirm?

-- Should we accept any keywords in dc:subject elements? We think yes.

-- If controlled, we will extend the AGRIS AP to accomodate their identification.

-- Should we inform providers of poor practice in their data export. What is the limit after which we will not accept their data? Examples:
-- Messed up encoding. Happens frequently and we spend a lot of time fixing it. We suggest simply warning providers and leaving data as-is.
-- Incorrect use of CDATA blocks. DFID is encoding dc:description in CDATA blocks meaning it remains unparsed causing HTML content in these blocks to be displayed incorrectly on the web like <br/>. We suggest warning them but otherwise displaying it incorrectly in AGRIS as it is encoded.
-- broken dc:identifier links. DFID again, the links are broken. We suggest warning providers and adding a script at import time that checks whether a link actually produces a resource.
-- Domain specificity. We feel a provider must at least be able to guarantee the records fit into the AGRIS domain, description of which exists on the website. DFID again is providing records that have nothing to do with agriculture, like, "road building". We think a provider should be rejected in this case.

AgHarvest
-- Stefano is preparing a list of repositories and websites for INFN


Add comment

Log in or register to post comments