VALA2014 Session 2 Balnaves

Complex harvesting for content from public sources and email

VALA2014 CONCURRENT SESSION 2: It’s All About the Data
Tuesday 4 February 2014, 12:00 – 12:30
Persistent URL:

Edmund Balnaves

Prosentient Systems, NSW

Please tag your comments, tweets, and blog posts about this session: #vala14 and #s6

VALA Peer Reviewed


This paper presents the results of a project for complex harvesting system from web and email sources integrated with open source platforms to improve discovery of information about or relevant to the organisation from public internet sources. The paper discusses methods of harvesting, drawing on a mix of RSS, Google API search and simple web parsing. The paper presents the results of automated metadata allocation and subsequent manual curation. The project highlights the need to use multiple web scanning techniques, so as to be sufficiently exhaustive to catch relevant references, but also sufficiently specific to avoid unduly large false positive candidates for selection.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial License.