Building collections in IRs from external data sources
Authors
Advisors
Issue Date
Type
Keywords
Citation
Abstract
This presentation will explore the process of building research publication collections in a DSpace-based institutional repository (IR) using external data sources such as PubMed, IEEE Xplore, and Web of Science. Obtaining data from external sources serves as an alternative to author self-deposit, which has yet to become common practice in institutional repositories. This approach aligns with the current trend in metadata cataloging, which emphasizes a shift from item-by-item cataloging to batch processing of metadata, repurposing metadata across different systems and communities, and providing value-added data services to students and faculty through the IR.
The presentation will discuss methods for batch transforming, enhancing, and transferring over 720 student and faculty publications from Medline format in PubMed to Dublin Core (DC) format in DSpace. In the PubMed-DSpace project, PubMed XML data is mapped and transformed into DCXML, exported and enriched in Excel, divided into separate departmental collections, and batch-loaded into the DSpace server. Topics covered will include project planning, workflow management, and record prototype creation based on user needs.
Technical details will include the selection of metadata fields, mapping Medline to Dublin Core, conducting name authority checks, enriching content with additional DOIs and links, adding descriptions, copyright information, peer-review status, data normalization, and performing accuracy and consistency checks. The presentation will also detail the implementation and customization of an add-on to facilitate batch data import into DSpace.
Challenges associated with adding institutional research output through this approach will be addressed, such as the pros and cons of different methods for transforming Medline to DC, issues with data acquisition and content recruitment, balancing metadata granularity with generality, selecting appropriate subject types and identifiers, enhancing content, and ensuring copyright compliance. Additionally, the presentation will cover the cases of collecting data from IEEE Xplore and Web of Science, enhancing it in spreadsheets, and batch loading it into separate departmental collections in DSpace. Finally, the potential for adding data from other external databases and the open web to the IR will also be discussed.