Tutorial: Catmandu – a (meta)data toolkit

Full day tutorial, Monday, September 14, 2015, 9:00 AM – 5:00 PM.

Aims, scope and learning objectives of the tutorial

Catmandu provides an ETL suite of software modules to ease the import, storage, retrieval, export and transformation of (meta)data records. After a short introduction to Catmandu and its features, we will present the command line interface (CLI) and the domain specific language (DSL). Participants will be guided to get data from different sources via APIs, to transform data records to a common data model, to store/index it in ElasticSearch or MongoDB, to query data from stores and to export it into different formats. The aim of this tutorial to generate linked data from library sources.

Full description

Catmandu is a data processing toolkit developed as part of the LibreCat project. Catmandu provides a command line client and a suite of Perl modules to ease the import, storage, retrieval, export and transformation of data.

With Catmandu we want to make it easier to extract, transform data such as: JSON, YAML, CSV, MARC, MAB, OAI-PMH, Z39.50, SRU, RDF and much more. Using small transformation language called Fix language, the communication between domain specialists and programmers how data should be transformed is simplified.

Combine Catmandu modules with web application frameworks such as PSGI/Plack document stores such as MongoDB and full text indexes as ElasticSearch to create a rapid development environment for digital library services such as institutional repositories and search engines.

Where do we use it?

Catmandu is used in the LibreCat project for several applications. We have more than 60 Catmandu projects available at GitHub LibreCat.

Why do we use it?

Create a search engine, one of your first tasks will to import data from various sources, map the fields to a common data model and post it to a full-text search engine. Modules such as WebService::Solr or ElasticSearch provide easy access to your favorite document stores, but a lot of boilerplate code needs to be written to create the connections, massage the incoming data into the correct format, validating and uploading and indexing the data in the database. Next morning you are asked to provide a fast dump of records into an Excel worksheet. After some fixes are applied you are asked to upload it into your database. Again you hit Emacs or Vi and provide an ad-hoc script. In our LibreCat group we saw this workflow over and over. We tried to abstract this problem to a set of tools which can work with library data such as MARC, Dublin Core, EndNote protocols such as OAI-PMH, SRU and repositories such as DSpace and Fedora. In data warehouses these processes are called ETL, Extract, Transform, Load. Many tools currenty exist for ETL processing but none address typical library data models and services.

With an international group of systems librarians the open source toolkit Catmandu was created to provide best practices with a common set of tools. Many digital library environments and library vendors currently use Catmandu to transform library data as input for search engines, institutional repositories, data cleaning and data publication.

Target audience and prerequisite knowledge

Audience: systems librarians, data managers. Introductory level: basic knowledge of the Linux command line.

In this tutorial a VirtualBox virtual machine will be provided containing all the required tools. Participants need to have an installation of Oracle VirtualBox available on their laptop before attending the tutorial.

Presenters

Nicolas Steenlant is working as a programmer and institutional repository manager at Ghent University Library. He is the lead developer of the LibreCat project, and Catmandu.

Patrick Hochstenbach is working as a digital architect at Ghent University Library. In the past he worked at Los Alamos National Laboratories and Lund University. Patrick was involved in library standards such as OpenURL, OAI-PMH, MPEG-21 and has created many library applications such as the SFX linking server (currently marked by Ex Libris), the OAI Static Repository gateway.