Tutorial: Dynamic Data Citation – Enabling Reproducibility in Evolving Environment
Half day tutorial, Monday, September 14, 2015, 1:30 PM – 5:00 PM.
Aims, scope and learning objectives of the tutorial
Being able to reliably and efficiently identify entire or subsets of data in large and dynamically growing or changing datasets constitutes a significant challenge for a range of research domains. In order to repeat an earlier study, to apply data from an earlier study to a new model, we need to be able to precisely identify the very subset of data used. While verbal descriptions of how the subset was created (e.g. by providing selected attribute ranges and time intervals) are hardly precise enough and do not support automated handling, keeping redundant copies of the data in question does not scale up to the big data settings encountered in many disciplines today. Furthermore, we need to be able to handle situations where new data gets added or existing data gets corrected or otherwise modified over time. Conventional approaches, such as assigning persistent identifiers to entire data sets or individual subsets or data items, are thus not sufficient.
In this half-day tutorial we will review the challenges identified above and discuss solutions that are currently elaborated within the context of the working group of the Research Data Alliance (RDA) on Data Citation: Making Dynamic Data Citeable. The approach is based on versioned and time-stamped data sources, with persistent identifiers being assigned to the time-stamped queries/expressions that are used for creating the subset of data. We will present and work with a prototype solution implementing this approach for CSV data. We will further review results from the first pilots evaluating the approach in different domains.
Introduction to data citation: This introduction will provide the basics of data citation in general and dynamic data citation in particular. It will explain why data citation is crucial for reproducible science and an essential building block of trust in scientific environments. We will describe how data citation is currently handled and provide a motivation why a new approach is needed. The relationships between data producers, consumers and repositories, their expectations and responsibilities with respect to data will be explained as well as the interrelations of the different research areas within the scope of digital libraries.
Classical Data Citation: Data forms the basis of the results of many research publications, and thus needs to be referenced with the same accuracy as bibliographic data; only if data can be identified with high precision can it be reused, validated, verified and reproduced. Citing a specific data set is not trivial, however: it exists in a vast plurality of specifications and instances, can be potentially huge in size, and its location might change. We will provide an overview of existing approaches to overcoming these challenges.
Dynamic Data Citation: Researchers need to be able to precisely specify which data subset they used during an experiment, even if the data changed meanwhile. The approaches currently used for referencing and citing data cannot be applied directly on dynamic data for several reasons. We will explain in detail why the research workflows used in various disciplines require a more flexible approach and show how dynamic data citation addresses the challenges to overcome the limitations imposed by the classical data citation. In addition, the audience will learn about the current research questions in the area and how existing challenges are being tackled. This will lead to the presentation of the recommendations elaborated by the working Group on Data Citation of the Research Data Alliance (RDA), which is based on time-stamped/versioned data and the citation being realized via time-stamped queries.
Hands on exercise: Based on a concrete use case we will demonstrate how existing data can be made citable even if the data is dynamic. The audience will have the possibility to create citable subsets of their own CSV / Excel files or even use data stored in SQL databases. The prototype which we provide allows retrieving highly specific versions as they existed at a given point in time. In addition we will demonstrate how versions of defined subsets can be retrieved at different points in time and compared with each other, thus allowing researchers to measure the effects of data changes on their specific subsets in a direct way.
Implications for existing systems: In a last block we will present our experience in different use cases and explain the necessary steps. This will allow the audience to learn how data citation could be implemented within their workflows and understand which implications the implementation of the dynamic citation mechanisms causes. This will lead to a discussion on how the specific data citation challenges can be addressed in the various participants’ settings.
Expected learning outcomes
The tutorial participants will gain understanding of the importance of data citation in general and especially when dealing with highly dynamic data. The process preservation, including:
- Motivations and challenges of dynamic data citation
- Involved stakeholders and their view on dynamic data citation.
- How Data is Cited Today: The Joint Declaration of Data Citation, requirements, guidelines, metadata, locators and persistent identifiers, approaches to naming schemes and properties.
- Necessary steps to render existing data citable and provide citable subsets in dynamic environments.
- Hands on experience of creating a citable subset from a dynamic data source of own or provided data.
- Understand the implications of rendering dynamic data citable.
Tutorial Level: introductory.
Intended audience: The aim of this tutorial is to raise awareness for the need to reliably cite dynamic data and provide hands on experiences about the advantages of a dynamic citation approach. It provides an overview of existing approaches, discussion of advantages and disadvantages of current standards and provides a hands on demo of a dynamic data citation provided as open-source solution.
Stefan Pröll is a researcher at SBA Research. His primary research focus lies on digital preservation, especially on data citation, security aspects of digital archives and scientific processes. Further areas of interest are databases and information systems. Stefan has been working on the FP7 projects APARSEN, TIMBUS and SCAPE, focusing on data citation of dynamic data sets, security and workflow preservation. Before he joined SBA in April 2011, he was working in international organizations in the area of Web development, Linux server and database administration.
Andreas Rauber is Associate Professor at the Department of Software Technology and Interactive Systems (ifs) at the Vienna University of Technology (TU-Wien). He furthermore is president of AARIT, the Austrian Association for Research in IT and a Key Researcher at Secure Business Austria (SBA-Research. He is co-chairing the RDA Working Group on Data Citation together with Ari Asmi and Dieter van Uytvanck. He received his MSc and PhD in Computer Science from the Vienna University of Technology in 1997 and 2000, respectively. In 2001 he joined the National Research Council of Italy (CNR) in Pisa as an ERCIM Research Fellow, followed by an ERCIM Research position at the French National Institute for Research in Computer Science and Control (INRIA), at Rocquencourt, France, in 2002. From 2004-2008 he was also head of the iSpaces research group at the eCommerce Competence Center (ec3).