Technical Solutions to Advance Evaluation and Replication in the Social Sciences

What’s New, What's Next

Wednesday, August 29, 2018, Boston, MA

Agenda

Overview

Journal editors are increasingly encouraging or requiring authors to make their research transparent, and to share the data that underpin their articles. Given the diverse types of data that social scientists collect and generate through their research -- including those derived from interactions with human participants -- authors will need increasingly sophisticated solutions to share their data and other supplemental materials.

Domain repositories, in collaboration with software developers, data scientists, and others, are developing new and powerful ways to improve data sharing -- techniques and technologies with which it will be useful for journal editors to be familiar. Examples include: improved data citation infrastructure; better methods to provide and preserve code and ensure its continuing usability; and more sophisticated functions to allow for the safe sharing of sensitive data. This workshop will provide information about some of these new developments.

The workshop is the fourth in a series on "Developing and Implementing Data Policies: Conversations Between Journals and Data Repositories." The series is designed to promote discussion among social science journal editors, personnel from data repositories, data librarians, and other relevant constituencies about current approaches to data citation, management, and archiving.

As with previous events in the series, the workshop is being organized and led by various members of the Data Preservation Alliance for the Social Sciences (Data-PASS, www.data- pass.org), a consortium of social science data repositories: the Institute for Quantitative Social Science (IQSS) at Harvard University; the Howard W. Odum Institute for Research in Social Science at the University of North Carolina‐Chapel Hill; the Inter‐university Consortium for Political and Social Research (ICPSR) at the University of Michigan; and the Qualitative Data Repository (QDR) at Syracuse University.

Light Breakfast (9:00 – 9:30am)
Introductions (9:30 – 10:00am)

Introductions, workshop outcomes (reinforcing points above -- assumptions and principles under which we’re operating), initial questions, distribution of attendees list (for networking and continued discussion)

Integrating Replication Tools with Data Repositories (10:00 – 10:45am)

Presentation by Matthew K. Lau and Mercè Crosas Harvard University – Slides

Recent findings of low levels of reproducibility in research has been a wake-up call to scientists. In addition to the challenges of making study details and data and metadata available and accessible, the rapid rise of custom analytical software (such as R, MatLab and Python scripts) is quickly becoming a significant challenge as well. Such analytical scripts that are used in scientific research are often informal, written without following software best practices. This is leading to a proliferation of irreproducible software. Given the realities of the demands placed on scientists, we have investigated the use of "data provenance" (i.e. a formalized record of a computational process) to produce tools to help researchers improve the transparency and reproducibility of analytical software associated with a research project. This talk will present the concept of data provenance and how it has been applied to create tools, such as an automatic project "capsule" creation program (encapsulator) and a code cleaning package for R (Rclean), to aid in the process of sharing research through public project repositories like Dataverse.

Coffee Break (10:45 – 11:00am)
Capsule Model for Open Science with Restricted Data (11:00 – 11:45am)

Presentation by Beth Plale, Indiana University Bloomington– Slides

Open science is yielding active efforts to make data from research available for broader use. But data have restrictions on them (privacy, sensitivity restrictions; regulated by statute or otherwise) that can limit their ability to be made available more broadly. In this talk we offer that there are alternate approaches to the spectrum of data sharing options that offers more control over data than full sharing yet are more contributory than no sharing at all. We offer the controlled compute environment, or capsule, as a viable new approach for computational analysis of data that have restrictions. The compute environment increases the range of possibilities for facilitating science through data reuse, an objective of open science. This talk frames the capsule, and provides experience based on one such capsule used in HathiTrust for research with copyrighted materials.

New Developments at ICPSR for Improving Replication and Evaluation” 11:45am – 12:30pm

Presentation by Jared Lyle, Inter-university Consortium for Political and Social Research– Slides

This presentation will report on three recent activities at ICPSR to improve replication and evaluation in the social sciences. The first activity is facilitating and opening access to restricted-use data. The second activity is improving discovery of data-related publications. The third activity is building communities around archived data by enabling users to comment and contribute.

Lunch (12:30 - 1:30pm)
The Confirmable Reproducible Research (CoRe2) Environment: Linking Tools to Promote Computational Reproducibility (1:30 – 2:15pm)

Thu-Mai Christian, H. W. Odum Institute for Research in Social Science – Slides

Over the past three years, the Odum Institute has worked with journals to incorporate data curation and verification into their manuscript publication workflow as part of the implementation and enforcement of data replication policies. The additional steps—and the associated expertise, tools, and labor hours—required to execute such a workflow raises the question about the feasibility of data replication policy adoption for journals that operate in tightly resourced environments. This question has prompted the Odum Institute to examine the existing integrated manuscript publication and data curation and verification process and identify opportunities for streamlining the workflow. The result is the development of the Confirmable Reproducible Research (CoRe2) environment, which will connect and coordinate systems, standards, and stakeholders to reduce or eliminate the encumbrances within the current workflow to yield gains in efficiency. This presentation will describe how CoRe2 addresses the challenges of integrating manuscript publication and data verification workflows and offers opportunities to promote computationally reproducible research.

Annotation for Transparent Inquiry (ATI) (2:15 – 3:00pm)

Presentation by Colin Elman and Diana Kapiszewski Qualitative Data Repository – Slides

The Qualitative Data Repository (QDR), the software non-profit Hypothesis (https://hypothes.is/), and Cambridge University Press, have partnered to develop and test a new approach to achieving transparency in qualitative and multi-method research: Annotation for Transparent Inquiry (ATI). Building on “active citation,” an earlier approach developed by Andrew Moravcsik, ATI employs “open annotation,” which allows for the generation, sharing, and discovery of digital annotations across the web. Using ATI empowers scholars to develop data supplements that can be linked directly to digital publications on multiple platforms. An ATI Data Supplement contains a set of digital annotations comprising "analytic notes" discussing data generation and analysis and excerpts from underlying data sources. Authors may also include the data sources themselves. The presentation will briefly describe ATI, and report on preliminary results from the first part of a Robert Wood Johnson Foundation funded pilot study.