Persistent Identifiers (PIDs) and Data Management

Experiences and Use Cases

The PIDs and Data Management initiative at D2I focuses on defining emerging PID use cases and developing and testing solutions that would benefit specific research communities. The team collaborations with the working and interest groups at the Research Data Alliance and many collaborators around the world. The following use cases were tested with the PID framework.

Streaming Sensor Data 

The use case focused on air sensor data collected from devices that are installed mostly in Taiwan by the air quality microsensing project (https://pm25.lass-net.org/) - a collaboration between the Taiwanese academic institutions, computer industry and Taipei city government. The streaming data for our use case was collected in 2017.

To enable easy referencing and re-use from a repository, PIDs were assigned to daily sensor feeds of the data.

It is a RDA collection API. A collection API is a software to manage the collection and member objects in a specified data structure, and providing a flexible searching functionality for representing the data object. Metadata is captured and made available via the Data Type Registry - a component of PID framework developed in collaboration with RDA.

For more information about collection API:

https://github.com/RDACollectionsWG/specification

For more information, data and code visit:

https://github.com/Data-to-Insight-Center/SEADTrain

Rice Genomics Galaxy Workflows

PRAGMA-RDA Data Service Galaxy application brings persistent IDs and registration of data objects generated by scientific analysis, that is carried out using cloud virtual machines (VMs) in PRAGMA (http://www.pragma-grid.net/). The objective of the project is to improve sharing of data objects specifically from genomic analyses by the International Rice Research Institute (http://irri.org/) community. This service is designed to be reusable in other cases where VMs are used for analysis and PIDs are used to enhance sharing and reusability of results.

As part of our research at IU, we explored storing provenance as part of the PID Kernel Information record. PID Kernel Information is a small amount of information stored at resolver (Local Handle Server) in PID record of a PID. Below left provenance trace demonstrates provenance DAG (Directed Acyclic Graph) as published in the PID Kernel Information. DAG is a finite directed graph with no directed cycles. The below right provenance trace shows the full provenance trace of one Galaxy workflow execution. The PID backbone provenance trace emphasizes on the derivation history among published DOs to strengthen their trust and reusability.

   

For more information, data and code visit:

GitHub code:

https://github.com/Data-to-Insight-Center/RDA-PRAGMA-Data-Service/tree/master

https://github.com/Data-to-Insight-Center/PRAGMA-Data-Repository

GitHub wiki:

https://github.com/Data-to-Insight-Center/RDA-PRAGMA-Data-Service/wiki/Welcome-to-PRAGMA-Data-Service-Prototype

https://github.com/Data-to-Insight-Center/RDA-PRAGMA-Data-Service/wiki/RDA-PRAGMA-Data-Identity-Service-API-Documents

RPID Testbed

The RPID testbed is, to stimulate and enable evaluation of powerful new complementary outputs of the Research Data Alliance (RDA) in PID oriented data management. The testbed is responsive to data driven priorities in science and education, specifically as part of the cyberinfrastructure ecosystem that accelerates a broad spectrum of data-intensive research. The advancements developed and tested here we believe have the transformative magnitude to stimulate an entire ecosystem of new discovery services for research data. It is open for research, education, non-profit, or pre-competitive use.

The RPID testbed consists of the following services and software: the Data Type Registry, the Handle System, and the PID Kernel Information framework.

The Data Type Registry (DTR) provides a way to register detailed and structured descriptions of data objects and encourages the use of established data types, for example Identifier, name, description, standards, issuer, provenance, contributors, creationDate, lastModificationDate, representationsAndSemantics, properties and so on.

The Handle System is used in the RPID testbed to assign PIDs (Handles)  to data objects and to resolving the identification of  the resources.

PID  Kernel Information is small amount of information stored in a PID record. Such minimal information can help make data objects FAIR (Findable, Accessible, Interoperable, Reusable) and less dependent on the repository system. We take FAIR principles as an inspiration and a guide and explore how far can PID Kernel Information aid in implementing FAIR.

Find the Data Type Registry service below.

http://rpid-dtr.grid.iu.edu:8080/

Find the RPID handle service URL below. 

http://rpid-handle.grid.iu.edu:8080/

See also the the case statement from the RDA Data Type Registry working group below.

https://www.rd-alliance.org/sites/default/files/case_statement/DTR2%20Case%20Statement_Final.pdf

Find the PID Kernel Information WG Case Statement below.

https://www.rd-alliance.org/group/pid-kernel-information-wg/case-statement/pid-kernel-information-wg-case-statement 

For more information, data and code visit https://rpidproject.github.io/rpid or email rpid-l@iu.edu.

Data Management Studies

Data repositories that evolved in support of applications below.

MongoDB databases

In the PRAGMA project, data repository is designed for managing scientific data objects across the boundaries among different domains. Our data repository present a convenient and clearly-defined interface that can host both long-tail data objects or large data sets.

This is implemented with MongoDB, which provides sharding feature that distributes the database among different machines while maintain replicas in other machines. Besides, with MongoDB as backend, we also use a single framework to store both metadata and data and offer users the possibility to decide the information they want to have as data objects metadata. For more details about our Pragma data repository find the GitHub page below.

https://github.com/Data-to-Insight-Center/PRAGMA-Data-Repository/tree/master

Forthcoming study of DOIP API interface

DOIP (Digital Object Interface Protocol) is a protocol and authentication mechanism used to interact with digital objects through a connection to the digital object server. Our extended PID system supports two type of transport protocols. They are DOIP and HTTP (Hyper Text Transfer Protocol). Most using the DOIP protocol are represented as sets of key-value pairs with a single UTF-8 text label. Key-value pairs are ASCII text strings delimited by ampersands(&). Each pair is split into a key and a value by an equals sign(=). All non-ASCII characters, equals signs and ampersands in the keys or values are UTF-8 encoded and %-escaped. An example DOIP message taken from the standard documentation for the protocol, is shown in the listing below.

{
<message> := <messagetype> ’:’
<segment><newline>
<segment> :=
<segment> := <kvpair>
<segment> ’&’ <kvpair>
<segment> := <kvpair>
<kvpair> := <key>
<kvpair> := <key> ’=’ <value>
<messagetype> := <encodedtoken>
<key> := <encodedtoken>
<value> := <encodedtoken>
}

As our extended PID system supports DOIP and HTTP we need to make sure which protocol is better for our use case. This is the purpose of our study. To find out that we did below experimental methods and based on the findings we finalized the protocol.

Experimental methods and findings:

As part of the baseline study, we measure the network behavior for both HTTP protocol and DOIP protocols. Network behavior is measured by the overall Response time minus the Service time for both send and receive. The measurement uses a 5,000 event workload. For HTTP the mean and standard deviation over the entire workload is mean = 7:42ms and stdev = 4:19  and for DOIP mean = 3:92ms and stdev = 3:76 . The DOIP protocol is almost twice as fast as HTTP based on response time, and shows more stability given its smaller standard deviation. On the other hand, as a custom protocol, DOIP may be more problematic for widespread adoption.

Common Interfaces

We have developed few PRAGMA Data Service Prototype access user interfaces that will help user to understand the Pragma services very easily. Below are those access points.

Home page for IRRI data products:

http://202.90.159.39:8079/dataidentity/irri-index.html

The Genomics Analysis Tool Galaxy workflow:

http://202.90.159.39:8080/

PRAGMA Data Repository Search GUI:  http://202.90.159.39:8079/dataidentity/irri-search.html?DataTypePID=20.5000.347/1af9b7467412d3982998&DataTypeName=IRRI%20Rice%20Genomes%20tassel%20workflow

IU SEAD Cloud Discover Interface for Airbox data:

http://d2i-dev.d2i.indiana.edu:8081/iusc-azure-search/search.html

Data Publishing Workflow Services

Metadata and PID assignment

We have used Metadata and PID assignment in “Streaming Sensor Data” and “Rice Genomics Galaxy Workflows” use cases. In both use cases, RPID Testbed at Indiana University is used to create PIDs and improve sharing and interoperability of scientific data objects by embedding  minimum metadata associated with data objects (called PID kernel information). A reliable database could be used to keep PID and its relative metadata of data object(not Kernel Information) for large scope searching, making PID flexible in users’ requirements. It is used to host and collect heterogeneous scientific DOs with associated metadata from varied disciplines. A single framework to store both metadata and data, offering users the possibility to decide the information they want to have as data objects metadata. For ease the search to get PIDs, we accomplish a frontend server called Discovery User Interface. In the search page, user can filter the DO list by using some key points, like publication date, creation date, title and so on. On the right side, you can see a list of results with minimal metadata including a PID URL. Find the Discovery User Interfaces below.

Streaming Sensor Data: 

http://d2i-dev.d2i.indiana.edu:8081/iusc-azure-search/search.html

Rice Genomics Galaxy Workflow: 

http://202.90.159.39:8079/dataidentity/irri-search.html?DataTypePID=20.5000.347/1af9b7467412d3982998&DataTypeName=IRRI%20Rice%20Genomes%20tassel%20workflow

Provenance extraction

The provenance extraction is implemented in the “Rice Genomics Galaxy Workflows” use case. Our design represents data provenance capture and employs a hands-off technique (data provenance capture) to gather information about a researcher’s rice genomics analysis while the analysis is running. Through this technique, the information acquired during the analysis, is compiled and combined with pre-analysis information that is available at the beginning of the analysis workflow. Such information includes who performed the analysis, when it was performed, and under what conditions.

Our design end goal is to advance open access, hence making Rice Galaxy consistent with open access policy. To do this, we focus on each piece of data and valuable information emerges from workflow runs deemed to be of importance. This particular data and information must be retained and shared with others, while being subject to reasonable restrictions.  This is a highly selective approach for provenance capture, and one that makes our work unique. We briefly outline the solution here and identify resources for those interested in pursuing the topic in more detail.

For more information about data publishing workflow services, data and code visit.

Streaming Sensor Data Metadata and PID assignment:

GitHub code: 

https://github.com/Data-to-Insight-Center/SEADTrain/tree/master/sead-client

GitHub Data:

https://github.com/Data-to-Insight-Center/SEADTrain/tree/master/data

GitHub wiki:

https://github.com/Data-to-Insight-Center/SEADTrain/wiki/PID-Creation

https://github.com/Data-to-Insight-Center/SEADTrain/wiki/SEADTrain-Data-Description

Rice Genomics Galaxy Workflow Metadata and PID assignment:

GitHub code:

https://github.com/Data-to-Insight-Center/RDA-PRAGMA-Data-Service/tree/master/pragmapit-ext

GitHub wiki:

https://github.com/Data-to-Insight-Center/RDA-PRAGMA-Data-Service/wiki/RDA-PRAGMA-Data-Identity-Service-API-Documents