18
Content-based Comparison for Collections Identification Weijia Xu 1 , Ruizhu Huang 1 , Maria Esteva 1, Jawon Song 1 , Ramona Walls 2 , 1 Texas Advanced Computing Center, University of Texas at Austin 2 Cyverse.org IEEE BigData’16, Workshop on Computational Archival Science Dec 8, 2016 @ Washington D.C. 1

Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Content-based Comparison for Collections Identification

Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1, Ramona Walls2,

1 Texas Advanced Computing Center, University of Texas at Austin 2 Cyverse.org

IEEE BigData’16, Workshop on Computational Archival Science

Dec 8, 2016 @ Washington D.C.

1

Page 2: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Panorama •  Bio data collections evolve in a redundant, unstable,

and distributed research environment •  Data is big, has multiple components at different

stages of completion and may be stored across repositories

•  Pre and post data publication events are difficult to document

•  Auto-archiving is now an ubiquitous practice for research data

•  Metadata is not enough to establish data uniqueness

2

Page 3: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Identifier Services (IDS) Research Project Automated lifecycle identifiers management Use-case driven Focus on genomics data IDS services to:

Bind dispersed data objects Track and represent provenance Validate data location and integrity over time Aid data identity

Cyberinfrastructure to: Deal with large data/metadata Deal with evolving data over time Deal with big data tasks in a distributed environment

DATA  AUTHENTICITY  

3

Page 4: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

IDS Architecture User  

Agave  Tenant  

Agave  API  

...  

HPC  System  

User  registers  repository(s)  data,  providing  access  mechanism(s).  User  selects  files  from  a  registered  storage  system  

1  

Public  Cloud  

idenHfierservices.org  IDS  Web  

Metadata  API  

Systems  API  

Files  API  

Apps  API  

Jobs  API  

IDS  queues  corresponding  Agave  apps.  

2  

IDS  updates  metadata  with  data  analysis  results.  

4  

Data  app  pulls  data  from  repository,  computes  analysis  on  high  performance  computing  system.  

3

Data  Repository  Data  Repository  

Data  Repository  

User   User  User  User   User  

4

Page 5: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Content-based Comparison Service: Motivation •  To identify changes, connections, and differences

between datasets •  Infer issues of provenance •  Establish data identity •  Promote data reuse

•  How is the uniqueness of the datasets determined? •  Data curators mostly use manual processes •  Rely on metadata

•  Data description •  File checksums/fixity

5

Page 6: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Questions •  If two datasets share the same metadata, are they

the same dataset?

•  If two similar or identical datasets have different metadata, are they two different datasets?

•  In any of the cases above, how can curators apply global unique identifiers and corresponding metadata?

6

Page 7: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Content-based Data Comparison •  Goals:

•  Provide automated methods to verify data identity •  Provide additional information regarding two or more data

collections •  Provenance: documentation of data origin and changes

•  Increase data reuse •  Challenges:

•  Diverse data formats •  E.g. fasta vs. fastq, vs SRA formats

•  Different naming conventions/identifier structures •  Performance and scalability

•  Tens of gigabytes on disk •  Includes millions of data records per dataset (all pairwise record

comparisons are not feasible)

7

Page 8: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Algorithm Overview •  Determine the

composition of the collections to be compared

•  Decide record readers: gff, fastq, fasta

•  List record pairs for comparison

•  3 to 40 Gb file sizes ~19 mil to 165 mil records to compare

•  Compare all the records from the lists using Spark framework

•  Report results

Records comparison: compare records for each pair

Collection A Collection B

Collections analysis: determine records in each collection

Records list in A Records list in B

Records analysis: determine best pairs of records for comparison

Pairs of records from each collection

Comparison report

8

Page 9: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Algorithm •  Comparison works as follows:

Convert list of records as (ID, value) pairs, ︎Sort the records based on the IDs︎Start with a, b as the pointer to the head of list A, B respectively according to a score function. ︎

If score(a, b)<=0, record results and move a forward, ︎If score(a, b) >0, record results and move b forward. ︎

•  Different distance functions can be used to compare records: Hamming distance, edit distance, prefix distance etc.

9

Page 10: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Presentation of Results for Decision-making •  Number of records identified and compared •  Number of matched keys (id) and/or values

(sequence) between collections •  Numbers of records that only exist in one of the

collections (A or B) •  Degree of similarity of records as a histogram of the

match score distribution

10

Page 11: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Case 1 •  The dataset:

•  Rice genomic variations •  Two copies (105 gff files) available at:

•  Cyverse Data Commons •  Direct connection to HPC resource for data analysis

•  Dryad •  Integrate data to journal publication

•  At a glance •  Each repository has a different functionality •  Metadata does not match exactly •  Different formats: compressed file vs. individual files

•  Per content-based comparison both datasets are identical

•  Metadata record in Cyverse will be updated to reflect the relationship between the datasets

11

Page 12: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Case 2 •  A research group wants to publish a complete dataset

resulting from the analysis of five maize lines •  The input data, whole genome bisulfite FASTQ

sequencing files, have been published via SRA (Sequence Read Archive)

•  At a glance, •  The working copy consists of two FastaQ format files •  Researchers think that the working copy is also

available from SRA, 20.7 GB. (http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR850328)

12

Page 13: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Set A: 165 million sequences from working copy Set B: 167 million sequences from archived data with SRA

•  Through the results the researchers were able to infer that the working copy had been processed by adaptive trimming (provenance)

•  They concluded that the difference between the datasets was not significant; both trimmed and un-trimmed collections could be considered the same “work.”

•  Both datasets will have different unique identifiers, clarifications in their metadata, and identifiers will be related in the metadata

Results from Case 2

13

Page 14: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Case 3 •  Two datasets published with almost identical metadata •  Datasets are referenced as

•  http://www.onekp.com/samples/single.php?id=DDEV •  http://www.onekp.com/samples/single.php?id=IFCJ

•  At a glance •  Same datasets organization •  12 of 14 metadata fields have identical information •  No provenance information or stated relationship between both datasets

14

Page 15: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Results from Case 3 Set A: solexa reads of DDEV from 1kp (13 million records) Set B: solexa reads of IFCJ from 1kp (18 million records) Prefix match score: e.g. abcde and abdc, match score 2*2/(5+4) = 0.44

DistribuHon  of  prefix  match  score    

•  Researchers could infer that:

•  One the plant samples may be contaminated

•  Beginning 20% segments match may be due to the use of the same sequencing primer

15

Page 16: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Performance and Scalability

•  All computations were conducted in the data intensive system Wrangler at TACC

•  Implementation using Spark data processing framework •  Data format detection using API in bioPython package •  mpiBlast takes hours instead of minutes

0  

20  

40  

60  

80  

100  

120  

Data  reading     Records  match     Pre8ix  check    

Execution  Time  in  Seconds   4  nodes   8  nodes  

ExecuHon  Hme  in  comparing  two  sequencing  files  in  use  case  3  

16

Page 17: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Conclusions •  Identity and provenance of data has to be managed over

time as copies and versions of a dataset are generated, reused and published

•  Metadata alone may not be enough to determine uniqueness of the data

•  Metadata needs to be updated and relationships between datasets need to be clarified

•  Determining data uniqueness requires content-based comparison

•  Comparison results help curators understand the reasons for differences and similarities

•  More automated scalable computation services are needed for data curation

17

Page 18: Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Thanks & Questions?

Acknowledgement This work is supported through funding provided by the National Science Foundation from the following project: •  Evaluating Identifier Services for the Lifecycle of Biological Data

(#26100741) •  The iPlant Collaborative: Cyberinfrastructure for the Life

Sciences (#1265383, data and use cases) •  Wrangler: A Transformational Data Intensive Resource for the

Open Science Community (#1341711, computational resource support)

18