16
HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization Gong Cheng , Cheng Jin, Yuzhong Qu National Key Laboratory for Novel Software Technology Nanjing University, China Websoft

HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Embed Size (px)

Citation preview

Page 1: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Gong Cheng, Cheng Jin, Yuzhong Qu

National Key Laboratory for Novel Software TechnologyNanjing University, China

Websoft

Page 2: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/

Page 3: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Scenario: browsing a dataset in an open data portal

https://data.europa.eu/euodp/en/data/dataset/dgt-translation-memory

I need some insight into the contents, not just metadata.

Page 4: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Meeting the challenge with a dataset summary

i.e., automatically generated small-sized, high-level abstraction of data,to summarize the contents of a dataset for quick inspection.

Page 5: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Expected features of a dataset summary

• To provide multigranular abstraction of data to be incrementally explored

• To preserve the structural nature of a dataset

• To be comprehensible

Page 6: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Constitution of a dataset summary

• An example

A hierarchical grouping of entities Relations connecting sibling groups

A property-value pair differentiates a group of entities from sibling groups.

Page 7: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Quality of a dataset summary

• Coverage of data

• Height of hierarchy

• Cohesion within groups

• Overlap between groups

• Homogeneity of groups

Page 8: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Quality of a dataset summary

• Coverage of data• large subgroups, frequent relations

• Height of hierarchy

• Cohesion within groups

• Overlap between groups

• Homogeneity of groups

Page 9: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Quality of a dataset summary

• Coverage of data

• Height of hierarchy• moderate-sized subgroups

• Cohesion within groups

• Overlap between groups

• Homogeneity of groups

Page 10: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Quality of a dataset summary

• Coverage of data

• Height of hierarchy

• Cohesion within groups• informative (i.e., less frequent) property-value pairs

• Overlap between groups

• Homogeneity of groups

Page 11: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Quality of a dataset summary

• Coverage of data

• Height of hierarchy

• Cohesion within groups

• Overlap between groups• controllable overlap

• Homogeneity of groups

Page 12: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Quality of a dataset summary

• Coverage of data

• Height of hierarchy

• Cohesion within groups

• Overlap between groups

• Homogeneity of groups• different values of the same property

Page 13: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Problem formulation:multidimensional knapsack problem (MKP)

maximizing moderateness of each subgroup

maximizing cohesion within each subgroup

disallowing large overlap between subgroups

selecting ≤k subgroups

(optionally) disallowing different properties

Page 14: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Problem solution

• A greedy strategy is used(sorting candidates by )

but its efficient implementation is non-trivial.

Page 15: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Experiments

• Baseline: LODeX (ISWC’14)• flat grouping

• biased towards coverage (e.g., Type:Person)

• redundant information (e.g., Type:Person and Type:Chair)

• Advantages of HIEDS• hierarchical grouping

• trade-off between coverage and cohesion (e.g., Type:Actor)

• controllable overlap

Page 16: HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

Details can be found in our poster!