Upload
gong-cheng
View
129
Download
4
Embed Size (px)
Citation preview
HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization
Gong Cheng, Cheng Jin, Yuzhong Qu
National Key Laboratory for Novel Software TechnologyNanjing University, China
Websoft
Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
Scenario: browsing a dataset in an open data portal
https://data.europa.eu/euodp/en/data/dataset/dgt-translation-memory
I need some insight into the contents, not just metadata.
Meeting the challenge with a dataset summary
i.e., automatically generated small-sized, high-level abstraction of data,to summarize the contents of a dataset for quick inspection.
Expected features of a dataset summary
• To provide multigranular abstraction of data to be incrementally explored
• To preserve the structural nature of a dataset
• To be comprehensible
Constitution of a dataset summary
• An example
A hierarchical grouping of entities Relations connecting sibling groups
A property-value pair differentiates a group of entities from sibling groups.
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
Quality of a dataset summary
• Coverage of data• large subgroups, frequent relations
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
Quality of a dataset summary
• Coverage of data
• Height of hierarchy• moderate-sized subgroups
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups• informative (i.e., less frequent) property-value pairs
• Overlap between groups
• Homogeneity of groups
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups• controllable overlap
• Homogeneity of groups
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups• different values of the same property
Problem formulation:multidimensional knapsack problem (MKP)
maximizing moderateness of each subgroup
maximizing cohesion within each subgroup
disallowing large overlap between subgroups
selecting ≤k subgroups
(optionally) disallowing different properties
Problem solution
• A greedy strategy is used(sorting candidates by )
but its efficient implementation is non-trivial.
Experiments
• Baseline: LODeX (ISWC’14)• flat grouping
• biased towards coverage (e.g., Type:Person)
• redundant information (e.g., Type:Person and Type:Chair)
• Advantages of HIEDS• hierarchical grouping
• trade-off between coverage and cohesion (e.g., Type:Actor)
• controllable overlap
Details can be found in our poster!