Upload
kyria
View
26
Download
0
Embed Size (px)
DESCRIPTION
Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions. Aparna Varde, Elke Rundensteiner, Carolina Ruiz, David Brown, Mohammed Maniruzzaman and Richard Sisson Jr. Worcester Polytechnic Institute Worcester, MA, USA ACM CIKM 2006, Arlington, VA, USA. - PowerPoint PPT Presentation
Citation preview
Designing Semantics-Designing Semantics-Preserving Cluster Preserving Cluster
Representatives for Scientific Representatives for Scientific Input ConditionsInput Conditions
Aparna Varde, Elke Rundensteiner, Aparna Varde, Elke Rundensteiner,
Carolina Ruiz, David Brown, Carolina Ruiz, David Brown,
Mohammed Maniruzzaman and Richard Sisson Jr.Mohammed Maniruzzaman and Richard Sisson Jr.
Worcester Polytechnic InstituteWorcester Polytechnic Institute
Worcester, MA, USAWorcester, MA, USA
ACM CIKM 2006, Arlington, VA, USAACM CIKM 2006, Arlington, VA, USA
IntroductionIntroduction Clustering often groups data with mixed attributesClustering often groups data with mixed attributes
NumericNumeric CategoricalCategorical OrdinalOrdinal
Examples: PDAs, Web Pages, Scientific Examples: PDAs, Web Pages, Scientific Experiments Experiments
Cluster Representatives: depictions of each cluster Cluster Representatives: depictions of each cluster
Randomly selected representatives not enough inRandomly selected representatives not enough in Capturing cluster informationCapturing cluster information Providing ease of interpretation Providing ease of interpretation Incorporating different user interestsIncorporating different user interests
Need for Designing Cluster RepresentativesNeed for Designing Cluster Representatives
Motivating Example
Scientific experiments Scientific experiments clustered based on resultsclustered based on results
Clustering criteria learned Clustering criteria learned based on input conditionsbased on input conditions
Representative of conditions Representative of conditions used to characterize a clusterused to characterize a cluster
Problem with randomly Problem with randomly selected representativeselected representative Distinct combinations of conditions Distinct combinations of conditions
could lead to a given clustercould lead to a given cluster Decision tree learning the clustering criteria(Heat Treating of Materials)
Goals Goals
Need to Design Semantics-Need to Design Semantics-Preserving Cluster Representatives Preserving Cluster Representatives thatthat
Capture relevant information in clusterCapture relevant information in clusterAvoid visual clutter and are easy to Avoid visual clutter and are easy to
interpretinterpretTake into account various user Take into account various user
interests in targeted applicationsinterests in targeted applications
Proposed Approach: DesCondProposed Approach: DesCond
Build candidate representatives with increasing levels of detail
Given: Clusters of experiments,conditions leading to clusters
Compare candidates using MDL-based encoding capturing user interests
Return candidate with lowest encoding as best for each cluster
Define notion of distance for conditions
incorporating domain semantics
Main Tasks in DesCondMain Tasks in DesCond
Defining a notion of distance for the Defining a notion of distance for the input conditionsinput conditions
Obtaining suitable candidate Obtaining suitable candidate representatives for each clusterrepresentatives for each cluster
Proposing an encoding to compare Proposing an encoding to compare candidates and find a winnercandidates and find a winner
Notion of DistanceNotion of Distance
Example: Heat Treating of Example: Heat Treating of MaterialsMaterials Quenchant: Cooling MediumQuenchant: Cooling Medium Part: The material being treatedPart: The material being treated Probe: Characterizes shape, Probe: Characterizes shape,
dimension dimension Oxide: Thickness of oxide on Oxide: Thickness of oxide on
surfacesurface Agitation: Extent of agitation of Agitation: Extent of agitation of
cooling mediumcooling medium Quenchant Temperature: Starting Quenchant Temperature: Starting
temperature of cooling mediumtemperature of cooling medium
Define domain-specific distance Define domain-specific distance metric for conditions metric for conditions incorporatingincorporating Data types of attributesData types of attributes Distance between attribute valuesDistance between attribute values Weights of the attributesWeights of the attributes
Pneum atic cylinder
Furnace
O il beaker
Pneum atic on/offsw itch
K-type therm ocouple
Probe tip
C onnecting rod
C om puter w ith D ataAcquisition C ard
Therm ocouple for O il tem p.
Data Types of the AttributesData Types of the Attributes
CategoricalCategoricalCharacters or strings with descriptive informationCharacters or strings with descriptive informationE.g., Quenchant Name, Part Material, Probe TypeE.g., Quenchant Name, Part Material, Probe Type
Numerical Numerical Integers or real numbersIntegers or real numbersE.g., Quenchant TemperatureE.g., Quenchant Temperature
OrdinalOrdinalWhere order mattersWhere order mattersE.g., Oxide Layer, Agitation LevelE.g., Oxide Layer, Agitation Level
Distance Between the Attribute Distance Between the Attribute ValuesValues
CategoricalCategorical Different = 1Different = 1 Same = 0Same = 0
Numerical Numerical Absolute difference between Absolute difference between
Values or Values or Mean values of rangesMean values of ranges
OrdinalOrdinal Map values to integerMap values to integer
E.g., Oxide Layer: none = 0, thin =1, thick = 2E.g., Oxide Layer: none = 0, thin =1, thick = 2 Absolute difference between mapped values Absolute difference between mapped values
Weights of the AttributesWeights of the Attributes
Attribute has higher Attribute has higher weight if itweight if it Is at higher level in treeIs at higher level in tree Belongs to a shorter pathBelongs to a shorter path Has more experiments in Has more experiments in
its corresponding clusterits corresponding cluster Decision Tree Weight Decision Tree Weight
HeuristicHeuristicWWi i = 1/P ∑= 1/P ∑j=1 to Pj=1 to P (H (Hi,ji,j / H / Hjj) * G) * Gj j
Candidate Representatives in Candidate Representatives in Levels of DetailLevels of Detail
Level 1: Single Conditions Representative (SCR)Level 1: Single Conditions Representative (SCR)One set of conditions preserving cluster One set of conditions preserving cluster
informationinformation
Level 2: Multiple Conditions Representative (MCR)Level 2: Multiple Conditions Representative (MCR)Summary of information in cluster Summary of information in cluster
Level 3: All Conditions Representative (ACR)Level 3: All Conditions Representative (ACR)All information in cluster abstracted suitablyAll information in cluster abstracted suitably
Single Conditions Representative
Return set of conditions closest to all Return set of conditions closest to all others in clusterothers in cluster
Notion of distance: Domain-specific Notion of distance: Domain-specific distance metric for conditionsdistance metric for conditions
Input conditions in Cluster A
SCR for Cluster A
Multiple Conditions Representative
Build sub-Build sub-clusters of clusters of condition using condition using domain domain knowledgeknowledge
Return nearest Return nearest sub-cluster sub-cluster representativesrepresentatives
Sort themSort them
MCR for Cluster A
Cluster A
Sub-clusters within Cluster A
All Conditions Representative
Return all sets of conditionsReturn all sets of conditions Sort them in ascending order Sort them in ascending order
Cluster A
ACR for Cluster A
DesCond Encoding to Compare DesCond Encoding to Compare CandidatesCandidates
Analogous to Minimum Description Length (MDL)Analogous to Minimum Description Length (MDL) Theory: representative, Examples: Sets of conditions in clusterTheory: representative, Examples: Sets of conditions in cluster
Complexity of representative (ease of interpretation) Complexity of representative (ease of interpretation) Complexity = logComplexity = log22 AV AV
A= number of attributes, V= number of values for each A= number of attributes, V= number of values for each attributeattribute
Distance of all items from representative (information Distance of all items from representative (information loss)loss)Distance = logDistance = log2 2 (1/s)∑ (1/s)∑{i=1 to s} {i=1 to s} D(R,SD(R,Sii))
D: domain-specific distance metric for conditionsD: domain-specific distance metric for conditionss: total number of items (sets of conditions) in cluster s: total number of items (sets of conditions) in cluster SSii: each individual item: each individual itemR: representative set of conditionsR: representative set of conditions
DesCond Encoding DesCond Encoding Effectiveness= UBC*Complexity + UBD*DistanceEffectiveness= UBC*Complexity + UBD*Distance
UBC, UBD: User bias % weights for complexity and distanceUBC, UBD: User bias % weights for complexity and distance
Evaluation of DesCond with Domain Evaluation of DesCond with Domain Expert InterviewsExpert Interviews
Evaluated with real data in Heat TreatingEvaluated with real data in Heat Treating User Bias weights in Encoding reflect User Bias weights in Encoding reflect
interests in targeted applications interests in targeted applications Different data sets and number of clustersDifferent data sets and number of clusters
For each data set score calculated as followsFor each data set score calculated as follows Consider winning candidate for each clusterConsider winning candidate for each cluster
Based on DesCond EncodingBased on DesCond Encoding Score: Number of clusters in which candidate is winnerScore: Number of clusters in which candidate is winner Example: Dataset of size 25 with 5 clustersExample: Dataset of size 25 with 5 clusters
If SCR wins for 2 clusters, ACR for 3If SCR wins for 2 clusters, ACR for 3 Score: SCR=2, ACR=3Score: SCR=2, ACR=3
Evaluation Results
DetailsDetails• Data Set Size = 400, Number of Clusters = 20Data Set Size = 400, Number of Clusters = 20• Experts provide UBC / UBD values in EncodingExperts provide UBC / UBD values in Encoding
ObservationsObservations• Overall winner is Overall winner is MCRMCR• As weight for complexity increases, As weight for complexity increases, SCRSCR wins wins• Designed better than RandomDesigned better than Random
Evaluation with Formal User Evaluation with Formal User SurveysSurveys
DesCond used to design representatives DesCond used to design representatives for a trademarked estimation tool [ref for a trademarked estimation tool [ref CHTE: Center for Heat Treating Excellence]CHTE: Center for Heat Treating Excellence]
Formal user surveys conducted in different Formal user surveys conducted in different applications of the systemapplications of the system
Evaluation ProcessEvaluation Process• Compare estimation with real data in Compare estimation with real data in
test settest set• If they match estimation is accurateIf they match estimation is accurate
Evaluation ResultsEvaluation Results
Different winners in different applicationsDifferent winners in different applications Results of surveys tally with those of Encoding-based evaluationResults of surveys tally with those of Encoding-based evaluation Estimation Accuracy: 90 to 94% (better than earlier versions of tool)Estimation Accuracy: 90 to 94% (better than earlier versions of tool)
Parameter Selection Applications Simulation Tool Applications
Decision Support Applications Intelligent Tutoring Applications
Related WorkRelated Work
Image Rating: [HH-01]Image Rating: [HH-01]• User intervention involved in manual User intervention involved in manual
ratingrating Semantic Fish Eye Views: [JP-04] Semantic Fish Eye Views: [JP-04]
• Display multiple objects in small space, Display multiple objects in small space, no representativesno representatives
PDA Displays in Levels of Detail: [BGMP-01]PDA Displays in Levels of Detail: [BGMP-01]• Do not evaluate different types of Do not evaluate different types of
representativesrepresentatives
ConclusionsConclusions Contributions of this workContributions of this work
Designing cluster representatives for scientific input Designing cluster representatives for scientific input conditions in levels of detailconditions in levels of detail
Defining a domain-specific distance metric for Defining a domain-specific distance metric for conditionsconditions
Proposing an encoding to compare representativesProposing an encoding to compare representatives Conducting evaluation using encoding with real data Conducting evaluation using encoding with real data
from Heat Treatingfrom Heat Treating Assessing use of representatives in applications of a Assessing use of representatives in applications of a
CHTE trademarked estimation tool CHTE trademarked estimation tool
ResultsResults Designed Representatives better than randomDesigned Representatives better than random Different designed representatives suit different Different designed representatives suit different
applicationsapplications DesCond enhances accuracy of estimation toolDesCond enhances accuracy of estimation tool