Upload
ajith-ajjarani
View
139
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Presentation for Data Exploration
Citation preview
1
Leveraging Collaborative Tagging for Web Item Design
Mahashweta Das, Gautam Das , Vagelis Hristidis
04/12/2023
Presenter : Ajith C Ajjarani[1000-727269]
04/12/2023 2
Outline : Organization of Presentation!
Motivation & Problem Definition
Tag Maximization : NP Complete
Approximation Algorithm
Exact 2 Tier : Top K Algorithm
Experiment & result Tabulation
Moderate Instances Larger Instances
Naïve Bayes
Classifier
04/12/2023 3
Motivation
Can I design a New Camera Which Attracts
& maximizes the Tags ??
Lets Define this Opportunity as
problem !
04/12/2023 4
Problem Construction ? Attributes are product definition Tags are user-defined
Now, given subset of subjective “Desired“ Tags predict a New Item( a combination of Attribute values) Extend this to “Top K” version for potential k Items with
highest expected number of desirable Tags.
Training Data
• Given a database of tagged products, task is to design k new products (attribute values) that are likely to attract maximum number of desirable tags– tag-desirability is just one aspect of product design consideration
• Applications– electronics, autos, apparel– musical artist, blogger
Problem Statement
Resolution?
Zoom? Flash?
Shooting mode?
Light Sensitivity?
04/12/2023 6
Tag Maximization
Technically challenging, as complex dependencies exist between tags and items
Difficult to determine a combination of attribute values that maximizes the expected number of desirable tags.
“Naïve Bayes” Classifier for Tag Prediction. Even for this Classifier(assumption of simplistic Conditional
Independence), Tag maximization problem is NP- Complete.Researchers have NOT resorted to Heuristics
Developed Principal Algorithms
Proposed Solution
Exact – Top K Algorithm (ETT) performs significantly better than naïve brute force algorithm.
(No need to compute all possible products ) Application of Rank-Join and TA top-k algorithm in a two-tier architecture In the worst case, may have exponential running time
Approximation Algorithm (Poly Time Approximation Scheme) with provable Error bounds
The algorithm’s overall running time is exponential only in the (constant) size of the groups, but can be reduced to a polynomial time complexity.
For Large datasets
04/12/2023 8
Problem Framework • D = {o1, o2, ..., on}• A = {A1,A2, ..., Am}• T = {T1, T2, ..., Tr }
Each item is thus a vector of size (m + r) Eg :
• Above such dataset has been used as a training set to build Naive Bayes Classifiers (NBC) & compute P (Tag | Attributes)
BooleanDataset
04/12/2023 9
Derived Results
The probability that a new item o is annotated by the tag Tj
Probability Pr(Tj ‘ | o) of an item o not having tag Tj :
Derived ResultsDerived :
Expected number of desirable tags Td = {T1, . . . , Tz} T .⊆ new Item(o) is annotated with:
Rj : Convenience
Exact Algorithm
• Naïve brute-force– Consider all possible 2m products and compute for each
possible product– Exponential Complexity
• Exact two-tier top-k (ETT)– Application of Rank-Join and TA top-k algorithm in a two-tier
architecture– Does not need to compute all possible products
• performs significantly better than naïve brute-force– Works well for moderate data instances, does not scale to larger data
• In the worst case, may have exponential running time
ETT: Two Tier Architecture
Z – desirable Tagsm‘ =m / l
Match these Items in tier-2 to compute global best product across all tags
Determine “best” Item for each tag(T1,T2..Tz) in tier-1
ETT Algorithm(Exemplification)
• Database: {A1, A2, A3, A4 } and {T1, T2} and top-1– Partition attributes into 2 groups {A1, A2} and {A3, A4 } to form 2 lists of
partial products– Each list has ( 2m‘ ) 22= 4 entries (partial products)– Compute score for each partial product for each tag using and sort in descending order
Run NBC & Calculate
GetNext() = GetNext() =
(A1 A2)
10, 1.97
00, 0.84
11, 0.84
01, 0.36
(A3 A4)
10, 1.97
00, 0.84
11, 0.84
01, 0.36
L1 L2
(A1 A2)
11, 2.76
01, 1.18
10, 1.18
00, 0.51
(A3 A4)
11, 4.57
10, 2.53
01, 0.91
00, 0.51
L1 L2
T1 T2
Tier 2
Tier 1
Join Product PartialScore
MPFS
.. .. .. ..
.. .. .. ..
BufferTop-K ()
Product Complete Score
.. ..
.. ..
MUS: sum of last seen score from all GetNext()
MPFS:
Actual/Complete :Score
GetNext( ) = 1111 GetNext( ) = 1010
BufferTop-K ()
Product Complete Score
1111 1.75
1010 1.70
(A1 A2)
10, 1.97
00, 0.84
11, 0.84
01, 0.36
(A3 A4)
10, 1.97
00, 0.84
11, 0.84
01, 0.36
L11 L12
(A1 A2)
11, 2.76
01, 1.18
10, 1.18
00, 0.51
(A3 A4)
11, 4.57
10, 2.53
01, 0.91
00, 0.51
L21 L22
Join Product PartialScore
MPFS
1 1010 0.95 0.95
2 ..
.. ..
T1 T2
RankJoin
Join
Tier 2
Tier 1
Return to Tier 1
MinK (1.75) <= MUS (1.88)
Join Product PartialScore
MPFS
1 1111 0.93 0.93
.. ..
.. ..
>= >=
Iteration 1
GetNext( ) = 1110GetNext( ) = 1011
BufferTop-K ()
Product Complete Score
1110 1.77
1011 1.76
(A1 A2)
10, 1.97
00, 0.84
11, 0.84
01, 0.36
(A3 A4)
10, 1.97
00, 0.84
11, 0.84
01, 0.36
L11 L12
(A1 A2)
11, 2.76
01, 1.18
10, 1.18
00, 0.51
(A3 A4)
11, 4.57
10, 2.53
01, 0.91
00, 0.51
L21 L22
Join Product PartialScore
MPFS
1 1010 0.95 0.95
2 1011 0.92 0.92
.. ..
T1 T2
RankJoin
Join
Tier 2
Tier 1
Return to Tier 1
MinK (1.77) <= MUS (1.79)
Join Product PartialScore
MPFS
1 1111 0.93 0.93
.2 1110 0.88 0.88
.. ..>= >=
Iteration 2
GetNext( ) = 0111GetNext( ) = 0010
BufferTop-K ()
Product Complete Score
0111 1.77
0010 1.76
(A1 A2)
10, 1.97
00, 0.84
11, 0.84
01, 0.36
(A3 A4)
10, 1.97
00, 0.84
11, 0.84
01, 0.36
L11 L12
(A1 A2)
11, 2.76
01, 1.18
10, 1.18
00, 0.51
(A3 A4)
11, 4.57
10, 2.53
01, 0.91
00, 0.51
L21 L22
Join Product PartialScore
MPFS
1 1010 0.95 0.95
2 1011 0.92 0.92
3 0010 0.89 0.89
T1 T2
RankJoin
Join
Tier 2
Tier 1
ETT Terminates
MinK (1.77) <= MUS (1.74)
Join Product PartialScore
MPFS
1 1111 0.93 0.93
.2 1110 0.88 0.88
3 0111 0.84 0.84>= >=
Iteration 3
Thus, ETT returns the Best Item
(0111 or 1110) in Just 6 Item Look -up
04/12/2023 18
Approximation AlgorithmZ Desirable tags
T3,T4 … Tz ‘ T1,T2… Tz ‘
O11,O12…O1k
Z ‘ Tags
Z/Z‘ Subgroups
Z ‘ Tags
Top K Items for Each Subgroup
O21,O22…O2k
O1,O2…Ok
Overall Top K Items
Solved using PTAS in polynomial time defined for Approximation factor €
€ = 2σm σ = Compression factor
04/12/2023 19
PTAS Algorithm DesignZ Desirable tags
T1,T2… Tz ‘
Oa
Z =Z ‘ Tags
Top K =1 Item for This Subgroup
Oa PTAS returned ItemOg Optimal Item
For K = 1 & 1 Sub Group € > 0
PTAS Should run in Polynomial Time & Invariant Exact Score (Oa) >= (1- €) Exact Score (Og)
04/12/2023 20
PTAS Algorithm DesignSimple exponential time exact top-1 algorithm for the sub-problem is created & then deduced to PTAS
Given (m ) Boolean attributes and Z ‘ tags, the exponential time algorithm makes m iterations
Initial step : Produces the set S0 consisting of the single item {0m} along with its Z ‘ scores, one for each tag.
first iteration,it produces the set containing two items S1 = {0m, 10m−1}each accompanied by its Z ‘ scores, one for each tag.
ith iteration, it produces the set of itemsSi = {{0, 1}i×0m−1} along with their z scores, one for each tag.
final set Sm contains all 2m items along exact scores, from which the top-1 item can be returned,
04/12/2023 21
PTAS Algorithm Design
Consider this TableZ = Z‘ = 2σ = 0.5m = 4€ = (2σm) = 4
Og = {1110} [1.77] = [0.89+0.88]
Oa = {1111} [1.75] = [0.82+0.93]
04/12/2023 22
PTAS Algorithm DesignCluster’s item’s exact underlined scoreshould be close to the deleted item’s exact score.
04/12/2023 23
Experiment
Synthetic and real datasets for quantitative and qualitative analysis of proposed algorithms
Quantitative performance indicators are : efficiency of the proposed exact and approximation
algorithm. Obtained Approximation factor of results produced by
the approximation algorithm
Qualitative results of algorithms :Amazon Mechanical Turk user study to assess the results of algorithms.
04/12/2023 24
ExperimentReal Camera Dataset : Crawled a real dataset of 100 cameras listed at Amazon .
The listed camera’s contain technical details (attributes) & tags customers associate with each camera.
The tags are sanitized to remove synonyms, unintelligent and undesirable tags such as Nikon coolpix, quali, bad, etc.
Synthetic Dataset : Boolean matrix of dimension 10,000 (items) × 100 (50 attributes +50 tags)
50 independent distributed attributes into 4 groups, where the value is set to 1 with probabilities of 0.75, 0.15, 0.10 and 0.05
50 tags, predefined relations by randomly picking a set of attributes that are correlated
04/12/2023 25
Quantitative : PerformanceExact Algorithm:• Synthetic dataset having 1000 items, 16 attributes and 8 tags (Naïve Vs ETT)
04/12/2023 26
Quantitative : PerformanceBelow figure, reveals that ETT is extremely slow beyond number of attributes (m) = 16
PA with an approximation factor =0.5, continues to return guaranteed results in reasonable time with increasing number of attributes m
04/12/2023 27
Quantitative : PerformanceExecution time & obtained approximation factor Synthetic dataset1000 items, 20 attributes & 8 tags
Top 1 Item is considered.
04/12/2023 28
Qualitative : User StudyFirst part of User study :
PA algorithm with an approximation factor =0.5, by considering tag sets corresponding to compact cameras and slr cameras respectively.
Built 4 new cameras (2 digital compact & 2digital slr) PA algorithm € =0. 5 Vs
4 existing popular cameras
65% of users choose the new cameras
04/12/2023 29
Qualitative : User StudySecond part of the study :
Built 6 new cameras designed for three groups : 1. young students2. old retired3. professional photographers.
2 potential new cameras for each Group
When asked with users to assign at least five tags : observation : majority of the users rightly classify the six cameras into the three groups
04/12/2023 30
Conclusion Define the Tag Maximization problem & investigate its
computational complexity. Propose 2 novel Algorithms & shown the practicability This work is a preliminary look at a very novel area of
research & promises exciting directions of future research.
Decision trees, SVMs, and regression trees classifiers are to used & Conduct the experiment
04/12/2023 31
Referenceshttp://crystal.uta.edu/~gdas/Courses/websitepages/fall10DBIR.html
Questions?
Thank You