Kuliah9 - StudiKasus Datamining-1

Embed Size (px)

DESCRIPTION

data mining

Citation preview

  • Data Mining

  • More data is generated:Bank, telecom, other business transactions ...Scientific Data: astronomy, biology, etcWeb, text, and e-commerce More data is captured:Storage technology faster and cheaperDBMS capable of handling bigger DB

  • We have large data stored in one or more database/s.We starved to find new information within those data (for research usage, competitive edge, etc).We want to identify patterns or rules (trends and relationships) in those data.We know that a certain data exist inside a database, but what are the consequences of that datas existence?

  • There is often information hidden in the data that is not readily evidentHuman analysts may take weeks to discover useful informationMuch of the data is never analyzed at allThe Data GapTotal new disk (TB) since 1995Number of analysts

    disks

    UnitsCapacity PBs

    199589,054104.8

    1996105,686183.9

    1997129,281343.63

    1998143,649724.36

    1999165,8571394.6

    2000187,8352553.7

    2001212,8004641

    2002239,1388119

    2003268,22713027

    1995104.8

    1996183.9

    1997343.63

    1998724.36

    19991394.6

    20002553.7

    20014641

    20028119

    200313027

    disks

    0

    0

    0

    0

    0

    0

    0

    0

    0

    chart data gap

    26535105700

    27229227400

    27245425330

    27309891970

    259531727000

    chart data gap 2

    26535105700

    27229333100

    27245758430

    273091650400

    259533377400

    data gap

    Ph.D.PetabytesTerabytesTotal TBsPBs

    1995105.7105700105700105.7

    1996227.4227400333100333.1

    1997425.33425330758430758.43

    1998891.9789197016504001650.4

    19991727172700033774003377.4

    20005792579200091694009169.4

    1990199119921993199419951996199719981999

    Science and engineering Ph.D.s, total22,86824,02324,67525,44326,20526,53527,22927,24527,30925,953

    10570033310075843016504003377400

    10570033310075843016504003377400

    Sheet3

  • Data Mining is a process of extracting previously unknown, valid and actionable information from large databases and then using the information to make crucial business decisions (Cabena et al. 1998).Data mining: discovering interesting patterns from large amounts of data (Han and Kamber, 2001).

  • Definition from [Connolly, 2005]:The process of extracting valid, previously unknown, comprehensive, and actionable information from large databases and using it to make crucial business decisions.The thin red line of data mining: it is all about finding patterns or rules by extracting data from large databases in order to find new information that could lead to new knowledge.

  • What is Data Mining? Certain names are more prevalent in certain US locations (OBrien, ORurke, OReilly in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) What is not Data Mining? Look up phone number in phone directory Query a Web search engine for information about Amazon

  • Prentice Hall*Database

    Data Mining Find all customers who have purchased milk Find all items which are frequently purchased with milk. (association rules) Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering)

    Prentice Hall

  • Transformed DataTarget DataRawDataKnowledgeData MiningTransformationInterpretation& EvaluationSelection& CleaningIntegrationUnderstandingDATAWarehouseKnowledgeData Mining in Knowledge Discovery Process

  • *Marketing, customer profiling and retention, identifying potential customers, market segmentation.Fraud detection identifying credit card fraud, intrusion detectionScientific data analysisText and web miningAny application that involves a large amount of data

  • *Data region1Data region2A data recordA data record

  • *

    image1EN7410 17-inch LCD Monitor Black/Dark charcoal$299.99Add to Cart(Delivery / Pick-Up )Penny ShoppingCompareimage217-inch LCD Monitor$249.99Add to Cart(Delivery / Pick-Up )Penny ShoppingCompareimage3AL1714 17-inch LCD Monitor, Black$269.99Add to Cart(Delivery / Pick-Up )Penny ShoppingCompareimage4SyncMaster 712n 17-inch LCD Monitor, BlackWas: $369.99$299.99Save $70 After: $70 mail-in-rebate(s)Add to Cart(Delivery / Pick-Up )Penny ShoppingCompare

  • CS 583*Word-of-mouth on the WebThe Web has dramatically changed the way that consumers express their opinions. One can post reviews of products at merchant sites, Web forums, discussion groups, blogs Techniques are being developed to exploit these sources.Benefits of Review AnalysisPotential Customer: No need to read many reviewsProduct manufacturer: market intelligence, product benchmarking

  • CS 583*Extracting product features (called Opinion Features) that have been commented on by customers. Identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative. Summarizing and comparing results.

  • CS 583* GREAT Camera., Jun 3, 2004 Reviewer: jprice174 from Atlanta, Ga.I did a lot of research last year before I bought this camera... It kinda hurt to leave behind my beloved nikon 35mm SLR, but I was going to Italy, and I needed something smaller, and digital. The pictures coming out of this camera are amazing. The 'auto' feature takes great pictures most of the time. And with digital, you're not wasting film if the picture doesn't come out.

    .Summary:

    Feature1: picturePositive: 12The pictures coming out of this camera are amazing. Overall this is a good camera with a really good picture clarity.Negative: 2The pictures come out hazy if your hands shake even for a moment during the entire process of taking a picture.Focusing on a display rack about 20 feet away in a brightly lit room during day time, pictures produced by this camera were blurry and in a shade of orange.

    Feature2: battery life

  • CS 583*Summary of reviews of Digital camera 1Picture BatterySize Weight ZoomComparison of reviews of Digital camera 1 Digital camera 2+__+

  • ClusteringClassificationAssociation RulesOther Methods:Outlier detectionSequential patternsPredictionTrends and analysis of changesMethods for special data types, e.g., spatial data mining, web mining*

  • Association rules try to find association between items in a set of transactions.For example, in the case of association between items bought by customers in supermarket: 90% of transactions that purchase bread and butter also purchase milk

    Output Antecedent: bread and butterConsequent: milkConfidence factor: 90%

    *

  • A transaction is a set of items: T={ia, ib,it}T I, where I is the set of all possible items {i1, i2,in}D, the task relevant data, is a set of transactions (database of transactions).Example:items sold by supermarket (I:Itemset): {sugar, parsley, onion, tomato, salt, bread, olives, cheese, butter}Transaction by customer (T): T1: {sugar, onion, salt}Database (D): {T1={salt, bread, olives}, T2={sugar, onion, salt}, T3={bread}, T4={cheese, butter}, T5={tomato}, }*

  • An association rule is the form: P Q, where P I, Q I, and P Q =

    Example:{bread} {butter, cheese}{onion, tomato} {salt}*

  • Support of a rule P Q = Support of (P Q) in DsD(P Q ) = sD(P Q)percentage of transactions in D containing P and Q.#transactions containing P and Q divided by cardinality of D.

    Confidence of a rule P QcD(P Q) = sD(P Q)/sD(P)percentage of transactions that contain both P and Q in the subset of transactions that contain already P.*

  • Thresholds:minimum support: minsupminimum confidence: minconfFrequent itemset Psupport of P larger than minimum supportStrong rule P Q (c%)(P Q) frequent,c is larger than minimum confidence

    *

  • For rule {A} {C}:support = support({A, C}) = 50%confidence = support({A, C})/support({A}) = 66.6%For rule {C} {A}:support = support({A, C}) = 50%confidence = support({A, C})/support({C}) = 100.0%

    *Min. support 50%Min. confidence 50%

    Transaction IDItems Bought2000A, B, C1000A, C4000A, D5000B, E, F

    Frequent ItemsetSupport{A}75%{B}50%{C}50%{A,C}50%

  • InputA database of transactionsEach transaction is a list of items (Ex. purchased by a customer in a visit)Find all strong rules that associate the presence of one set of items with that of another set of items.Example: 98% of people who purchase tires and auto accessories also get automotive services doneThere are no restrictions on the number of items in the head or body of the rule.The most famous algorithm is APRIORI*

  • Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also be a frequent itemseti.e., if {AB} isa frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.Source: [Sunysb, 2009]

  • Consider a database, D, consisting of 9 transactions.Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22%).Let minimum confidence required is 70%.We have to first find out the frequent itemset using apriori algorithm.Then, association rules will be generated using min. support & min. confidence

    TIDList of ItemsT100I1, I2, I5 T100I2, I4 T100I2, I3 T100I1, I2, I4 T100I1, I3T100I2, I3 T100I1, I3 T100I1, I2 ,I3, I5T100I1, I2, I3

  • Step 1: Generating 1-itemset Frequent Pattern

    Scan D for count of each candidateCompare candidate support count with minimum support countC1L1The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum support.In the first iteration of the algorithm, each item is member of the set of candidate

    ItemsetSup.Count{l1}6{l2}7{l3}6{l4}2{;5}2

    ItemsetSup.Count{l1}6{l2}7{l3}6{l4}2{;5}2

  • Step 2: Generating 2-itemset Frequent PatternGenerate C2 candidates from L1Scan D for count of each candidateCompare candidate support count with minimum support countC2C2L2

    Itemset{l1,l2}{l1,l3}{l1,l4}{l1,l5}{l2,l3}{l2,l4}{l2,l5}{l3,l4}{l3,l5}{l4,l5}

    ItemsetSup.Count{l1,l2}4{l1,l3}4{l1,l4}1{l1,l5}2{l2,l3}4{l2,l4}2{l2,l5}2{l3,l4}0{l3,l5}1{l4,l5}0

    ItemsetSup.Count{l1,l2}4{l1,l3}4{l1,l5}2{l2,l3}4{l2,l4}2{l2,l5}2

  • To discover the set of frequent 2-itemsets, L2, the algorithm uses L1JoinL1 to generate a candidate set of 2-itemsets, C2.Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as shown in the middle table).The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support. Note:We havent used Apriori Property yet.

  • Step 3: Generating 3-itemset Frequent PatternGenerate C3 candidates from L2Scan D for count of each candidateCompare candidate support count with minimum support countC3C3L3

    Itemset{l1,l2,l3}{l1,l2,l5}

    ItemsetSup.Count{l1,l2,l3}2{l1,l2,l5}2

    ItemsetSup.Count{l1,l2,l3}2{l1,l2,l5}2

  • The generation of the set of candidate 3-itemsets, C3, involves use of the Apriori Property.In order to find C3, we compute L2JoinL2. C3= L2JoinL2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.

  • Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four latter candidates cannot possibly be frequent. For example, lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.

  • BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3. Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning. Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C having minimum support.

  • Step 4: Generating 4-itemset Frequent PatternThe algorithm uses L3JoinL3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent. Thus, C4= , and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm.Whats Next ? These frequent itemsets will be used to generate strong association rules( where strong association rules satisfy both minimum support & minimum confidence).

  • Step 5: Generating Association Rules from Frequent ItemsetsProcedure:For each frequent itemset I, generate all nonempty subsets of I.For every nonempty subset s of I, output the rule s (I-s) if support_count(I) / support_count(s) min_conf where min_conf is minimum confidence threshold.

  • In our example:We had L = {{l1},{l2},{l3},{l4},{l5},{l1,l2},{l1,l3},{l1,l5},{l2,l3},{l2,l3},{l2,l5},{l1,l2,l3},{l1,l2,l5}}.Lets take I = {l1,l2,l5}Its all nonempty subsets are {l1,l2}, {l1,l5}, {l2,l5}, {l1}, {l2}, {l5}

  • Let minimum confidence thresholdis , say 70%.The resulting association rules are shown below, each listed with its confidence. R1: {I1,I2} {I5}Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% R1 is Rejected.R2: {I1,I5} {I2} Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%R2 is Selected. R3: {I2,I5} {I1}Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% R3 is Selected.

  • R4: {I1} {I2,I5} Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%R4 is Rejected. R5: {I2} {I1,I5} Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%R5 is Rejected. R6: {I5} {I1,I2}Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%R6 is Selected. In this way, We have found three strong association rules.

  • Learn a method for predicting the instance class from pre-labeled (classified) instancesMany approaches: Statistics, Decision Trees, Neural Networks, ...

  • Prepare a collection of records (training set )Each record contains a set of attributes, one of the attributes is the class.Find a model for class attribute as a function of the values of other attributes (decision tree, neural network, etc)Prepare test set to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.After happy with the accuracy, use your model to classify new instance*

  • *categoricalcategoricalcontinuousclassTraining SetLearn Classifier

    Tid

    Refund

    Marital

    Status

    Taxable

    Income

    Cheat

    1

    Yes

    Single

    125K

    No

    2

    No

    Married

    100K

    No

    3

    No

    Single

    70K

    No

    4

    Yes

    Married

    120K

    No

    5

    No

    Divorced

    95K

    Yes

    6

    No

    Married

    60K

    No

    7

    Yes

    Divorced

    220K

    No

    8

    No

    Single

    85K

    Yes

    9

    No

    Married

    75K

    No

    10

    No

    Single

    90K

    Yes

    10

    Refund

    Marital

    Status

    Taxable

    Income

    Cheat

    No

    Single

    75K

    ?

    Yes

    Married

    50K

    ?

    No

    Married

    150K

    ?

    Yes

    Divorced

    90K

    ?

    No

    Single

    40K

    ?

    No

    Married

    80K

    ?

    10

  • RefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced< 80K> 80KSplitting AttributesTraining DataModel: Decision Tree

  • categoricalcategoricalcontinuousclassMarStRefundTaxIncYESNONOYesNoMarried Single, Divorced< 80K> 80KThere could be more than one tree that fits the same data!

    Tid

    Refund

    Marital

    Status

    Taxable

    Income

    Cheat

    1

    Yes

    Single

    125K

    No

    2

    No

    Married

    100K

    No

    3

    No

    Single

    70K

    No

    4

    Yes

    Married

    120K

    No

    5

    No

    Divorced

    95K

    Yes

    6

    No

    Married

    60K

    No

    7

    Yes

    Divorced

    220K

    No

    8

    No

    Single

    85K

    Yes

    9

    No

    Married

    75K

    No

    10

    No

    Single

    90K

    Yes

    10

  • Test DataStart from the root of tree.

    Refund

    Marital

    Status

    Taxable

    Income

    Cheat

    No

    Married

    80K

    ?

    10

  • Direct MarketingGoal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.Approach:Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, dont buy} decision forms the class attribute.Collect various demographic, lifestyle, and company-interaction related information about all such customers.Type of business, where they stay, how much they earn, etc.Use this information as input attributes to learn a classifier model.*From [Berry & Linoff] Data Mining Techniques, 1997

  • Fraud DetectionGoal: Predict fraudulent cases in credit card transactions.Approach:Use credit card transactions and the information on its account-holder as attributes.When does a customer buy, what does he buy, how often he pays on time, etcLabel past transactions as fraud or fair transactions. This forms the class attribute.Learn a model for the class of the transactions.Use this model to detect fraud by observing credit card transactions on an account.*

  • Customer Attrition/Churn:Goal: To predict whether a customer is likely to be lost to a competitor.Approach:Use detailed record of transactions with each of the past and present customers, to find attributes.How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal.Find a model for loyalty.*From [Berry & Linoff] Data Mining Techniques, 1997

  • Helps users understand the natural grouping or structure in a data set.Cluster: a collection of data objects that are similar to one another and thus can be treated collectively as one group.Clustering: unsupervised classification: no predefined classes*Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters.

  • Find natural grouping of instances given un-labeled data

  • A good clustering method will produce high quality clusters in which:the intra-class similarity (that is within a cluster) is high.the inter-class similarity (that is between clusters) is low.The quality of a clustering result also depends on both the similarity measure used by the method and its implementation.The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.The quality of a clustering result also depends on the definition and representation of cluster chosen.*

  • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion.Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion. There is an agglomerative approach and a divisive approach.*

  • Partitioning method: Given a number k, partition a database D of n objects into a set of k clusters so that a chosen objective function is minimized (e.g., sum of distances to the center of the clusters).Global optimum: exhaustively enumerate all partitions too expensive!Heuristic methods based on iterative refinement of an initial partition

    *

  • Hierarchical decomposition of the data set (with respect to a given similarity measure) into a set of nested clustersResult represented by a so called dendrogramNodes in the dendrogram represent possible clusterscan be constructed bottom-up (agglomerative approach) or top down (divisive approach)

    *Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

  • cluster similarity = similarity of two most similar members- Potentially long and skinny clusters+ Fast

  • 12345

  • 12345

  • 12345

  • cluster similarity = similarity of two least similar members+ tight clusters- slow

  • 12345

  • 12345

  • 12345

  • Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.Dendogram: Hierarchical Clustering*

    ****I will start by motivating the work next describe the Novel algorithm we developedand then describe the experimental evaluation and present our resultsI will start by motivating the work next describe the Novel algorithm we developedand then describe the experimental evaluation and present our results*