QB Students Dm

Embed Size (px)

Citation preview

  • 8/8/2019 QB Students Dm

    1/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISE

    NITTE MEENAKSHI INSTITUTE OF TECHNOLOGY

    (AN AUTONOMOUS INSTITUTION)

    (AFFILIATED TO VISVESVARAYA TECHNOLOGICAL UNIVERSITY, BELGAUM, APPROVED BYAICTE & GOVT.OF KARNATAKA)

    DATA MINING (ISE751)Sem: 7th Credits: 3

    Dept: ISE

    UNIT-I

    1. What is Data Mining? Explain the process of Knowledge Discovery in Databases (KDD) with a

    diagram

    2. What are the different motivating challenges faced by Data Mining Algorithms? Explain each of them

    3. Explain the origins of Data Mining with diagram

    4. What is predictive modeling? Explain with example

    5. Discuss Association Analysis and Cluster Analysis with examples

    6. What are the different types of attributes? Explain with a table

    7. In case of record data, what is transaction / market based data, Data Matrix and Sparse Data Matrix?

    Explain with examples.

    8. In case of ordered data, Explain Sequential Data, Sequence Data, Time Series Data and Spatial Data

    with examples

    9. What do you mean by Data Preprocessing? Explain Aggregation and Sampling in this respect

    10.Explain Dimensionality reduction in Data Preprocessing

    11.What are the different variations of Graph Data? Explain with diagrams

    12.What is Feature Subset Selection? What are the different approaches for doing this? Explain the

    architecture of Feature subset selection with a diagram

    13.In case of Feature Creation, Explain the following with examples:

    i) Feature Extraction

    ii) Mapping Data to new space

    1

  • 8/8/2019 QB Students Dm

    2/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISE

    iii) Feature Construction

    14.What do you mean by Binarization? Explain the conversion of a Categorical Attribute to 3 binary

    attributes? What is its drawback? How is it overcome?

    15.How is Discretization of Continuous Attributes done? In this regard, Explain unsupervised and

    supervised Discretization.

    16.What is variable transformation? In this regard, explain

    i) Simple Functional Transformation

    ii) Normalization/Standardization

    17.Explain the following terms:

    i) Outliers (ii) Precision (iii) Accuracy (iv) Bias

    18. Explain Data Mining Tasks in detail with examples

    19.Define and explain the terms:

    i) Attribute (ii) Measurement (iii) Data Set (iv) Sparsity

    20.What are Discrete and Continuous Attributes? Explain the term resolution.

    21.What is the curse of Dimensionality? Explain Data Quality issues related to applications

    UNIT-II

    1. Give the formal definition of classification. What is classification model? Explain with diagram

    2. With a diagram, explain the general approach for building a classification model

    3. For the Nodes N1 & N2 given below, calculate the Gini Index, Entropy and Classification Error.

    Based on this, mention which node is suitable for splitting

    Node N1 Count

    Class=0 0

    2

  • 8/8/2019 QB Students Dm

    3/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISEClass=1 6

    4. What is confusion matrix? Explain the confusion matrix for a 2-class problem with an example. In

    this regard, explain Accuracy and error rate of prediction with appropriate formula

    5. Write Hunts Algorithm. Explain it with an example

    6. Compare rule-based and class-ordering schemes with examples

    7. Explain different methods for expressing Attribute test conditions

    8. Explain in detail the characteristics of Decision Tree Induction.

    9. Write and explain the algorithm for Decision Tree Induction.

    10. What is Gain Ratio? Explain with formula.

    11. Calculate the Gini Index for Attributes A and B given below and specify which attribute is better for

    splitting.

    Where C0, C1 stand for Class 0 and Class 1 respectively.

    12.What is rule based classifier? Explain how it works with an example. In this regard, also define

    accuracy and coverage

    13.Consider a training set that contains 60 positive examples and 100 negative examples. Suppose two

    rules are given:

    R1: covers 50 positive examples and 5 negative examples

    R2: covers 2 positive examples and no negative examples

    For the above two rules, calculate Laplace, accuracy, coverage and likelihood ratio.

    3

    Node N2 Count

    Class=0 1

    Class=1 5

    A

    Node N1

    C0: 4

    C1: 3

    Node N2

    C0: 2

    C1: 3

    B

    Node N1

    C0: 1

    C1: 4

    Node N2

    C0: 5

    C1: 2

  • 8/8/2019 QB Students Dm

    4/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISE

    14.Explain characteristics of Rule-Based classifier

    15. How can a decision tree be converted into classification rules? Explain with example.

    16. Write and explain the k-nearest neighbor classification algorithm

    17.What are the characteristics of Nearest-neighbor classifier? Explain

    18.Explain 1-nearest-neighbour, 2-nearest-neighbour and 3-nearest neighbour with examples.

    19.Consider a training set that contains 100 positive examples and 400 negative examples. For each of

    the following candidate rules

    R1: A -> + (Covers 4 positive and one negative example)

    R2: B -> + (Covers 30 positive and 10 negative examples)

    R3: C -> + (Covers 100 positive and 90 negative examples)

    Determine which is the best and worst candidate rules according to:

    i) Rule Accuracy (ii) Laplace measure (iii) Likelihood ratio statistic

    20. Consider a training set that contains 29 positive examples and 21 negative examples. For each of the

    following candidate rules

    R1: A -> + (Covers 12 positive and 3 negative examples)

    R2: B -> + (Covers 7 positive and 3 negative examples)

    R3: C -> + (Covers 8 positive and 4 negative examples)

    Determine which is the best and worst candidate rules according to:

    i) Rule Accuracy (ii) Laplace measure (iii) Likelihood ratio statistic

    21. For the following Confusion matrix , calculate the Accuracy and Error rate:

    Predicted Class

    Class=1 Class=0

    Actual

    Class

    Class=1 15 10

    Class=0 20 11

    22. Consider the following table with attributes A, B, C and two class labels +, -

    4

  • 8/8/2019 QB Students Dm

    5/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISEA B C Number of Instances

    + -

    T T T 5 0

    F T T 0 20

    T F T 20 0

    F F T 0 5

    T T F 0 0

    F T F 25 0

    T F F 0 0

    F F F 0 25

    According to the classification error rate, which attribute would be chosen as the best splitting attribute?

    23. Explain the measures for selecting the best split

    UNIT-III

    1. How is market basket data represented in a binary format? Explain with example. In this case explain

    the terms itemset, association rule, support count, support and confidence

    2. What is use of support and confidence? Explain

    3. Discuss association rule mining problem. Explain

    4. What is frequent itemset generation? Generate candidate 3 itemsets for the following data by applying

    APriori principle taking a minimum support threshold of 60%

    TID Items

    1 {Bread, Milk}

    2 {Bread, A, B, C}

    3 {Milk, A, B, D}

    4 {Bread, Milk, A, B}

    5 {Bread, Milk, A, D}

    5

  • 8/8/2019 QB Students Dm

    6/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISE

    5. Write the algorithm for Frequent itemset generation of the Apriori algorithm

    6. How is support counting done using a Hash tree? Explain with example

    7. How are candidates generated using Lexicographic ordering ? Explain with example.

    8. What is candidate generation and pruning? Explain Fk-1 x F1 and Fk-1 x Fk-1 methods of candidate

    generation with examples.

    9. What are the factors that affect the computation complexity of the Apriori algorithm? Explain

    10.Explain rule generation in Apriori algorithm with example.

    11.Write the Apriori algorithm for rule generation

    12.What is maximal frequent itemset? Explain with example.

    13.Discuss closed Frequent itemsets with example

    14.What are the alternative methods for generating frequent items? Explain

    15.Explain relationships among frequent, maximal frequent and closed frequent itemsets with diagram

    16.Explain the DFS and BFS methods of generating frequent itemsets with examples.

    17.What are the two ways in which a transaction data set be represented? Explain with example

    18.For the following data set:

    TransID Items Bought

    0001 {a,d,e}0024 {a,b,c,e}

    0012 {a,b,d,e}

    0031 {a,c,d,e}

    0015 {b,c,e}

    0022 {b,d,e}

    0029 {c,d}

    0040 {a,b,c}

    0030 {a,d,e}

    0038 {a,b,e}

    i) Compute the support count for itemsets {e}, {b,d} and {b,d,e}

    ii) Compute the support and confidence for association rules:

    {b,d} -> {e} and {e} -> {b, d}

    Is confidence a symmetric measure?

    19.For the market based transactions given below:

    Trans ID Items Bought

    6

  • 8/8/2019 QB Students Dm

    7/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISE1 {a,b,d,e}

    2 {b,c,d}

    3 {a,b,d,e}

    4 {a,c,d,e}

    5 {b,c,d,e}

    6 {b,d,e}

    7 {c,d}

    8 {a,b,c}

    9 {a,d,e}10 {b,d}

    i) What is the maximum number of association rules that can be extracted from this data?

    ii) What is the maximum number of frequent itemsets that can be extracted (including null set)?

    iii) Generate candidate 1-itemset, 2-itemsets and 3-itemsets assuming a support threshold of 60%

    using Apriori algorithm

    20.Write short notes on the following:

    i. Equivalence classes

    ii. Breadth First and Depth First Search

    iii. General-to-Specific Vs Specific-to-General

    UNIT-IV

    1. What is FP Tree? Explain its construction with example

    2. How are frequent itemsets generated using FP-Tree Algorithm? Explain with example.

    3. What are contingency tables? Explain its contents.

    4. Explain the limitations of Support and Confidence Framework by taking an example.

    5. For the following tables, Calculate the Interest Factor, -correlation coefficient and IS Measure

    p p

    q 880 5

    0

    930

    q 50 3

    0

    70

    7

  • 8/8/2019 QB Students Dm

    8/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISE

    930 7

    0

    1000

    6. How can Objective Measures be extended

    beyond pairs of Binary Variables? Explain

    with contingency table

    7. What are the properties of Objective Measures? Explain in detail

    8. Calculate -correlation coefficient, IS Measure, Interest Factor and Confidence for the rule

    {Tea} -> {Coffee} for the following table

    Coffe

    e

    Coffe

    e

    Tea 880 50 200

    Tea 50 30 800

    800 200 1000

    9. What is Sequence? Explain with examples

    10.What is Simpsons Paradox? Explain with example.

    11.For the following contingency tables compute support, the interest measure, and the -correlation

    coefficient, for the association patterns {A,B}. Also compute the confidence of rules A -> B and B

    -> A. Is confidence a Symmetric measure?

    8

    r r

    s 2

    0

    50 70

    s 50

    880 930

    7

    0

    930 1000

  • 8/8/2019 QB Students Dm

    9/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISE

    B B B B

    A

    AAAAAAAAAAAAA

    A AAAAAAAAAAAAA

    12.What are Subsequences? Explain with example

    13.What is meant by Cross-Support Patterns? How is it eliminated?

    14.For the following two-way contingency table

    Calculate:

    i) Confidence for the rules {HDTV=Yes} -> {Exercise Machine=Yes} and

    {HDTV=No}-> {Exercise Machine=Yes}

    ii) -correlation coefficient, IS Measure and Interest Factor

    9

    9 1

    1 89

    89 1

    1 9

    BuyHDTV

    Buy ExerciseMachine

    Yes No

    Yes

    No

    99

    54

    81

    66

    180

    120

    153 147 300

  • 8/8/2019 QB Students Dm

    10/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISE

    c) Explain the inversion and scaling properties of Objective Measures with examples (6)

    15.Consider the three-way contingency table below:

    Customer Group Buy HDTV Buy Exercise

    Machine

    Total

    Yes No

    College Students Yes

    No

    1

    4

    9

    30

    10

    34

    Working Adult Yes

    No

    98

    50

    72

    36

    170

    86

    Compute:

    i) -correlation coefficient, IS Measure, Interest Factor when Customer Group=College

    Students

    ii) -correlation coefficient, IS Measure, Interest Factor when Customer Group=Working Adult

    iii)Calculate Confidence for the rules when {HDTV=Yes} -> {Exercise Machine=Yes},

    {HDTV=No} -> {Exercise Machine=Yes} when Customer Group=College Students and

    {HDTV=Yes} -> {Exercise Machine=Yes}, {HDTV=No} -> {Exercise Machine=Yes}

    when Customer Group=Working Adult

    16.Construct FP-Tree for the following Transaction Data Set:

    Transaction ID Items Bought

    1 {a,b,d,e}

    2 {b,c,d}

    3 {a,b,d,e}

    4 {a,c,d,e}

    5 {b,c,d,e}

    6 {b,d,e}

    7 {c,d}

    8 {a,b,c}

    9 {a,d,e}

    10 {b,d}

    17.Identify the frequent itemsets in the above transactions using FP-Tree Algorithm

    18.What is Sequential Pattern Discovery? Explain with example

    19.For the following contingency table Compute:

    i) -correlation coefficient, IS Measure, Interest Factor when C=0

    ii) -correlation coefficient, IS Measure, Interest Factor when C=1

    10

  • 8/8/2019 QB Students Dm

    11/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISEA

    1 0

    C=0 B 1 0 15

    0 15 30

    C=1 B 1 5 0

    0 0 15

    20. What do you mean by Timing Constraints with regard to Sequential Patterns?

    21.Draw contingency tables for the rules {b} -> {c} and {a} -> {d} using the transactions shown

    below:

    Transaction ID Items Bought

    1 {a,b,d,e}

    2 {b,c,d}

    3 {a,b,d,e}

    4 {a,c,d,e}

    5 {b,c,d,e}

    6 {b,d,e}

    7 {c,d}

    8 {a,b,c}

    9 {a,d,e}

    10 {b,d}

    Using these contingency tables from compute -correlation coefficient, IS Measure, Interest Factor and

    Confidence for the two rules (Contingency tables)

    22.Write the Apriori-like Algorithm for Sequential Pattern Discovery.

    UNIT-V

    1. What is Cluster Analysis? Explain

    2. What are the different types of Clustering? Explain with diagrams

    3. Discuss different types of Clusters

    4. Write and explain the basic K-Means Algorithm

    5. With respect to K-Means algorithm, explain how points are assigned to closest centroid using

    SSE for centroid calculation

    6. In K-Means Algorithm, How are initial centroids chosen? Explain with diagram

    7. Give a table listing common choices for Proximity, Centroids and Objective Functions with

    respect to K-Means Algorithm

    11

  • 8/8/2019 QB Students Dm

    12/12

    NMIT, Bangalore Data Mining Question Bank Dept of ISE

    8. Comment on Time and Space Complexity of K-Means Algorithm

    9. What are the additional issues in K-Means algorithm? Explain

    10.Write and explain the Bisecting K-Means Algorithm

    11.What are Strengths and Weaknesses of K-Means Algorithm

    12.Write and explain Basic Agglomerative Hierarchical Clustering Algorithm. How is proximity

    between clusters defined?

    13.Comment on the Time and Space Complexity of Agglomerative Hierarchical Clustering

    algorithm.

    14.Discuss the key issues in Hierarchical Clustering

    15.What are the Strengths and Weaknesses of Hierarchical Clustering

    16.Explain the Single Link or MIN method of Hierarchical Clustering with example

    17.Explain Complete Link or MAX method of Hierarchical Clustering with example

    18.Discuss Group Average Version of Hierarchical Clustering with example

    19.How are points classified according to Centroid Based Density in DBSCAN algorithm? Explain with

    diagrams and example

    20.Write and Explain DBSCAN algorithm

    21.Comment on Time and Space Complexity of DBSCAN algorithm

    22.What are strengths and weaknesses of DBSCAN algorithm

    23.What is Cluster Evaluation? Explain overview of Cluster Evaluation

    12