3
International Journ Internat ISSN No: 245 @ IJTSRD | Available Online @ www A Compression Based Rajendra Chouh 1 M.Tech Scholar, Department of Computer Scie ABSTRACT Data mining is not new. People who fi how to start fire and that the earth discovered knowledge which is the mai mining. Data Mining, also called Discovery in Database, is one of the l area, which has emerged in response to data or the flood of data, world is facing has taken up the challenge to develop t can help humans to discover useful patte data. One such important technique is fr mining. This paper will present an comp technique for mining frequent ite transaction data set. Keyword: Data Mining, KDD Proce Pattern Mining, Minimum sup Compression. 1. INTRODUCTION The term data mining refers to the ‘mining’ of valuable knowledge from of data, in analogy to industrial mining sets of valuable nuggets (e.g. gold) are e a great deal of raw material (e.g. rocks). It is the combination of Multiple Discip shows the different disciplines that tak mining. Figure 1 Show the Multiple Disciplin mining nal of Trend in Scientific Research and De tional Open Access Journal | www.ijtsr 56 - 6470 | Volume - 2 | Issue – 6 | Sep w.ijtsrd.com | Volume – 2 | Issue – 6 | Sep-Oct d Methodology to Mine All Fre han 1 , Khushboo Sawant 2 , Dr. Harish Patid , 2 Assistant Professor, 3 HOD, Associate Professo ence and Engineering, LNCT, Indore, Madhya P irst discovered is round also in idea of Data d knowledge latest research o the Tsunami g nowadays. It techniques that erns in assive requent pattern pression based ems from a cess, Frequent pport, Data extraction or large amounts g where small extracted from plines. Figure 1 ke part in data nes for data Even before technologies wer statisticians were using prob techniques to model histor technology allows to capture a of data. Finding and summariz and anomalies in these data challenges in today’s inform unprecedented growth-rate a collected [2] and stored electr all fields of human endeavor, of useful information from becoming an increasing scie massive economic need” [3]. In Data Mining process, selec are forms of preprocessing, w the complete database and po a certain form required for th cleaning and data integration initial phase of data preparatio the input for the data mining results in discovered patterns. are then presented to the user be stored as new knowledge they can, in turn, again be c another knowledge discovery which have support no less min sup value are mined as fre Figure 2: Frequent P evelopment (IJTSRD) rd.com p – Oct 2018 2018 Page: 266 equent Items dar 3 or Pradesh, India re used for Data mining, bability and regressing rical data [1]. Today and store vast quantities zing the patterns, trends, sets is one of the big mation age. “With the at which data is being ronically today in almost , the efficient extraction the data available is entific challenge and a ction and transformation where one selects part of ossibly transforms it into he next step. Often data n are also part of this on. The resulting data is phase, which in its turn The interesting patterns r. As these patterns can e in a knowledge base, considered as input for process [4]. All patterns than the user-specified equent patterns. Pattern Mining

A Compression Based Methodology to Mine All Frequent Items

  • Upload
    ijtsrd

  • View
    1

  • Download
    0

Embed Size (px)

DESCRIPTION

Data mining is not new. People who first discovered how to start fire and that the earth is round also discovered knowledge which is the main idea of Data mining. Data Mining, also called knowledge Discovery in Database, is one of the latest research area, which has emerged in response to the Tsunami data or the flood of data, world is facing nowadays. It has taken up the challenge to develop techniques that can help humans to discover useful patterns in assive data. One such important technique is frequent pattern mining. This paper will present an compression based technique for mining frequent items from a transaction data set. Rajendra Chouhan | Khushboo Sawant | Dr. Harish Patidar "A Compression Based Methodology to Mine All Frequent Items" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-6 , October 2018, URL: https://www.ijtsrd.com/papers/ijtsrd18423.pdf Paper URL: http://www.ijtsrd.com/engineering/computer-engineering/18423/a-compression-based-methodology-to-mine-all-frequent-items/rajendra-chouhan

Citation preview

Page 1: A Compression Based Methodology to Mine All Frequent Items

International Journal of Trend in

International Open Access Journal

ISSN No: 2456

@ IJTSRD | Available Online @ www.ijtsrd.com

A Compression Based MethodologyRajendra Chouhan

1M.Tech Scholar,Department of Computer Science

ABSTRACT Data mining is not new. People who first discovered how to start fire and that the earth is round also discovered knowledge which is the main idea of Data mining. Data Mining, also called knowledge Discovery in Database, is one of the latest research area, which has emerged in response to the Tsunami data or the flood of data, world is facing nowadays. It has taken up the challenge to develop techniques that can help humans to discover useful patternsdata. One such important technique is frequent pattern mining. This paper will present an compression based technique for mining frequent items from a transaction data set. Keyword: Data Mining, KDD Process, Frequent Pattern Mining, Minimum support, Data Compression. 1. INTRODUCTION The term data mining refers to the extraction or ‘mining’ of valuable knowledge from large amounts of data, in analogy to industrial mining where small sets of valuable nuggets (e.g. gold) are extracted from a great deal of raw material (e.g. rocks). It is the combination of Multiple Disciplines. Figure 1 shows the different disciplines that take part in data mining.

Figure 1 Show the Multiple Disciplines for data mining

International Journal of Trend in Scientific Research and Development (IJTSRD)

International Open Access Journal | www.ijtsrd.com

ISSN No: 2456 - 6470 | Volume - 2 | Issue – 6 | Sep

www.ijtsrd.com | Volume – 2 | Issue – 6 | Sep-Oct 2018

Compression Based Methodology to Mine All Frequent Items

Rajendra Chouhan1, Khushboo Sawant2, Dr. Harish Patidar, 2Assistant Professor, 3HOD, Associate Professor

Computer Science and Engineering, LNCT, Indore, Madhya Pradesh

Data mining is not new. People who first discovered how to start fire and that the earth is round also discovered knowledge which is the main idea of Data mining. Data Mining, also called knowledge

n Database, is one of the latest research area, which has emerged in response to the Tsunami data or the flood of data, world is facing nowadays. It has taken up the challenge to develop techniques that

ver useful patterns in assive data. One such important technique is frequent pattern mining. This paper will present an compression based technique for mining frequent items from a

Data Mining, KDD Process, Frequent Pattern Mining, Minimum support, Data

The term data mining refers to the extraction or ‘mining’ of valuable knowledge from large amounts of data, in analogy to industrial mining where small sets of valuable nuggets (e.g. gold) are extracted from

great deal of raw material (e.g. rocks).

It is the combination of Multiple Disciplines. Figure 1 shows the different disciplines that take part in data

Figure 1 Show the Multiple Disciplines for data

Even before technologies were used for Data mining, statisticians were using probability and regressing techniques to model historical data [1]. Today technology allows to capture and store vast quantities of data. Finding and summarizing the patterns, trends, and anomalies in these data sets is one of the big challenges in today’s information age. “With the unprecedented growth-rate at which data is being collected [2] and stored electronically today in almost all fields of human endeavor, the efficient extraction of useful information from the data available is becoming an increasing scientific challenge and a massive economic need” [3]. In Data Mining process, selection and transformation are forms of preprocessing, where one selects part of the complete database and possibly transforms it into a certain form required for the next step. Often data cleaning and data integration are also part of this initial phase of data preparation. The resulting data is the input for the data mining phase, which in its turn results in discovered patterns. The interesting patterns are then presented to the user. As these patterns can be stored as new knowledge in a knowledge base, they can, in turn, again be considered as input for another knowledge discovery process [4].which have support no less than the usermin sup value are mined as frequent patterns.

Figure 2: Frequent Pattern Mining

Research and Development (IJTSRD)

www.ijtsrd.com

6 | Sep – Oct 2018

Oct 2018 Page: 266

ll Frequent Items Harish Patidar3

Associate Professor Madhya Pradesh, India

e technologies were used for Data mining, statisticians were using probability and regressing techniques to model historical data [1]. Today technology allows to capture and store vast quantities of data. Finding and summarizing the patterns, trends,

nomalies in these data sets is one of the big challenges in today’s information age. “With the

rate at which data is being collected [2] and stored electronically today in almost all fields of human endeavor, the efficient extraction f useful information from the data available is

becoming an increasing scientific challenge and a

In Data Mining process, selection and transformation are forms of preprocessing, where one selects part of

e and possibly transforms it into a certain form required for the next step. Often data cleaning and data integration are also part of this initial phase of data preparation. The resulting data is the input for the data mining phase, which in its turn

lts in discovered patterns. The interesting patterns are then presented to the user. As these patterns can be stored as new knowledge in a knowledge base, they can, in turn, again be considered as input for another knowledge discovery process [4]. All patterns which have support no less than the user-specified

sup value are mined as frequent patterns.

Figure 2: Frequent Pattern Mining

Page 2: A Compression Based Methodology to Mine All Frequent Items

International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN:

@ IJTSRD | Available Online @ www.ijtsrd.com

Frequent pattern mining is used to prune the search space and limit the number of association rules being generated. Many algorithms discussed in the literature use “single min sup framework” to discover the complete set of frequent patterns. The reason forpopular usage of “single min sup framework” is that frequent patterns discovered with this framework satisfy downward closure property, i.e., all nonsubsets of a frequent pattern must also be frequent. The downward closure property makes associarule mining practical in real-world applications [[8]. The two popular algorithms to discover frequent patterns are: Apriori [6] and Frequent Pattern(FP-growth) [9] algorithms. The Apriori algorithm employs breadth-first search (or candidaand-test) technique to discover the complete set of frequent patterns. The FP-growth algorithm employs depth-first search (or pattern-growth) technique to discover the complete set of frequent patterns. It has been shown in the literature that FP-growth algorithm is relatively efficient than the Apriori algorithm [

2. RELATED WORK As a result additional interestingness measures, such as lift, correlation and all-confidence, have been proposed in the literature to address the interestingness of an association rule [10measure has its own selection bias that justifies the rationale for preferring a set of association rules over another [11]. As a result, selecting a right interestingness measure for mining association rules is a tricky problem. To confront this problem, a framework has been suggested in for selecting a right measure. In this framework, authors have discussed various properties of a measure and suggested to choose a measure depending on the properties interesting to the user.

Generally, frequent-pattern mining results in a huge number of patterns of which most can be found to be insignificant according to application and/or user requirements. As a result, there have been efforts in the literature to mine constraint-based andinterest based frequent patterns [12], [13In recent times, temporal periodicity of frequent patterns has been used as an interestingness criterion to discover a class of user-interest based frequent patterns, called periodic-frequent patterns [

3. PROPOSED METHOD STEP1: START STEP2: INPUT TRANSACTION DATA SET &

MINIMUM SUPPORT THRESHOLD

International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN:

www.ijtsrd.com | Volume – 2 | Issue – 6 | Sep-Oct 2018

Frequent pattern mining is used to prune the search space and limit the number of association rules being generated. Many algorithms discussed in the literature

sup framework” to discover the complete set of frequent patterns. The reason for the

framework” is that frequent patterns discovered with this framework satisfy downward closure property, i.e., all non-empty subsets of a frequent pattern must also be frequent. The downward closure property makes association

world applications [7] . The two popular algorithms to discover frequent

] and Frequent Pattern-growth ] algorithms. The Apriori algorithm

first search (or candidate-generate-test) technique to discover the complete set of

growth algorithm employs growth) technique to

discover the complete set of frequent patterns. It has growth algorithm

is relatively efficient than the Apriori algorithm [9].

As a result additional interestingness measures, such confidence, have been

proposed in the literature to address the an association rule [10]. Each

measure has its own selection bias that justifies the sociation rules over

]. As a result, selecting a right interestingness measure for mining association rules is

blem. To confront this problem, a for selecting a right

measure. In this framework, authors have discussed various properties of a measure and suggested to choose a measure depending on the properties

pattern mining results in a huge number of patterns of which most can be found to be insignificant according to application and/or user requirements. As a result, there have been efforts in

based and/or user-[13], [14], [15].

In recent times, temporal periodicity of frequent patterns has been used as an interestingness criterion

interest based frequent frequent patterns [16].

INPUT TRANSACTION DATA SET & MINIMUM SUPPORT THRESHOLD

STEP3: FIRST THE PROPOSED ALGORITHM SCANS THE TRANSACTION DATA BASE AND CALULATES THE SUPPORT OF EACH SINGLE SIZE ITEM.

STEP4: IN THIS STEP A LIST OF FREQUENT ITEM AND INFREQUENT ITEM IS PREPARED ON THE BASIS OF MINIMUM SUPPORT THRESHOLD. IF AN ITEM IS HAVING SUPPORT GREATER THAN THE MINIMUM SUPPORT THRESHOLD THEN ITEM IS PLACED IN FREQUENT ITEM LIST AND ALSO IN EXPANSION LIST.OTHERWISE IT IS PLACED IN INFREQUENT ITEM LIST

STEP5: REMOVE THE TRANSACTION WHICH DOES NOT CONTAIN ANY FREQUENT ITEM

STEP 6: IN THIS STEP, ALL THE MEMBERS OF THE INFREQUENT ITEM LIST ARE REMOVED FROM THE TRANSACTION DATA BASE BECAUSE THEY WILL NOT APPEAR IN ANY FREQUENT ITEM SET. IN THIS WAY, THE ORIGINAL TRANSACTION DATA BASE IS CONVERTED INTO REDUCED SIZE DATA BASE. NOW THIS REDUCED DATA BASE WILL BE USED IN THE CALCULATION OF LARGER SIZE FREQUENT ITEM SETS.

STEP7: WHILE EXPANSION LIST IS NOT EMPTY PERFORM LEFT EXPANSION OF SMALLER SIZE ITEMS TO GENERATE HIGHER SIZE ITEMS AND THEN REPEAT STEP 4 FOR THEM

OR PERFORM RIGHT EXPANSION OF ELEMENTS AND THEN REPEAT STEP4 FORTHEM.

STEP8: WRITE THE LIST OF FREQUENT ITEM SETS

STEP9: STOP 4. RESULT The proposed and existing algorithmimplemented in Java. The retail data set is used as the input data set. The time consumed by both algorithms (For minimum support 40 percent)

International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470

Oct 2018 Page: 267

FIRST THE PROPOSED ALGORITHM SCANS THE TRANSACTION DATA BASE AND CALULATES THE SUPPORT OF EACH SINGLE SIZE ITEM. IN THIS STEP A LIST OF FREQUENT ITEM AND INFREQUENT ITEM IS PREPARED ON THE BASIS OF MINIMUM SUPPORT THRESHOLD.

IF AN ITEM IS HAVING SUPPORT GREATER THAN THE MINIMUM SUPPORT THRESHOLD THEN ITEM IS PLACED IN FREQUENT ITEM LIST AND ALSO IN EXPANSION LIST. OTHERWISE IT IS PLACED IN INFREQUENT ITEM LIST

REMOVE THE TRANSACTION WHICH DOES NOT CONTAIN ANY FREQUENT

STEP 6: IN THIS STEP, ALL THE MEMBERS OF THE INFREQUENT ITEM LIST ARE REMOVED FROM THE TRANSACTION DATA BASE BECAUSE THEY WILL

R IN ANY FREQUENT ITEM SET. IN THIS WAY, THE ORIGINAL TRANSACTION DATA BASE IS CONVERTED INTO REDUCED SIZE DATA BASE. NOW THIS REDUCED DATA BASE WILL BE USED IN THE CALCULATION OF LARGER SIZE FREQUENT ITEM SETS. WHILE EXPANSION LIST IS NOT

RFORM LEFT EXPANSION OF SMALLER SIZE ITEMS TO GENERATE HIGHER SIZE ITEMS AND THEN REPEAT STEP 4 FOR THEM

PERFORM RIGHT EXPANSION OF ELEMENTS AND THEN REPEAT STEP4

WRITE THE LIST OF FREQUENT ITEM

proposed and existing algorithms are implemented in Java. The retail data set is used as the

e consumed by both algorithms For minimum support 40 percent) is shown below

Page 3: A Compression Based Methodology to Mine All Frequent Items

International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN:

@ IJTSRD | Available Online @ www.ijtsrd.com

Figure 3: Time Consumption Comparison

5. CONCLUSION: Frequent pattern mining has a wide range of real world applications. That’s why it is one of the most favorite topic of research. Frequent mining helps in mining of items which are worthy. This paper proposed an updated method to find sets from a transaction data set. The proposed method makes use of data compression for data reductionUseless items are eliminated in the initial stage of the mining process. REFERENCES 1. Groth Robert. "Data Mining: A Hands

Approach for Business Professionals". Prentice Hall PTR, 1998.

2. Witten Ian and Eibe. Frank, "Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations". San Francisco: Morgan Kaufmann Publishers, 2000.

3. Zaki Mohammed and Ho Ching-Scale Parallel Data Mining". Berlin: Springer, 2000.

4. From Data Mining to Knowledge Discovery in Databases Usama Fayyad, Gregory PiatetskyShapiro, and Padhraic Smyth.

5. Omiecinski E., "Alternative interest mining associations in databases", IEEE Trans. Knowl. Data Eng., 15(1):57–69, 2003.

6. Agrawal R., Imielinski T., and Swami A. N., "Mining association rules between sets of items in

International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN:

www.ijtsrd.com | Volume – 2 | Issue – 6 | Sep-Oct 2018

Figure 3: Time Consumption Comparison

requent pattern mining has a wide range of real world applications. That’s why it is one of the most

mining helps in mining of items which are worthy. This paper proposed an updated method to find frequent item

m a transaction data set. The proposed method data compression for data reduction.

Useless items are eliminated in the initial stage of the

Robert. "Data Mining: A Hands-on Approach for Business Professionals". Prentice

Witten Ian and Eibe. Frank, "Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations". San Francisco:

hers, 2000.

-Tien, "Large-Scale Parallel Data Mining". Berlin: Springer,

From Data Mining to Knowledge Discovery in Databases Usama Fayyad, Gregory Piatetsky-

Omiecinski E., "Alternative interest measures for mining associations in databases", IEEE Trans.

69, 2003.

Agrawal R., Imielinski T., and Swami A. N., "Mining association rules between sets of items in

large databases", In SIGMOD Conference, pages 207–216, 1993.

7. Chen Chun-Hao , Hong TzungVincent S. , “Genetic-fuzzy mining with multiple minimum supports based on fuzzy clustering”, 2319-2333, Springer 2011.

8. Weimin Ouyang , Qinhua Huang , “Mining direct and indirect fuzzy association rules with multipleminimum supports in large transaction databases”, 947 – 951, IEEE 2011.

9. Han J., Pei J., Yin Y., and Mao R., "Mining frequent patterns without candidate generation: A frequent-pattern tree approach", Data Min. Knowledge. Discovery. 8(1):53

10. Haiying Ma,Dong Gang, “Generalized association rules and decision tree”, Page(s) 4600 IEEE, 2011.

11. Kim Hyea Kyeong, Kim Jae Kyeong , “A product network analysis for extending the market basket analysis”, Pages 7403–7410, Elsevier 2012.

12. Tan P.-N., Kumar V., and Srivastava J. "Selecting the right interestingness measure for association patterns". In KDD, pages 32

13. Vaillant B., Lenca P. , and Lallich S. "A clustering of interestingness measures". Pages 290Springer, 2004.

14. Hu T., Sung S. Y., Xion"Discovery of maximum length frequent itemsets". Inf. Sci., 178:69

15. Schmidt Jana and Kramer Augmented Itemset Tree: A Data Structure for Online Maximum Frequent Pattern Mining”, pp 277-291 Springer 2011.

16. Tanbeer S. K., Ahmed C. F., Jeong B.Y.-K., "Discovering periodictransactional databases". In PAKDD ’09: Proceedings of the 13th Pacificon Advances in Knowledge Discovery and Data Mining, pages 242–253, BerliSpringer-Verlag 2009.

International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470

Oct 2018 Page: 268

large databases", In SIGMOD Conference, pages

Hao , Hong Tzung-Pei and Tseng fuzzy mining with multiple

minimum supports based on fuzzy clustering”, 2333, Springer 2011.

Weimin Ouyang , Qinhua Huang , “Mining direct and indirect fuzzy association rules with multiple minimum supports in large transaction databases”,

Han J., Pei J., Yin Y., and Mao R., "Mining frequent patterns without candidate generation: A

pattern tree approach", Data Min. Knowledge. Discovery. 8(1):53–87, 2004.

g Ma,Dong Gang, “Generalized association rules and decision tree”, Page(s) 4600 – 4603,

Kim Hyea Kyeong, Kim Jae Kyeong , “A product network analysis for extending the market basket

7410, Elsevier 2012.

, and Srivastava J. "Selecting the right interestingness measure for association patterns". In KDD, pages 32–41, 2002.

Vaillant B., Lenca P. , and Lallich S. "A clustering of interestingness measures". Pages 290–297.

Hu T., Sung S. Y., Xiong H., and Fu Q., "Discovery of maximum length frequent itemsets". Inf. Sci., 178:69–87, January 2008.

Kramer Stefan, “The Augmented Itemset Tree: A Data Structure for Online Maximum Frequent Pattern Mining”, pp

beer S. K., Ahmed C. F., Jeong B.-S., and Lee K., "Discovering periodic-frequent patterns in

transactional databases". In PAKDD ’09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data

253, Berlin, Heidelberg,