5

Click here to load reader

[IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - A Distributed Algorithm

Embed Size (px)

Citation preview

Page 1: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - A Distributed Algorithm

A Distributed Algorithm Based on CompetitiveNeural Network for Mining Frequent Patterns

Yihong DongInstitute of Artificial Intelligence

Zhejiang University,Hangzhou 310027, ChinaInstitute of Computer Science and Technology

Ningbo University, Ningbo 31521 1,ChinaE-mail:[email protected]

Abstract-Although FP-growth method is efficient andscalable for mining both long and short frequent patterns, andis about an order of magnitude faster than the Apriorialgorithm, it is unrealistic to construct memory-based FP-treewhen dataset is huge, because the FP-tree is too great to beheld in memory entirely. In this study, we propose a novelmethod named Competitive-Network-based FP-growth method(CNFP), which combines competitive neural network withFP-growth to mine frequent patterns. In competitive learning,similar patterns are grouped by the network and representedby a single neuron. This grouping is done automatically basedon data correlations. Huge database is divided into sets ofsimilar data. After competitive learning, neurons incompetitive layer are regarded as root to constructFP-sub-trees, in which transactions are similar to each other.Frequent patterns are mined based on FP-sub-tree todecompose the mining task into a set of smaller tasks, whichdramatically reduces the search space. CNFP can minefrequent patterns on web log files and discover associationrules between URL pages users access. Not only can it help usto discover the user access patterns effectively, but to providethe valid decision-making for the Web master to devise thepersonalized web site. Our experiments on a large real data setshow that the approach is efficient and practical for miningassociation rules on website pages.

I. INTRODUCTION

Association rule[l], which is used widely in additive post,directory devised and client divided, is an importantresearch field in data mining. A lot of Apriori-likeapproaches[2-4] have achieved good performance.However, it is costly to handle a large number of candidatesets. Synchronously it needs to scan database repeatedly.FP-growth[5] that avoids the costly generation of a hugenumber of candidate sets is efficient and scalable for miningfrequent patterns, and it runs faster than Apriori-likealgorithms.

Although FP-growth method is efficient and scalable formining both long and short frequent patterns, and is aboutan order of magnitude faster than the Apriori algorithm, it isunrealistic to construct memory-based FP-tree when datasetis huge, because the FP-tree is too great to be held inmemory entirely.

Xiaoying Tai, Jieyu ZhaoInstitute of Computer Science and Technology

Ningbo UniversityNingbo 31521 1,China

Besides these studies on sequential data miningtechniques, algorithms for parallel mining of associationrules have been proposed recently. In PDM[6] and CD[7]algorithms, not only do a great many of candidate sets needto be transmitted, but iteration need to be synchronous.FDM[8] and DDDM[9] adopt heuristic method to prune thelocal candidate sets to reduce the traffic. However, all abovemethods will create a lot of local candidate sets in largedatabases that it will cost much more time to handle.

In this paper, FP-tree structure is applied under thedistributed environment. Combining competitive neuralnetwork with FP-growth, we propose a novel distributedmethod to mine frequent patterns, which namedCompetitive-Network-based FP-growth method (CNFP). Incompetitive learning, similar patterns are grouped by thenetwork and represented by a single neuron. This groupingis done automatically based on data correlations. Hugedatabase is divided into sets of similar data. Aftercompetitive learning, neurons in competitive layer areregarded as root to construct FP-sub-trees, in whichtransactions are similar to each other. Frequent pattern ismined based on FP-sub-tree to decompose the mining taskinto a set of smaller tasks, which dramatically reduces thesearch space.The remaining of the paper is organized as follows:

Section II introduces CNFP algorithm we proposed, whichdevelops existed FP-growth method. Web log files aremined using CNFP method in section III. Section IVsummarizes our study and points out some future researchissues.

II. CNFP MODEL

A. Network StructureCNFP is a two-layer directed graph, whose characters are

as follows:(1) Nodes in directed graph are binary neurons;(2) Directed graph is divide into two layers: input

layer and competitive layer;

0-7803-9422-4/05/$20.00 C2005 IEEE499

Page 2: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - A Distributed Algorithm

(3) Whole-linked structure is adopted between inputlayer and competitive layer. Competitive connected weightWij describes the connected intensity between two layers.

There is one winner in competitive layer for each inputdatum.

Whole-linked structure is employed between input layerand competitive one to form basic competitive network.Neurons in competitive layer representing classificationprimarily, the vectors of competitive connected weight frominput layer to competitive layer show the centroid ofsub-clusters.

B. Learning algorithmDefinition l(Transaction):Suppose that I={al,a2,...,am}

is a set of items, and a transaction databaseDB={Tl,T2,...,Tn}, where Ti(i E [1,n]) is a transactionwhich contains a set of items in I.

Definition 2(Frequent pattern):The support of a patternP is the number of transactions containing P in DB. P is afrequent pattern if P's support is no less than a predefinedminimum support threshold 4.

Given a transaction database DB and a minimum supportthreshold 4 ,what we should do is to find the complete setof frequent patterns.

Learning and working of CNFP is by the way ofcompetition in competitive network and constructingFP-sub-tree by taking the competitive neurons as root.Each input vector of neural network, namely a transaction,is presented to the network to compare with the prototypevector that it most closely matches. If there are not anyprototype vectors to match the input pattern, a newprototype is generated to hold the input vector. Aftertraining finished, DB should be rescanned to constructFP-sub-tree. One neuron in competitive layer, which will beactivated by a new input transaction, is treated as root ofFP-sub-tree. The transactions that activate the same neuronin competitive layer disposed in the same FP-sub-tree.CNFP mining is mining FP-sub-tree to discover the frequentpatterns.

There are five steps in learning period: initialization,competition, FP-sub-tree construction, CNFP-growth andglobal FP generation.

Initialization: Three parameters are selected to beinitialized in CNFP: {Wij}, A and 4, where Wij is weightbetween input layer and competitive layer, 2 and 4 arethreshold. A, a win threshold whose value is between 0 and1, represents the match level of prototype vector. 4 isminimum support threshold for mining frequent patterns.

Competition: Competitive process is a process todiscover the match prototype. The winner neuron g isensured by choosing out the maximal similarity Sg afterevery similarity between input patterns and prototypevectors is computed.

N

s=Te Wy = t~~2 wijtk

1=1 i= iSi= 1,,..,

Similarity Sj computed from above formula denotes thematch degree between input vector Tk and Wij. Sj=l meanscomplete match between Tk and Wij. The match betweenprototype and input vector is allowed to be a gap becauseclassification is clustering to similar objects. Completelymatch is not required. The difference is determined bythreshold p . If Sj>=2, the recognition is accepted for thematch is in our range. The output of corresponding neuronin competitive layer is set to be 1 and the competitiveconnected vector is revised simultaneously. If Sj<2, therecognition is rejected since the match is not adequate. Anew prototype is generated to be a new classification.

FP-sub-tree Construction: In second scan of database,the won neuron in competitive layer is treated as root of anFP-sub-tree. The frequent items in transaction T is selectedand sorted by order. When a branch is generated for atransaction, the count of nodes with same prefix path areincrement by 1, and a new node after prefix path is createdand let its count be 1.

CNFP-growth: For each root of competitive neurons,conditional pattern base of initial postfix pattern isconstructed from frequent pattern whose length is 1. Thenits condition FP-tree is constructed and mined recursively.We achieve pattern increase by connecting the postfixpatterns and frequent patterns created from conditionalFP-tree. From this step many local frequent patterns arecreated.

Global FP Generation: Local frequent itemsets maybenot the frequent patterns of whole DB. Every frequentitemsets of DB should come forth in one partition as localfrequent itemset. The set of local frequent itemsets forms theglobal frequent itemsets ofDB.The algorithm is as follows:(1) Initialize: U is learning count. {w, }is given an

arbitrary value between 0 and 1.(2) New input pattern T, = (Qt,4,...,t) is provided.

Collect the set of frequent items F and their supports.(3) Similarity between input patterns and prototype

vector is computed.N

ENNIITk WY. =II -JS || Tk |||| Wy || $(ti)2I S 2

WY=

M(4) Select the winner sg = MAX[sj ]

j=l(5) Judge whether the formula sg >A come into

500

Page 3: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - A Distributed Algorithm

existence for the winner g. If it is true, then go to step (6)because the winner g is exactly a winner, else go to step (8)

(6) Discriminating result is admitted to adjust thecompetitive connected weight as following:

Wig = wig + Awigt.

Awig=a.('( Wig)

i=1,2,...,N (0<a<1)where a is a learning rate(7) Return to step (2) for next input pattern(8) A new prototype is created p=M+l. Initialize Ww t, i = 1,2,...,N(9) Return to step (2) for next input pattern until p

learning samples are trained.(10) Let u=u+l, return to (2) until u=U(11) Sort F in support descending order as L, the list of

frequent items(12) Mining local frequent itemsets by FP-growth

algorithm(13) The set of local frequent itemsets forms the global

frequent itemsets ofDB.

III. WEB LOG MINING: EXPERIMENTAL RESULT

Web mining [10]-[12], the discovery of knowledge on theWeb, has become an interesting research area. Web logmining is finding patterns in the usage of the Web. Animportant topic in Web log mining is the relation of pagesaccessed by users. For example, 40% of all users taking careof stock are favor for national economy. By analyzing thecharacteristics of these relations, webmasters mayunderstand Web usage better and may provide more suitable,customized services for users. The association rules on webcan provide a very useful tool for many applications ineducation and e-commerce.

In our experiment, CNFP proposed is used to mineassociation rules based on browsing activities or accesspatterns on the Web. The server log data should bepre-processed [12] to identify the sessions.

A. Session e-xtractionWeb log files contain records of user accesses. Each

record represents a page request from a Web user. A typicalrecord contains the IP address of client, the date and time ofrequest, the URLs of the page, the protocol, the return codeof the server and the size of the page if the request issuccessful, etc. Above all users browsing pattern fromserver's log files need to be extracted. Session, whichdescribe the set of URL pages clients browse in period oftime, is used to describe the users browsing pattern that ismade up ofURL pages and time consume on these pages.

Definition 3(Session P) Session P is the element as

follows: P =< sid1, userj, {(url _ idk, timek)}" >

where sidi is session id, userj is user j, url_ idk is theURL serial users visit, timek is the time users spend on theweb page, (url_idk, timek) is the time users spend on theweb page url_idk .

Irrelevant information for Web usage mining such asbackground images and unsuccessful requests is ignored.The users are identified by the IP addresses in the log and allrequests from both the same IP address and the sameoperating system are regarded as one user. Two thresholdsare used to recognize the sessions: one is min_time toexclude noises in the data. If the time spent on one page isless than min_time, the page is assumed to be uninterestingto the user or an index page and is discarded. Another one ismax_idle_time. If the elapse of time between twoconsecutive requests from the same user exceeds amax_idle_time, it will be regard as two sessions.A new IP address implies a new session by scanning the

server to recognize the sessions. Subsequent requests fromthe IP address is added to its session as long as the elapse oftime between two consecutive requests does not exceedmax_idle_time. Otherwise, the current session is finishedand a new session is created.

B. MiningfrequentpatternsOur approach has been implemented using VC++ on

1.7GHz Pentium IV PC machine with 256 megabytes mainmemory, running Windows 2000. The experimental data setused was obtained from "www.nbol.net". The server logsare approximately 50M in size taken in November 2004.After Web log files pre-processing, min_time choosing as 5seconds, max_idle_time choosing as 30 minutes, weextracted sessions and pages from different sets of data afterignoring background images and unsuccessful requests.

Table 1 is the result of our algorithm in different subdatasets when p is 0.75 and minimum support of 1%.There are 5 datasets in Table 1. In 1OM dataset, there areabout 72589 records and 775 sessions extracted. URL pagenumber is 147. Through competitive network of CNFP,about 65 neurons created in competitive layer. Table 1 alsoshows that the number of sessions is linear to the number ofrecords in datasets, and page number increases with size ofdatasets.

Analyzing the frequent itemsets, we can find that someinteresting association rules have been mined, such as46.3% of all users taking care of stock are favor for nationaleconomy, 52.7% persons who like basketball are also paymore attention to football, etc. These useful relationshipsbetween URL pages may help webmaster understand webusage better and establish adaptive website to provide moresuitable, customized services for users.

501

Page 4: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - A Distributed Algorithm

Table I

Datasets in experiment

SIZE RECORD_NUM SESSION NUM PAGE_NUM COMP_NEURON_NUM

1OM 72589 775 147 6520M 148722 1505 172 12430M 214104 2146 211 17240M 287048 2848 234 22750M 354312 3556 247 295

C. Comparison ofCNFP andAprioriBecause datasets are so large that FP-growth is different

to hold FP-tree in memory entirely, we select Apriori to becompared with our algorithm CNFP. We implement Apriorito the best of our knowledge based on the published reportson the same machine and compare in the same runningenvironment. Figure 1 shows the relation between runtimeand minimum support threshold with 50M dataset in sizeand p of 0.75. From Figure we can recognize that theexecution times increase for both CNFP and Apriori as theminimum support is reduced because the total number oflarge.CNFP performed better than Apriori as expected. It

shows that CNFP has good scalability with the reduction ofminimum support threshold. Although the number offrequent itemsets grows exponentially, the run time ofCNFP increases in a much more conservative way.

0 0,5 LO 1 Is Z. GI GI

.~ ~ ~ ~ W"Jt \

Fig. 1. Scalability with threshold

lv. CONCLUSION

We have proposed a novel algorithm combiningcompetitive network with FP-growth to mine associationrules. There are several advantages of CNFP:(1) CNFP

[4]

[5]

rides huge dataset into sets of similar data to construct'-sub-tree. It solves the problem of space cost that-growth exists. (2) In competitive learning of CNFP,nilar patterns are grouped by the network and representeda single neuron. This grouping is done automaticallysed on data correlations. This partition, which is differentm other arbitrary partition methods, is based onstering, so it makes global frequent itemsets from localquent itemsets easily. Experiment on web log files to,cover the relation between pages users access show thatQFP approach is efficient and practical for Web miningplications.

ACKNOWLEDGEMENT

This work is partially supported by Scientific Researchnd of Zhejiang Provincial Education Department ofina(20030485) to Yihong Dong, Natural Scienceundation ofChina(NSFC 60472099) to Xiaoying Tai.

REFERENCES

R. Agrawal and R. Srikant. Fast algorithms for mining associationmles. In VLDB'94, pp:487-499R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratorymining and pruning optimizations of constrained associations rules.In SIGMOD'98, pp:l3-24S. sarawagi, S. Thomas, and R. Agrawal. Integrating association mlemining with relational database systems: Alternatives andimplications. In SIGMOD'98, pp:343-354.R. Srikant, Q. Vu, and R. Agrawal. Mining association rules withitem constraints. In KDD '97, pp:67-73J. Han, J. Pei, and Y. Yin. Mining frequent patterns withoutcandidate generation. In proc. 2000 ACM-SIGMOD Int.ConfManagement ofData(SIGMOD'00), pp:l-12, Dallas, TX, May2000J. S. Park, M. S. Chen, P. S. Yu. Efficient parallel data mining forassociation rules. The 4h Intl. Conf: on Information and KnowledgeManagement, Baltimore, Maryland, 1995R. Agrawal, J. Shafer. Parallel mining of association rules. IEEETrans on Knowledge andData Engineering, 1996, 8(6):962-969W. Cheung, J. W. Han, V. T. Ng, et al. A fast distributed algorithmsfor mining association rules. IEEE 4th Intl Conf: on Parallel andDistributed Information Systems, Miami Beach, Florida, 1996

502

Page 5: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - A Distributed Algorithm

[9] Schuster, R. Wolff. Communication efficient distributed mining of [11] Fu, Y. & Shih, M. A Framework for Personal Web Usage Mining,association rules. The 2001 ACM SIGMOD Intl. Conf: on International Conference on Internet Computing(IC'2002), LasManagement ofData, Santa Barbara, California, 2001. Vegas, NV, 2002.

[10] Jiawei Han Micheline Kamber. Data Mining: Concepts and [12] Fu, Y. Shih, M, et,al. Reorganizing Web Sites Based on User AccessTechniques. San Mateo, CA: Morgan Kaufrnann, 2000. Patterns, International Journal ofIntelligent Systems in Accounting,

Finance and Management, 11(1), 2002.

503