Upload
fazal-mahar
View
215
Download
0
Embed Size (px)
Citation preview
8/13/2019 advance db
1/13
Assignment Cover Sheet
Research Report on Clustering in Data Mining
Student Id No: 12783348
Student Name: Fazal Din
Name of Subject: Advance Databases and Applications
Name of Lecturer: Dr cue
8/13/2019 advance db
2/13
Table of Content:
Cluster.................................1
Introduction....................................2
Ideas of Clustering.................................3
Clustering Methods....................................4
Clustering in oracle with Data mining...................................5
Association Model of clustering in Database.................................6
Clustering Algorithm Applications................................7
Conclusion..............................8
References................................9
8/13/2019 advance db
3/13
Cluster
Cluster is basically a group of objects that belong to the unique class. In samewords we can say the similar object are combined in one cluster and different are
grouped in other group of cluster.
Introduction:Data clustering is a process through which we make cluster of objects that are
looks same in their characteristics. There must be criteria for checking the uniquebetween objects to check that is called the implementation of dependent.
Clustering is normally sometimes confused with classification, while there is some
major difference between the two objects. In classification its necessary the objectsmust be assigned to pre-defined classes, clustering the classes are also to be
defined. Data Clustering is a method in which, the data that is logically same isphysically stored combined together. In order to decrease or increase the efficiency
in the database the number of storage hardwares accesses mustbe minimized.
Moreover, in clustering the objects of same properties are in one class of objectsand an access to the hard drive disk makes the entire class available.
Ideas of ClusteringIn order to unfold the concept, for instance, take the one great example of library
system. In a library concerning to a large variety of books and related topics whichare available. These books are mostly kept in formation of clusters. The books that
have the some kind of uniqueness among them are put in one cluster. Moreover,
those books are on the same database is always kept in one shelf and other bookson systems are placed in another cupboard. To reduce the complex situation, thebooks that have the same kind of topics are combined in same shelf. And then the
shelf, cupboards are labeled with the different names. So when a customer wantsbook of specific kind on topic, he would only has to go to that that shelf and checkfor the book, no need than checking in the library.
8/13/2019 advance db
4/13
Clustering Method Types:There are different clustering methods; each method of them may provide differenttypes of grouping of the real dataset. The particular methods will reliable on the
type of the result desired, The performance of known method with specific typesof data, the software ,hardware facilities available size of the dataset. In simple
words, clustering methods can be divided into two types which are based on the
cluster structure in which they produce. The non hierarchical methods are sub-
divided in to the dataset of Nobjects into M clusters, with or might be withoutoverlap.
These methods are sometimes divided into partitioning methods, in which the
classes are mutually exclusive, and the less common method, in which overlap isallowed. Each object is a member of the cluster with which it is most similar;
however the threshold of similarity has to be defined. The hierarchical methodsproduce a set of nested clusters in which each pair of objects or clusters is
progressively nested in a larger cluster until only one cluster remains. The
hierarchical methods can be further divided into divisive methods.
In agglomerative methods, the hierarchy is build up in a series of N-1agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered
dataset. The less common divisive methods begin with all objects in a single clusterand at each of N-1 steps divide some clusters into two smaller clusters, until each
object resides in its own cluster
.
Important Data Clustering Methods are given as follows
Partitioning Method: Hierarchical Agglomerative method: The Single Link Method Complete Link Method Group Average Method Text Based Method
8/13/2019 advance db
5/13
Partitioning Method:
The partitioning method normally give result in a set of M types of clusters, each
of the object belongs to one cluster. Every cluster has been represented by
cancroids or cluster representation; this is the summary explanation of all objectsmixed in a cluster. The golden form of this explanation will be depended on the
type of the object which has been clustered. In real case real values data isavailable, the arithmetic average of the attributes for all that objects within cluster
gives an appropriate output; other types of centred might be required in other
cases, for instance a cluster of all documents can be shown by a list of all keywords
that are in some maximum number of documents which are which are within acluster. If the number of the clusters is big, the centre can be more clustered to
gives hierarchy within a dataset. There is special type of this method called assingle phase which has been described as follows.
Single Pass:
A simple partition method, which works on following statements:
Make the object the centroids for the first given cluster.Next object, it calculates the similarity which is denoted by Swith each
existing cluster, by using some same coefficient.
If the calculated value of S is more than some threshold value, then add theobject to the next cluster and again determine the centroid; else, use theobject to start a new cluster. If any are remained to be clustered, always
return to step two.
8/13/2019 advance db
6/13
As the name shows, this method needs only one pass through all of the dataset; thetime requirements are very essential and typically of order Log (N) for order Ology
(N) clusters. This makes it efficient clustering method for a serial number ofprocessor. A drawback is that the output of clusters are dependent of the order in
which the documents have been processed, with the first given clusters formedbeing greater than those created after in the clustering running time.
8/13/2019 advance db
7/13
Hierarchical Agglomerative method:This type of clustering methods commonly used because of its high accuracy. Theconstruction for this type of method hierarchical agglomerative classification can
be understood by the following general algorithm.
1. Find the 2 closest objects and merge them into a cluster2. Find and merge the next two closest points, where a point is either an
individual object or a cluster of objects.3. If more than one cluster remains , return to step 2
Individual methods are characterized by the definition used for identification of the
closest pair of points, and by the means used to describe the new cluster when twoclusters are merged together.
There are some approaches of this algorithm, these being understood by storedmatrix and stored information, are explained
In 2ndmatrix approach, an (N*N) matrix which contains all distance valuesis first has been created, and updated clusters are formed. This approach has
at (n*n) time requirement conditions, rising to (n3), simple serial can bescanned of matrix which is used to recognize the points which needs to beused in each agglomeration, a serious limitation value for large: N.
The stored data is to be required the pair wise dissimilarity values for eachof the (N-1) agglomerations, and (N) space requirement is therefore to beachieved at the expense of an n(N3) time requirement.
8/13/2019 advance db
8/13
The Single Link Method:The single link method is best known of the hierarchical method and it operates bycombining, at each of the step, the two most same objects, which is not part of the
same cluster. The name single link referred to the pairs of cluster by the singleshort link between them.
The Complete Link Method:The complete link method is same as to the single link method but that it alwaysuses the least same pair between the two clusters to identify the inters luster
similarity in order that every cluster member more near furthest member of itscluster item in any other cluster . This method is used by small, tightly in-boundclusters.
The Group Average MethodThe type of method based on the average values of the pair within inside cluster,
not dependent on the minimum and maximum or same as with single link or tothe complete link methods. All the objects within cluster contribute to outer-cluster
similarity; each object is on average like every other member of its clusters objectin any other cluster.
Text Based Documents Method:Text based documents, the clusters is made by similarity as some of the essentialkey words that are found on a maximum number of times in the document. When a
query comes a typical words then instead the entire database, only that cluster isscanned which has same word in the list of its key words which is given. The order
received in the result is totally dependents on the number of times that key wordsappear in the entire document.
8/13/2019 advance db
9/13
Clustering in Oracle Data Mining
Clustering is a tool useful for unfolding data. It is mostly useful when there are somany different cases and no natural groupings. Clustering data mining tools can be
useful to find whatever natural groups may exist. Clustering identifies clustersemerged in the data. A cluster is a group of collection of objects which are similar
to one another. A best clustering method creates high number of quality clusters to
make sure that the inter cluster similar is low ,high the intro-cluster similarity isvery high ,in same words members of single cluster are more same to each other
than they are likely to be members of a different types of cluster. Clustering can
be served as a useful data processing steps to know the same number of groups onwhich is used to build predictive models. Clustering models are change from
predictive models in the process which is not guided by known outputs; there is noreal target attribute. Predictive models find values for target attributes, an error rate
between the unknown and predicted values can be known to guide model forbuilding real model. Clustering models, on same hand, uncover natural clustering
in the data. The model can then assign for groupings labels to data points. In (odd)cluster is characterized by its centre point attributes his to-grams, and can be
placed in the clustering model tree. (ODM) performs clustering can be used anupdated version of the k-means and Cluster, proprietary algorithm which is the part
of the oracle. The clusters used by these algorithms are then to create rules that
give the main characteristics of the data which has been assigned to one anothercluster. Theism represents the hyper boxes that envelop the data in the clustersutilized by the clustering algorithm. The creation of each rule gives the clustering
bounding box. The encodes the cluster (ID) for the cluster defined by the rule. Forinstance, for a data set with two different attributes: Height and Age the following
rule uses the most of the data assigned to clusters
AGE >= 30 and AGE = 6.0ft and HEIGHT
8/13/2019 advance db
10/13
The clusters are mostly used to generate a Bayesian model which is useful duringscoring and also for assigning data points to each cluster.
The two clustering algorithms used and supported by (ODM) interfaces which areas
K-means Enhanced Orthogonal partitioning
Association Models in Oracle Data MiningThe Association model is mostly often linked with Market Analysis which is
used to provide relationships to its items. It is mostly used in data analysis
for prediction of marketing, and also other business making processes. Atypical association rule of this kind asserts, for instance 80 percent of the
people who usually buy wine, and sauce also buy garlic bread. Association
models are used to capture the -occurrence of objects or events in volumes
of customer transaction information. Because of progress of Bar-codetechnology, it is always easy for retail organizations to gather and store huge
amounts of sales data, called as basket data. Association models were firstlydefined on marketing even though they are useable in several other
applications. Finding all these rules is valuable for marketing and mailpromotions, but there might be other applications as well: catalogue design,
sales, storage, segmentation, and web page, target marketing.
8/13/2019 advance db
11/13
Clustering Algorithm Usage:Clustering algorithm is identifying the real values data set. Firstly we take thesamples of non-cancerous, cancerous and also data set. Label samples data set.
We then mix both the samples and apply for different types of clusteringalgorithms into samples data set ,this is we called as learning phase for
clustering and the check the results for data set which we are getting the correctoutputs it is known as samples we know the results before and after, hence we can
find the percentage of correct results known ,for some sample of the data set if weuse the same algorithm so we can expect the output to be the same percent
correct as going to during the learning real phase of the exact algorithm. On this
basis search for the best suitable of the clustering algorithm for data samples.
Clustering Algorithm in Wireless Network's:Clustering Algorithm which is used in Wireless Sensors One application where it
can be used is in Land detection. Clustering algorithm play good role of finding theCluster heads or center which collects all the data in its cluster.
8/13/2019 advance db
12/13
CONCLUSIONIn this report I have tried to give the major concepts of clustering in data mining byfirst providing the definition and clustering and then the description of some
related Algorithms. I gave some examples to clear the concept of clustering ,afterthat I have explained different approaches to data clustering with some proofs and
also discussed some algorithms and how to implement that approaches. Thehierarchical method and partitioning method of clustering were also explained. The
applications of clustering are also elaborated here with the some sort of examplesof medical sketches database, data mining using data clustering.
So we try to prove the importance of clustering in every area of our subject
Advance Database and Applications. We also tried to prove that clustering issomething really typical to databases but it has aloe of applications in the fields
like networking, image processing.
8/13/2019 advance db
13/13
References:
Data Mining, Second Edition: Concepts and Techniques - Jiawei Han, Micheline Kamber, Jian Pei Books.
2014. Data Mining, Second Edition: Concepts and Techniques - Jiawei Han
Data Mining:. Techniques and concepts 3
rd
Edition. i Han Jiawe. and. Kamber. Micheline University ofChampaign.p-571
Data Mining: Concepts 2. Appendix C. An Introduction to System Architecture;p-260
Data Mining: Techniques. Kaufmann, Morgan 2nd edition 2009; H. Mannila, dj Hand, and P. Smyth,
Principles of Data Mining, MIT P, 2011..p-121
Han, M. Kamber, Data Mining analysis: Concepts and Techniques, 2001-2012 page 158
R. Rastogi, M. Garofalakis, and K. Shim. Spirit: pattern mining with regular expression constrain p`225
.