advance db

8/13/2019 advance db

1/13

Assignment Cover Sheet

Research Report on Clustering in Data Mining

Student Id No: 12783348

Student Name: Fazal Din

Name of Subject: Advance Databases and Applications

Name of Lecturer: Dr cue


2/13

Table of Content:

Cluster.................................1

Introduction....................................2

Ideas of Clustering.................................3

Clustering Methods....................................4

Clustering in oracle with Data mining...................................5

Association Model of clustering in Database.................................6

Clustering Algorithm Applications................................7

Conclusion..............................8

References................................9


3/13

Cluster

Cluster is basically a group of objects that belong to the unique class. In samewords we can say the similar object are combined in one cluster and different are

grouped in other group of cluster.

Introduction:Data clustering is a process through which we make cluster of objects that are

looks same in their characteristics. There must be criteria for checking the uniquebetween objects to check that is called the implementation of dependent.

Clustering is normally sometimes confused with classification, while there is some

major difference between the two objects. In classification its necessary the objectsmust be assigned to pre-defined classes, clustering the classes are also to be

defined. Data Clustering is a method in which, the data that is logically same isphysically stored combined together. In order to decrease or increase the efficiency

in the database the number of storage hardwares accesses mustbe minimized.

Moreover, in clustering the objects of same properties are in one class of objectsand an access to the hard drive disk makes the entire class available.

Ideas of ClusteringIn order to unfold the concept, for instance, take the one great example of library

system. In a library concerning to a large variety of books and related topics whichare available. These books are mostly kept in formation of clusters. The books that

have the some kind of uniqueness among them are put in one cluster. Moreover,

those books are on the same database is always kept in one shelf and other bookson systems are placed in another cupboard. To reduce the complex situation, thebooks that have the same kind of topics are combined in same shelf. And then the

shelf, cupboards are labeled with the different names. So when a customer wantsbook of specific kind on topic, he would only has to go to that that shelf and checkfor the book, no need than checking in the library.


4/13

Clustering Method Types:There are different clustering methods; each method of them may provide differenttypes of grouping of the real dataset. The particular methods will reliable on the

type of the result desired, The performance of known method with specific typesof data, the software ,hardware facilities available size of the dataset. In simple

words, clustering methods can be divided into two types which are based on the

cluster structure in which they produce. The non hierarchical methods are sub-

divided in to the dataset of Nobjects into M clusters, with or might be withoutoverlap.

These methods are sometimes divided into partitioning methods, in which the

classes are mutually exclusive, and the less common method, in which overlap isallowed. Each object is a member of the cluster with which it is most similar;

however the threshold of similarity has to be defined. The hierarchical methodsproduce a set of nested clusters in which each pair of objects or clusters is

progressively nested in a larger cluster until only one cluster remains. The

hierarchical methods can be further divided into divisive methods.

In agglomerative methods, the hierarchy is build up in a series of N-1agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered

dataset. The less common divisive methods begin with all objects in a single clusterand at each of N-1 steps divide some clusters into two smaller clusters, until each

object resides in its own cluster

.

Important Data Clustering Methods are given as follows

Partitioning Method: Hierarchical Agglomerative method: The Single Link Method Complete Link Method Group Average Method Text Based Method


5/13

Partitioning Method:

The partitioning method normally give result in a set of M types of clusters, each

of the object belongs to one cluster. Every cluster has been represented by

cancroids or cluster representation; this is the summary explanation of all objectsmixed in a cluster. The golden form of this explanation will be depended on the

type of the object which has been clustered. In real case real values data isavailable, the arithmetic average of the attributes for all that objects within cluster

gives an appropriate output; other types of centred might be required in other

cases, for instance a cluster of all documents can be shown by a list of all keywords

that are in some maximum number of documents which are which are within acluster. If the number of the clusters is big, the centre can be more clustered to

gives hierarchy within a dataset. There is special type of this method called assingle phase which has been described as follows.

Single Pass:

A simple partition method, which works on following statements:

Make the object the centroids for the first given cluster.Next object, it calculates the similarity which is denoted by Swith each

existing cluster, by using some same coefficient.

If the calculated value of S is more than some threshold value, then add theobject to the next cluster and again determine the centroid; else, use theobject to start a new cluster. If any are remained to be clustered, always

return to step two.


6/13

As the name shows, this method needs only one pass through all of the dataset; thetime requirements are very essential and typically of order Log (N) for order Ology

(N) clusters. This makes it efficient clustering method for a serial number ofprocessor. A drawback is that the output of clusters are dependent of the order in

which the documents have been processed, with the first given clusters formedbeing greater than those created after in the clustering running time.


7/13

Hierarchical Agglomerative method:This type of clustering methods commonly used because of its high accuracy. Theconstruction for this type of method hierarchical agglomerative classification can

be understood by the following general algorithm.

1. Find the 2 closest objects and merge them into a cluster2. Find and merge the next two closest points, where a point is either an

individual object or a cluster of objects.3. If more than one cluster remains , return to step 2

Individual methods are characterized by the definition used for identification of the

closest pair of points, and by the means used to describe the new cluster when twoclusters are merged together.

There are some approaches of this algorithm, these being understood by storedmatrix and stored information, are explained

In 2ndmatrix approach, an (N*N) matrix which contains all distance valuesis first has been created, and updated clusters are formed. This approach has

at (n*n) time requirement conditions, rising to (n3), simple serial can bescanned of matrix which is used to recognize the points which needs to beused in each agglomeration, a serious limitation value for large: N.

The stored data is to be required the pair wise dissimilarity values for eachof the (N-1) agglomerations, and (N) space requirement is therefore to beachieved at the expense of an n(N3) time requirement.


8/13

The Single Link Method:The single link method is best known of the hierarchical method and it operates bycombining, at each of the step, the two most same objects, which is not part of the

same cluster. The name single link referred to the pairs of cluster by the singleshort link between them.

The Complete Link Method:The complete link method is same as to the single link method but that it alwaysuses the least same pair between the two clusters to identify the inters luster

similarity in order that every cluster member more near furthest member of itscluster item in any other cluster . This method is used by small, tightly in-boundclusters.

The Group Average MethodThe type of method based on the average values of the pair within inside cluster,

not dependent on the minimum and maximum or same as with single link or tothe complete link methods. All the objects within cluster contribute to outer-cluster

similarity; each object is on average like every other member of its clusters objectin any other cluster.

Text Based Documents Method:Text based documents, the clusters is made by similarity as some of the essentialkey words that are found on a maximum number of times in the document. When a

query comes a typical words then instead the entire database, only that cluster isscanned which has same word in the list of its key words which is given. The order

received in the result is totally dependents on the number of times that key wordsappear in the entire document.


9/13

Clustering in Oracle Data Mining

Clustering is a tool useful for unfolding data. It is mostly useful when there are somany different cases and no natural groupings. Clustering data mining tools can be

useful to find whatever natural groups may exist. Clustering identifies clustersemerged in the data. A cluster is a group of collection of objects which are similar

to one another. A best clustering method creates high number of quality clusters to

make sure that the inter cluster similar is low ,high the intro-cluster similarity isvery high ,in same words members of single cluster are more same to each other

than they are likely to be members of a different types of cluster. Clustering can

be served as a useful data processing steps to know the same number of groups onwhich is used to build predictive models. Clustering models are change from

predictive models in the process which is not guided by known outputs; there is noreal target attribute. Predictive models find values for target attributes, an error rate

between the unknown and predicted values can be known to guide model forbuilding real model. Clustering models, on same hand, uncover natural clustering

in the data. The model can then assign for groupings labels to data points. In (odd)cluster is characterized by its centre point attributes his to-grams, and can be

placed in the clustering model tree. (ODM) performs clustering can be used anupdated version of the k-means and Cluster, proprietary algorithm which is the part

of the oracle. The clusters used by these algorithms are then to create rules that

give the main characteristics of the data which has been assigned to one anothercluster. Theism represents the hyper boxes that envelop the data in the clustersutilized by the clustering algorithm. The creation of each rule gives the clustering

bounding box. The encodes the cluster (ID) for the cluster defined by the rule. Forinstance, for a data set with two different attributes: Height and Age the following

rule uses the most of the data assigned to clusters

AGE >= 30 and AGE = 6.0ft and HEIGHT


10/13

The clusters are mostly used to generate a Bayesian model which is useful duringscoring and also for assigning data points to each cluster.

The two clustering algorithms used and supported by (ODM) interfaces which areas

K-means Enhanced Orthogonal partitioning

Association Models in Oracle Data MiningThe Association model is mostly often linked with Market Analysis which is

used to provide relationships to its items. It is mostly used in data analysis

for prediction of marketing, and also other business making processes. Atypical association rule of this kind asserts, for instance 80 percent of the

people who usually buy wine, and sauce also buy garlic bread. Association

models are used to capture the -occurrence of objects or events in volumes

of customer transaction information. Because of progress of Bar-codetechnology, it is always easy for retail organizations to gather and store huge

amounts of sales data, called as basket data. Association models were firstlydefined on marketing even though they are useable in several other

applications. Finding all these rules is valuable for marketing and mailpromotions, but there might be other applications as well: catalogue design,

sales, storage, segmentation, and web page, target marketing.


11/13

Clustering Algorithm Usage:Clustering algorithm is identifying the real values data set. Firstly we take thesamples of non-cancerous, cancerous and also data set. Label samples data set.

We then mix both the samples and apply for different types of clusteringalgorithms into samples data set ,this is we called as learning phase for

clustering and the check the results for data set which we are getting the correctoutputs it is known as samples we know the results before and after, hence we can

find the percentage of correct results known ,for some sample of the data set if weuse the same algorithm so we can expect the output to be the same percent

correct as going to during the learning real phase of the exact algorithm. On this

basis search for the best suitable of the clustering algorithm for data samples.

Clustering Algorithm in Wireless Network's:Clustering Algorithm which is used in Wireless Sensors One application where it

can be used is in Land detection. Clustering algorithm play good role of finding theCluster heads or center which collects all the data in its cluster.


12/13

CONCLUSIONIn this report I have tried to give the major concepts of clustering in data mining byfirst providing the definition and clustering and then the description of some

related Algorithms. I gave some examples to clear the concept of clustering ,afterthat I have explained different approaches to data clustering with some proofs and

also discussed some algorithms and how to implement that approaches. Thehierarchical method and partitioning method of clustering were also explained. The

applications of clustering are also elaborated here with the some sort of examplesof medical sketches database, data mining using data clustering.

So we try to prove the importance of clustering in every area of our subject

Advance Database and Applications. We also tried to prove that clustering issomething really typical to databases but it has aloe of applications in the fields

like networking, image processing.


13/13

References:

Data Mining, Second Edition: Concepts and Techniques - Jiawei Han, Micheline Kamber, Jian Pei Books.

2014. Data Mining, Second Edition: Concepts and Techniques - Jiawei Han

Data Mining:. Techniques and concepts 3

rd

Edition. i Han Jiawe. and. Kamber. Micheline University ofChampaign.p-571

Data Mining: Concepts 2. Appendix C. An Introduction to System Architecture;p-260

Data Mining: Techniques. Kaufmann, Morgan 2nd edition 2009; H. Mannila, dj Hand, and P. Smyth,

Principles of Data Mining, MIT P, 2011..p-121

Han, M. Kamber, Data Mining analysis: Concepts and Techniques, 2001-2012 page 158

R. Rastogi, M. Garofalakis, and K. Shim. Spirit: pattern mining with regular expression constrain p`225

.