Chap. 10 Mining Object, Spatial, Multimedia, Text, and Web Data

Chap. 10 Mining Object, Spatial, Multimedia, Text, and Web Data

Data Mining

Mining Complex Types of Data Mining spatial data

Mining image data

Mining text data

Mining the Web

Mining Spatial Databases Spatial database

Space related data: maps, VLSI layouts, … Topological, distance information organized by spatial

indexing structures Spatial data warehousing

Issue: different representations & structures Dimensions

Nonspatial: 25-30 degree hot Spatial-to-nonspatial: “New York” “western provinces” Spatial-to-spatial: equi. temp region 0-5 degree region

Measures numerical Spatial: collection of spatial pointers (0-5 degree region)

Example: BC Weather Pattern Analysis

Input A map with about 3,000 weather probes scattered in B.C. Daily data for temperature, wind velocity, etc. Concept hierarchies for all attributes

Output A map that reveals patterns: merged (similar) regions

Goals Interactive analysis (drill-down, slice, dice, pivot, roll-up) Fast response time, Minimizing storage space used

Challenge A merged region may contain hundreds of “primitive”

regions (polygons)

Spatial Merge

Precomputing: too much storage space

On-line merge: very expensive

Spatial Association Analysis

Spatial association rule: A B [s%, c%] A and B are sets of spatial or nonspatial predicates

Topological relations: intersects, overlaps, disjoint, etc. Spatial orientations: left_of, west_of, under, etc. Distance information: close_to, within_distance, etc.

Example is_a(x, “school”) ^ close_to(x, “sports_center”)

close_to(x, “park”) [7%, 85%]

Progressive Refinement First search for rough relationship (e.g. g_close_to for

close_to, touch, intersect) using rough evaluation (e.g. MBR)

Then apply only to those objects which have passed the rough test

Spatial classification Analyze spatial objects to derive classification schemes,

such as decision trees in relevance to spatial properties Example

Classify regions into rich vs. poor Properties: containing university, containing highway, near

ocean, etc.

Spatial Classification

Constraints-based clustering Selection of relevant objects before clustering Parameters as constraints

K-means, density-based: radius, min points Clustering with obstructed distance

Spatial Cluster Analysis

Spatial data with obstacles Clustering without takingobstacles into consideration

Mountain

RiverBridge

C1

C2 C3

C4

Mining Image Data - Retrieval

Description-based retrieval systems Retrieval based on image descriptions, such as keywords,

captions, size, etc. Labor-intensive, poor quality

Content-based retrieval systems Retrieval based on the image content(features), such as

color histogram, texture, shape, and wavelet transforms Sample-based queries

Find all of the images that are similar to the features of given image

Feature specification queries Specify or sketch image features like color, texture, or shape,

which are translated into a feature vector

Mining Image Data - RetrievalCombining searches

Search for “blue sky”(top layout grid is blue)

Search for “airplane in blue sky”(top layout grid is blue and keyword = “airplane”)

Classification of Image Data Classification

Decision tree Based on descriptive features Based on content features

Feature extraction Extract features for classification from raw image Various image analysis techniques are required

Data transformation, edge detection, etc.

Example Classify sky images to recognize galaxies, stars, etc. By using properties obtained from image analysis

Classification of Image Data

Mining Text Databases Text databases (document databases)

Large collections of documents from various sources News articles, research papers, books, e-mail messages, and

Web pages Data stored is usually semi-structured Traditional information retrieval techniques become

inadequate for the increasingly vast amounts of text data Information retrieval

Information is organized into documents Information retrieval problem

Locating relevant documents based on user input, such as keywords or example documents

Basic Measures for IR Precision: the percentage of retrieved documents that are in

fact relevant to the query (i.e., “correct” responses)

Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

|}{|

|}{}{|

Relevant

RetrievedRelevantrecall

|}{||}{}{|

RetrievedRetrievedRelevant

precision

Keyword-Based Retrieval A document is represented by a set of keywords

Retrieval by keyword matching Queries may use expressions of keywords

(Car and accessory), (C++ or Java) Major difficulties

Synonymy: same meaning but different word Ex> Q: “software” Doc: about programming, do not have

the keyword Polysemy: same word but different meaning

Ex> Q: “mining” Doc: about gold mining, have the keyword

Similarity-Based Retrieval A document is represented as a keyword vector

Retrieval by similarity computing Basic techniques

Stop list – set of words that are frequent but irrelevant Ex> a, the, of, for, with, …

Stemming – use a common word stem Ex> drug, drugs, drugged drug

Weighting – count frequency Term frequency, inverse document frequency, …

Similarity metrics Measure the closeness of a document to a query Cosine similarity:

||||),(

21

2121 vv

vvvvsim

TF-IDF Weighting TF (Term Frequency)

TF= f(t,d) : how many times term t appears in doc d More frequent more relevant to topic Normalization:

Document length varies : relative frequency preferred IDF (Inverse Document Frequency)

IDF = 1 + log (n / k) : in how many documents term t appears n : total number of docs k : # docs with term t appearing (the document frequency)

Less frequent among documents more discriminative TF-IDF weighting

weight(t, d) = TF(t, d) * IDF(t)

Latent Semantic Indexing Reduce the dimension of keyword matrix

To resolve the synonym problem and the size problem Use a singular value decomposition (SVD) techniques

Example

10000

01000

11000

00001

00110

01101

6

5

4

3

2

1

D

D

D

D

D

Dtruckcarmoonrocketuniverse

SVD Singular Value Decomposition

Decompose the matrix Amn Amn = Umm Smn (Vnn)T

Reduce dimension Select largest k singular values

A’mn = Umk Skk (Vnk)T

Projection of A into k dimensionA’mn Vnk = Umk Skk

Computing similarityAAT = USVT(USVT)T

= USVTVSTUT = (US)(US)T

SVD

22.058.033.041.012.0

41.058.012.022.033.0

19.000.020.063.045.0

63.058.045.019.020.0

29.000.075.053.028.0

53.000.028.029.075.0

U

39.000.000.000.000.0

00.000.100.000.000.0

00.000.028.100.000.0

00.000.000.059.100.0

00.000.000.000.016.2

S ...TV

65.026.0

35.071.0

00.197.0

30.004.0

84.060.0

46.062.0

2USVA

00.1

74.000.1

93.094.000.1

87.032.062.000.1

54.016.018.088.000.1

10.074.047.040.078.000.1

))(( TUSUS

Automatic Document Classification

Motivation Automatic classification for the tremendous number of on-line

text documents (Web pages, e-mails, etc.) A classification problem

Training set: Human experts generate a training data set Classification(learning): The system discovers the

classification rules Methods

Extract keywords and weights from documents Documents are represented as (keyword, weight) pairs

Classify training documents into classes Apply classification algorithm

Decision tree, Bayesian, neural network, etc.

Mining the World-Wide Web WWW provides rich sources for data mining

Contents information Hyperlink information Usage information

Challenges Too huge for effective data warehousing and data mining Too complex and heterogeneous Growing and changing very rapidly

Web Search Engines Index-based

Search the Web, collect Web pages, index Web pages, and build and store huge keyword-based indices

Locate sets of Web pages containing certain keywords Deficiencies

A topic of any breadth may easily contain hundreds of thousands of documents

Many documents that are highly relevant to a topic may not contain keywords defining them (synonymy, polysemy)

Web Contents Mining - Classification

Web page/site classification Assign a class label to each web page from a set of

predefined topic categories Based on a set of examples of preclassified documents

Example Use Yahoo!'s taxonomy and its associated documents as

training and test sets Derive a Web document classification model Use the model to classify new Web documents by assigning

categories from the same taxonomy Methods

Keyword-based classification, use of hyperlink information, statistical models, …

Web Structure Mining Finding authoritative Web pages

Retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic

Hyperlinks can infer the notion of authority A hyperlink pointing to another Web page, this can be

considered as the author's endorsement of the other page Problems

Not every hyperlink represents an endorsement One authority will seldom point to its rival authority Authoritative pages are seldom particularly descriptive

Hub Set of Web pages that provides collections of links to

authorities

HITS (Hyperlink-Induced Topic Search)

Method1. Use an index-based search engine to form the root set2. Expand the root set into a base set

Include all of the pages that the root-set pages link to, and all of the pages that link to a page in the root set

3. Apply weight-propagation Determines numerical estimates of hub and authority

weights4. Output a list of the pages

Large hub weights, large authority weights for the given search topic

Systems based on the HITS algorithm Clever, Google

Achieve better quality search results than AltaVista, Yahoo!

Web Usage Mining Mining Web log records

Discover user access patterns Typical Web log entry - URL requested, the IP address from

which the request originated, timestamp, etc. OLAP on the Weblog database

Find the top N users, top N accessed Web pages, most frequently accessed time periods, etc.

Data mining on Weblog records Find association patterns, sequential patterns, and trends of

Web accessing

Web Usage Mining Applications

Target potential customers for electronic commerce Identify potential prime advertisement locations Enhance the quality and delivery of Internet information

services to the end user Improve Web server system performance

Web caching, Web page prefetching, and Web page swapping

References H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2001. Ester M., Frommelt A., Kriegel H.-P., Sander J.: Spatial Data Mining: Database Primitives, Algorithms and

Efficient DBMS Support, Data Mining and Knowledge Discovery, 4: 193-216, 2000. J. Han, M. Kamber, and A. K. H. Tung, "Spatial Clustering Methods in Data Mining: A Survey", in H. Miller

and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2000. Y. Bedard, T. Merrett, and J. Han, "Fundamentals of Geospatial Data Warehousing for Geographic

Knowledge Discovery", in H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2000

K. Koperski and J. Han. Discovery of spatial association rules in geographic information databases. SSD'95. Shashi Shekhar and Sanjay Chawla, Spatial Databases: A Tour , Prentice Hall, 2003 (ISBN 013-017480-7).

Chapter 7.: Introduction to Spatial Data Mining X. Li, J. Han, and S. Kim, Motion-Alert: Automatic Anomaly Detection in Massive Moving Objects”, IEEE Int.

Conf. on Intelligence and Security Informatics (ISI'06). Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol.

34, No.1, March 2002 Soumen Chakrabarti, “Data mining for hypertext: A tutorial survey”, ACM SIGKDD Explorations, 2000. Cleverdon, “Optimizing convenient online access to bibliographic databases”, Information Survey, Use4, 1,

37-47, 1984 Yiming Yang, “An evaluation of statistical approaches to text categorization”, Journal of Information Retrieval,

1:67-88, 1999. Yiming Yang and Xin Liu “A re-examination of text categorization methods”. Proceedings of ACM SIGIR

Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999. S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and Semi-Structured Data”, Morgan

Kaufmann, 2002.

References G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton

University, August 1993. M. Hearst, Untangling Text Data Mining, ACL’99, invited paper. R. Sproat, Introduction to Computational

Linguistics, LING 306, UIUC, Fall 2003. A Road Map to Text Mining and Web Mining, University of Texas resource page.

http://www.cs.utexas.edu/users/pebronia/text-mining/ Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Extracting Content Structure for Web Pages based

on Visual Representation”, The Fifth Asia Pacific Web Conference, 2003. Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “VIPS: a Vision-based Page Segmentation

Algorithm”, Microsoft Technical Report (MSR-TR-2003-79), 2003. Shipeng Yu, Deng Cai, Ji-Rong Wen and Wei-Ying Ma, “Improving Pseudo-Relevance Feedback in Web

Information Retrieval Using Web Page Segmentation”, 12th International World Wide Web Conference (WWW2003), May 2003.

Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, “Learning Block Importance Models for Web Pages”, 13th International World Wide Web Conference (WWW2004), May 2004.

Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Block-based Web Search”, SIGIR 2004, July 2004 . Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma, “Block-Level Link Analysis”, SIGIR 2004, July 2004 . Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen and Hong-Jiang Zhang, “Organizing WWW Images Based

on The Analysis of Page Layout and Web Link Structure”, The IEEE International Conference on Multimedia and EXPO (ICME'2004) , June 2004

Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen, “Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis”,12th ACM International Conference on Multimedia, Oct. 2004 .