Upload
tommy96
View
6.314
Download
48
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Chap. 10 Mining Object, Spatial, Multimedia, Text, and Web Data
Data Mining
Mining Complex Types of Data Mining spatial data
Mining image data
Mining text data
Mining the Web
Mining Spatial Databases Spatial database
Space related data: maps, VLSI layouts, … Topological, distance information organized by spatial
indexing structures Spatial data warehousing
Issue: different representations & structures Dimensions
Nonspatial: 25-30 degree hot Spatial-to-nonspatial: “New York” “western provinces” Spatial-to-spatial: equi. temp region 0-5 degree region
Measures numerical Spatial: collection of spatial pointers (0-5 degree region)
Example: BC Weather Pattern Analysis
Input A map with about 3,000 weather probes scattered in B.C. Daily data for temperature, wind velocity, etc. Concept hierarchies for all attributes
Output A map that reveals patterns: merged (similar) regions
Goals Interactive analysis (drill-down, slice, dice, pivot, roll-up) Fast response time, Minimizing storage space used
Challenge A merged region may contain hundreds of “primitive”
regions (polygons)
Spatial Merge
Precomputing: too much storage space
On-line merge: very expensive
Spatial Association Analysis
Spatial association rule: A B [s%, c%] A and B are sets of spatial or nonspatial predicates
Topological relations: intersects, overlaps, disjoint, etc. Spatial orientations: left_of, west_of, under, etc. Distance information: close_to, within_distance, etc.
Example is_a(x, “school”) ^ close_to(x, “sports_center”)
close_to(x, “park”) [7%, 85%]
Progressive Refinement First search for rough relationship (e.g. g_close_to for
close_to, touch, intersect) using rough evaluation (e.g. MBR)
Then apply only to those objects which have passed the rough test
Spatial classification Analyze spatial objects to derive classification schemes,
such as decision trees in relevance to spatial properties Example
Classify regions into rich vs. poor Properties: containing university, containing highway, near
ocean, etc.
Spatial Classification
Constraints-based clustering Selection of relevant objects before clustering Parameters as constraints
K-means, density-based: radius, min points Clustering with obstructed distance
Spatial Cluster Analysis
Spatial data with obstacles Clustering without takingobstacles into consideration
Mountain
RiverBridge
C1
C2 C3
C4
Mining Image Data - Retrieval
Description-based retrieval systems Retrieval based on image descriptions, such as keywords,
captions, size, etc. Labor-intensive, poor quality
Content-based retrieval systems Retrieval based on the image content(features), such as
color histogram, texture, shape, and wavelet transforms Sample-based queries
Find all of the images that are similar to the features of given image
Feature specification queries Specify or sketch image features like color, texture, or shape,
which are translated into a feature vector
Mining Image Data - RetrievalCombining searches
Search for “blue sky”(top layout grid is blue)
Search for “airplane in blue sky”(top layout grid is blue and keyword = “airplane”)
Classification of Image Data Classification
Decision tree Based on descriptive features Based on content features
Feature extraction Extract features for classification from raw image Various image analysis techniques are required
Data transformation, edge detection, etc.
Example Classify sky images to recognize galaxies, stars, etc. By using properties obtained from image analysis
Classification of Image Data
Mining Text Databases Text databases (document databases)
Large collections of documents from various sources News articles, research papers, books, e-mail messages, and
Web pages Data stored is usually semi-structured Traditional information retrieval techniques become
inadequate for the increasingly vast amounts of text data Information retrieval
Information is organized into documents Information retrieval problem
Locating relevant documents based on user input, such as keywords or example documents
Basic Measures for IR Precision: the percentage of retrieved documents that are in
fact relevant to the query (i.e., “correct” responses)
Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
|}{|
|}{}{|
Relevant
RetrievedRelevantrecall
|}{||}{}{|
RetrievedRetrievedRelevant
precision
Keyword-Based Retrieval A document is represented by a set of keywords
Retrieval by keyword matching Queries may use expressions of keywords
(Car and accessory), (C++ or Java) Major difficulties
Synonymy: same meaning but different word Ex> Q: “software” Doc: about programming, do not have
the keyword Polysemy: same word but different meaning
Ex> Q: “mining” Doc: about gold mining, have the keyword
Similarity-Based Retrieval A document is represented as a keyword vector
Retrieval by similarity computing Basic techniques
Stop list – set of words that are frequent but irrelevant Ex> a, the, of, for, with, …
Stemming – use a common word stem Ex> drug, drugs, drugged drug
Weighting – count frequency Term frequency, inverse document frequency, …
Similarity metrics Measure the closeness of a document to a query Cosine similarity:
||||),(
21
2121 vv
vvvvsim
TF-IDF Weighting TF (Term Frequency)
TF= f(t,d) : how many times term t appears in doc d More frequent more relevant to topic Normalization:
Document length varies : relative frequency preferred IDF (Inverse Document Frequency)
IDF = 1 + log (n / k) : in how many documents term t appears n : total number of docs k : # docs with term t appearing (the document frequency)
Less frequent among documents more discriminative TF-IDF weighting
weight(t, d) = TF(t, d) * IDF(t)
Latent Semantic Indexing Reduce the dimension of keyword matrix
To resolve the synonym problem and the size problem Use a singular value decomposition (SVD) techniques
Example
10000
01000
11000
00001
00110
01101
6
5
4
3
2
1
D
D
D
D
D
Dtruckcarmoonrocketuniverse
SVD Singular Value Decomposition
Decompose the matrix Amn Amn = Umm Smn (Vnn)T
Reduce dimension Select largest k singular values
A’mn = Umk Skk (Vnk)T
Projection of A into k dimensionA’mn Vnk = Umk Skk
Computing similarityAAT = USVT(USVT)T
= USVTVSTUT = (US)(US)T
SVD
22.058.033.041.012.0
41.058.012.022.033.0
19.000.020.063.045.0
63.058.045.019.020.0
29.000.075.053.028.0
53.000.028.029.075.0
U
39.000.000.000.000.0
00.000.100.000.000.0
00.000.028.100.000.0
00.000.000.059.100.0
00.000.000.000.016.2
S ...TV
65.026.0
35.071.0
00.197.0
30.004.0
84.060.0
46.062.0
2USVA
00.1
74.000.1
93.094.000.1
87.032.062.000.1
54.016.018.088.000.1
10.074.047.040.078.000.1
))(( TUSUS
Automatic Document Classification
Motivation Automatic classification for the tremendous number of on-line
text documents (Web pages, e-mails, etc.) A classification problem
Training set: Human experts generate a training data set Classification(learning): The system discovers the
classification rules Methods
Extract keywords and weights from documents Documents are represented as (keyword, weight) pairs
Classify training documents into classes Apply classification algorithm
Decision tree, Bayesian, neural network, etc.
Mining the World-Wide Web WWW provides rich sources for data mining
Contents information Hyperlink information Usage information
Challenges Too huge for effective data warehousing and data mining Too complex and heterogeneous Growing and changing very rapidly
Web Search Engines Index-based
Search the Web, collect Web pages, index Web pages, and build and store huge keyword-based indices
Locate sets of Web pages containing certain keywords Deficiencies
A topic of any breadth may easily contain hundreds of thousands of documents
Many documents that are highly relevant to a topic may not contain keywords defining them (synonymy, polysemy)
Web Contents Mining - Classification
Web page/site classification Assign a class label to each web page from a set of
predefined topic categories Based on a set of examples of preclassified documents
Example Use Yahoo!'s taxonomy and its associated documents as
training and test sets Derive a Web document classification model Use the model to classify new Web documents by assigning
categories from the same taxonomy Methods
Keyword-based classification, use of hyperlink information, statistical models, …
Web Structure Mining Finding authoritative Web pages
Retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic
Hyperlinks can infer the notion of authority A hyperlink pointing to another Web page, this can be
considered as the author's endorsement of the other page Problems
Not every hyperlink represents an endorsement One authority will seldom point to its rival authority Authoritative pages are seldom particularly descriptive
Hub Set of Web pages that provides collections of links to
authorities
HITS (Hyperlink-Induced Topic Search)
Method1. Use an index-based search engine to form the root set2. Expand the root set into a base set
Include all of the pages that the root-set pages link to, and all of the pages that link to a page in the root set
3. Apply weight-propagation Determines numerical estimates of hub and authority
weights4. Output a list of the pages
Large hub weights, large authority weights for the given search topic
Systems based on the HITS algorithm Clever, Google
Achieve better quality search results than AltaVista, Yahoo!
Web Usage Mining Mining Web log records
Discover user access patterns Typical Web log entry - URL requested, the IP address from
which the request originated, timestamp, etc. OLAP on the Weblog database
Find the top N users, top N accessed Web pages, most frequently accessed time periods, etc.
Data mining on Weblog records Find association patterns, sequential patterns, and trends of
Web accessing
Web Usage Mining Applications
Target potential customers for electronic commerce Identify potential prime advertisement locations Enhance the quality and delivery of Internet information
services to the end user Improve Web server system performance
Web caching, Web page prefetching, and Web page swapping
References H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2001. Ester M., Frommelt A., Kriegel H.-P., Sander J.: Spatial Data Mining: Database Primitives, Algorithms and
Efficient DBMS Support, Data Mining and Knowledge Discovery, 4: 193-216, 2000. J. Han, M. Kamber, and A. K. H. Tung, "Spatial Clustering Methods in Data Mining: A Survey", in H. Miller
and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2000. Y. Bedard, T. Merrett, and J. Han, "Fundamentals of Geospatial Data Warehousing for Geographic
Knowledge Discovery", in H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2000
K. Koperski and J. Han. Discovery of spatial association rules in geographic information databases. SSD'95. Shashi Shekhar and Sanjay Chawla, Spatial Databases: A Tour , Prentice Hall, 2003 (ISBN 013-017480-7).
Chapter 7.: Introduction to Spatial Data Mining X. Li, J. Han, and S. Kim, Motion-Alert: Automatic Anomaly Detection in Massive Moving Objects”, IEEE Int.
Conf. on Intelligence and Security Informatics (ISI'06). Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol.
34, No.1, March 2002 Soumen Chakrabarti, “Data mining for hypertext: A tutorial survey”, ACM SIGKDD Explorations, 2000. Cleverdon, “Optimizing convenient online access to bibliographic databases”, Information Survey, Use4, 1,
37-47, 1984 Yiming Yang, “An evaluation of statistical approaches to text categorization”, Journal of Information Retrieval,
1:67-88, 1999. Yiming Yang and Xin Liu “A re-examination of text categorization methods”. Proceedings of ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999. S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and Semi-Structured Data”, Morgan
Kaufmann, 2002.
References G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton
University, August 1993. M. Hearst, Untangling Text Data Mining, ACL’99, invited paper. R. Sproat, Introduction to Computational
Linguistics, LING 306, UIUC, Fall 2003. A Road Map to Text Mining and Web Mining, University of Texas resource page.
http://www.cs.utexas.edu/users/pebronia/text-mining/ Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Extracting Content Structure for Web Pages based
on Visual Representation”, The Fifth Asia Pacific Web Conference, 2003. Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “VIPS: a Vision-based Page Segmentation
Algorithm”, Microsoft Technical Report (MSR-TR-2003-79), 2003. Shipeng Yu, Deng Cai, Ji-Rong Wen and Wei-Ying Ma, “Improving Pseudo-Relevance Feedback in Web
Information Retrieval Using Web Page Segmentation”, 12th International World Wide Web Conference (WWW2003), May 2003.
Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, “Learning Block Importance Models for Web Pages”, 13th International World Wide Web Conference (WWW2004), May 2004.
Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Block-based Web Search”, SIGIR 2004, July 2004 . Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma, “Block-Level Link Analysis”, SIGIR 2004, July 2004 . Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen and Hong-Jiang Zhang, “Organizing WWW Images Based
on The Analysis of Page Layout and Web Link Structure”, The IEEE International Conference on Multimedia and EXPO (ICME'2004) , June 2004
Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen, “Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis”,12th ACM International Conference on Multimedia, Oct. 2004 .