Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Data Management and Data ExplorationProf. Dr. Thomas Seidl
The Future of Multimedia DatabasesThe Future of Multimedia Databases and Multimedia Data Miningg
Thomas SeidlRWTH Aachen University, Germany
MMKM 2008Milton Keynes, Feb. 14th, 2008
Data Management and Data ExplorationProf. Dr. Thomas Seidl
OverviewOverview
• Content-based similarity search– Complex models: quadratic forms, Earth Movers' distance– Efficient algorithms: approximations and indexing
• New interaction models (change of use)– Incremental search, relevance feedback, anytime querying
• From retrieval to new data mining tasksS b l t i tli d t ti t d t i i– Subspace clustering, outlier detection, stream data mining
Thomas Seidl 2MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
OverviewOverview
• Content-based similarity search– Complex models: quadratic forms, Earth Movers' distance– Efficient algorithms: approximations and indexing
• New interaction models (change of use)– Incremental search, relevance feedback, anytime querying
• From retrieval to new data mining tasksS b l t i tli d t ti t d t i i– Subspace clustering, outlier detection, stream data mining
Thomas Seidl 3MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Content based Similarity SearchContent-based Similarity Search3
Retrieval task: WhichRetrieval task: Which images in the archive are similar to the example?
4
1
2 Justification for hits: Similar color frequencies
Thomas Seidl 4MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Example: Mosaic PosterExample: Mosaic Poster
Bild: Hendrik Brixius
Thomas Seidl 5MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Result: Mosaic made from 2 500 ImagesResult: Mosaic made from 2.500 Images
Thomas Seidl 6MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Result: Mosaic made from 2 500 ImagesResult: Mosaic made from 2.500 Images
Thomas Seidl 7MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Description of Image ContentsDescription of Image Contents
• Categories: colors, textures, shapes, etc.• Description by histogramsDescription by histograms
Image Color histogram Texture histogram …Image Color histogram Texture histogram …
#123673 …
#5436430
1
2
Aal
Abend
Amerika
Anblick
Anzahl
Architekt …
Zaun
Zitrone
Zwischen
Zypresse
2
#543643 …
#363273 …0
1
Aal
Abend
Amerika
Anblick
Anzahl
Architekt …
Zaun
Zitrone
Zwischen
Zypresse
0
1
2
Aal
Abend
Amerika
Anblick
Anzahl
Architekt …
Zaun
Zitrone
Zwischen
Zypresse
• Specification of similarity by formal models
Thomas Seidl 8MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Adaptable Similarity ModelsAdaptable Similarity Models
• Euclidean DistanceNeglects cross-bin similarities
• Quadratic FormsLinear AlgebraLinear Algebra
( ) ( )tA qpAqpqpd −⋅⋅−=),(q
• Earth Mover‘s Distancej12
Linear Programming
⎪⎪⎬
⎫⎪⎪⎨
⎧
∀
≥∀∀
∑∑ ∑n n
nij
ij pfifji
fc
qpEMD :0:
min)(
i
20201
0 2 1 2 1 0
⎪⎪⎭
⎬
⎪⎪⎩
⎨
=∀
=∀= ∑∑∑∑
= =
=
=i j n
i jij
j iijijj
C
qfj
pfifm
qpEMD1 1
1
1
:
:,min),(0 2 1 2 1 0
Thomas Seidl 9MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Mosaic Videos Require Faster RetrievalMosaic Videos, Require Faster Retrieval …
Demo AttentionAttractor[Assent Krieger Seidl: ICDE 2007][Assent, Krieger, Seidl: ICDE 2007]
Thomas Seidl 10MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Acceleration by Filter Refine ArchitecturesAcceleration by Filter-Refine Architectures
Usage of filters– Filter step for fast pruning– Filter step for fast pruning
– Refinement step for exact result[GEMINI: Faloutsos 1996; KNOP: Seidl&Kriegel 1998]
Quality of filters: ICESI I Filter is supported by index structures
C Filter does not miss qualifying objects
I ndex-enabled
C omplete
E Filter distance is calculated efficiently
S Filter yields only small candidate set
E fficient
S elective[Assent, Wenning, Seidl: ICDE 2006]
Thomas Seidl 11MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
A Filter for the EMDuse instead
of
A Filter for the EMD
⎬⎫
⎨⎧
∀∀≥∀∀∑∑ ∑∑n n nn
ij fjfifjifc
EMD 0i)(
≥ ⊆⎭⎬
⎩⎨ =∀=∀≥∀∀= ∑∑ ∑∑
= = ==i j íjij
jiijijij
jC yfjxfifjif
myxEMD
1 1 11
:,:,0:,min),(
⎫⎧∑∑ ∑
n n nijc
≥ ⊆Constraint Relaxation
⎭⎬⎫
⎩⎨⎧
≤∀∀=∀≥∀∀= ∑∑ ∑= = =i j
jijj
iijijijij
IM yfijxfifjifmc
yxLB1 1 1
:,:,0:,min),(
∑ ∑ ∑= = = ⎭
⎬⎫
⎩⎨⎧
≤∀=≥∀=n
i
n
jjij
n
jiijijij
ij yfjxffjfmc
1 1 1:,,0:,min
I +++C +++E +++S +++S +++
Filter: [Assent, Wenning, Seidl: ICDE 2006]Index: [Assent, Wichterich, Meisen, Seidl: ICDE 2008]Thomas Seidl 12
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Scalability w r t Database SizeScalability w.r.t. Database Size
Selectivity Response Time
LB_Max LB_Avg 3DLB M LB M LB IM 3D10 00%
100.0
LB_Man LB_Man+LB_IM 3DLB_IM LB_Avg+LB_IM 3D
10.00%
10.0
1.00%
LB_Max
1.00.10%
LB_ManLB_AvgLB_IM
0.125k 50k 75k 100k 125k 150k 175k 200k
0.01%25k 50k 75k 100k 125k 150k 175k 200k
Data bases with 25,000 to 200,000 images, 64d color histograms, 10NN, log. scales
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Scalability w r t DimensionalityScalability w.r.t. Dimensionality
Selectivity Response Time10.000% 1000.0
EMD LB Avg 3D
1.000% 100.0
EMD LB_Avg 3DLB_Man LB_Man+LB_IM 3DLB_IM LB_Avg+LB_IM 3D
0.100%
LB_ManLB_AvgLB IM
10.0
0.010%
LB_IM
1.0
0.001% 0.116 28 40 52 64
Data base with 200,000 images, color histograms of dimensionalities 16 to 64, 10NN, log. scales
16 28 40 52 64 16 28 40 52 64
Data Management and Data ExplorationProf. Dr. Thomas Seidl
OverviewOverview
• Content-based similarity search– Complex models: quadratic forms, Earth Movers' distance– Efficient algorithms: approximations and indexing
• New interaction models (change of use)– Incremental search, relevance feedback, anytime querying
• From retrieval to new data mining tasksS b l t i tli d t ti t d t i i– Subspace clustering, outlier detection, stream data mining
Thomas Seidl 15MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Change of Interaction ModelsChange of Interaction Models
• Incremental Retrieval– From ranges over nearest neighbors to incremental o a ges o e ea est e g bo s to c e e ta
retrieval
• Relevance Feedback– From weights over covariances to ground distances
• Anytime Search– From blind waiting over progress estimation to
progress monitoring
Thomas Seidl 16MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Incremental Nearest Neighbor SearchIncremental Nearest Neighbor Search
{ }o DB d o q∈ ≤( ) ε
Range Queries k-nn Queries
d r DBq q DB-Ranking :ℑ →
4
{ }o DB d o q∈ ≤( , ) ε i i d r i q d r i qq q( ( ), ) ( ( ), )≤ ⇒ ≤1 2 1 2
1
23
4
lt t lt k t i hb (ℑ )no results too many results k-nearest neighbors: rq(ℑk)
Incremental Search: Give-me-more“Thomas Seidl MMKM 2008, Milton Keynes 17
Incremental Search: „Give me more
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Relevance Feedback: MindReaderRelevance Feedback: MindReader
Regard cross-bin similarities (covariance matrix)Y. Ishikawa, R. Subramanya, C. Faloutsos: MindReader: Querying Databases Through Multiple Examples. Proc. 24th Int. Conf. on Very Large Data Bases, 1998.
Thomas Seidl 18MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Relevance Feedback: HistoryRelevance Feedback: History
Incorporate History of FeedbackExponential aging of former feedback influenceExponential aging of former feedback influence
Thomas Seidl 19MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Relevance Feedback: ForesightRelevance Feedback: Foresight
Two PhasesTwo Phases1. Approach fast to relevant area2. Adjust similarity model at destination areaj y
Thomas Seidl 20MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Relevance Feedback: Earth Mover‘s DistanceRelevance Feedback: Earth Mover‘s Distance
From weights over covariances (QF) to ground distances (EMD)
Thomas Seidl 21MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Anytime Data Mining (example: classification)Anytime Data Mining (example: classification)
Algor. 1Algor. 2Al 3Algor. 3Algor. 4
• No predefined budget but query can be interrupted at any timeExpectation: increasing classification accuracy• Expectation: increasing classification accuracy– The later the interruption, the better the classication accuracy should be
Yang Y G I Webb K Korb and K M Ting (2007) Classifying under Computational Resource Constraints: AnytimeYang, Y., G.I. Webb, K. Korb, and K-M. Ting (2007). Classifying under Computational Resource Constraints: Anytime Classification Using Probabilistic Estimators. Machine Learning 69(1). Netherlands: Springer, pp. 35-53.
Thomas Seidl 22MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
OverviewOverview
• Content-based similarity search– Complex models: quadratic forms, Earth Movers' distance– Efficient algorithms: approximations and indexing
• New interaction models (change of use)– Incremental search, relevance feedback, anytime querying
• From retrieval to new data mining tasksS b l t i tli d t ti t d t i i– Subspace clustering, outlier detection, stream data mining
Thomas Seidl 23MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
From Information Retrieval to Data MiningFrom Information Retrieval to Data Mining
• Multimedia Information Retrieval– Driven by queries („query by example“)– Adaptable models for content-based similarity (QF, EMD)– Interactive usage: incremental search, relevance feedback
• Multimedia Data MiningN bj t (Whi h t b it? D b i h l ?)– No query objects (Which to submit? Does browsing help?)
– Reveal patterns which are hidden in vast amounts of data: regularities, irregularitiesg , g
– Typical tasks: clustering, mining association rules, aggregation / generalization of data
Thomas Seidl MMKM 2008, Milton Keynes 24
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Subspace ClusteringSubspace Clustering
27
28
2929
30
• Cluster structures are hidden by noisy dimensions• Separate relevant and irrelevant dimensions locally• Identify clusters and their respective subspacesIdentify clusters and their respective subspacesThomas Seidl 25MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Subspace Outlier DetectionSubspace Outlier Detection
• New challenge: how to define outliers wrt. subspace clusters?• Definition: clusters = dense areas outliers = singularities• Definition: clusters = dense areas, outliers = singularities• Question: Outliers = noise … or outliers = objects of interest ?• Task: Outlier detection = complementary task to clusteringTask: Outlier detection complementary task to clusteringThomas Seidl 26MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Stream Aggregation and GeneralizationStream Aggregation and Generalization
aggr1
Data Aggregation
aggr2
aggr
3Data
Data
Limitations: battery power; bandwidth; monitoring attention
Data Integration
Thomas Seidl MMKM 2008, Milton Keynes 27
Data Management and Data ExplorationProf. Dr. Thomas Seidl
ConclusionConclusion
• Content-based similarity search– Complex models: quadratic forms, Earth Movers' distance– Efficient algorithms: approximations and indexing
• New interaction models (change of use)– Incremental search, relevance feedback, anytime querying
• From retrieval to new data mining tasksS b l t i tli d t ti t d t i i– Subspace clustering, outlier detection, stream data mining
Thomas Seidl 28MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Selected Publications• Assent I., Krieger R., Glavic B., Seidl T.: Clustering of Multidimensional Spatial Sequences. In: Int. Journal on Knowledge and Information
Selected Publications
Systems (KAIS), 2008.• Seidl T.: Nearest Neighbor Classification. In: Liu L., Özsu M. T. (Eds.): Encyclopedia of Database Systems. Springer, 2008. (to appear)• Wichterich M., Assent I., Kranen P., Seidl T.: Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality
Reduction. Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), Vancouver, Canada, 2008.• Assent I., Wichterich M., Meisen T., Seidl T.: Efficient similarity search using the Earth Mover's Distance for large multimedia databases.
Proc. IEEE Int. Conf. on Data Engineering (ICDE), Cancun, Mexico, 2008.• Assent I., Krieger R., Afschari F., Seidl T.: The TS-Tree: Efficient Time Series Search and Retrieval. Proc. Int. Conf. on Extending Data
Base Technology (EDBT), Nantes, France, 2008.• Assent I., Krieger R., Welter P., Herbers J., Seidl T.: SubClass: Classification of Multidimensional Noisy Data Using Subspace Clusters.
Proc. Int. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Osaka, Japan, Springer LNCS-LNAI, 2008.• Wichterich M., Beecks C., Seidl T.: Ranking Multimedia Databases via Relevance Feedback with History and Foresight Support. Proc. 2nd
Int. Workshop on Ranking in Databases (DBRank) in conj. with IEEE 24th Int. Conf. on Data Engineering (ICDE), Cancun, Mexico, 2008. • Müller E., Assent I., Steinhausen U., Seidl T.: OutRank: ranking outliers in high dimensional data. Proc. 2nd Int. Workshop on Ranking in
Databases (DBRank) in conjunction with IEEE 24th Int. Conf. on Data Engineering (ICDE), Cancun, Mexico, 2008.• Zaki M.J., Peters M., Assent I., Seidl T.: Clicks: An effective algorithm for mining subspace clusters in categorical datasets. Data &
Knowledge Engineering (DKE) 60 (1): 51 70 2007Knowledge Engineering (DKE) 60 (1): 51-70, 2007.• Assent I., Krieger R., Müller E., Seidl T.: VISA: Visual Subspace Clustering Analysis. In ACM SIGKDD Explorations Special Issue on
Visual Analytics, Vol. 9(2), Dec. 2007.• Assent I., Krieger R., Müller E., Seidl T.: DUSC: Dimensionality Unbiased Subspace Clustering. Proc. IEEE Int. Conf. on Data Mining
(ICDM), Omaha, Nebraska, USA, 2007.Assent I Krieger R Seidl T : AttentionAttractor: efficient video stream similarity query processing in real time (Demo) Proc IEEE 23rd Int• Assent I., Krieger R., Seidl T.: AttentionAttractor: efficient video stream similarity query processing in real time (Demo). Proc. IEEE 23rd Int. Conf. on Data Engineering (ICDE), Istanbul, Turkey, 2007.
• Assent I., Wenning A., Seidl T.: Approximation Techniques for Indexing the Earth Mover's Distance in Multimedia Databases. Proc. IEEEInt. Conf. on Data Engineering (ICDE), Atlanta, Georgia, USA, 2006.
• Assent I., Wichterich M., Seidl T.: Adaptable Distance Functions for Similarity-based Multimedia Retrieval. In: Datenbank-Spektrum Nr. 19: 23-31 200619: 23-31, 2006.
Thomas Seidl MMKM 2008, Milton Keynes 29
Data Management and Data ExplorationProf. Dr. Thomas Seidl
AbstractAbstract
Multimedia data archives, databases, and web services grow from day to day with a high speed. In order to cope with the huge amount of data, new exploration models and scalable algorithms need to be developed Theexploration models and scalable algorithms need to be developed. The expected changes of use also demand changes on the technology level. Starting with content-based retrieval based on complex distance functions including quadratic forms and Earth Movers' distance interaction modelsincluding quadratic forms and Earth Movers distance, interaction models such as incremental search and relevance feedback are discussed. New data mining approaches including subspace clustering, outlier mining, stream data mining and anytime classification also will be applied tostream data mining, and anytime classification also will be applied to multimedia databases in the future. Following this trend, the underlying retrieval and mining technologies are highly demanded to be extended. F t d l t ill i ld l i ti i d i t h iFuture developments will yield novel approximations, indexing techniques, and multi-step query processing in order to provide efficiency and scalability.
Thomas Seidl 30MMKM 2008, Milton Keynes
Data Management and Data ExplorationProf. Dr. Thomas Seidl
Bio of Thomas SeidlBio of Thomas Seidl
Thomas Seidl is a full professor for Computer Science and head of the Data Management and Data Exploration group at RWTH Aachen University, Germany, where he currently advises eight PhD students.
His research interests include data mining and data management in multimedia and spatio-temporal databases for applications from computational biology, medical imaging, mechanical engineering, computer graphics, etc. whith a focus on content, h t t f l bj t i l d t b C t j t i t f tshape or structure of complex objects in large databases. Current projects aim at fast
content-based multimedia retrieval, relevance feedback, subspace clustering, outlier detection, stream data minining, and anytime mining algorithms. His research in the field of relational indexing aims at exploiting the robustness and high performance ofof relational indexing aims at exploiting the robustness and high performance of relational database systems for complex indexing tasks.
Having finished his MS in 1992 at the Technische Universität München, Thomas received his Ph.D. in 1997 and his venia legendi in 2001 from the University of Munich, g y ,Germany. In 2001, he was a guest lecturer at the University of Augsburg and from 2001 to 2002, he held a substitute professorship for Databases, Data Mining, and Visualization at the University of Constance, Germany.
Thomas Seidl 31MMKM 2008, Milton Keynes