29
Princeton University 1 Ferret: A Toolkit for Content-Based Similarity Search of Feature-Rich Data Qin Lv Joint work with: William Josephson, Zhe Wang, Moses Charikar, and Kai Li

Ferret: A Toolkit for Content-Based Similarity Search of ...Ferret SHD PSB 3D Shape TIMT Audio Ferret 0.44 0.42 0.49 10:1 0.63 5:1 0.47 0.54 0.41 0.59 0.41 Ferret SIMPLIcity VARY Image

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Princeton University 1

    Ferret: A Toolkit for Content-Based Similarity Search of Feature-Rich Data

    Qin Lv

    Joint work with:William Josephson, Zhe Wang,

    Moses Charikar, and Kai Li

  • 2Princeton University

    Motivations

    Digital data is everywhereIncreases exponentially

    Feature-rich digital data dominates data volumeAudio, video, digital photos, scientific sensor dataSystems support for managing feature-rich data

    Techniques for text data do not applyFeature-rich data are noisy and high-dimensional

    Domain efforts limited to small datasets

  • 3Princeton University

    Current Search Techniques

    Search capability is becoming an integral part of modern operating systems

    Mac OS X Tiger: SpotlightWindows Vista

    Limited to text-based searchWeb search engines: Google, Yahoo, Microsoft, …Desktop search: Google, Yahoo, MSN, …Text-based documents

    • Emails, word documents, PDF files, instant messages, …Text-based annotations and attributes

    • Image annotations, music (title, artist, lyrics), …

  • 4Princeton University

  • 5Princeton University

    Ferret Toolkit: Design Goals

    Works with multiple feature-rich data typesImage, audio, 3D shape model, gene expression data

    High performanceSearch quality, search speed, memory usage

    storage layer

    attribute-based search

    content-based search: text

    content-based search: feature-rich

    basic file system

  • 6Princeton University

    Outline

    MotivationsFerret toolkit architecture designSimilarity search problemCore similarity search engineUsing the Ferret toolkitEvaluation resultsConclusion & future work

  • 7Princeton University

    Ferret Toolkit Architecture Design

  • 8Princeton University

    Similarity Search Problem

    Similarity searchGiven a query object, find similar objects(i.e. containing similar features)

    Distance function d (X, Y)Between objects

    Nearest neighbor search K-nearest neighbor (KNN)Approximate nearest neighbor (ANN)

    Hard problem for high dimensional search

  • 9Princeton University

    Object Representation & Distance Function

    Multi-feature representation

    Distance functionE.g. Earth Mover’s Distance (EMD)

  • 10Princeton University

    Core Similarity Search Engine

    Sketchconstruction

    Sketchconstruction Filtering

    Similarityranking

    Segmentation& featureextraction

    InputData

    objects

    Sketchdatabase

    Segmentation& featureextraction

    QueryData

    object

    Results

    Core Similarity Search Engine

  • 11Princeton University

    Sketch Construction

    SketchesCompact data structures, estimate properties of original data

    Sketch distanceHamming distance between bit vectorsEstimate distance between objects

    Sketch0

    10 0

    0

    0

    000

    00 0

    111 1

    1 1 1

    1Complex object

    0 1

    x1

    y2

    x4

    x2

    x3

    y1

    y3

    y4

    x = (x1,x2,x3,x4)y = (y1,y2,y3,y4)

  • 12Princeton University

    Filtering for Similarity Search

    Multi-feature representationComputing object distance is expensive

    FilteringScans through the entire datasetUses a much faster distance function to filter out “bad”answers

    • Hamming distance of sketchesComputes object distance for a much smaller candidate set

    Criteria in picking candidate objectsHas at least one segment that is close enough to one of the major segments of the query object

  • 13Princeton University

    Using the Ferret Toolkit

    Can the Ferret toolkit be applied to multiple data types?Image data?Audio data?3D shape models?Gene expression data?

  • 14Princeton University

    Image Similarity Search

    SegmentationJSEG segmentation tool from UCSB

    Feature extraction14-d features: 9-d color moments and 5-d bounding boxSegment weight: square root of segment size

    Distance functionsSegment distance: weighted ℓ1 distanceObject distance: EMD

  • 15Princeton University

    Image Similarity Search

  • 16Princeton University

    Audio Similarity Search

    SegmentationUtterance level segmenter, human marked word boundary

    Feature extraction32 windows x 6 MFCC parameters = 192 featuresSegment weight: proportional to segment length

    Distance functionsSegment distance: ℓ1 distanceObject distance: EMD

  • 17Princeton University

    Audio Similarity Search

  • 18Princeton University

    3D Shape Similarity Search

    Segmentation32 decomposing spheres

    Feature extractionSpherical harmonic descriptor (SHD)32 x 17 = 544 dimensions

    Distance functionsSegment distance: ℓ1 distanceObject distance: same as segment distance

  • 19Princeton University

    3D Shape Similarity Search

  • 20Princeton University

    Gene Expression Similarity Search

    Segmentation Gene expression microarray data: one gene per row

    Feature extractionGene expression values

    Distance functionPearson correlation, spearman correlation, ℓ1 distance

  • 21Princeton University

    Gene Expression Similarity Search

  • 22Princeton University

    Evaluations

    Can the systems built with Ferret toolkit achieve high-quality similarity search results at a high speed?

    How small can the sketches be as the metadata of the similarity search systems?

    How much benefit can we get by using sketching and filtering?

  • 23Princeton University

    Benchmarks

    Search quality benchmark suiteVARY image: 10,000 images, 32 setsTIMIT audio: 6,300 sentences, 450 setsPSB shape: 1,814 3D shape models, 92 sets

    Search speed benchmark suiteMixed image dataset: 660,000 imagesTIMIT audio: 6,300 sentencesMixed shape dataset: 40,000 3D shape models

  • 24Princeton University

    Search Quality Metrics

    Given a query q with k similar objects:1st-tier recall

    Percentage of similar objects returned within rank k

    2nd-tier recallPercentage of similar objects returned within rank 2k

    Average precision

    Example: k = 5, return 4 good results ranked at 1, 2, 5, 10Average precision = (1/1 + 2/2 + 3/5 + 4/10) / 5 = 0.6

    ∑=

    =k

    i iranki

    k 11Precision Average

  • 25Princeton University

    Search Quality & Search Speed

    80017472

    600

    96264

    Vector Size (bits)

    22:10.410.43

    0.300.32

    0.320.33

    FerretSHD

    PSB 3D Shape

    10:10.490.420.44FerretTIMT Audio

    5:10.630.47

    0.540.41

    0.590.41

    FerretSIMPLIcity

    VARY Image

    Size Ratio2nd-tier1st-tierAverage Precision

    Method

    0.01140,000Mixed 3D Shape

    0.098.66,300TIMIT Audio

    2.010.8660,000Mixed Image

    Search Time (s)#Vectors / Object#Data Objects

  • 26Princeton University

    Search Quality vs. Sketch Size

    600 bits (29:1)200 bits (87:1)PSB 3D Shape

    450 bits (3:1)250 bits (6:1)TIMIT Audio

    88 bits (5:1)64 bits (7:1)VARY Image

    Sketch SizeSketch Size

  • 27Princeton University

    Brute-Force, Sketching, Filtering

    BruteForceOriginalLinear scan using original feature vectors

    BruteForceSketchLinear scan using segment sketches

    FilteringFiltering using segment sketches

  • 28Princeton University

    Conclusion & Future Work

    Ferret toolkit for content-based similarity search Used for image, audio, 3D shape, genomic data

    Achieves high search quality at reasonably high search speedUsing sketches greatly reduces metadata size with minimal quality degradationFuture work

    Integrate with attribute-based searchIndexing techniquesMore effective and efficient distance functions and corresponding sketching techniquesMore data types: video, sensor data

  • 29Princeton University

    Thanks!

    CASS: Content-Aware Search Systemshttp://www.cs.princeton.edu/cass

    Try our image similarity search tool for Windows• http://www.cs.princeton.edu/cass/software