The Flamingo Software Package on Approximate String Queries

Chen LiUC Irvine and Bimaple

http://flamingo.ics.uci.edu/

Personal Journey: 2001 …

Chen Li, UC Irvine 3

Data Integration Problems?

Talking to medical doctors…

Example

Name SSN AddrJack Lemmon

430-871-8294 Maple St

Harrison Ford

292-918-2913 Culver Blvd

Tom Hanks 234-762-1234 Main St… … …

Table RName SSN Addr

Ton Hanks 234-162-1234 Main StreetKevin Spacey

928-184-2813 Frost Blvd

Jack Lemon 430-817-8294 Maple Street

… … …

Table S

Find records from different datasets that could be the same entity

Another Example P. Bernstein, D. Chiu: Using Semi-Joins

to Solve Relational Queries. JACM 28(1): 25-40(1981)

Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

Challenges How to define good similarity functions?

— Many functions proposed (edit distance, cosine similarity, …)

— Domain knowledge is critical Names: “Wall Street Journal” and “LA Times” Address: “Main Street” versus “Main St”

How to do matching efficiently

Nested-loop? Not desirable for large data sets 5 hours for 30K strings! (in 2002)

Our first attempt (DASFAA 2003)

- Map strings into a high-dimensional Euclidean space

- Do a similarity join in the Euclidean space

Metric Space Euclidean Space

Use data set 1 (54K names) as an example k=2, d=20

— Use k’=5.2 to differentiate similar and dissimilar pairs.

Can it preserve distances?

2nd Problem: Selectivity Estimation

A bag of strings

Input: fuzzy string predicate P(q, δ)

star SIMILARTO ’Schwarrzenger’

Output: # of strings s that satisfy dist(s,q) <= δ

SEPIA: Intuition (VLDB 2005)

Cluster

Pivot: p

String s

Query String: q

v2ed(p,s)1 2 3

44%28%

Probability 100%

1M strings in 1ms 10M strings in 10ms

Story of “1-1-10-10”

String Grams q-grams

(un),(ni),(iv),(ve),(er),(rs),(sa),(al)

For example: 2-gram

u n i v e r s a l

Inverted lists Convert strings to gram inverted lists

id strings01234

richstickstichstuckstatic

2 301 4

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433

Main ExampleQuery

Data Grams

stick (st,ti,ic,ck)

count >=2

id strings0 rich1 stick2 stich3 stuck4 static

1,2,3,4

ed(s,q)≤1

0,1,2,4

Candidates

Problem definition:

Find elements whose occurrences ≥ T

Ascendingorder

Example T = 4

Result: 13

1351013

101315

Five Merge Algorithms (icde2008)

HeapMerger[Sarawagi,SIGMOD

MergeOpt[Sarawagi,SIGMOD

PreviousNew

ScanCount MergeSkip DivideSkip

Next: VGRAM

Story of “1-1-10-10”

Observation 1: dilemma of choosing “q” Increasing “q” causing:

Longer grams Shorter lists Smaller # of common grams of similar strings

id strings01234

richstickstichstuckstatic

2 301 4

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio

VGRAM: Main idea Grams with variable lengths (between qmin

and qmax) zebra

ze(123) corrasion

co(5213), cor(859), corr(171) Advantages

Reduce index size Reducing running time Adoptable by many algorithms

Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their

gram-set similarity? Adopting VGRAM in existing algorithms?

—Challenge: large index size

Story of “1-1-10-10”

Contributions (icde2009)

Proposed two lossy compression techniques— Answer queries exactly— Index fits into a space budget — Queries faster on the compressed indexes — Flexibility to choose space / time tradeoff— Existing list-merging algorithms: re-use + compression

specific optimizations

Intuition of compression techniques

Find elements whose occurrences ≥ T

Ascendingorder

Content of Flamingo Package

— List mergers— SEPIA— Stringmap— Location-based fuzzy search— PartEnum (fuzzy join)— Fuzzy join using MapReduce— …

Development of Flamingo

— C++— Contributors: 9 people (different times)— Four releases— Well received by various communities

Making an impact?

UCI People Search

PSearch

Other systems built

— iPubmed: http://ipubmed.ics.uci.edu— Location-based instant search— …— Started a company: Bimaple

Lessons learned

Hands-on experiences …

Lessons learnedResearch management

— Software development: code sharing— Tools: svn, wiki, etc.— Team environment— Research continuity

Lessons learned—Impact —Outreach activities

Thank you!

http://flamingo.ics.uci.edu/

The Flamingo Software Package on Approximate String Queries

Documents

Approximate Queries on String Collectionsdb-event.jpn.org/idb2008/invited_talks/iDB2b_Yang.pdf · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer

The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple

Approximate Queries and Representations for Large Data ...cs.brown.edu/research/pubs/pdfs/1996/Shatkay-1996-AQR.pdfApproximate Queries and Representations for Large Data Sequences*

Approximate and Incremental Processing of Complex Queries against the Web of Data

SRS: Solving c-Approximate Nearest Neighbor Queries in · PDF file · 2014-08-12SRS: Solving c-Approximate Nearest Neighbor Queries ... due to the phenomenon called \curse of dimensionality"

Violet Design Flamingo Standard and Stereo Flamingo

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure

Flamingo Oil

Andean flamingo

Pink Flamingo

Flamingo 2014 GB - Perouse Medicalperousemedical.com/wp-content/uploads/2016/02/Flamingo... · 2016-10-12 · FLAMINGO Cardiovascular FLAMINGO A compact and ergonomic inﬂ ation

Approximate Queries on Very Large Data

Flamingo daze

5kl Flamingo

Pretty Flamingo

Answering Approximate String Queries on Large …asterix.ics.uci.edu/pub/ICDE2011-214.pdf · Answering Approximate String Queries on Large Data Sets Using External Memory Alexander

Flamingo Times

Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

flamingo project

Flamingo supports