66
Exact data mining from in-exact data Nick Freris Qualcomm, San Diego October 10, 2013

Exact data mining from in-exact data

  • Upload
    jolene

  • View
    33

  • Download
    4

Embed Size (px)

DESCRIPTION

Exact data mining from in-exact data. Nick Freris Qualcomm, San Diego October 10, 2013. (1). Introduction. Information retrieval is a large industry.. Biology, finance, engineering, marketing, vision/graphics, etc. ..but data are hardly ever maintained in original form. Rights protection. - PowerPoint PPT Presentation

Citation preview

Page 1: Exact data mining from in-exact data

Exact data mining from in-exact data

Nick Freris

Qualcomm, San DiegoOctober 10, 2013

Page 2: Exact data mining from in-exact data

Exact data mining on in-exact data

IntroductionInformation retrieval is a large industry..

Biology, finance, engineering, marketing, vision/graphics, etc...but data are hardly ever maintained in original form

(1)

Original Quantized

Compression Rights protection

Watermarking

Page 3: Exact data mining from in-exact data

Exact data mining on in-exact data

Preservation of the mining outcome

this talk: provable guarantees for distanced-based mining

(2)

K-means preservationOptimal distance estimation

D=11

Page 4: Exact data mining from in-exact data

Optimal distance estimation

This work presents optimal estimation of Euclidean distance in the compressed domain. This result cannot be further improved

Our approach is applicable on any orthonormal data compression basis (Fourier, Wavelets, Chebyshev, PCA, etc.)

Our method allows up to – 57% better distance estimation– 80% less computation effort– 128 : 1 compression efficiency

(3)

D = 11.5

Page 5: Exact data mining from in-exact data

Motivation/BenefitsTime-series data are customarily compressed in order to:

Save storage spaceReduce transmission bandwidthAchieve faster processing / data analysisRemove noise

Distance estimation has various data analytics applicationsClustering / ClassificationAnomaly detectionSimilarity search

Now we can do all this very efficiently directly on the compressed data!

(4)

Page 6: Exact data mining from in-exact data

Previous work vs current approach

Our approachApproach 1 Approach 2:(only one compressed)

First coeffs + errorBest coeffs + errorBest coeffs + error + constraints Optimal (our method)

(5)

Page 7: Exact data mining from in-exact data

Exact data mining on in-exact data

applicationsSimilarity Search/Grouping/Clustering

find semantically similar searches, deduce associations

QUERY: Supply ChainResults with similar query patterns

integrated supply chainSmarter supply chain

Business consultant supply chain management

dynamic inventory optimization

Five key supply chain areas

IBM’s dynamic inventory

Chief supply chain officer

Retail Supply Management

(6)

Page 8: Exact data mining from in-exact data

Exact data mining on in-exact data

applications

Query: business

analytics

Query: business services

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Requests

Similarity Search/Grouping/ClusteringAdvertising/recommendationsBuying keywords that are cheaper but exhibit same demand pattern

(6)

Page 9: Exact data mining from in-exact data

Exact data mining on in-exact data

applications

Query: Acquisition

Jan Feb Mar Apr May Jun Jul Aug Sep Okt Nov Dec

Requests May ‘10: Sterling Commerce Acquisition (1.4B)

Feb ‘10: Intelliden Inc. Acquisition

Burst Detection Discovery of important events

(6)

Page 10: Exact data mining from in-exact data

Exact data mining on in-exact data

underlying operation: k-NN searchK-Nearest Neighbor (k-NN) similarity searchIssues that arise:

How to compress data?How to speedup search

We have to estimate tight bounds on the distance metric using just the compressed representation

(7)

Page 11: Exact data mining from in-exact data

Exact data mining on in-exact data

similarity search problemquery

D = 7.3

D = 10.2

D = 11.8

D = 17

D = 22

Distance

Objective: Compare the query with all sequences in DB and return the k most similar sequences to the query.

Linear Scan:

(8)

Page 12: Exact data mining from in-exact data

Exact data mining on in-exact data

speeding up similarity searchoriginalDB

Candidate Superset

Verify against original DB

simplifiedDB

keyword

simplifiedquery

Final Answer set

keyword 1

keyword 2

keyword 3

Upper / lower bounds on distance

(8)

Page 13: Exact data mining from in-exact data

Exact data mining on in-exact data

compressing weblog data Use Euclidean distance to match time-series. But how should we compress the

data?

Query: “analytics and optimization”

Query: “consulting services”

1 year span

The data are highly periodic, so we can use Fourier

decomposition.

Instead of using the first Fourier coefficients we can

use the best ones instead.

(9)

Page 14: Exact data mining from in-exact data

Discrete Fourier Transform (DFT) decomposition of time-series

into sinusoids

-0.4446

-0.9864

-0.3254

-0.6938

-0.1086

-0.3470

0.5849

1.5927

-0.9430

-0.3037

-0.7805

-0.1953

-0.3037

0.2381

2.8389

-0.7046

-0.5529

-0.6721

0.1189

0.2706

-0.0003

1.3976

-0.4987

-0.2387

-0.7588

x(n)

-0.3633

-0.6280 + 0.2709i

-0.4929 + 0.0399i

-1.0143 + 0.9520i

0.7200 - 1.0571i

-0.0411 + 0.1674i

-0.5120 - 0.3572i

0.9860 + 0.8043i

-0.3680 - 0.1296i

-0.0517 - 0.0830i

-0.9158 + 0.4481i

1.1212 - 0.6795i

0.2667 + 0.1100i

0.2667 - 0.1100i

1.1212 + 0.6795i

-0.9158 - 0.4481i

-0.0517 + 0.0830i

-0.3680 + 0.1296i

0.9860 - 0.8043i

-0.5120 + 0.3572i

-0.0411 - 0.1674i

0.7200 + 1.0571i

-1.0143 - 0.9520i

-0.4929 - 0.0399i

-0.6280 - 0.2709i

X(f)

“Every signal can be represented as a superposition of sines and cosines”

“Every signal can be represented as a superposition of sines and cosines”

Jean Baptiste Fourier (1768-1830)

(10)

Page 15: Exact data mining from in-exact data

spectrum

Keep just some frequencies of periodogram

Sequence:

Periodogram:

-0.4446

-0.9864

-0.3254

-0.6938

-0.1086

-0.3470

0.5849

1.5927

-0.9430

-0.3037

-0.7805

-0.1953

-0.3037

0.2381

2.8389

-0.7046

-0.5529

-0.6721

0.1189

0.2706

-0.0003

1.3976

-0.4987

-0.2387

-0.7588

-0.3633

-0.6280 + 0.2709i

-0.4929 + 0.0399i

-1.0143 + 0.9520i

0.7200 - 1.0571i

-0.0411 + 0.1674i

-0.5120 - 0.3572i

0.9860 + 0.8043i

-0.3680 - 0.1296i

-0.0517 - 0.0830i

-0.9158 + 0.4481i

1.1212 - 0.6795i

0.2667 + 0.1100i

0.2667 - 0.1100i

1.1212 + 0.6795i

-0.9158 - 0.4481i

-0.0517 + 0.0830i

-0.3680 + 0.1296i

0.9860 - 0.8043i

-0.5120 + 0.3572i

-0.0411 - 0.1674i

0.7200 + 1.0571i

-1.0143 - 0.9520i

-0.4929 - 0.0399i

-0.6280 - 0.2709i

0.1320

0.4677

0.2445

1.9352

1.6359

0.0297

0.3897

1.6190

0.1522

0.0096

1.0395

1.7187

0.0832

0.0832

1.7187

1.0395

0.0096

0.1522

1.6190

0.3897

0.0297

1.6359

1.9352

0.2445

0.4677

x(n)X(f)P(f)

f

(10)

Page 16: Exact data mining from in-exact data

spectrum

Keep just some frequencies of periodogram

Sequence:

Periodogram:

-0.4446

-0.9864

-0.3254

-0.6938

-0.1086

-0.3470

0.5849

1.5927

-0.9430

-0.3037

-0.7805

-0.1953

-0.3037

0.2381

2.8389

-0.7046

-0.5529

-0.6721

0.1189

0.2706

-0.0003

1.3976

-0.4987

-0.2387

-0.7588

-0.3633

-0.6280 + 0.2709i

-0.4929 + 0.0399i

-1.0143 + 0.9520i

0.7200 - 1.0571i

-0.0411 + 0.1674i

-0.5120 - 0.3572i

0.9860 + 0.8043i

-0.3680 - 0.1296i

-0.0517 - 0.0830i

-0.9158 + 0.4481i

1.1212 - 0.6795i

0.2667 + 0.1100i

0.2667 - 0.1100i

1.1212 + 0.6795i

-0.9158 - 0.4481i

-0.0517 + 0.0830i

-0.3680 + 0.1296i

0.9860 - 0.8043i

-0.5120 + 0.3572i

-0.0411 - 0.1674i

0.7200 + 1.0571i

-1.0143 - 0.9520i

-0.4929 - 0.0399i

-0.6280 - 0.2709i

0.1320

0.4677

0.2445

1.9352

1.6359

0.0297

0.3897

1.6190

0.1522

0.0096

1.0395

1.7187

0.0832

0.0832

1.7187

1.0395

0.0096

0.1522

1.6190

0.3897

0.0297

1.6359

1.9352

0.2445

0.4677

x(n)X(f)P(f)

f

0 50 100 150 200 250 300 350

7 day period

(10)

Page 17: Exact data mining from in-exact data

spectrum

Keep just some frequencies of periodogram

Sequence:

Periodogram:

-0.4446

-0.9864

-0.3254

-0.6938

-0.1086

-0.3470

0.5849

1.5927

-0.9430

-0.3037

-0.7805

-0.1953

-0.3037

0.2381

2.8389

-0.7046

-0.5529

-0.6721

0.1189

0.2706

-0.0003

1.3976

-0.4987

-0.2387

-0.7588

-0.3633

-0.6280 + 0.2709i

-0.4929 + 0.0399i

-1.0143 + 0.9520i

0.7200 - 1.0571i

-0.0411 + 0.1674i

-0.5120 - 0.3572i

0.9860 + 0.8043i

-0.3680 - 0.1296i

-0.0517 - 0.0830i

-0.9158 + 0.4481i

1.1212 - 0.6795i

0.2667 + 0.1100i

0.2667 - 0.1100i

1.1212 + 0.6795i

-0.9158 - 0.4481i

-0.0517 + 0.0830i

-0.3680 + 0.1296i

0.9860 - 0.8043i

-0.5120 + 0.3572i

-0.0411 - 0.1674i

0.7200 + 1.0571i

-1.0143 - 0.9520i

-0.4929 - 0.0399i

-0.6280 - 0.2709i

0.1320

0.4677

0.2445

1.9352

1.6359

0.0297

0.3897

1.6190

0.1522

0.0096

1.0395

1.7187

0.0832

0.0832

1.7187

1.0395

0.0096

0.1522

1.6190

0.3897

0.0297

1.6359

1.9352

0.2445

0.4677

x(n)X(f)P(f)

f

0 50 100 150 200 250 300 350

30 day period

(10)

Page 18: Exact data mining from in-exact data

spectrum

Keep just some frequencies of periodogram

Sequence:

Periodogram:

-0.4446

-0.9864

-0.3254

-0.6938

-0.1086

-0.3470

0.5849

1.5927

-0.9430

-0.3037

-0.7805

-0.1953

-0.3037

0.2381

2.8389

-0.7046

-0.5529

-0.6721

0.1189

0.2706

-0.0003

1.3976

-0.4987

-0.2387

-0.7588

-0.3633

-0.6280 + 0.2709i

-0.4929 + 0.0399i

-1.0143 + 0.9520i

0.7200 - 1.0571i

-0.0411 + 0.1674i

-0.5120 - 0.3572i

0.9860 + 0.8043i

-0.3680 - 0.1296i

-0.0517 - 0.0830i

-0.9158 + 0.4481i

1.1212 - 0.6795i

0.2667 + 0.1100i

0.2667 - 0.1100i

1.1212 + 0.6795i

-0.9158 - 0.4481i

-0.0517 + 0.0830i

-0.3680 + 0.1296i

0.9860 - 0.8043i

-0.5120 + 0.3572i

-0.0411 - 0.1674i

0.7200 + 1.0571i

-1.0143 - 0.9520i

-0.4929 - 0.0399i

-0.6280 - 0.2709i

0.1320

0.4677

0.2445

1.9352

1.6359

0.0297

0.3897

1.6190

0.1522

0.0096

1.0395

1.7187

0.0832

0.0832

1.7187

1.0395

0.0096

0.1522

1.6190

0.3897

0.0297

1.6359

1.9352

0.2445

0.4677

x(n)X(f)P(f)

fReconstruction:

(10)

Page 19: Exact data mining from in-exact data

Exact data mining on in-exact data

similarity Search Approximate the Euclidean distance using DFT

0.4326

2.0981

1.9728

1.6851

2.8316

1.6407

0.4515

0.4892

0.1619

0.0128

0.1739

0.5518

0.0365

2.1467

2.0103

2.1243

3.1910

3.2503

3.1547

2.3223

2.6167

1.2805

1.9949

3.6184

2.9266

3.7846

5.0386

3.4449

2.0039

2.5751

2.1752

2.8652

65.0630

0.4721 +20.2755i

4.6455 - 0.7204i

-11.6083 - 5.8807i

1.0588 - 3.9980i

0.3014 + 3.4492i

-1.3769 + 3.3947i

0.5637 + 3.4632i

-2.7711 + 0.3124i

-3.6572 - 1.4000i

0.1078 - 6.0011i

-1.9892 + 2.8143i

-2.6120 + 3.4182i

-3.9104 - 0.4136i

-1.2362 + 3.5155i

-2.2397 - 0.6038i

-2.7173

-2.2397 + 0.6038i

-1.2362 - 3.5155i

-3.9104 + 0.4136i

-2.6120 - 3.4182i

-1.9892 - 2.8143i

0.1078 + 6.0011i

-3.6572 + 1.4000i

-2.7711 - 0.3124i

0.5637 - 3.4632i

-1.3769 - 3.3947i

0.3014 - 3.4492i

1.0588 + 3.9980i

-11.6083 + 5.8807i

4.6455 + 0.7204i

0.4721 -20.2755i

x(n) X(f)

11.5517

7.9234

First 5 Coefficients +symmetric ones

(11)

Page 20: Exact data mining from in-exact data

Exact data mining on in-exact data

similarity Search Approximate the Euclidean distance using DFT

0.4326

2.0981

1.9728

1.6851

2.8316

1.6407

0.4515

0.4892

0.1619

0.0128

0.1739

0.5518

0.0365

2.1467

2.0103

2.1243

3.1910

3.2503

3.1547

2.3223

2.6167

1.2805

1.9949

3.6184

2.9266

3.7846

5.0386

3.4449

2.0039

2.5751

2.1752

2.8652

65.0630

0.4721 +20.2755i

4.6455 - 0.7204i

-11.6083 - 5.8807i

1.0588 - 3.9980i

0.3014 + 3.4492i

-1.3769 + 3.3947i

0.5637 + 3.4632i

-2.7711 + 0.3124i

-3.6572 - 1.4000i

0.1078 - 6.0011i

-1.9892 + 2.8143i

-2.6120 + 3.4182i

-3.9104 - 0.4136i

-1.2362 + 3.5155i

-2.2397 - 0.6038i

-2.7173

-2.2397 + 0.6038i

-1.2362 - 3.5155i

-3.9104 + 0.4136i

-2.6120 - 3.4182i

-1.9892 - 2.8143i

0.1078 + 6.0011i

-3.6572 + 1.4000i

-2.7711 - 0.3124i

0.5637 - 3.4632i

-1.3769 - 3.3947i

0.3014 - 3.4492i

1.0588 + 3.9980i

-11.6083 + 5.8807i

4.6455 + 0.7204i

0.4721 -20.2755i

x(n) X(f)

11.5517

11.1624

Best 5 Coefficients + symmetric ones

(11)

Page 21: Exact data mining from in-exact data

Exact data mining on in-exact data

objective

Calculate the tightest possible upper/lower boundsusing the coefficients with the highest energy

This will result in better pruning of the search space ->faster search

(12)

Page 22: Exact data mining from in-exact data

Exact data mining on in-exact data

0.0000

-12.0861 + 4.4812i

7.0708 - 3.3545i

0.8045 + 4.2567i

0.2386 + 2.0592i

-1.6003 + 6.7154i

-2.6539 + 0.8595i

5.8845 + 3.7689i

-2.0182 + 3.9356i

-3.9484 + 0.2234i

-1.2440 + 2.4538i

-1.6268 + 1.0393i

-1.0458 + 4.0774i

-4.2436 + 0.5988i

-6.4020 - 0.9529i

-3.3663 - 1.0825i

-2.9264

-3.3663 + 1.0825i

-6.4020 + 0.9529i

-4.2436 - 0.5988i

-1.0458 - 4.0774i

-1.6268 - 1.0393i

-1.2440 - 2.4538i

-3.9484 - 0.2234i

-2.0182 - 3.9356i

5.8845 - 3.7689i

-2.6539 - 0.8595i

-1.6003 - 6.7154i

0.2386 - 2.0592i

0.8045 - 4.2567i

7.0708 + 3.3545i

-12.0861 - 4.4812i

0.0000

-2.4756 + 1.7973i

-2.8455 - 0.8510i

-5.5245 - 2.4452i

-2.4342 - 1.9897i

-2.1082 + 1.6303i

-3.6438 - 1.9424i

-1.5989 - 0.1514i

1.6186 - 2.3544i

15.2057 - 6.0515i

-3.6746 + 0.3413i

-0.0532 + 0.6724i

-1.8331 - 0.6000i

0.1642 + 1.3907i

-7.3632 - 5.7938i

-2.2362 - 1.7287i

-5.1339

-2.2362 + 1.7287i

-7.3632 + 5.7938i

0.1642 - 1.3907i

-1.8331 + 0.6000i

-0.0532 - 0.6724i

-3.6746 - 0.3413i

15.2057 + 6.0515i

1.6186 + 2.3544i

-1.5989 + 0.1514i

-3.6438 + 1.9424i

-2.1082 - 1.6303i

-2.4342 + 1.9897i

-5.5245 + 2.4452i

-2.8455 + 0.8510i

-2.4756 - 1.7973i

x X Q q -1.7313

-0.7221

-1.2267

-0.4194

-0.4194

-0.0158

-0.8231

-1.2267

-0.3185

-0.3185

-0.1167

-0.5203

0.0851

0.6906

0.1861

0.7915

0.7915

1.2961

2.3052

1.9016

0.1861

-0.0158

0.5897

0.8924

0.1861

-1.7313

-0.3185

-0.9240

-0.5203

-0.4194

-0.3185

2.2043

-1.3356

0.6182

-1.6135

1.0842

0.7981

0.2667

-1.2048

0.7654

0.6182

-1.4500

-0.2892

0.6264

0.2503

-1.0740

0.7163

0.7654

-1.5073

1.1987

0.7735

0.1277

-1.2865

0.6509

0.5283

-1.0413

0.9207

1.2150

0.4302

-1.2211

1.0678

1.0351

-1.4337

-1.0004

(13)

Page 23: Exact data mining from in-exact data

Exact data mining on in-exact data

Find best k coefficients of X (k=4)

0.0000

-12.0861 + 4.4812i

7.0708 - 3.3545i

0.8045 + 4.2567i

0.2386 + 2.0592i

-1.6003 + 6.7154i

-2.6539 + 0.8595i

5.8845 + 3.7689i

-2.0182 + 3.9356i

-3.9484 + 0.2234i

-1.2440 + 2.4538i

-1.6268 + 1.0393i

-1.0458 + 4.0774i

-4.2436 + 0.5988i

-6.4020 - 0.9529i

-3.3663 - 1.0825i

-2.9264

-3.3663 + 1.0825i

-6.4020 + 0.9529i

-4.2436 - 0.5988i

-1.0458 - 4.0774i

-1.6268 - 1.0393i

-1.2440 - 2.4538i

-3.9484 - 0.2234i

-2.0182 - 3.9356i

5.8845 - 3.7689i

-2.6539 - 0.8595i

-1.6003 - 6.7154i

0.2386 - 2.0592i

0.8045 - 4.2567i

7.0708 + 3.3545i

-12.0861 - 4.4812i

0.0000

12.8901

7.8261

4.3321

2.0729

6.9035

2.7897

6.9880

4.4229

3.9548

2.7512

1.9304

4.2094

4.2856

6.4725

3.5360

2.9264

3.5360

6.4725

4.2856

4.2094

1.9304

2.7512

3.9548

4.4229

6.9880

2.7897

6.9035

2.0729

4.3321

7.8261

12.8901

0.0000

-2.4756 + 1.7973i

-2.8455 - 0.8510i

-5.5245 - 2.4452i

-2.4342 - 1.9897i

-2.1082 + 1.6303i

-3.6438 - 1.9424i

-1.5989 - 0.1514i

1.6186 - 2.3544i

15.2057 - 6.0515i

-3.6746 + 0.3413i

-0.0532 + 0.6724i

-1.8331 - 0.6000i

0.1642 + 1.3907i

-7.3632 - 5.7938i

-2.2362 - 1.7287i

-5.1339

-2.2362 + 1.7287i

-7.3632 + 5.7938i

0.1642 - 1.3907i

-1.8331 + 0.6000i

-0.0532 - 0.6724i

-3.6746 - 0.3413i

15.2057 + 6.0515i

1.6186 + 2.3544i

-1.5989 + 0.1514i

-3.6438 + 1.9424i

-2.1082 - 1.6303i

-2.4342 + 1.9897i

-5.5245 + 2.4452i

-2.8455 + 0.8510i

-2.4756 - 1.7973i

||X|| X Q 0

3.0592

2.9700

6.0414

3.1439

2.6650

4.1292

1.6061

2.8571

16.3656

3.6904

0.6745

1.9288

1.4004

9.3694

2.8265

5.1339

2.8265

9.3694

1.4004

1.9288

0.6745

3.6904

16.3656

2.8571

1.6061

4.1292

2.6650

3.1439

6.0414

2.9700

3.0592

||Q||Magnitude

vector

(13)

Page 24: Exact data mining from in-exact data

Exact data mining on in-exact data

Find best 4 coefficients of T

0.0000

-12.0861 + 4.4812i

7.0708 - 3.3545i

0.8045 + 4.2567i

0.2386 + 2.0592i

-1.6003 + 6.7154i

-2.6539 + 0.8595i

5.8845 + 3.7689i

-2.0182 + 3.9356i

-3.9484 + 0.2234i

-1.2440 + 2.4538i

-1.6268 + 1.0393i

-1.0458 + 4.0774i

-4.2436 + 0.5988i

-6.4020 - 0.9529i

-3.3663 - 1.0825i

-2.9264

-3.3663 + 1.0825i

-6.4020 + 0.9529i

-4.2436 - 0.5988i

-1.0458 - 4.0774i

-1.6268 - 1.0393i

-1.2440 - 2.4538i

-3.9484 - 0.2234i

-2.0182 - 3.9356i

5.8845 - 3.7689i

-2.6539 - 0.8595i

-1.6003 - 6.7154i

0.2386 - 2.0592i

0.8045 - 4.2567i

7.0708 + 3.3545i

-12.0861 - 4.4812i

0.0000

12.8901

7.8261

4.3321

2.0729

6.9035

2.7897

6.9880

4.4229

3.9548

2.7512

1.9304

4.2094

4.2856

6.4725

3.5360

2.9264

3.5360

6.4725

4.2856

4.2094

1.9304

2.7512

3.9548

4.4229

6.9880

2.7897

6.9035

2.0729

4.3321

7.8261

12.8901

0.0000

-2.4756 + 1.7973i

-2.8455 - 0.8510i

-5.5245 - 2.4452i

-2.4342 - 1.9897i

-2.1082 + 1.6303i

-3.6438 - 1.9424i

-1.5989 - 0.1514i

1.6186 - 2.3544i

15.2057 - 6.0515i

-3.6746 + 0.3413i

-0.0532 + 0.6724i

-1.8331 - 0.6000i

0.1642 + 1.3907i

-7.3632 - 5.7938i

-2.2362 - 1.7287i

-5.1339

-2.2362 + 1.7287i

-7.3632 + 5.7938i

0.1642 - 1.3907i

-1.8331 + 0.6000i

-0.0532 - 0.6724i

-3.6746 - 0.3413i

15.2057 + 6.0515i

1.6186 + 2.3544i

-1.5989 + 0.1514i

-3.6438 + 1.9424i

-2.1082 - 1.6303i

-2.4342 + 1.9897i

-5.5245 + 2.4452i

-2.8455 + 0.8510i

-2.4756 - 1.7973i

||X|| X Q 0

3.0592

2.9700

6.0414

3.1439

2.6650

4.1292

1.6061

2.8571

16.3656

3.6904

0.6745

1.9288

1.4004

9.3694

2.8265

5.1339

2.8265

9.3694

1.4004

1.9288

0.6745

3.6904

16.3656

2.8571

1.6061

4.1292

2.6650

3.1439

6.0414

2.9700

3.0592

||Q||

(13)

Page 25: Exact data mining from in-exact data

Exact data mining on in-exact data

Identify smallest magnitude ( power)

0.0000

-12.0861 + 4.4812i

7.0708 - 3.3545i

0.8045 + 4.2567i

0.2386 + 2.0592i

-1.6003 + 6.7154i

-2.6539 + 0.8595i

5.8845 + 3.7689i

-2.0182 + 3.9356i

-3.9484 + 0.2234i

-1.2440 + 2.4538i

-1.6268 + 1.0393i

-1.0458 + 4.0774i

-4.2436 + 0.5988i

-6.4020 - 0.9529i

-3.3663 - 1.0825i

-2.9264

-3.3663 + 1.0825i

-6.4020 + 0.9529i

-4.2436 - 0.5988i

-1.0458 - 4.0774i

-1.6268 - 1.0393i

-1.2440 - 2.4538i

-3.9484 - 0.2234i

-2.0182 - 3.9356i

5.8845 - 3.7689i

-2.6539 - 0.8595i

-1.6003 - 6.7154i

0.2386 - 2.0592i

0.8045 - 4.2567i

7.0708 + 3.3545i

-12.0861 - 4.4812i

minPower

6.9035

0.0000

12.8901

7.8261

4.3321

2.0729

6.9035

2.7897

6.9880

4.4229

3.9548

2.7512

1.9304

4.2094

4.2856

6.4725

3.5360

2.9264

3.5360

6.4725

4.2856

4.2094

1.9304

2.7512

3.9548

4.4229

6.9880

2.7897

6.9035

2.0729

4.3321

7.8261

12.8901

0.0000

-2.4756 + 1.7973i

-2.8455 - 0.8510i

-5.5245 - 2.4452i

-2.4342 - 1.9897i

-2.1082 + 1.6303i

-3.6438 - 1.9424i

-1.5989 - 0.1514i

1.6186 - 2.3544i

15.2057 - 6.0515i

-3.6746 + 0.3413i

-0.0532 + 0.6724i

-1.8331 - 0.6000i

0.1642 + 1.3907i

-7.3632 - 5.7938i

-2.2362 - 1.7287i

-5.1339

-2.2362 + 1.7287i

-7.3632 + 5.7938i

0.1642 - 1.3907i

-1.8331 + 0.6000i

-0.0532 - 0.6724i

-3.6746 - 0.3413i

15.2057 + 6.0515i

1.6186 + 2.3544i

-1.5989 + 0.1514i

-3.6438 + 1.9424i

-2.1082 - 1.6303i

-2.4342 + 1.9897i

-5.5245 + 2.4452i

-2.8455 + 0.8510i

-2.4756 - 1.7973i

||X|| X Q 0

3.0592

2.9700

6.0414

3.1439

2.6650

4.1292

1.6061

2.8571

16.3656

3.6904

0.6745

1.9288

1.4004

9.3694

2.8265

5.1339

2.8265

9.3694

1.4004

1.9288

0.6745

3.6904

16.3656

2.8571

1.6061

4.1292

2.6650

3.1439

6.0414

2.9700

3.0592

||Q||

(13)

Page 26: Exact data mining from in-exact data

Exact data mining on in-exact data

The remaining powers are less than minPower

0.0000

-12.0861 + 4.4812i

7.0708 - 3.3545i

0.8045 + 4.2567i

0.2386 + 2.0592i

-1.6003 + 6.7154i

-2.6539 + 0.8595i

5.8845 + 3.7689i

-2.0182 + 3.9356i

-3.9484 + 0.2234i

-1.2440 + 2.4538i

-1.6268 + 1.0393i

-1.0458 + 4.0774i

-4.2436 + 0.5988i

-6.4020 - 0.9529i

-3.3663 - 1.0825i

-2.9264

-3.3663 + 1.0825i

-6.4020 + 0.9529i

-4.2436 - 0.5988i

-1.0458 - 4.0774i

-1.6268 - 1.0393i

-1.2440 - 2.4538i

-3.9484 - 0.2234i

-2.0182 - 3.9356i

5.8845 - 3.7689i

-2.6539 - 0.8595i

-1.6003 - 6.7154i

0.2386 - 2.0592i

0.8045 - 4.2567i

7.0708 + 3.3545i

-12.0861 - 4.4812i

minPower

6.9035

0.0000

12.8901

7.8261

4.3321

2.0729

6.9035

2.7897

6.9880

4.4229

3.9548

2.7512

1.9304

4.2094

4.2856

6.4725

3.5360

2.9264

3.5360

6.4725

4.2856

4.2094

1.9304

2.7512

3.9548

4.4229

6.9880

2.7897

6.9035

2.0729

4.3321

7.8261

12.8901

0.0000

-2.4756 + 1.7973i

-2.8455 - 0.8510i

-5.5245 - 2.4452i

-2.4342 - 1.9897i

-2.1082 + 1.6303i

-3.6438 - 1.9424i

-1.5989 - 0.1514i

1.6186 - 2.3544i

15.2057 - 6.0515i

-3.6746 + 0.3413i

-0.0532 + 0.6724i

-1.8331 - 0.6000i

0.1642 + 1.3907i

-7.3632 - 5.7938i

-2.2362 - 1.7287i

-5.1339

-2.2362 + 1.7287i

-7.3632 + 5.7938i

0.1642 - 1.3907i

-1.8331 + 0.6000i

-0.0532 - 0.6724i

-3.6746 - 0.3413i

15.2057 + 6.0515i

1.6186 + 2.3544i

-1.5989 + 0.1514i

-3.6438 + 1.9424i

-2.1082 - 1.6303i

-2.4342 + 1.9897i

-5.5245 + 2.4452i

-2.8455 + 0.8510i

-2.4756 - 1.7973i

||X|| X Q 0

3.0592

2.9700

6.0414

3.1439

2.6650

4.1292

1.6061

2.8571

16.3656

3.6904

0.6745

1.9288

1.4004

9.3694

2.8265

5.1339

2.8265

9.3694

1.4004

1.9288

0.6745

3.6904

16.3656

2.8571

1.6061

4.1292

2.6650

3.1439

6.0414

2.9700

3.0592

||Q||

(13)

Page 27: Exact data mining from in-exact data

Exact data mining on in-exact data

We keep also the sum of squares of the remaining powers

0.0000

-12.0861 + 4.4812i

7.0708 - 3.3545i

0.8045 + 4.2567i

0.2386 + 2.0592i

-1.6003 + 6.7154i

-2.6539 + 0.8595i

5.8845 + 3.7689i

-2.0182 + 3.9356i

-3.9484 + 0.2234i

-1.2440 + 2.4538i

-1.6268 + 1.0393i

-1.0458 + 4.0774i

-4.2436 + 0.5988i

-6.4020 - 0.9529i

-3.3663 - 1.0825i

-2.9264

-3.3663 + 1.0825i

-6.4020 + 0.9529i

-4.2436 - 0.5988i

-1.0458 - 4.0774i

-1.6268 - 1.0393i

-1.2440 - 2.4538i

-3.9484 - 0.2234i

-2.0182 - 3.9356i

5.8845 - 3.7689i

-2.6539 - 0.8595i

-1.6003 - 6.7154i

0.2386 - 2.0592i

0.8045 - 4.2567i

7.0708 + 3.3545i

-12.0861 - 4.4812i

0.0000

12.8901

7.8261

4.3321

2.0729

6.9035

2.7897

6.9880

4.4229

3.9548

2.7512

1.9304

4.2094

4.2856

6.4725

3.5360

2.9264

3.5360

6.4725

4.2856

4.2094

1.9304

2.7512

3.9548

4.4229

6.9880

2.7897

6.9035

2.0729

4.3321

7.8261

12.8901

0.0000

-2.4756 + 1.7973i

-2.8455 - 0.8510i

-5.5245 - 2.4452i

-2.4342 - 1.9897i

-2.1082 + 1.6303i

-3.6438 - 1.9424i

-1.5989 - 0.1514i

1.6186 - 2.3544i

15.2057 - 6.0515i

-3.6746 + 0.3413i

-0.0532 + 0.6724i

-1.8331 - 0.6000i

0.1642 + 1.3907i

-7.3632 - 5.7938i

-2.2362 - 1.7287i

-5.1339

-2.2362 + 1.7287i

-7.3632 + 5.7938i

0.1642 - 1.3907i

-1.8331 + 0.6000i

-0.0532 - 0.6724i

-3.6746 - 0.3413i

15.2057 + 6.0515i

1.6186 + 2.3544i

-1.5989 + 0.1514i

-3.6438 + 1.9424i

-2.1082 - 1.6303i

-2.4342 + 1.9897i

-5.5245 + 2.4452i

-2.8455 + 0.8510i

-2.4756 - 1.7973i

||X|| X Q 0

3.0592

2.9700

6.0414

3.1439

2.6650

4.1292

1.6061

2.8571

16.3656

3.6904

0.6745

1.9288

1.4004

9.3694

2.8265

5.1339

2.8265

9.3694

1.4004

1.9288

0.6745

3.6904

16.3656

2.8571

1.6061

4.1292

2.6650

3.1439

6.0414

2.9700

3.0592

||Q||

e2

(13)

Page 28: Exact data mining from in-exact data

Exact data mining on in-exact data

Calculate distance from k known coeffs

0.0000

-12.0861 + 4.4812i

7.0708 - 3.3545i

0.8045 + 4.2567i

0.2386 + 2.0592i

-1.6003 + 6.7154i

-2.6539 + 0.8595i

5.8845 + 3.7689i

-2.0182 + 3.9356i

-3.9484 + 0.2234i

-1.2440 + 2.4538i

-1.6268 + 1.0393i

-1.0458 + 4.0774i

-4.2436 + 0.5988i

-6.4020 - 0.9529i

-3.3663 - 1.0825i

-2.9264

-3.3663 + 1.0825i

-6.4020 + 0.9529i

-4.2436 - 0.5988i

-1.0458 - 4.0774i

-1.6268 - 1.0393i

-1.2440 - 2.4538i

-3.9484 - 0.2234i

-2.0182 - 3.9356i

5.8845 - 3.7689i

-2.6539 - 0.8595i

-1.6003 - 6.7154i

0.2386 - 2.0592i

0.8045 - 4.2567i

7.0708 + 3.3545i

-12.0861 - 4.4812i

0.0000

12.8901

7.8261

4.3321

2.0729

6.9035

2.7897

6.9880

4.4229

3.9548

2.7512

1.9304

4.2094

4.2856

6.4725

3.5360

2.9264

3.5360

6.4725

4.2856

4.2094

1.9304

2.7512

3.9548

4.4229

6.9880

2.7897

6.9035

2.0729

4.3321

7.8261

12.8901

0.0000

-2.4756 + 1.7973i

-2.8455 - 0.8510i

-5.5245 - 2.4452i

-2.4342 - 1.9897i

-2.1082 + 1.6303i

-3.6438 - 1.9424i

-1.5989 - 0.1514i

1.6186 - 2.3544i

15.2057 - 6.0515i

-3.6746 + 0.3413i

-0.0532 + 0.6724i

-1.8331 - 0.6000i

0.1642 + 1.3907i

-7.3632 - 5.7938i

-2.2362 - 1.7287i

-5.1339

-2.2362 + 1.7287i

-7.3632 + 5.7938i

0.1642 - 1.3907i

-1.8331 + 0.6000i

-0.0532 - 0.6724i

-3.6746 - 0.3413i

15.2057 + 6.0515i

1.6186 + 2.3544i

-1.5989 + 0.1514i

-3.6438 + 1.9424i

-2.1082 - 1.6303i

-2.4342 + 1.9897i

-5.5245 + 2.4452i

-2.8455 + 0.8510i

-2.4756 - 1.7973i

||X|| X Q 0

3.0592

2.9700

6.0414

3.1439

2.6650

4.1292

1.6061

2.8571

16.3656

3.6904

0.6745

1.9288

1.4004

9.3694

2.8265

5.1339

2.8265

9.3694

1.4004

1.9288

0.6745

3.6904

16.3656

2.8571

1.6061

4.1292

2.6650

3.1439

6.0414

2.9700

3.0592

||Q||

(13)

Page 29: Exact data mining from in-exact data

Exact data mining on in-exact data

Optimize distance from remaining coefficients

0.0000

-12.0861 + 4.4812i

7.0708 - 3.3545i

0.8045 + 4.2567i

0.2386 + 2.0592i

-1.6003 + 6.7154i

-2.6539 + 0.8595i

5.8845 + 3.7689i

-2.0182 + 3.9356i

-3.9484 + 0.2234i

-1.2440 + 2.4538i

-1.6268 + 1.0393i

-1.0458 + 4.0774i

-4.2436 + 0.5988i

-6.4020 - 0.9529i

-3.3663 - 1.0825i

-2.9264

-3.3663 + 1.0825i

-6.4020 + 0.9529i

-4.2436 - 0.5988i

-1.0458 - 4.0774i

-1.6268 - 1.0393i

-1.2440 - 2.4538i

-3.9484 - 0.2234i

-2.0182 - 3.9356i

5.8845 - 3.7689i

-2.6539 - 0.8595i

-1.6003 - 6.7154i

0.2386 - 2.0592i

0.8045 - 4.2567i

7.0708 + 3.3545i

-12.0861 - 4.4812i

0.0000

12.8901

7.8261

4.3321

2.0729

6.9035

2.7897

6.9880

4.4229

3.9548

2.7512

1.9304

4.2094

4.2856

6.4725

3.5360

2.9264

3.5360

6.4725

4.2856

4.2094

1.9304

2.7512

3.9548

4.4229

6.9880

2.7897

6.9035

2.0729

4.3321

7.8261

12.8901

0.0000

-2.4756 + 1.7973i

-2.8455 - 0.8510i

-5.5245 - 2.4452i

-2.4342 - 1.9897i

-2.1082 + 1.6303i

-3.6438 - 1.9424i

-1.5989 - 0.1514i

1.6186 - 2.3544i

15.2057 - 6.0515i

-3.6746 + 0.3413i

-0.0532 + 0.6724i

-1.8331 - 0.6000i

0.1642 + 1.3907i

-7.3632 - 5.7938i

-2.2362 - 1.7287i

-5.1339

-2.2362 + 1.7287i

-7.3632 + 5.7938i

0.1642 - 1.3907i

-1.8331 + 0.6000i

-0.0532 - 0.6724i

-3.6746 - 0.3413i

15.2057 + 6.0515i

1.6186 + 2.3544i

-1.5989 + 0.1514i

-3.6438 + 1.9424i

-2.1082 - 1.6303i

-2.4342 + 1.9897i

-5.5245 + 2.4452i

-2.8455 + 0.8510i

-2.4756 - 1.7973i

||X|| X Q 0

3.0592

2.9700

6.0414

3.1439

2.6650

4.1292

1.6061

2.8571

16.3656

3.6904

0.6745

1.9288

1.4004

9.3694

2.8265

5.1339

2.8265

9.3694

1.4004

1.9288

0.6745

3.6904

16.3656

2.8571

1.6061

4.1292

2.6650

3.1439

6.0414

2.9700

3.0592

||Q||

minPower

6.9035

Optimization

Generalization

(13)

Page 30: Exact data mining from in-exact data

Exact data mining on in-exact data

water-filling

X = -α Q

Q

d1

d2

(14)

Page 31: Exact data mining from in-exact data

Exact data mining on in-exact data

water-filling

X = -α Q

Q

d1

d2

Boundary condition on total energy:

d12 + d2

2 < e2 e

(14)

Page 32: Exact data mining from in-exact data

Exact data mining on in-exact data

water-filling

X = -α Q

Q

minP

d1

d2

minP

Boundary condition on each dimension:

d1 < minPower, d2 < minPowere

(14)

Page 33: Exact data mining from in-exact data

Exact data mining on in-exact data

water-filling

X = -α Q

Q

minP

d1

d2

minP

We hit one of our constraints

e

(14)

Page 34: Exact data mining from in-exact data

Exact data mining on in-exact data

water-filling

X = -α Q

Q

d1

d2

We start moving on the other dimension

until we use all our energyeminP

minP

(14)

Page 35: Exact data mining from in-exact data

Exact data mining on in-exact data

in N-dimensionsXQ

Available EnergyminPower

(14)

Page 36: Exact data mining from in-exact data

Exact data mining on in-exact data

in N-dimensionsXQ

Available EnergyminPower

(14)

Page 37: Exact data mining from in-exact data

Exact data mining on in-exact data

in N-dimensionsXQ

Available EnergyminPower

(14)

Page 38: Exact data mining from in-exact data

Exact data mining on in-exact data

in N-dimensionsXQ

Available EnergyminPower

(14)

Page 39: Exact data mining from in-exact data

Exact data mining on in-exact data

in N-dimensionsXQ

Available EnergyminPower

(14)

Page 40: Exact data mining from in-exact data

Exact data mining on in-exact data

in N-dimensionsXQ

Available EnergyminPower

(14)

Page 41: Exact data mining from in-exact data

Exact data mining on in-exact data

in N-dimensionsXQ

Available EnergyminPower

(14)

Page 42: Exact data mining from in-exact data

Exact data mining on in-exact data

in N-dimensionsXQ

Available EnergyminPower

(14)

Page 43: Exact data mining from in-exact data

Exact data mining on in-exact data

in N-dimensional spaceXQ

Available EnergyminPower

(14)

Page 44: Exact data mining from in-exact data

Exact data mining on in-exact data

in N-dimensional spaceXQ

Available EnergyminPower

(14)

Page 45: Exact data mining from in-exact data

Exact data mining on in-exact data

single waterfilling

When both sequences are compressed the optimal distance estimation can be solved using a double waterfilling process.

double waterfilling

(15)

Page 46: Exact data mining from in-exact data

Exact data mining on in-exact data

the bounds are optimally tight Theorem (Freris et al. 2012): The computation of lower and upper bounds given the aforementioned compressed representations can be solved exactly using double water-filling problem. The lower and upper bounds are optimally tight; no tighter solutions can be provided

Sketch of proof: - convex re-parametrization - KKT conditions - properties of optimal solutions

(16)

Page 47: Exact data mining from in-exact data

Double Water-filling algorithm

Intuition:

Water-fill for the discarded

coefficients of the two vectors

separately..

..using the appropriate energy

allocation

(17)

Zero-overhead algorithm

Complexity: Θ(nlong)

Page 48: Exact data mining from in-exact data

Exact data mining on in-exact data

Experiments

Unica database. IBM web traffic for year of 2010

BUSINESS DYNAMICS IBM EINSURANCE CUSTOMER EXPERIENCE.

BUSINESS CONSULTING

YIN YANG OF FINANCIAL DISRUPTION

AMERICA MEDIA PLAYER INDUSTRY

IBM GLOBAL BUSINESS ANDREW STEVENS

GLENN FINCH IBM STRATEGIE ENTREPRISE RENTABILIT

Marketing/Adwords recommendation

– Analysis/Storage of weblog queries (1TB of data per month)

– GBS: Scheduling advertising campaigns / pricing

(18)

Page 49: Exact data mining from in-exact data

Exact data mining on in-exact data

our analytic solution is 300x faster than convex optimizers

(19)

Page 50: Exact data mining from in-exact data

Our algorithm converges very fast: in 2-3 iterations irrespective of the compression rate

(20)

Page 51: Exact data mining from in-exact data

Exact data mining on in-exact data

the derived LB/UB are 20% tighter than state-of-art

(21)

Page 52: Exact data mining from in-exact data

Exact data mining on in-exact data

when searching in a DB this results in up to 80% data pruning

the previous (10-20%) improvement in distance estimation can significantly reduce the search space when searching for k-NN results

We retrieve 20%-80% fewer sequences than other approaches

(22)

Page 53: Exact data mining from in-exact data

Exact data mining on in-exact data

Conclusions

Increasing data sizes is a perennial problem for data analysis It is important to support mining directly on the compressed data

We have shown an exact algorithm for obtaining optimally tight Euclidean bounds on compressed representations Mining directly on compressed data with provable guarantees

Many generalizations:Cosine Similarity (text documents):

cos(x,y) = 1 - L2(x,y)2/2Correlation (financial analysis):

corr(x,y) = 1 - L2(x,y)2/2 (for normalized signals x,y)Dynamic Time Warping (flexible similarity metric):

(23)

Page 54: Exact data mining from in-exact data

Exact data mining on in-exact data

Compression with provable K-means preservation

Page 55: Exact data mining from in-exact data

Exact data mining on in-exact data

Scope of this work Multi-bit compression and invertible transformation of high-dimensional data

with provable K-means preservation

K-means

cluster 1 cluster 2 cluster 3

K-means

cluster 1 cluster 2 cluster 3

identical clustering

results

original data Original data quantized data Compressed data

(24)

Page 56: Exact data mining from in-exact data

Exact data mining on in-exact data

Applications Reduced storage requirements

Reduced processing time Data hiding / anonymization

Original values are hidden via an invertible cluster contraction Encoder-decoder scheme

Shape preservation High-quality data reconstruction Extends applicability to other mining operations such as Nearest-Neighbor search, data visualization, etc.

Tunable compression level based on number of allocated bits in order to achieve good storage/quality trade-off

(25)

Page 57: Exact data mining from in-exact data

Exact data mining on in-exact data

Overview1. Pre-clustering2. Quantization per cluster and dimension (one or multiple bits)3. Data contraction transformation 4. Post-quantization clustering 5. Decoding using key (contraction parameter)

(26)

Page 58: Exact data mining from in-exact data

Exact data mining on in-exact data

Our method effectively compresses the data1. MMSE quantization per cluster and dimension (one or multiple bits)

Preserves mean and reduces variance and dynamic range Example: 1-bit MMSE quantization

Quantizer:

(27)

Page 59: Exact data mining from in-exact data

Exact data mining on in-exact data

These are the points for one dimension and for one cluster of objects.

Process is repeated for all dimensions and for all clusters. We have one quantizer per class

Dimension d (or time instance )d)

How does our scheme apply to high-dimensional data?(28)

Page 60: Exact data mining from in-exact data

Exact data mining on in-exact data

We can also support Data-Hiding with our scheme2. Data contraction transformation

Invertible Preserves centroids and reduces variance and dynamic range

Better cluster separation

(29)

Page 61: Exact data mining from in-exact data

Exact data mining on in-exact data

Results Theoretical guarantee on cluster preservation

for sufficiently contracted clusters (sufficiently small a) Analytical calculation of optimal contraction coefficient (a)

K(K-1) calculations, K: number of clusters

Lower distortion (MSE) than MPQ which decreases with the number of allocated bits

Multi-bit allocation (per cluster) Generalization of MMSE quantizers for multiple bits Efficient algorithm for optimal allocation given storage constraints

(30)

Page 62: Exact data mining from in-exact data

Exact data mining on in-exact data

We preserve 100% the clustering structure Stock values

2100 stocks during a period of three years Clustered in 3,5,8 clusters

Cluster preservation was always 100% Shape preservation

(31)

Page 63: Exact data mining from in-exact data

Exact data mining on in-exact data

Multi-Bit quantization improves quality up to 96%! MSE due to quantization

decreases with number of bits and number of clusters

37%, 70% and 96% over MPQ using MMSE with 1, 2, 4 bits 76%, 84% using 5, 8 clusters over 3 clusters

(32)

Page 64: Exact data mining from in-exact data

Exact data mining on in-exact data

Compress > 8x without loss in clustering accuracy Compression efficiency:

1.8x – 8x reduction

(33)

Page 65: Exact data mining from in-exact data

Exact data mining on in-exact data

Conclusions Efficient K-means preserving compression of high-dimensional data

Data hiding (encoder-decoder data transformation) Tunable compression level (storage / distortion trade-off)

Theoretical guarantees Our scheme is the first one which always guarantees 100% cluster preservation Efficient bit allocation algorithm for optimal distortion given storage constraints

Experiments 100% cluster preservation Excellent shape preservation even for 1 bit quantizers High compression efficiency

Extensions Our approach works with any quantization scheme

Example: Minimum Absolute Error Quantization (MAE)

(34)

Page 66: Exact data mining from in-exact data

Exact data mining on in-exact data

Thank youQuestions…