Redis - for duplicate detection on real time stream

for duplicate detection on real time stream

whoami(1)15 years of experience, proud to be a programmerWrites software for information extraction, nlp, opinion mining (@scale ), and a lot of other buzzwordsImplements scalable architecturesMember of the JUG-Torino coordination team

[email protected] github.com/robfranktwitter.com/robfrankie linkedin.com/in/robfrankhttp://www.celi.it http://www.blogmeter.it

mailto:[email protected]

mailto:[email protected]

AgendaWhat is it?Main featuresCachingCountersScriptingHow we use it

From the site

Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps

and hyperloglogs.

http://redis.io/topics/data-types-intro#strings

http://redis.io/topics/data-types-intro#hashes

http://redis.io/topics/data-types-intro#lists

http://redis.io/topics/data-types-intro#sets

http://redis.io/topics/data-types-intro#sorted-sets

http://redis.io/topics/data-types-intro#bitmaps


http://redis.io/topics/data-types-intro#hyperloglogs

Who use itTwitterGithubYoupornPinterestGroupon

...

Clients in every known language

Articles, books, presentations

On High Scalability every other day

Ecosystem

Architecture

Single-threaded server

Yes: single threaded server

Remember that when you need to scale

Single Linux server can handle 500k req/s

Main features

In memory K/V storeBut with durable persistenceMaster-slave async replicaTransactionsPub/SubServer side LUA scripting

Main featuresKeys with TTLLRU evictionKeys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogsREDIS cluster on the go (3.0.0-rc1)


http://redis.io/topics/data-types-intro#hashes

http://redis.io/topics/data-types-intro#lists

http://redis.io/topics/data-types-intro#sets

http://redis.io/topics/data-types-intro#sorted-sets


http://redis.io/topics/data-types-intro#hyperloglogs


K/V storeKey-value (KV) stores use the associative array (also known as a map or dictionary) as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. (wikipedia)

http://en.wikipedia.org/wiki/Associative_array

K/V store

Key

“plain text”

name robsurname frank

A C E D B F

A B C D E F

String/blobs/bitmaps

HashTable: Objects

Linked lists

Sets

PersistenceConfigurable, two flavors

RDB: perfect for backupAOF: append only log, replayed at startup

Use AOF + RDB for rock solid persistenceAutomatic cache warm-up at startup!!Only RAM: switch off persistence

Common use casesCacheQueueSession replicationIn memory indexesCentralized ID generation

BasicsSET user:1 frankGET user:1 → frankEXISTS user:2 → 1

EXPIRE user:1 3600

INCR count:1 GET count:1 → 1

BasicsKEYS user:* → user:1, user:2MSET user:1 frank user:2 coderMGET user:1 user:2 → frank, coder

HMSET userdetail:3 name rob surname frankHGETALL userdetail:3 → name::rob, surname:: frank

TransactionsMULTIINCR counter:1INCR counter:2EXEC> 1> 1

WATCH counter:3val = GET counter:3val = val +1MULTISET counter:3 $valEXEC

Atomic countersOperators for key increment

INCR counter:1 GET counter:1 → 1

INCRBY counter:1 9GET counter:1 → 10

LUA scriptingServer side LUA scriptingA “sort of” stored procedureScripts are sandboxedAtomic execution ← bear in mind

SCRIPT LOAD "return {KEYS[1],KEYS[2]}""3905aac1828a8f75707b48e446988eaaeb173f13"EVALSHA 3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 user:21) "user:1"2) "user:2"

LUA scripting

Caching: server level

Configure REDIS as a cache

maxmemory 1024mbmaxmemory-policy allkeys-lru

all the keys will be evicted using an approximated LRU algorithm

Caching: TTL on key

Set a timeout on a keySET doc:1 “mydoc.txt”EXIPRE doc:1 10

OrSETEX doc:1 10 “mydoc.txt”

Demo

Caching+

Atomic Counters+

Atomic LUA scripting

Duplicate detectionReal time stream of documents from

the Internet20% to 50% of documents are duplicated

DUPLICATES ARE EVIL

And customers don’t pay for that :(

Basic Scenario

Duplicatesdetector NLP StorageProducer

5M 3M 3M

Producer

Producer

Avoid duplicated documentsAct on producers was

TOO HARD

Filter-out them before heavy document analysis (NLP)

Documents“Documents” are from:

twitterfacebookgplusinstagramforumsblogs

DocumentsEach kind of document has its own natural id

twitter: status idfacebook: post idforum: URLblog: URL

We don’t want this IDs inside our system

Duplicate and id generation

Producer

2M

Producer

Producer

Duplicatedetector - ID generation

Analysis

Storage

3M3M

Duplicatedetector - ID generation

Analysis1M 1M

5M

Map external keys to internal UIDGenerate an ID for each documentIDs are generated using daily named counters:

INCR day:20141028 → 12576INCR day:20141010 → 23412576

Cache generated IDtw_1234578688 → day:20141028;12576

Map external keys to internal UIDDocuments are internally stored on different storage systems with their generated id

globalId→ 20141028:3456789

OperationsNatural Keys are cached with TTL Documents out of time are parked in a staging areaDuplicated documents are usually dropped

LRU cache, counters and LUALUA scripts are executed atomicallyWrote a simple script to:

return previous mapped idor generate id and store key and id in cache

EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123GET tw_1234566 → 20141028:123

Demo

DeploymentPre-production phaseSingle server70M keys in 10GB of RAMIn production with a simple M/S

AlternativesPostgreSQL

sequence(s)table OR hstore

Hazelcast (we are java based)in memorywrite your own persistence

Q/A

References

http://redis.io/http://redis.io/commandshttp://stackoverflow.com/questions/tagged/redishttp://try.redis.io/

http://redis.io/

http://redis.io/

http://stackoverflow.com/questions/tagged/redis

http://stackoverflow.com/questions/tagged/redis

Software

Redis - ￼for duplicate detection on real time stream

Redis - for duplicate detection on real time stream