55
Basho Technologies Rusty Klophaus Querying Riak Just Got Easier Secondary Indices in Riak OSCON Data Portland, Oregon · July 2011 [email protected] twitter: rustyio

Querying Riak Just Got Easier - Introducing Secondary Indices

Embed Size (px)

DESCRIPTION

This presentation introduces new Riak KV functionality called Secondary Indexes. Secondary Indices allows a developer to retrieve data by attribute value, rather than by primary key.Currently, a developer coding outside of Riak’s key/value based access must maintain their own indexes into the data using links, other Riak objects, or external systems. This is straightforward for simple use cases, but can add substantial coding and data modeling for complex applications. By formalizing an approach and building index support directly into Riak KV, we remove this burden from the application developer while preserving Riak’s core benefits, including scalability and tolerance against hardware failure and network partitions.The presentation covers usage, capabilities, limitations, and lessons learned.

Citation preview

Page 1: Querying Riak Just Got Easier - Introducing Secondary Indices

Basho TechnologiesRusty Klophaus

Querying Riak Just Got EasierSecondary Indices in Riak

OSCON DataPortland, Oregon · July 2011

[email protected]: rustyio

Page 2: Querying Riak Just Got Easier - Introducing Secondary Indices

tl;dr:

Secondary Indices fundamentallychange data modeling in Riak.

Model one-to-one, one-to-many, or many-to-many relationships

simply and efficiently.

2

Page 3: Querying Riak Just Got Easier - Introducing Secondary Indices

But first, a little bit about tradeoffs and NoSQL.

3

Page 4: Querying Riak Just Got Easier - Introducing Secondary Indices
Page 5: Querying Riak Just Got Easier - Introducing Secondary Indices

Which one would you choose?

It depends on their abilities.It also depends on the quest.

5

Fools!

Page 6: Querying Riak Just Got Easier - Introducing Secondary Indices

Rule #1Your character will fare better

in certain environments depending on his (or her) abilities.

6

Page 7: Querying Riak Just Got Easier - Introducing Secondary Indices

Rule #2There are always tradeoffs

7

Page 8: Querying Riak Just Got Easier - Introducing Secondary Indices

Databases are likeRPG character classes.

They focus on certain abilities.

8

Page 9: Querying Riak Just Got Easier - Introducing Secondary Indices

Database Abilities (1/2)

9

SchemaFlexible Schema ⟩ Pre-Defined Schema ⟩ Typed Fields ⟩ Untyped Fields ⟩ Blob

Operation SkewMostly Writes ⟩ 50/50 ⟩ Mostly Reads

Disk PersistenceEvery Operation ⟩ Delayed Batch ⟩ Flush ⟩ Never

TransactionsGlobal ⟩ Table ⟩ Object ⟩ None

Data RelationsAd-Hoc ⟩ Pre-Defined ⟩ None

Operation OrderRandom ⟩ Sequential

Page 10: Querying Riak Just Got Easier - Introducing Secondary Indices

Database Abilities (2/2)

10

Secondary QueriesAd-Hoc ⟩ Pre-Determined ⟩ None

Native Data TypesTables ⟩ XML ⟩ JSON ⟩ Text ⟩ Blob

ScalableData Center ⟩ Cluster ⟩ Single Machine

Failure-ToleranceData Center ⟩ Network ⟩ Machine ⟩ Disk ⟩ Sector ⟩ None

StabilityPredictable Latency ⟩ Variable Latency

PerformanceOps Per Second

And so on...

Page 11: Querying Riak Just Got Easier - Introducing Secondary Indices

For a long time, industry focused on just one manifestation of database abilities:

«Relational Database»

11

SELECT * FROM Quests

Page 12: Querying Riak Just Got Easier - Introducing Secondary Indices

The World Has Changed

12

Global Internet“Everyone is connected...”

Mobile Computing“...all the time...”

Social Networking“...producing more data (in more varieties) than ever.”

Mob

ile

Global

Social

Page 13: Querying Riak Just Got Easier - Introducing Secondary Indices

NoSQL is about alternatives.

Focus on different abilities, environments,

and tradeoffs.

13

Page 14: Querying Riak Just Got Easier - Introducing Secondary Indices

As a database consumer,you need to understand the

basic tradeoffs of each solution.

In turn, database producers should strive to make those tradeoffs clear.

14

Page 15: Querying Riak Just Got Easier - Introducing Secondary Indices

There are always tradeoffs.

Databases that claim to do everything well are lying.

15

Page 16: Querying Riak Just Got Easier - Introducing Secondary Indices

Where Does Riak Fit?

16

Page 17: Querying Riak Just Got Easier - Introducing Secondary Indices

17

Page 18: Querying Riak Just Got Easier - Introducing Secondary Indices

Have you ever been burned by:

hardware failure?overloaded servers?emergencies at 2am?

17

Page 19: Querying Riak Just Got Easier - Introducing Secondary Indices

Does your quest involve:

SLAs that mention uptime or latency?data that you can’t afford to lose?

18

Page 20: Querying Riak Just Got Easier - Introducing Secondary Indices

If so, Riak’s tradeoffs will make sense.

If not, they won’t.

19

Page 21: Querying Riak Just Got Easier - Introducing Secondary Indices

Riak KV - Some Tradeoffs (1/3)

20

Amazon’s Dynamo Architecture✔Distributed, scalable, no single point of failure.✘No transactions; trade strong consistency for eventual consistency.

Page 22: Querying Riak Just Got Easier - Introducing Secondary Indices

Riak KV - Some Tradeoffs (2/3)

21

Extremely Focused on Operations✔Simple to install, manage, connect a cluster.✔Has been called “plumbing”, ie: it just works.✘Historically, developer-facing features lagged behind.

This is rapidly changing.

# Scale out...riak-admin join nodename@hostname

# Scale back in...riak-admin leave

Page 23: Querying Riak Just Got Easier - Introducing Secondary Indices

Riak KV - Some Tradeoffs (3/3)

22

Key/Value Model✔Simple, straightforward, content-type agnostic.✘More difficult to discover your data. (Queryability.)

Let’s dive deeper into “queryability.”

Page 24: Querying Riak Just Got Easier - Introducing Secondary Indices

Current Options for Querying Riak

23

MapReduceProvide set of starting keys, filter via map.MapReduce is meant for calculations/aggregations, not queries.

Riak SearchFull-text search in Riak.Opinionated, assumes your document is prose.

Roll Your Own IndicesDifficult to get right.More code to maintain.Often introduces SPOFs.

Page 25: Querying Riak Just Got Easier - Introducing Secondary Indices

New Feature: Secondary Indices

24

Page 26: Querying Riak Just Got Easier - Introducing Secondary Indices

What are Secondary Indices?

25

Page 27: Querying Riak Just Got Easier - Introducing Secondary Indices

What Are Secondary Indices?

26

GoalsProvide *simple* indexing on Riak objects.Maintain Riak’s operational advantages.Make a developer’s life easier.

How Does It Work?At write time, tag your data with key/value metadata.Query the metadata, get matching objects.

Page 28: Querying Riak Just Got Easier - Introducing Secondary Indices

For example...

27

category: armorprice: 400...

KEY VALUE INDEX METADATA

gauntlet24

“Gauntlets of Shininess”

BUCKET

loot

Page 29: Querying Riak Just Got Easier - Introducing Secondary Indices

Index an Object

28

# Store an object with:# Bucket: loot# Key: gauntlet24# Fields:# - category: armor# - price: 400curl \ -X PUT \ -d "OPAQUE_VALUE" \ -H "x-riak-index-category_bin: armor" \ -H "x-riak-index-price_int: 400" \ http://127.0.0.1:8098/riak/loot/gauntlet24

Page 30: Querying Riak Just Got Easier - Introducing Secondary Indices

Query the Index

29

# Query for category_bin = "armor"curl \ http://127.0.0.1:8098/buckets/loot/index/category_bin/armor

{"keys":["gauntlet24"]}

# Query for price_int between 300 and 500curl \ http://127.0.0.1:8098/buckets/loot/index/price_int/300/500

{"keys":["gauntlet24"]}

Page 31: Querying Riak Just Got Easier - Introducing Secondary Indices

Query Syntax

30

$BUCKETBucket to query.

$FIELDNAMEMust end with “_bin” for binaries, “_int” for integers.Special field $key for key range lookups.

$VALUE / $START / $ENDEquality or range queries.

/buckets/$BUCKET/index/$FIELDNAME/$VALUE/buckets/$BUCKET/index/$FIELDNAME/$START/$END

Page 32: Querying Riak Just Got Easier - Introducing Secondary Indices

Data Modeling withSecondary Indices

31

Page 33: Querying Riak Just Got Easier - Introducing Secondary Indices

Key/Value Lookups

32

Retrieve a user’s session.Retrieve an object by key.

sessions/8b6cfaa

{ foo: "bar", ...}

Page 34: Querying Riak Just Got Easier - Introducing Secondary Indices

Key/Value Lookups

33

Retrieve a user’s session.Retrieve an object by key.

sessions/8b6cfaa

{ foo: "bar", ...}

Object

Key Value

Use Case

Generic Case

Page 35: Querying Riak Just Got Easier - Introducing Secondary Indices

Alternate Keys / One-to-One Relationships

34

Retrieve a user by username or by email address.An object has multiple names.

users/rusty

{ username: "rusty", email: "[email protected]", twitter: "rustyio", ...}

emails/[email protected]

"rusty"

Page 36: Querying Riak Just Got Easier - Introducing Secondary Indices

Alternate Keys

35

Retieve a user by username or by email address.An object has multiple names.

users/rusty

{ username: "rusty", email: "[email protected]", twitter: "rustyio", ...}

Indexes: email_bin: [email protected] twitter_bin: rustyio

users/rusty

With Secondary Indices!

Page 37: Querying Riak Just Got Easier - Introducing Secondary Indices

Ownership / One-to-Many Relationships

36

A person has many cars.Parent has a *small* number of children.

people/frank

{ cars: [ { plate: "ET7-B928", color: "red", type: "corvette", } ... ]}

Page 38: Querying Riak Just Got Easier - Introducing Secondary Indices

Ownership / One-to-Many Relationships

37

A person has many cars.Parent has a *small* number of children.

people/frank

{ cars: [ { plate: "ET7-B928", color: "red", type: "corvette", }, ... ]}

Indexes: cars_plate_bin: ET7-B928 cars_plate_bin: BUB-7911 ...

With Secondary Indices!

Page 39: Querying Riak Just Got Easier - Introducing Secondary Indices

Ownership / One-to-Many Relationships

38

A user has many status updates.Parent has a *large* number of children.

users/rustyio

{ status_updates: [ "18258713", "87187597", "71117389", ... ]}

statuses/18258713

{ author: "rustyio" reply_to: "barackobama" text: "Sorry, can't hang out now, I'm speaking at OSCON Data."}

Page 40: Querying Riak Just Got Easier - Introducing Secondary Indices

Ownership / One-to-Many Relationships

39

A user has many status updates.Parent has a *large* number of children.

users/rustyio

{ ...}

statuses/18258713

{ author: "rustyio" reply_to: "barackobama" text: "Sorry, can't hang out now, I'm speaking at OSCON Data."}

Indexes: author_bin: rustyio reply_to_bin: barackobama

With Secondary Indices!

Page 41: Querying Riak Just Got Easier - Introducing Secondary Indices

users/rusty

{ clubs: [ "dc_larping_club", ... ]}

clubs/dc_larpers

{ users: [ "rusty", "frank" ]}

Membership / Many-to-Many Relationships

40

A user joins one or more clubs, a club has users.A group has many members.

Page 42: Querying Riak Just Got Easier - Introducing Secondary Indices

users/rusty

{ clubs: [ "dc_larping_club", "nascar_fans", ... ]}

Indexes: club_bin: dc_larping_club club_bin: nascar_fans

clubs/dc_larpers

...

Membership / Many-to-Many Relationships

41

A user joins one or more clubs, a club has users.A group has many members.

With Secondary Indices!

Page 43: Querying Riak Just Got Easier - Introducing Secondary Indices

What Were The Challenges?

42

Page 44: Querying Riak Just Got Easier - Introducing Secondary Indices

Challenge: Ambitious Prototyping (1/5)

43

Why Difficult?Early prototypes contained support for :• A SQL-like query language (RQL), with compound queries• Sorting and pagination, with intelligent caching• Inline map/reduce• Extensible data typesArguably too clever.Allowed developers to shoot selves in foot.

How Solved?Ruthlessly cut features & simplify.

Page 45: Querying Riak Just Got Easier - Introducing Secondary Indices

Challenge: Data Types (2/5)

44

Why Difficult?What type is a given field? Naming convention, or global dictionary?What if the user wants to change the type?What if the user provides a value of the wrong type?

How Solved?Field type determined by suffix. (field1_bin, field2_int)Different type == different field name.Pre-commit hook to validate data types.

Page 46: Querying Riak Just Got Easier - Introducing Secondary Indices

Challenge: Disk Based Storage (3/5)

45

Why Difficult?Disk is slow. (http://highscalability.com/numbers-everyone-should-know)

Need data structures that are both read and write efficient, for data of unpredictable sizes and shapes.

How Solved?Leverage merge_index (data engine from Riak Search).Investigate LevelDB (library from Google)

Page 47: Querying Riak Just Got Easier - Introducing Secondary Indices

Challenge: Atomicity (4/5)

46

Why Difficult?Need to keep the index synchronized with the object value.Account for eventual consistency, siblings, handoff, replication, etc.What happens during node failure / partial cluster situations?

How Solved?Make the KV object the authoritative data.Indexed data is discarded if the object moves.

Page 48: Querying Riak Just Got Easier - Introducing Secondary Indices

Challenge: System is Distributed (5/5)

47

Why Difficult?Index is split over many different partitions.*Don’t* need to query every partition.*Do* need to be smart about which partitions to query.

How Solved?Extend riak_core (distribution layer) with ability to broadcast command to covering set of partitions.(h/t to Kelly McLaughlin, @_klm)

Page 49: Querying Riak Just Got Easier - Introducing Secondary Indices

Cluster

Node

VNode (Virtual Node)

Write Coordinatorriak_kv_put_fsm

VNode Coordinatorriak_core_vnode_master

Core VNoderiak_core_vnode

KV VNoderiak_kv_vnode

Backendriak_kv_index_backend

PUT Request

Client API

Validate that metadata parses.

User tags the object with metadata.

Stores object in bitcask, index metadata in merge_index.

Indexing

Page 50: Querying Riak Just Got Easier - Introducing Secondary Indices

Cluster

Node

VNode (Virtual Node)

Query Coordinatorriak_index_query_fsm

VNode Coordinatorriak_core_vnode_master

Core VNoderiak_core_vnode

KV VNoderiak_kv_vnode

Backendriak_kv_index_backend

Client API

User issues a query against metadata.

Runs query against index, replies with results.

Querying

Coverage Logicriak_core_coverage_fsm

Query Request

Page 51: Querying Riak Just Got Easier - Introducing Secondary Indices

Next Steps

50

PublishWe are open source. Code is available now (for the foolhardy.)Beta version soon (for the adventurous.)Included in Riak version 1.0 (for the masses.)

Page 52: Querying Riak Just Got Easier - Introducing Secondary Indices

API design is like sex: Make one mistake and

support it for the rest of your life.- @joshbloch

(Everything is subject to change.)

51

Page 53: Querying Riak Just Got Easier - Introducing Secondary Indices

About Basho Technologies

52

<plug type=“shameless”>• Distributed company: ~30 people• Cambridge, MA / San Francisco, CA / Reston, VA• “We hate downtime, we hate overtime.” • Riak KV (Open Source)• Riak Support, Services, and Enterprise Features ($)

</plug>

Page 54: Querying Riak Just Got Easier - Introducing Secondary Indices

Thanks! / Questions?

Rusty [email protected]

@rustyio

Mark [email protected]@pharkmillups

Ryan [email protected]

@rzezeski

Tony [email protected]@antonyfalco

Also at OSCON...

Page 55: Querying Riak Just Got Easier - Introducing Secondary Indices

END