Sqrrl October Webinar: Data Modeling and Indexing

Preview:

DESCRIPTION

This webinar provides a technical deep dive into the NoSQL database Apache Accumulo. Sqrrl extends Accumulo with additional security, analytical, and data modeling tools. Topics include data modeling techniques, secondary indices, JSON and Graph capabilities for Accumulo.

Citation preview

Securely explore your data

DATA MODELING AND INDEXING FOR APACHE ACCUMULO

Sqrrl Webinar Series October, 2013 Adam Fuchs, CTO Sqrrl Data, Inc.

RECAP

1.  Introduction to Sqrrl and Accumulo 2.  Security In The Wild 3.  Sqrrl and Accumulo Technology 4.  The Data-Centric Security Ecosystem

In our September Webinar: Sqrrl, Apache Accumulo, and Cell-Level Security

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 2%

TODAY’S DISCUSSION

1.  Sqrrl and Accumulo Technology Review 2.  Table Designs

1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes

3.  Putting It All Together with Sqrrl

Data Modeling and Indexing for Apache Accumulo

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 3%

LAYERED ARCHITECTURE Turtles all the way down...

Accumulo'RPC'(Sorted(Key/Value(I/O)(

Hadoop'RPC'(File(I/O)(

Application

Sqrrl Enterprise

Sqrrl'API'over'Apache'Thri8'RPC'(JSON,(Graph,(Aggrega=on,(

Search,(etc.)(

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 4%

An Accumulo key is a 5-tuple, consisting of:

"   Row: Controls Atomicity "   Column Family: Controls Locality "   Column Qualifier: Controls Uniqueness "   Visibility Label: Controls Access "   Timestamp: Controls Versioning

Row Col. Fam. Col. Qual. Visibility Timestamp Value

John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …

John Doe Test Results Cholesterol JD|PCP_JD 20120912 183

John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass

John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…

Accumulo(Key/Value(Example(

ACCUMULO DATA FORMAT

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 5%

Instance new%ZooKeeperInstance(...)%

new%MockInstance()%

Connector

getConnector(...)%

TableOperations

InstanceOperations

SecurityOperations Scanner BatchScanner

createScanner(...)% createBatchScanner(...)%

Range

IteratorOption

Map.Entry

Key Value

iterator()%

BatchWriter

createBatchWriter(...)%

Mutation

addMuta3on(...)%

THE ACCUMULO CLIENT API

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 6%

InJMemory%Map%

Write%Ahead%Log%

(For%Recovery)%

Sorted,%Indexed%File%

Sorted,%Indexed%File%

Sorted,%Indexed%File%

Tablet(Data(Flow(

Reads&Iterator%Tree%

Minor&Compac0

on&

Merging&/&Major&Compac0on&

Iterator%Tree%

Writes& Iterator%Tree%

Scan&

Tablet%Server%

Tablet%

Tablet%Server%

Tablet%

Tablet%Server%

Tablet%

Applica3on%

Zookeeper%

Zookeeper%

Zookeeper%

Master%

HDFS%

Read/Write&

Store/Replicate&

Assign/Balance&

Delegate&Authority&

Delegate&Authority&

Applica3on%

Applica3on%

ACCUMULO TECHNOLOGY Strengths •  Shared-Nothing => Scalability •  Micro-Batching for Efficient

Random I/O •  High Concurrency, Low Latency

for Denormalized Data •  Sparse, Flexible Schema supports

dynamic and diverse data models •  Cell-level Security promotes

sharing Weaknesses •  Sorting induces write multiplication

factor •  Sparse schema support induces

additional storage overhead

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 7%

TODAY’S DISCUSSION

1.  Sqrrl and Accumulo Technology Review 2.  Table Designs

1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes

3.  Putting It All Together with Sqrrl

Data Modeling and Indexing for Apache Accumulo

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 8%

PROXY/NETFLOW EXAMPLE

Source Destination Port Bytes In Bytes Out Protocol 10.1.2.3 google.com 80 73,824 15,632 http 10.1.2.4 facebook.com 443 10,328 13,284,129 https 10.1.2.4 google.com 80 623,249 93,125 http 10.1.2.3 abcd1234.ru 3133

7 158 523,698,104 unknown

10.1.2.3 netflix.com 443 434,855,357 1,392,994 https 10.1.2.4 google.com 443 23,084 583,331 https 10.1.2.3 10.1.2.5 22 204 158 ssh

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 9%

INDEXES AND QFDS

Logs/Observations Input

Indexes

Question-Focused Datasets Transform

ation

•  Immutable(

•  AppendHOnly(

•  RealHTime(

•  Online(•  Sorted(•  Grouped(•  Aggregated(

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 10%

QFD KEY GENERATION

Source Destination Port Bytes In Bytes Out Protocol 10.1.2.3 google.com 80 73,824 15,632 http

Key% % % % % % %J>%%Value%10.1.2.3,%Bytes%In%% % %J>%+73,824%10.1.2.3,%Bytes%Out% % %J>%+15,632%10.1.2.3,%Ports%Used% % %J>%+{80}%10.1.2.3,%Protocols%Used% %J>%+{hap}%

Hosts QFD

0x00

.

.

.

0xFF

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 11%

HOSTS QFD WITH AGGREGATION IP Ports

Used Protos Used

Total Bytes In

Total Bytes Out

Ports Hosted

Protos Hosted

10.1.2.3 {22, 80, 443, 31337}

{http, https, ssh, unknown}

434,931,543 525,106,888 - -

10.1.2.4 {80, 443}

{http, https}

656,661 13,960,585 - -

10.1.2.5 - - 158 204 {22} {ssh}

New%Contribu3on:%(10.1.2.5,%Total%Bytes%In%J>%+3,215)%

158%+3,215%3,373%

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 12%

facebook.com

google.com

abcd1234.ru

netflix.com

10.1.2.3

10.1.2.4

10.1.2.5

CONNECTIVITY GRAPH

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 13%

Row Col. Fam. Col. Qual. Val. 10.1.2.3 Contacts 10.1.2.5 -

10.1.2.3 Contacts abcd1234.ru -

10.1.2.3 Contacts google.com -

10.1.2.3 Contacts netflix.com -

10.1.2.4 Contacts facebook.com -

10.1.2.4 Contacts google.com -

Row Col. Fam. Col. Qual. Val 10.1.2.5 Serves 10.1.2.3 -

abcd1234.ru Serves 10.1.2.3 -

facebook.com Serves 10.1.2.4 -

google.com Serves 10.1.2.3 -

google.com Serves 10.1.2.4 -

netflix.com Serves 10.1.2.3 -

INVERTED INDEXING

Table:(

Row:(

Column(Family:(

Column(Qualifier:(

Value:(

Forward(Index(

<UUID>(

<Type>(

<Field>(

<Term>(

Inverted(Index(

<Field>(

<Term>(

<UUID>(

<Digest(of(Event>(

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 14%

INVERTED INDEXING

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 15%

ADVANCED INDEXING

Table:(

Row:(

Column(Family:(

Column(Qualifier(

(Tuples):(

Value:(

Shard(Table(

<Par==on(ID>(

“Docs”( “Inv.(Index”( “Field(Index”(

<UUID>(

<Value>(

<Term>(

<UUID>(

<Field:Term>(

<UUID>(<Field>(

“Geo”(

<Hash>(

<UUID>(

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 16%

TODAY’S DISCUSSION

1.  Sqrrl and Accumulo Technology Review 2.  Table Designs

1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes

3.  Putting It All Together with Sqrrl

Data Modeling and Indexing for Apache Accumulo

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 17%

SQRRL ENTERPRISE

•  Dynamic Documents •  JSON I/O support •  Cell-level Security and Efficient Aggregation Extensions

•  Dynamic Graphs •  Co-partitioned with Documents for Integrated Search and

Discovery

•  Search •  Lucene Query Syntax •  Accumulo Indexes Preserve Security Model

•  Processing •  SQL-Like Language for Transforming and Aggregating Results •  Parallel Slicing and Extraction

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 18%

Simple API for Advanced Accumulo Usage

REAL-TIME OPERATIONAL APPS

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary%

Contact us for a demo

19%

HOW TO LEARN MORE

Download our White Paper "   www.sqrrl.com/whitepaper

Watch a video "   www.sqrrl.com/downloads#videos

Request a demo or one-on-one workshop "   www.sqrrl.com/contact

Come meet us "   Accumulo Meetup (October 28, New York) "   Strata + Hadoop World (October 28-30, New York) "   IBM IOD (November 4-7, Las Vegas) "   SC13 (November 18-21, Denver)

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 20%

THANK YOU

Thanks for attending!

To keep up to date with Sqrrl, check out or social media sites: www.twitter.com/sqrrl_inc www.linkedin.com/company/sqrrl

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 21%

Recommended