Upload
bruno-almeida
View
1.172
Download
1
Embed Size (px)
DESCRIPTION
- From PostgreSQL to Cassandra, In Four Easy Steps (Axel Eirola and Jarrod Creado, LabDev, F-Secure) In this presentation Axel and Jarrod will tell you the tale of our Network Reputation System Live Migration ( PostgreSQL to Cassandra ). F-Secure Network Reputation System is a core element of the protection we provide to our customers. It consists of URLs and other network related metadata, used to make fast assessments regarding their reputation. Currently the Network Reputation System database contains hundreds of millions of URLs. More info about Cassandra @ F-Secure? http://www.planetcassandra.com/blog/post/apache-cassandra-at-f-secure
Citation preview
From Postgres to Cassandra
In four easy stepsAxel Eirola < >
Jarrod Creado < >[email protected]@f-secure.com
Agenda1. Postgres2. Cassandra3. ???4. Profit
0. Context
Categorizing the internetHundreds of millions
Data size in the terabytes
Reputation metadata:
Categories: adult, gambling, …
Safety: malicious, safe, …
Automatic processing(re)Processing hundreds of thousands of URLs per dayComputation divided among multiple services, each withmultiple instancesDowntime not an option
Manual researchData mining capabilitiesResearching (aimlessly poking around)Reporting
1. Postgres
BCNF up in thisPlanned for storage, not queries
Highly normalized
Stiff schema, hard to add more fields
Sharding like a bossSegmenting the URL keyspace
One (or more) box for each segment
Difficult to add more capacity
We got eight single points of failure
Upgrading means downtime
Index all the things Building queries is hard due to the structure of the schemaManaging indices for those queries is hardThe mess needs to be abstracted away from the user, this is also hard
2. Cassandra
Easy managementEasy scaling up as more data is stored
Out of the box:
Replication
Pagination
Load balancing
Less downtime during upgrades
TTL
Mapping dataStructure of our data is suitable for NoSQLMostly based around single URLsGiven a URL, fetch metadata
Got queries?Cassandra schema designed for fixed pattern access
performed by automation
Human free-form searches offloaded to Elasticsearch
Load on one doesn't affect the other
DenormalizeProvide fixed pattern access for automationRelations become ranges in the column namespaceThis is pre-CQL, so we are doing the old-school wayMinimize the amount of read-then-write scenarios
collections
PostgresUrl_Category
url_key
category_key
timestamp
Category
key
name
Url
key
url
Url
row_key url (c)_<category_name>
<url_key> <url> <timestamp>
row_key <url_key>
<category_name> <empty>
Category
Cassandra
3. ???
Going into productionbefore going into production
DAL (data access layer) abstracts away the split databasesImplement new features in Cassandra onlyGet a feel of Cassandra before taking it into full use
A tale of two databasesRun both databases in parallelWrites:
New data, and updates, into both databases
Blind writes makes it easy to do partial updates
Reads:Reads from both databases, cross-validate responses
Easy to move responsibilities from one database to another
Migration boiled down to this1. Dump URL keys form Postgres into batches2. Custom migration script to chew a batch; for each URL in
batch:2.1. Read data from Postgres
2.2. Delete Cassandra row key for each URL
2.3. Write fresh data from Postgres into Cassandra
3. Log failing URLs4. Cross-validate on reads for a while to ensure successful
migration
4. Profit
Bro-tipsDecide what you don't want to migrateDry run while testing, keep an eye on the performanceStart in small batches, and verify the results before proceedingParallelize the batches, if you need to speed it upKeep an eye on performance, throttle if necessaryEverything doesn't always go as planned, make it easy torepeat migrationMake sure the cluster is prepared for the migration, reservetime to tweak if not
Kiitos