Data Design for Scaling

Data Design for Scaling Karthik Kumar Viswanathan

Agenda

• What is Scalability?

• Small Talk on the Big O

• Hashes or Trees?

• The I/O Hierarchy: Cost vs Speed

• Latency

• ACID vs BASE

Agenda

• Let the Query be your Guide

• Concurrency 101: Collision Avoidance

• Questions?

What is Scalability?

• Scalability is always bottom up

• The ability to keep increasing resources along with need without hitting roadblocks

• Adding new users as quickly as possible

• Adding new functionality without significant overhead

• Keeping it going after your team grows to 10+ developers!

Small Talk on the Big O

• The most complex parts of any Algorithm often yield O(n2), O(n), O(log n) or O(k)

• Most production systems work with those four. They almost always avoid O(n2) unless n is a very small number.

• Very seldom do TopCoder algorithms make it to high-level code. Most of them are abstracted as libraries.

• Most production algorithms work with approximations.

Small Talk on the Big O

• O(n2) - Arbitrary relationships between things

• O(n) - A list of things

• O(log n) - One among an ordered list of things

• O(k) - One thing among an unordered list of things

Hashes or Trees?

• If your data set requires processing on individual elements, Hashes. O(k). k = Load-Factor (Also, you may be calling it a Dict/Map. But technically, it could be a Hash or a Tree or anything else)

• If your data set requires processing a set of data who can be related to each other (lesser or greater), Trees. O(log n). log taken based on fanout.

• Now, we know this, this stuff is typically taught in school.

• But what we don’t usually realize is that ….

Hashes or Trees?

• Our NoSQL databases are mostly abstraction of Hashes.

• Our Relational SQL databases are mostly abstraction of Trees.

• Hash Indexes and NoSQL databases are great for the = query.

• Relational SQL databases with B/R-Tree/Skiplists are great for the < > query.

Hashes or Trees?

• If you are doing an = query on B-Tree/R-Tree/Skiplist, could be a bad idea. Hint: not if your load-factor O(k) is greater than your log(n).

• If you are doing an <> query with a Hash Index or NoSQL, does not compute!

• If you are into some higher level OOD frameworks, there IS a way to specify index type.

Hashes or Trees?

CouchDB’s flat B-Tree is shown to the right. Notice how scanning for values between 1 and 12 is possible quickly? i >=1 && i <=12

Hashes or Trees?

• Here is what Google tells me what a Hash is!

• Wait. Chopped Food? Mmmm…. Food!

Hashes or Trees?

• What does that have to do with this talk?

• When you chop a bigger piece of food into smaller bits, you hash, or hachér it.

• When you chop a large list into smaller buckets, you hash it. # A,C,E,G,I,K,M,O,Q

Hashes or Trees?

• You utilise a Hash Function and move an element into a sublist, called a bucket.

• And what exactly did you do there? If there were n elements, you reduced the time to find an element to n/k, where k is the load-factor of the Hash.

• When k = 6, and let’s say: h = crc32(str)%6 the str goes in the hth bucket

Hashes or Trees?

• But wait, what if the function was skewed? If crc32 were to put most of YOUR strings in the 3rd bucket.

• That makes the Big O move away from O(k) to O(n). Ouch!

• So if you really care about performance, do the hash function yourself. If you feed enough ‘Production Values’, the time taken to search for all of them remain fairly constant: all buckets have same number of values!

Hashes or Trees?

• And why is THAT important?

• The moment you implement a Distributed Hash Table yourself - where you treat each bucket as a computer serving data through TCP/IP, it becomes just obvious!

• Assuming that all of the buckets were connected to your client with the same network latency, you don’t want few get()s to do very well and many to work terribly bad!

The I/O Hierarchy: Cost vs Speed

• All memories are equal. Some memories are more equal.

• L1 Caches are extremely fast. Then L2, then L3…


• Then RAM… about 65x slower than the Caches.


• Then SSDs… about 10x slower than RAM.


• Then HDDs… about 8x slower than SSDs.


• Okay, but don’t expect even 1GB CPU Caches or 6TB RAM modules any sooner.

• It’s all about the price!

• So, if you can pay for it and call yourself scalable - sure, that’s one way to look at it.

• A more objective way could be to measure transactions/$ shelled. That is a great measure of scalability.


• Beyond a point, scaling vertically (adding more RAM) is just prohibitively expensive.

• A key factor realised by a certain company as early as 1998.

• So the best way is to go as vertical as your transactions/$ are low, t then scale out horizontally.

Hashes or Trees?

• A Distributed Hash Table is a great way to scale Data Horizontally, since it’s the hardest part. To perform any operation, deduce a Server s based on a Hash function. Now perform the underlying Hash operation on that Server.

Latency

• Latency is the time it takes to do any operation

• A get() on a Hash

• A read from a file on Disk

• TCP/IP Connection Establishment Time

• If you don’t make the hardware, you don’t control it.

• If you don’t operate the hardware, you don’t control it.

Latency

• But you can measure it. And prepare for it mentally. 😀

• The more you avoid transactions going to disk, the better your life will be.

• So it’s a balancing act. Figure our your working set in RAM. Keep some RAM buffered for those occasional I/Os on Disk from the tail end.

• And if you can get cheap SSDs, go for it! And use them to do DB logs/fast disk data. All other data, push to RAID-ed Cheap HDDs.

Latency

• Latency is what leads to a large queue.

• Latency is what leads to heavy load averages.

• The only other way latency gets introduced into the system is a context switch. Context switches are latencies indeed! Avoid as many of them as possible (Greenlets, anyone?) Again, if you can cache everything in-proc memory, sweet!

• “What was never lost can be served from RAM” - Zen Master

ACID vs BASE

• Going BASE is all about handling Latency

ACID vs BASE

• If sending old profile pictures was taboo for Facebook, it would have closed shop in 2009.

• If serving old product descriptions was not part of Amazon, they wouldn’t be very successful today.

• A few seconds old stale profile picture can completely reside in RAM as long as many people want it!

• And sometimes you just cannot send any old data. That’s completely fine - but it is your job to figure out how many of those to minimise as possible.

ACID vs BASE

• ACID works well for financial data, like a billing invoice. It isn’t expected to scale, but it is very critical for your users, and that is a headache you must bear.

• For everything else, there is BASE. Eventually, your data will be consistent, and everyone will be able to view the reflected changes. What we are saying is that the longer you can tolerate old cached data, the better your performance will be.

• And why so? Because you are _____________________

Let the Query be your Guide

• I need a profile picture on my App. Cache for ______

• I need to get a shared playlist from my friend Gulsher. Cache for ______

• I need to add my song to that shared playlist. Cache for ______

• I need to show the most selling products on my retail portal. Cache for _______

• I am chatting with this person. Cache for ________

• I am posting my comment on 9GAG. Cache for ________

Concurrency 101: Collision Avoidance

• The Problem Statement: How would Google notify the 1000,000th Chrome download of a $10k prize, assuming 20,000 users were downloading it every second?

Concurrency 101: Collision Avoidance

• The Problem Statement: How would Google notify the 1000,000th Chrome download of a $10k prize, assuming 20,000 users were downloading it every second?

• Avoiding Collisions can happen in Space and Time

• The idea is to have a manageable number of collisions, and have everything else fanned out.

• Reconcile periodically. MVCC does fan based on a Counter.

Questions?

Thank You!

• If you have any further questions

• If you want me to give a talk on something

• If you have suggestions for me

@kvisw

Internet

Data Design for Scaling