Friday, January 2, 2009
Intro to Caching,Caching algorithms and caching frameworks part 1
A lot of us heard the word cache and when you ask them about caching they give you a perfect answer but theydont know how it is built, or on which criteria I should favor this caching framework over that one and so on, inthis article we are going to talk about Caching, Caching Algorithms and caching frameworks and which is betterthan the other.
"Caching is a temp location where I store data in (data that I need it frequently) as the original data is expensiveto be fetched, so I can retrieve it faster. "
That what programmer 1 answered in the interview (one month ago he submitted his resume to a company whowanted a java programmer with a strong experience in caching and caching frameworks and extensive datamanipulation)
Programmer 1 did make his own cache implementation using hashtable and that what he only knows aboutcaching and his hashtable contains about 150 entry which he consider an extensive data(caching = hashtable, loadthe lookups in hashtable and everything will be fine nothing else) so lets see how will the interview goes.
Interviewer: Nice and based on what criteria do you choose your caching solution?
Programmer 1 :huh, (thinking for 5 minutes) , mmm based on, on , on the data (coughing)
Interviewer: excuse me! Could you repeat what you just said again?
Programmer 1: data?!
Interviewer: oh I see, ok list some caching algorithms and tell me which is used for what
Programmer 1: (staring at the interviewer and making strange expressions with his face, expressions that no oneknew that a human face can do :D )
Interviewer: ok, let me ask it in another way, how will a caching behave if it reached its capacity?
Programmer 1: capacity? Mmm (thinking hashtable is not limited to capacity I can add what I want and it willextend its capacity) (that was in programmer 1 mind he didnt say it)
The Interviewer thanked programmer 1 (the interview only lasted for 10minutes) after that a woman came andsaid: oh thanks for you time we will call you back have a nice dayThis was the worst interview programmer 1 had (he didnt read that there was a part in the job description whichstated that the candidate should have strong caching background ,in fact he only saw the line talking aboutexcellent package ;) )
Talk the talk and then walk the walk
After programmer 1 left he wanted to know what were the interviewer talking about and what are the answers tohis questions so he started to surf the net, Programmer 1 didnt know anything else about caching except: when Ineed cache I will use hashtableAfter using his favorite search engine he was able to find a nice caching article and started to read.
Why do we need cache?
Long time ago before caching age user used to request an object and this object was fetched from a storage placeand as the object grow bigger and bigger, the user had spend more time to fulfill his request, it really made the
storage place suffer a lot coz it had to be working for the whole time this caused both the user and the db to beangry and there were one of 2 possibilities
1- The user will get upset and complain and even wont use this application again(that was the case always)
2- The storage place will pack up its bags and leave your application , and that made a big problems(no place tostore data) (happened in rare situations )
Caching is a god sent:
After few years researchers at IBM (in 60s) introduced a new concept and named it Cache
What is Cache?
Caching is a temp location where I store data in (data that I need it frequently) as the original data is expensiveto be fetched, so I can retrieve it faster.
Caching is made of pool of entries and these entries are a copy of real data which are in storage (database forexample) and it is tagged with a tag (key identifier) value for retrieval.Great so programmer 1 already knows this but what he doesnt know is caching terminologies which are as follow:
When the client invokes a request (lets say he want to view product information) and our application gets therequest it will need to access the product data in our storage (database), it first checks the cache.
If an entry can be found with a tag matching that of the desired data (say product Id), the entry is used instead.This is known as a cache hit (cache hit is the primary measurement for the caching effectiveness we will discussthat later on).And the percentage of accesses that result in cache hits is known as the hit rate or hit ratio of the cache.
On the contrary when the tag isnt found in the cache (no match were found) this is known as cache miss , a hit tothe back storage is made and the data is fetched back and it is placed in the cache so in future hits it will be foundand will make a cache hit.
If we encountered a cache miss there can be either a scenarios from two scenarios:
First scenario: there is free space in the cache (the cache didnt reach its limit and there is free space) so in thiscase the object that cause the cache miss will be retrieved from our storage and get inserted in to the cache.
Second Scenario: there is no free space in the cache (cache reached its capacity) so the object that cause cachemiss will be fetched from the storage and then we will have to decide which object in the cache we need to movein order to place our newly created object (the one we just retrieved) this is done by replacement policy (cachingalgorithms) that decide which entry will be remove to make more room which will be discussed below.
When a cache miss occurs, data will be fetch it from the back storage, load it and place it in the cache but howmuch space the data we just fetched takes in the cache memory? This is known as Storage cost
And when we need to load the data we need to know how much does it take to load the data. This is known asRetrieval cost
When the object that resides in the cache need is updated in the back storage for example it needs to be updated,so keeping the cache up to date is known as Invalidation.Entry will be invalidate from cache and fetched again from the back storage to get an updated version.
When cache miss happens, the cache ejects some other entry in order to make room for the previously uncacheddata (in case we dont have enough room). The heuristic used to select the entry to eject is known as thereplacement policy.
Optimal Replacement Policy:
The theoretically optimal page replacement algorithm (also known as OPT or Beladys optimal page replacementpolicy) is an algorithm that tries to achieve the following: when a cached object need to be placed in the cache,the cache algorithm should replace the entry which will not be used for the longest period of time.
For example, a cache entry that is not going to be used for the next 10 seconds will be replaced by an entry thatis going to be used within the next 2 seconds.
Thinking of the optimal replacement policy we can say it is impossible to achieve but some algorithms do nearoptimal replacement policy based on heuristics.So everything is based on heuristics so what makes algorithm better than another one? And what do they use fortheir heuristics?
Nightmare at Java Street:
While reading the article programmer 1 fall a sleep and had nightmare (the scariest nightmare one can ever have)
Programmer 1: nihahha I will invalidate you. (Talking in a mad way)
Cached Object: no no please let me live, they still need me, I have children.
Programmer 1: all cached entries say that before they are invalidated and since when do you have children? Nevermind now vanish for ever.
Buhaaahaha , laughed programmer 1 in a scary way, ,silence took over the place for few minutes and then apolice serine broke this silence, police caught programmer 1 and he was accused of invalidating an entry that wasstill needed by a cache client, and he was sent to jail.
Programmer 1 work up and he was really scared, he started to look around and realized that it was just a dreamthen he continued reading about caching and tried to get rid of his fears.
No one can talk about caching algorithms better than the caching algorithms themselves
Least Frequently Used (LFU):
I am Least Frequently used; I count how often an entry is needed by incrementing a counter associated with eachentry.
I remove the entry with least frequently used counter first am not that fast and I am not that good in adaptiveactions (which means that it keeps the entries which is really needed and discard the ones that arent needed forthe longest period based on the access pattern or in other words the request pattern)
Least Recently Used (LRU):
I am Least Recently Used cache algorithm; I remove the least recently used items first. The one that wasnt usedfor a longest time.
I require keeping track of what was used when, which is expensive if one wants to make sure that I alwaysdiscards the least recently used item.Web browsers use me for caching. New items are placed into the top of the cache. When the cache exceeds itssize limit, I will discard items from the bottom. The trick is that whenever an item is accessed, I place at the top.
So items which are frequently accessed tend to stay in the cache. There are two ways to implement me either anarray or a linked list (which will