15
Journal of International Academic Research for Multidisciplinary ISSN 2320 -5083 A Scholarly, Peer Reviewed, Monthly, Open Access, Online Research Journal Impact Factor – 1.393 VOLUME 1 ISSUE 11 DECEMBER 2013 A GLOBAL SOCIETY FOR MULTIDISCIPLINARY RESEARCH www.jiarm.com A GREEN PUBLISHING HOUSE

ISSN 2320 -5083 Journal of · PDF fileJournal of International ... Advanced Materials Technology Department ... A Microprocessor performance has improved 60% each year for almost two

  • Upload
    lyphuc

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Journal of International Academic Research for Multidisciplinary

ISSN 2320 -5083

A Scholarly, Peer Reviewed, Monthly, Open Access, Online Research Journal

Impact Factor – 1.393

VOLUME 1 ISSUE 11 DECEMBER 2013

A GLOBAL SOCIETY FOR MULTIDISCIPLINARY RESEARCH

www.jiarm.com

A GREEN PUBLISHING HOUSE

Editorial Board

Dr. Kari Jabbour, Ph.D Curriculum Developer, American College of Technology, Missouri, USA.

Er.Chandramohan, M.S System Specialist - OGP ABB Australia Pvt. Ltd., Australia.

Dr. S.K. Singh Chief Scientist Advanced Materials Technology Department Institute of Minerals & Materials Technology Bhubaneswar, India

Dr. Jake M. Laguador Director, Research and Statistics Center, Lyceum of the Philippines University, Philippines.

Prof. Dr. Sharath Babu, LLM Ph.D Dean. Faculty of Law, Karnatak University Dharwad, Karnataka, India

Dr.S.M Kadri, MBBS, MPH/ICHD, FFP Fellow, Public Health Foundation of India Epidemiologist Division of Epidemiology and Public Health, Kashmir, India

Dr.Bhumika Talwar, BDS Research Officer State Institute of Health & Family Welfare Jaipur, India

Dr. Tej Pratap Mall Ph.D Head, Postgraduate Department of Botany, Kisan P.G. College, Bahraich, India.

Dr. Arup Kanti Konar, Ph.D Associate Professor of Economics Achhruram, Memorial College, SKB University, Jhalda,Purulia, West Bengal. India

Dr. S.Raja Ph.D Research Associate, Madras Research Center of CMFR , Indian Council of Agricultural Research, Chennai, India

Dr. Vijay Pithadia, Ph.D, Director - Sri Aurobindo Institute of Management Rajkot, India.

Er. R. Bhuvanewari Devi M. Tech, MCIHT Highway Engineer, Infrastructure, Ramboll, Abu Dhabi, UAE Sanda Maican, Ph.D. Senior Researcher, Department of Ecology, Taxonomy and Nature Conservation Institute of Biology of the Romanian Academy, Bucharest, Romania Dr. Reynalda B. Garcia Professor, Graduate School & College of Education, Arts and Sciences Lyceum of the Philippines University Philippines Dr.Damarla Bala Venkata Ramana Senior Scientist Central Research Institute for Dryland Agriculture (CRIDA) Hyderabad, A.P, India PROF. Dr.S.V.Kshirsagar, M.B.B.S,M.S Head - Department of Anatomy, Bidar Institute of Medical Sciences, Karnataka, India. Dr Asifa Nazir, M.B.B.S, MD, Assistant Professor, Dept of Microbiology Government Medical College, Srinagar, India. Dr.AmitaPuri, Ph.D Officiating Principal Army Inst. Of Education New Delhi, India Dr. Shobana Nelasco Ph.D Associate Professor, Fellow of Indian Council of Social Science Research (On Deputation, Department of Economics, Bharathidasan University, Trichirappalli. India M. Suresh Kumar, PHD Assistant Manager, Godrej Security Solution, India. Dr.T.Chandrasekarayya,Ph.D Assistant Professor, Dept Of Population Studies & Social Work, S.V.University, Tirupati, India.

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

605 www.jiarm.com

ANALYSIS OF CACHE OBLIVIOUS ALGORITHMS

KORDE P.S* *Dept. of Computer Science, Shri Shivaji College, Parbhani (M.S.) India

ABSTRACT

We propose to cache-oblivious algorithms for better memory utilization. Design of

cache-oblivious algorithm requires deep understanding of program structure and operation,

and familiarity with memory architecture. To make the performance benefits of cache

oblivious structure available to the average programmer. It is very necessity to design such

techniques which reduce cache miss rate and achieve significant performance improvements

on real programs. We have developed techniques with detailed performance in comparison

with its cache algorithm, with both the parameter values in the cache oblivious algorithms

and the hardware platforms varied.

KEYWORDS: Cache Oblivious, Cache miss, Cache hit; Multiplication; Transposition; Sorting; B-tree

1. INTRODUCTION

A Cache memory is high speed and relatively small memory which play very important

role in computer system. It is utilized to eliminate the gap between CPU and main memory

speed. It has a smaller access time than that of main memory. At present there are three types

of cache

1. Fully Associative

2. Direct Mapped

3. Set Associative

In Fully Associative Cache, there is no restriction on mapping form memory to cache. The

tag used for the search it is expensive. It is feasible for small size cache only. In Direct

Mapped Cache, A memory block is mapping by only one cache line. There is no need to

expensive search. In Set Associative, A predefined set are used for mapping from memory to

cache. From this list of types of Cache, Direct Mapped cache is the fastest. With advance

development in hardware of computer, it is possible to have high speed CPU but the

performance of CPU get restricted by the performance of Cache memory access. There are

various approaches available for effective utilization of cache memory. List Recently Used

(LRU), Most Recently Used (MRU), Pseudo LRU (PLRU), Segmented LRU, Set Associative

Direct Map, Least Frequency Used (LFU) and Multi Queue (MQ) .

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

606 www.jiarm.com

2. Utilization of Cache Memory Algorithms

Program transformations are one of the most valuable compiler techniques to improve data

locality. However, restructuring compilers have a hard time coping with data dependences. A

typical solution is to focus on program parts where the dependences are simple enough to

enable any cache obliviousness. For more complex problems is only addressed the question

of checking whether a it is legal or not.

There are two types of processing algorithms

a) Utilization of Cache Oblivious Algorithms.

b) Utilization Cache Coconscious Algorithms.

Cache Oblivious Algorithms

A Cache Oblivious Algorithm is designed to perform well, without modification on multiple

machines with different cache sizes or for a memory hierarchy with different levels of cache

having different sizes. The idea of cache oblivious algorithms was conceived by Charles E.

Leserson[12] as early as 1996 and first published by Harald Prokop [12] in his master thesis

at the Massachusetts Institute of Technology in 1999. Optimal Cache Oblivious Algorithms

are known for Cooley- Tukey FFT algorithms, matrix multiplication, sorting, matrix

transposition and several other problems. The goal of Cache Oblivious Algorithms is to

reduce the amount of tuning that is required. Cache Oblivious Algorithms are analyzed using

idealized model of cache known as cache oblivious model.

Cache Coconscious Algorithms

A Microprocessor performance has improved 60% each year for almost two decades. Yet

over same period, memory access time has decreased by less than 10% per year [3]. These

trends appear likely to persist barring an unforeseen technology break through. This is

because the primary driving force behind memory technology is storage capacity and current

technology consists of high capacity memory with fast access times.

The unfortunate, but inevitable, consequence is an ever-increasing processor memory

performance gap. Memory caches are the ubiquitous hardware solution to this problem.

These are small, fast memories that store recently accessed data items and attempt to

intercept and satisfy data requests without accessing main memory. In the beginning ,

single level of cache suffered, but increasing performance gap(now two orders of

magnitude) requires two levels of caches today and three in the near future.

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

607 www.jiarm.com

3. Cache Oblivious Algorithms

We first discuss the structure of basic cache oblivious algorithms and its optimization

techniques.

3.1 Cache oblivious model

The basic notation of cache oblivious model is simple. Modern Computer has multi-

level storage hierarchies. In which data which is being manipulated by the CPU moves back

and forth between various levels of storage.

‐ In register on the CPU

‐ In a Specialized Cache Memory(Often called level-1 or level-2 cache)

‐ In Computers main memory

‐ On disk

The CPU has different sized caches. The size of cache is important specifically the block size

of the cache is important; the block size is unit of transfer and replacement.

We have demonstrated standard Cache Oblivious Algorithms cache oblivious matrix

transposition, cache oblivious matrix multiplication and Cache oblivious Dynamic

Algorithms. These have good cache performance and cache aware sense. We tested first

matrix transposition, matrix multiplication and dynamic programming with its cache miss ratio.

3.1.1 Matrix Transposition

Matrix transposition is a fundamental operation in linear algebra and in fast Fourier

transforms and has applications in numerical analysis, image processing and graphics. The

simplest hack for transposing a N X N square matrix in C++ could be:

for (i = 0; i < N; i++) for (j = i+1; j < N; j++) swap(A[i][j], A[j][i])

In C++ matrices are stored in “row-major" storage, i.e. the rightmost dimension varies

the fastest. In the above case, the number of cache misses the code could do is (N2). The

optimal cache oblivious matrix transposition makes (1 + N2/B ) cache misses[12]. Before

we go into the divide and conquer based algorithm for matrix transposition that is cache

oblivious, let us see an experimental results.

We present here general free cache oblivious of performance, we consider here 3×3,

4×4, 5×5 matrices then find out cache miss and cache hit ratio. Graphically Cache oblivious

how its length in fig 1.

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

608 www.jiarm.com

Fig 1: Cache oblivious of matrix transposition

3.1.2 Matrix Multiplication

Matrix multiplication is one of the most studied computational problems, we are given two

matrices of m × n and n × p and we want to compute the product matrix of m× p size. In this

section we will use n = m = N although the results can easily be extended to the case when

they are not equal. Thus, we are given two N×N matrices x (xi;j ), y = (yi;j ), and we wish to

compute their product z, i.e. there are N2 outputs where the ( i, j) output is

The algorithm breaks the three matrices x; y; z into four sub matrices of size N/2 ,

N/2, rewriting the equation z = x.y as

r s a c e g t u = b d × f h

We present here another general free cache oblivious of performance, we consider

here 3×3, 4×4, 5×5 matrices multiplication. How it processes in fig 2.

Fig 2: Cache oblivious of matrix multiplication

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

609 www.jiarm.com

3.1.3 Dynamic Programming

Dynamic Programming is a method for solving complex problems by breaking them

into simpler sub problems. It is applicable to problems exhibiting properties of overlapping

sub problems which are only slightly smaller and optimal substructure. It takes less time than

native methods.

Cache-oblivious implementation of the classic dynamic programming algorithms

using Transferring values and storing value methods, the algorithms continues to run in

(m+n) space and perform (mn/BM) block transfers. Experimental Algorithms shows the

faster than widely used algorithms.

We present here general free cache oblivious of performance, we consider here 3×3, 4×4, 5×5

matrices then find out cache miss and cache hit Graphically it rises according to length in fig 3.

Fig 3: Cache oblivious of Dynamic Algorithms

In 2×2 Matrices, 3×3 Matrices the cache miss ratio n as n+1 but when size of array increase

and length of cache miss ratio also goes to rises as shown in fig 3.

4. Analysis of Cache Oblivious Algorithms

We designed following cache oblivious algorithms with new approach

1) Cache Oblivious Matrix Multiplication

2) Cache Oblivious Matrix Transposition

3) Cache Oblivious Sorting

4) Cache Oblivious B-Tree

4.1 Cache Oblivious Matrix Multiplication

We have to perform two steps

1) Initialize single dimension array a, b, c and transfer element of matrix a as row-major

order and matrix b as column-major order in single dimension array.

2) Compute the array a, array b and store the result in array c as in Algorithm 1.

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

610 www.jiarm.com

Algorithm 1 Multiplication of 3-by-3 matrix

1. Set x=0 2. Repeat steps 3

for i = 1 to 3 a) If ( x > 9) then

Set x=0 [end of if ]

3. Repeat For j = i to 9 a) Repeat for k = i to 9

i) Set c[x]=c[x]+a[j]*b[k] ii) Set x= x+1 iii) Set k=k+1

[end of k loop] b) Set j=j+2

[End of j loop] [End of i loop]

The multiplications scheme presented can be easily extended to multiplication of 5-by-5 , 7-

by-7 and so on. It can be used on any matrix multiplication as long as the matrix dimensions

numbers. It is necessary to use a same approach. In case of a large matrix, the matrix can be

divided into sub matrix.

We tested cache oblivious of performance, we consider here 3×3, 4×4, 5×5 matrices then find

out cache miss and cache hit ratio as shown in fig 4.

Fig 4: Cache Oblivious matrix multiplication

Implementation

Implementation of Matrix multiplication scheme we have developed in the previous

section. The algorithm takes counter variable convert two dimension matrix into single

dimension. An additional single dimension is required to store the elements.

In programming language C or C++ the matrix a, b and counter we may design matrix

transposition.

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

611 www.jiarm.com

4.2 Cache Oblivious Matrix Transposition

We reformulate matrix transposition algorithm into the following form

In Algorithm 2 we have used one dimension array on the execution order of the main loop. It

may be executed in same order of another loop with coping values of matrix an in column-

major order. In second time it back transfers the values of matrix. The processing of Matrix

element is in fig 5.

Fig 5: Operation of elements of matrix We just rearrange the matrix elements by using single dimension array. It gives very good

temporal locality in the access pattern of the matrix elements. Thus the key requirements for

good cache performance are satisfied.

4.3 Cache Oblivious Sorting

In this technique we use an ordering of min elements array. However our sequential sorting

which totally avoid the transposition process.

Algorithm 2: Matrix Transposition Algorithms 1. Set counter=0 2. Repeat steps for i = 1 to n

a) Repeat for j = 1 to n i) Set b[counter]=a[j][i] ii) counter=counter+1 [end of j loop] [end of i loop]

3. set counter=0 4. Repeat steps for i = 1 to n

b) Repeat for j = 1 to n iii) Set b[counter]=a[j][i] iv) counter=counter+1 [end of j loop] [end of i loop]

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

612 www.jiarm.com

Fig 6: Cache Oblivious Processing Sorting First we rearrange algorithm in the following form as fig 6 and fig 7

Fig 7 Array 2

In this process, we have used two arrays having same length. In this method

We find minimum value at every phase store in second array list. It shows the better locality

of the element access and can benefit from the presence of cache memory.

In Bubble Sort we make passes from left to right over the permutation to be sorted

and always move the currently largest right by exchanges between it and right adjacent

element – if that one is smaller. We make at most n-1 passes, since after moving all but one

element in the correct place the single remaining element must be also in its correct place.

The total number of exchanges is obviously at most n2, so we only need to consider the lower

bound. Let B be a Bubble Sort algorithm. For a permutation of elements 1, 2, - - - - - - - -

n. we can describe the total number of exchanges by

In sorting we move minimum number from array and store into array 2. These processes

repeat n times. In this method there are eliminating process of swapping and exchanging

elements.

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

613 www.jiarm.com

Sorting requires θ ( log M/B ) block transfers and permuting an array requires θ (

min log M/B ) block transfers. Lower bounds hold for the cache-oblivious model. The

lower bounds from immediately give a lower bound of Ω ( log M/B ) block transfers for

cache-oblivious sorting. The upper bound from cannot be applied to the cache-oblivious

setting since these algorithm make explicit use of B and M.

In sequential processing we transfer the value of min to array 2 while processing and

reformulate the algorithm 3.

Algorithm 3 1. Repeat steps 2,3,4 for i = 1 to n 2. Repeat for j = 1 to n

a) If ( a[ j] < = min ) and ( a[ j ] < > 0 ) then i) Set min =a [ j ] ii) Set loc = j

[ end of if ] [ end of j loop ]

3. Set b[ i ] = min 4. Set min = max value

[ End of i loop ] 5. Exit

In this process we formulate the Bubble sort method; here we used two arrays as array 1 and

array 2. In first array we find the location of min value and place as null value and copy value

in array 2. The process may continue up to last elements. Every phase we must store min

value as max value. It is compulsory for every external loop. We make at most n

passes, since after moving all but one element in the correct place the single remaining

element must be also in its correct place. The total number of exchanges is obviously at most

n2, so we only need to consider the lower bound. By using same technique we may design

algorithm 4 Cache oblivious Sequential Sorting for descending.

Implementation: Algorithm 3 is simple scheme we developed in previous sections. The

algorithm takes n phases as min1, min 2 - - - - - min n to indicate the sorting process. In

programming language like C or C++, Java directly used these schemes

4.4 Cache Oblivious B-tree

A binary tree T is in memory. The algorithm does a DFS (preorder) traversal of T, applying

an operation PROCESS to each of its nodes. An array stack is used to temporarily hold the

addresses of nodes as in algorithm 4.

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

614 www.jiarm.com

Algorithm 4 1. Set Top=1, Stack [1] = Null and PTR= ROOT 2. Repeat steps 3 to 5 while PTR< > NULL 3. Apply PROCESS to INFO[PTR] 4. [Right child ?]

If RIGHT[PTR] < > NULL then [ Push on STACK ] Set TOP = TOP+1, and STACK[TOP] = RIGHT[PTR] [End of if structure]

5. [Left child] If LEFT[PTR] < > NULL then Set PTR= LEFT[PTR] Else : [Pop from STACK] Set PTR=STACK[TOP] and TOP=TOP-1 [ End of If structure] [ End of 2 loop]

6. Exit.

The worst case number of memory transfers during a top down of a path using the above

layout schemes assuming each block stress B nodes. With the BFS layout, the topmost

[log(B+1)] levels of the tree will be contained in at most two blocks whereas each of the

following blocks read only contains one node form the path. The total number of memory

transfers is therefore θ (log ( n /B). For DFS and In-order layout we get worst case bound

when path to rightmost leaf, since first [ log ( n+1) ] - [log B] node have distance at least B in

memory, where as the last log(B+1)] nodes stored in at most tow blocks. Prokop observed, in

the van Emde Boas layout there are at most ( log BN) memory transfers. Only van Emde

Boas layout has asymptotically optimal bound achieved by B-tree. In Sequential Accessing

Process we change the sequence of storing binary trees in memory.

Linked Representation of Binary Tree u

A binary tree T in memory by using three parallel arrays INFO, LEFT, RIGHT and a pointer

variable ROOT as follows. Each node N of T will correspond to a location K such that as in

fig 8, fig 9.

1) INFO [K] contains the data at the node N.

2) LEFT [K] contains the location of the left child of node N.

3) RIGHT[K] contains the location of the right child of node N

ig 8: A Binary Tree T

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

615 www.jiarm.com

Fig9: Memory Representation of Sequential Processing In this process we divide binary tree T as s Left-sub-tree and Right sub-tree which linked

with each other for sequential processing. The four memory layout DFS, in-order, BFS, van

Emde Boas when processed the algorithm 5 will be convert as algorithm 6. CO B-tree

executes in linking order elements as left-sub tree and right sub-tree. In sequential access

method it process linking order of data structure.

We have perform two steps

1) It executes in sequential order.

2) The same approach are used for all four layouts

Algorithm 5 ( DFS Preorder) 1. Set PTR=ROOT 2. While ( PTR < > Null )

a) Apply Process of INFO[PTR] b) Set SAVE=PTR c) Set PTR=LEFT[PTR]

[ End of loop] 3. Set PTR=RIGHT[SAVE] 4. While ( PTR < > Null )

d) Apply Process of INFO[PTR] e) Set PTR=RIGHT[PTR]

[ End of loop] 5. EXIT

Cache oblivious B-tree which uses two phase of loop it repeat until PTR goes to null. In

programming language like C or C++, Java directly used these schemes.

5. Conclusions

To summarize, we have studied cache oblivious algorithms for matrix multiplication, matrix

transposition, dynamic programming, sorting and binary searching trees. The Algorithm

design is used heavily in both parallel and external memory algorithms. Cache oblivious

algorithms make heavy use of this paradigm and lot of seemingly simple algorithm that

were based on this paradigm are already cache oblivious for instance, stressen’s matrix

multiplication, quick sort, merge sort, closed pair, Convex hulls median selection are all

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

616 www.jiarm.com

algorithms are that are cache oblivious, through not all of them are optimal in this model.

This means that they might be cache oblivious but can be modified to make fewer cache

misses than they do in the current form. In the section on matrix multiplication, we see that

stressen’s matrix multiplication algorithm is already optimal in the cache oblivious sense.

Divide and Conquer algorithm split the instance of the problem to be solved into several sub

problems such that each of the sub problems can be solved independently. Since the

algorithm recourses on the sub problems, at some point of time the sub problems fit inside

Memory and subsequent recursion fit the sub problem into B. For instance let’s analyze the

average number of cache missed in randomized sort algorithms. This algorithm is quite cache

friendly if not cache optimal. In cache oblivious process, we present here the problem of

optimization of cache memory done by implementation of optimal oblivious matrix

multiplication. Our algorithm uses only two types of recursive blocks. All the elements are

accessed sequentially and there are no jumps at all. The number of cache miss are of the order

of (N3/L √M).

Cache Oblivious algorithm, we present optimization of cache memory done by

implementation of optimal oblivious matrix multiplication. Our algorithm uses only two

types of recursive blocks. All the elements are accessed sequentially and there are no jumps

at all. The number of cache miss are of the order of (N3/L √M).

In matrix transposition that has excellent spatial and temporal locality features. Using

cache ideal model the number of cache misses is order of (N3/L . This is asymptotically

optimal for any algorithms that are based on recursive block multiplication.

We have presented sorting algorithm that has better locality features. Using ideal cache

model cache misses is of order of θ( log M/B ) block transfers and permuting an array

requires θ ( min log M/B ) block transfers. This is asymptotically optimal for any

algorithm that is based on sorting. It totally avoids the need for swapping for address

arithmetic. While this fact is not fully exploited on standard computers, it may be

considerable advantage for hardware implementations of sorting techniques.

The basic idea of data structure is to maintain binary tree of height log n + using existing

methods. In sequential accessing cache oblivious binary process we also investigate the

practicality of cache obliviousness in the area of trees by providing comparison of different

methods for laying out a tree in memory. One further observation is that the effects from the

space saving and in size caused by implicit layout are notable.

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013

617 www.jiarm.com

6. References 1. A. Aggarwal and J. Vitter, “The Input/Output Complexity of Sorting and Related Problems,” Comm.

ACM, vol. 31, pp. 1116-1127, 1988. 2. Bingsheng He, Qiong Luo Cache Oblivous Nested Loop Joins CIKM06 2006 ACM 3. Bader M A, Z. Duan, J. Iacono, and J. Wu. A locality-preserving cache-oblivious dynamic dictionary.

2002 In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 29 to 38.

4. D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progression. Journal of Symbolic Computation, 9:251 to 280, 1990.

5. D. E. knuth. The art of computer programming: Fundamental Algorithms, third edition, volume 1 of the art of computer programming. Addison Wesley Longman, Boston, New York 1997.

6. D.A. Paterson and J.L. Hennessay Computer Oraganization and Design: The Hardware/ Software Interface Second Edition Morgan Kaufmann San Francisco 1998.

7. Demaine, E.D.: Cache-oblivious algorithms and data structures. In: Lecture Notes from the EEF Summer School on Massive Data Sets. Lecture Notes in Computer Science, BRICS, University of Aarhus, Denmark (2002)

8. Erik D. Demaine, Cache-Oblivious Algorithms and Data Structures, in Lecture Notes in Computer Science, BRICS, June 27-July 1, 2002, University of Aarhus, Denmark, Springer.

9. Joon Sang Park Optimizing Graph Alogorithms for Improved Cache Performance IEEE Transanctions on Parallel and Distributed system Vol 15 No 9 Sept 2004.

10. G. S. Brodal and R. Fagerberg. Cache oblivious distribution sweeping. In Proc. 29th In-ternational Colloquium on Automata, Languages, and Programming, volume 2380 of Lecture Notes in Computer Science, pages 426–438. Springer Verlag, Berlin, 2002.

11. Gustavson F.G. M. Mehl, M. P¨ogl, C. Zenger. 1999 A cache-aware algorithm for PDEson hierarchical data structures based on space-filling curves. SIAM Journal of Scientific Computing.

12. M. Frigo Charles E. Leiserson. H Prokop and S. Ramachandaran Cache Oblivious Algorithm Oct 1999 In Proc 40th Annual Symposium on Foundation of Computer Science.

13. M Bader and Christoph Zenger Institute for Informatics der TU M¨unchen, Boltzmannstr. 3, 85748 Garching, Germany Preprint submitted to Elsevier Science 30th December 2004 (http://www.springerlink.com/content/p088315m7t778650)

14. M. Sniedovich. Dynamic Programming. The Marcel Dekker, Inc., New York, NY, USA, 1992.B. Smith, “An approach to graphs of linear forms (Unpublished work style),” unpublished.

15. Mihai Alexandru Furis Cache Miss Analysis of Walsh-Hadamard Transform Algorithms Master thesis M. A. Bender, G. S. Brodal, R. Fagerberg, D. Ge, S. He, H. Hu, J. Iacono, A. L´opez-Ortiz. The cost of cache-oblivious searching. In FOCS’2003, pp. 271–282, 2005.

16. M. A. Bender, E. Demaine, and M. Farach-Colton. Cache-oblivious B-trees. In Proc. 41st Ann. Symp. on Foundations of Computer Science, pages 399–409. IEEE Computer Society Press, 2000.

17. N. Rahman, R. Cole, and R. Raman. Optimized predecessor data structures for internal memory. In WAE 2001, 5th Int. Workshop on Algorithm Engineering, volume 2141 of LNCS, pages 67–78. Springer, 2001.

18. Piyush Kumar Department of Computer Science University of New York, NSF( CCR-973220, CCR-0098172) Cache Oblivious Algorithms.

19. L. Arge, M. A. Bender, E. D. Demaine, B. Holland- Minkley, and J. I. Munro, Cache-oblivious priority queue and graph algorithm applicationsǁ, In ACM, editor, Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC ’02), ACM Press, 2002, pages 268–276.

20. R. Bellman. Dynamic Programming. The Princeton University Press, Princeton, New Jersey, 1957. 21. Kazushige Goto,Robert van De Geijn. On Reducing TLB Misses in Matrix Multplication. TOMS under

revision (preprinton http://www.csutexas.edu/users/flame/pubs.html) 22. S. Chaterjee, S. Sen DARPA DABT6398-1-0001, NFS grants DA07-2637 Department of Computer

Science Carolina NC 27599-3175 23. Steven Huss-Lederman, Elaine M. Jacobson, Jeremy R. Johnson Implementation of Streassen’s

Algorithms 089719-854-1/1996IEEE Explore 24. T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2nd ed., 25. Tobias Johnson Cache-Oblivious of an Array’s Pair May 7 2007 26. R. Chowdhury V. Ramachandaran Cache Oblivious Dynamic Programming NSF Grant CISE Research

Infrastructure NSF Grant CCF0514871 27. R. Chowdhury V. Ramachandaran Cache Oblivious Dynamic Programming for Bioinformatics IEEE

Bioinformatics Vol7No3 July Sept 2010 28. Z. Kasheff. Cache-oblivious dynamic search trees. MS thesis, Massachusetts Institute of Technology,

Cambridge, MA, 2004