A Survey of Machine Learning Methods Applied to Computer1422 (1)

8/10/2019 A Survey of Machine Learning Methods Applied to Computer1422 (1)

1/20

A Survey of Machine Learning MethodsApplied to Computer Architecture

Balaji [email protected]

Paul [email protected]

Introduction 2

Architecture Simulation 2

K-means Clustering 3

Design Space Exploration 6

Coordinated Resource Management on Multiprocessors 7

Articial Neural Networks 9

Hardware Predictors 11

Decision Tree Learning 12

Learning Heuristics for Instruction Scheduling 14

Other Machine Learning Methods 18

Online Hardware Reconguration 18

GPU 18

Data Layout 18

Emulate Highly Parallel Systems 19

References 19


2/20

Introduction

Machine learning is the subeld of articial intelligence that is concerned with the designand development of data based algorithms that improve in performance over time. Amajor focus of machine learning research is to automatically induce models, such asrules and patterns, from data. In computer architecture, many resources interact witheach other and building an exact model can be very difcult for even a simpleprocessor. Hence, machine learning methods can be applied to automatically inducemodels. In this paper we look for ways in which machine learning has been applied tovarious aspects of computer architecture and analyze the current and future inuence ofmachine learning in this eld.

Taxonomy of ML algorithms

Machine learning algorithms are organized into a taxonomy based on the desiredoutcome of the algorithm. The following is a list of common algorithm types used in thispaper.

Supervised learning - in which the algorithm generates a function that maps inputs todesired outputs. One standard formulation of the supervised learning task is theclassication problem: the learner is required to learn (to approximate) the behavior ofa function which maps a vector into one of several classes by looking at several input-output examples of the function. It may be difcult to get properly labeled data in manyscenarios. Also, if the training data is corrupted, the algorithm may not learn thecorrect function. The !learning algorithm " needs to be robust to noise in training data,e.g. articial neural networks and decision trees.

Unsupervised learning - in which the algorithm models a set of inputs where labeledexamples are not available. In this case, the inputs are grouped into clusters based onsome relative similarity measure. The performance may not be as good as theSupervised case, but it "s much easier to get unlabeled examples than labeled data,e.g. k-means clustering.

Semi-supervised learning - which combines both labeled and unlabeled examples togenerate an appropriate function or classier.

Reinforcement learning - in which the algorithm learns a policy of how to act given anobservation of the world. Every action has some impact in the environment, and theenvironment provides feedback that guides the learning algorithm.

Architecture SimulationArchitecture simulators typically model each cycle of a specic program on a givenhardware design using software. This modeling is used to gain information about ahardware design such as the average CPI and cache miss rates; it can be a timeconsuming process taking days or weeks just to run a single simulation. It is commonfor a suite of programs to be tested against a set of architectures. This is a problem


3/20

since it can take weeks for just a single test and several of these tests need to beperformed taking months.

SPEC (Standard Performance Evaluation Corporation) is one of many industry standardtests that allow the performance of various architectures to be compared. Spec

consists of a suite of 26 programs, 12 integer and 14 oating point.

Simple Scalar is a standard industry simulator that is used to compare results toSimPoint a machine learning approach simulation. It simulates each cycle of therunning program and records CPI, cache miss rates, branch miss prediction and powerconsumption.

SimPoint is a machine learning approach to architecture simulation that uses k-meansclustering. It exploits the structured way in which individual programs behavior changesover time. In this way it selects a set of samples called simulation points that representevery type of behavior in a program. These samples are then weighted by the amount

of behavior these samples represent.

Denitions: Interval - a slice of the overall program. The program is divided up into equal sized

intervals; SimPoint usually selects intervals around 100 million instructions. Similarity - a metric that represents the similarity in behavior of two intervals of a

programs execution. Phase (Cluster) - A set of intervals in a program that have similar behavior regardless

of temporal location.

K-means Clustering

K-means clustering takes a set of data points that have n features and uses some kindof formula to dene the similarity. This can be complex and needs to be dened beforehand. Then it clusters the data into K groups. K is not necessarily known ahead of timeand some tests need to be run to gure out a good value of K since too low a value of Kwill cause under-tting of data and too high a value will cause over-tting.


4/20

This is an example of K-means clustering applied to two dimensional data points where K = 4.

Assume each point in the example above represented the (x,y) location of a house thata mailman needs to travel to to make a delivery. The distance could be represented asthe straight line distance between those locations or some kind of street block distance.Then in order to assign each mailman to a group of houses the K-means clusteringwould take in K as the number of available mailmen and build clusters of those housesthat are closest together or have the highest similarity.

SimPoint Design

SimPoint uses an architecture independent metric to classify phase. It clusters datatogether based on the program behavior at each interval. This means that while using abenchmark such as SPEC the clustering of data can be done once over all 26 programsand then when an architecture is tested on the given programs the same clustering ofphases is used. Since the clustering is independent of architecture features such ascache miss rate there is no need to recompute the clustering for each architecturesaving a great deal of time.


5/20


6/20

skip the rst billion instructions and sample the rest of the program. The white bars are the errorassociated with SimPoint.

The overall error rate is important but what is far more important given a signicantlyhigh error rate is that the bias of the error from one architecture to another is the same.

The reason for this is that if the bias of error is the same between architectures thenregardless of the magnitude of the error they can be compared fairly without having torun a reference trial.

Machine learning has the potential to take simulation running time from months to daysor even hours. This is a signicant time savings for development and has potential tobecome the choice used in industry. SimPoint is being used in industry by companiessuch as Intel [1].

Design Space Exploration

As multi-core processor architectures with tens or even hundreds of cores, not all ofthem necessarily identical, become common, the current processor design methodologythat relies on large-scale simulations is not going to scale well because of the number ofpossibilities to be considered. In the previous section, we saw how time consuming itcan be to evaluate the performance of a single processor. Performance evaluation canbe even trickier with multicore processors. Consider the design of a k-core chipmultiprocessor where each core can be chosen from a library of n cores. There are n k designs possible. If n = 100 and k = 4, there are totally 10 million possibilities. We seethat the design space explodes even for very small n and k. It is obvious that we needto nd a smart way to choose the !best " from these n k designs. We need intelligent/ efcient techniques to navigate through the processor design space. There are twoapproaches to tackle this problem

1. Reduce the simulation time for a single design conguration. Techniques likeSimPoint can be used to approximately predict the performance.

2. Reduce the number of congurations tested. In this case, only a small number ofcongurations are tested, i.e. the search space is pruned. At each point, thealgorithm moves to a new conguration in a direction that increases the performanceby the maximum amount. This can be thought of as a Steepest Ascent Hill Climbingalgorithm. The algorithm may get stuck at local maxima. To overcome this, one mayemploy Hybrid Start Hill Climbing, wherein the Steepest Ascent Hill Climbing isinitiated at several initial points. Each initial point will converge to a local maxima andthe global maximum is the maximum amongst these local maxima. Other searchtechniques such as Genetic Algorithm, Ant Colony Optimization may also be applied.

In reality, all the n k congurations may not be very different from each other. So, we cangroup processors based on some relative similarities. One simple method is k-tupleTagging. Each processor is characterized by the following parameters ( k=5 here)


7/20

Simple D-cache intensive I-Cache intensive Execution units intensive

Fetch Width intensive

So a processor suitable for D-cache intensive applications would be tagged as ( 0, 1, 0,0, 0). These tags are treated as feature vectors and then !clustering " is employed to nddifferent categories of processors. If we have M clusters, design space is M k instead ofnk . Assume we had n=100 and M=10. We see the number of possibilities drops from100 4 to 10 4!Apart from tagging the cores, we can also tag the different benchmarks so that we geteven more speedup. Based on some performance criterion, one may evaluate theperformance of the processors on the M clusters and then cluster the differentbenchmarks. I.e. if a benchmark performs best on a D-cache intensive processor, it "s

more likely that the benchmark contains many D-cache intensive instructions. Taginformation is highly useful in the design of Application Specic multi-core processors

Coordinated Resource Management onMultiprocessors

Efcient sharing of system resources is critical to obtaining high utilization and enforcingsystem-level performance objectives on chip multiprocessors (CMPs). Although severalproposals that address the management of a single micro-architectural resource havebeen published in the literature, coordinated management of multiple interactingresources on CMPs remains an open problem. Global resource allocation can beformulated as a machine learning problem. At runtime, the resource managementscheme monitors the execution of each application, and learns a predictive model ofsystem performance as a function of allocation decisions. By learning each application "sperformance response to different resource distributions, this approach makes itpossible to anticipate the system-level performance impact of allocation decisions atruntime with little runtime overhead. As a result, it becomes possible to make reliablecomparisons among different points in a vast and dynamically changing allocationspace, allowing us to adapt the allocation decisions as applications undergo phasechanges.

The key observation is that an application "s demands on the various resources arecorrelated i.e if the allocation of a particular resource changes, the application "sdemands on the other resources also change. E.g. increasing an application "s cachespace can reduce its off-chip bandwidth demand. Hence, optimal allocation of oneresource type depends in part on the allocated amounts of other resources, which is thebasic motivation for coordinated resource management scheme.


8/20

The above gure shows an overview of the resource allocation framework, whichcomprises per-application hardware performance models, as well as a global resourcemanager. Shared system resources are periodically redistributed between applicationsat xed decision-making intervals, allowing the global manager to respond to dynamicchanges in workload behavior. Longer intervals amortize higher system recongurationoverheads and enable more sophisticated (but also more costly) allocation algorithms,whereas shorter intervals permit faster reaction time to dynamic changes. At the end ofevery interval, the global manager searches the space of possible resource allocationsby repeatedly querying the application performance models. To do this, the managerpresents each model a set of state attributes summarizing recent program behavior,plus another set of attributes indicating the allocated amount of each resource type. Inturn, each performance model responds with a performance prediction for the next

interval. The global manager then aggregates these predictions into a system-levelperformance prediction (e.g., by calculating the weighted speedup across allapplications). This process is repeated for a xed number of query-response iterationson different candidate resource distributions, after which the global manager installs theconguration estimated to yield the highest aggregate performance. Successfullymanaging multiple interacting system resources in a CMP environment presents severalchallenges. The number of ways a system can be partitioned among differentapplications grows exponentially with the number of resources under control, leading toover one billion possible system congurations in a quad-core setup with threeindependent resources. Moreover, as a result of context switches and application phasebehavior, workloads can exert drastically different demands on each resource at

different points in time. Hence, optimizing system performance requires us to quicklydetermine high-performance points in a vast allocation space, as well as anticipate andrespond to dynamically changing workload demands.


9/20

Articial Neural Networks

Articial Neural Networks (ANNs) are machine learning models that automatically learnto approximate a target function (application performance in our case) based on a set ofinputs.

The above gure shows an example ANN consisting of 12 input units, four hidden units,and an output unit. In a fully connected feed-forward ANN, an input unit passes the datapresented to it to all hidden units via a set of weighted edges. Hidden units operate onthis data to generate the inputs to the output unit, which in turn calculates ANNpredictions. Hidden and output units form their results by rst taking a weighted sum oftheir inputs based on edge weights, and by passing this sum through a non-linearactivation function.

Increasing the number of hidden units in an ANN leads to better representational powerand the ability to model more complex functions, but increases the amount of training


10/20

data and time required to arrive at accurate models. ANNs represent one of the mostpowerful machine learning models for non-linear regression; their representationalpower is high enough to model multi-dimensional functions involving complexrelationships among variables.

Each network takes as input the amount of L2 cache space, off-chip bandwidth, andpower budget allocated to its application. In addition, networks are given nine attributesdescribing recent program behavior and current L2-cache state.These nine attributes are:Number of (1) read hits, (2) read misses, (3) write hits, and (4) write misses in the L1d-Cache over the last 20K instructions; Number of (5) read hits, (6) read misses, (7)write hits, and (8) write misses in the L1 d-Cache over the last 1.5M instructions; and (9)the fraction of cache ways allocated the modeled application that are dirty.

The rst four attributes are intended to capture the program "s phase behavior in therecent past, whereas the next four attributes summarize program behavior over a longer

time frame. Summarizing program execution at multiple granularities allows us to makeaccurate predictions for applications whose behaviors change at different speeds. UsingL1 d-Cache metrics as inputs allows us to track the application "s demands on thememory system without relying on metrics that are affected by resource allocationdecisions. The ninth attribute is intended to capture the amount of write-back trafc thatthe application may generate; an application typically generates more write-back trafcif it is allocated a larger number of dirty cache blocks.

Results

The above gure shows an example of performance loss due to uncoordinated resourcemanagement in a CMP where three resources (cache, BW, power and combinations ofthem) are shared. A four-application, desktop style multiprogrammed workload isexecuted on a quad-core CMP with an associated DDR2-800 memory subsystem.Performance is measured in terms of weighted speedup (ideal weighted speedup hereis 4, which corresponds to all four applications executing as if they had all the resourcesto themselves). Congurations that dynamically allocate one or more of the resources inan uncoordinated fashion (Cache, BW,Power, and combinations of them) are compared


11/20

to a static, fair-share allocation of the resources (Fair-Share), as well as an unmanagedsharing scenario (Unmanaged), where all resources are fully accessible by allapplications at all times. We see that co-ordinated management of all 3 resourcesCache, BW, Power is still worse than the static fair-share allocation. However, we canbuild models for resource allocation proles for different applications. If we had these

models, we can certainly expect the dynamic resource allocation to perform better.

Hardware Predictors

Hardware predictors are used to make quick predictions of some unknown value thatotherwise would take much longer to compute and waste clock cycles. If a predictorhas a high enough detection rate the expected saved time by using it can be signicant.There are many uses for predictors in computer architecture including branchpredictors, value predictors, memory address predictors and dependency predictors.These predictors all work in hardware at real time to improve performance.

Despite the fact that current table based branch predictors can achieve upward of 98%prediction accuracy research is still being done to analyze and improve upon currentmethods. Recently some machine learning methods have been applied, specicallydecision tree learning. We found a paper that uses decision tree based machinelearning to predict values based on smaller subsets of the overall feature space. Themethods used in this paper could be applied to other types of hardware predictors andat the same time improved upon by using some sort of hybrid approach with classictable based predictors.

Current table based predictors do not scale well so the number of features is limited.This means that although the average prediction rate is higher there are somebehaviors that the low featured table based predictors cannot handle. A table basedpredictor typically has a small set of features because for each feature, n, that it hasthere are 2 n feature vectors, each of which it must represent in memory. This meansthat the table size increases exponentially with the increase in feature size.

Previous papers have shown that prediction using a subset of features is nearly as goodif the features are carefully chosen. A study was done where predictions werecomputed by using a large set of features and then a human chose the most promisingsubset of features for each branch and predictions were done again. The branchpredictions were nearly as good as when using all the features. This means that by

intelligently choosing a subset of features from a larger set the number of features usedcan be greatly increased and the feature set does not need to be known ahead of time.

Denitions Target bit - the bit to be predicted Target outcome - the value that bit will eventually have Feature vector - set of bits used to predict the target bit


12/20

Decision Tree Learning

Decision trees are used to predict outcomes given a set of features. This set of featuresis known as the feature vector. Typically in machine learning the data set consists of

hundreds or thousands of feature vector/target outcome pairs and is processed tocreate a decision tree. That tree is then used to predict future outcomes. It is almostalways the case that the number of feature vectors is a small subset of the total numberof potential feature vectors otherwise one could just compare a new feature vector to anold one and copy the outcome.

This gure illustrates the relationship between binary data and a binary decision tree. The blueboxes represent positive values and the red boxes are negative values.

In the gure above an example data set of four feature vector/outcome bit pairs is given.Using this data a tree can be created that splits the data based on any of thosefeatures. It can be seen that F1 splits the data between red and blue without any mixing(this is ideal). The better a feature is the more information that is gained from dividingthe outcomes based on that features values. It can also be seen that F2 and F3 can beused together as a larger tree to segregate all the data elements into groups containingall of the same values.

Noise can be introduced into the data by having two sets of date with the same featurevectors but different outcomes. This can happen if the features are not representativeof all the possible features.


13/20

Dynamic Decision Tree (DDT)

The hardware implementation of a decision tree has some issues that need to be dealtwith. In hardware prediction there may not be a nice set of data to start with so the

predictor needs to start predicting right away and update its tree on the y. One designfor a DDT used for branch prediction stores a counter for each feature and updates thatcounter as feature vector/outcome pairs are added. The counter is incremented whenthe prediction is the same as the outcome and decremented otherwise.

This gure shows how the outcome bit is logically XOR against each feature vector value andupdates the counter for each of those features.

When the most desirable features are being chosen the absolute value of the feature isused because a feature that is always wrong ends up being always correct by simplyipping all the bits and thus can be a very good feature.

This gure shows how the best feature is selected by taking the max absolute value of all thefeatures.

There are two modes to the dynamic predictor. In prediction mode it takes in a featurevector and returns a prediction. In update mode it takes in a feature vector and thetarget outcome and updates its internal state. It alternates between prediction andupdate mode as it rst predicts an outcome then then when the real outcome is known itupdates. The gure below shows a high level view of the predictor. The tree is a xedsize in memory and thus can only deal with a small number of features but since itselects the features from a large set of features in a table that grows linear in size withrespect to the number of features it doesn "t need to be very large.


14/20

View of the high level view of the DDT hardware prediction logic for branch prediction for asingle branch.

Experimentally the decision tree branch prediction method compares well to somecurrent table based predictors. It does better in some situations and worse in othersand overall does almost as well in the experiments performed. Since machine learningis used to having lots of data for prediction and in this case it starts off with very limiteddata it would take a while for the predictions to become highly accurate the predictionswould eventually do very well.

There is some added hardware complexity to use a decision tree in hardware at eachbranch condition rather than a table and getting the learner to act online within certaintime limits can be a challenge. However the size of the hardware can remain relativelysmall and only grow linear with respect the the number of features added. I believe thisapproach could be useful as a hybrid predictor or in other hardware predictors.

Learning Heuristics for Instruction Scheduling

Execution speed of programs on modern computer architectures is sensitive, by a factorof two or more, to the order in which instructions are presented to the processor. Torealize potential execution efciency, it is now customary for an optimizing compiler toemploy a heuristic algorithm for instruction scheduling. These algorithms arepainstakingly hand-crafted, which is expensive and time-consuming. The instructionscheduling problem can be formulated as a learning task, so that one obtains theheuristic scheduling algorithm automatically. As discussed in the introduction,supervised learning requires a sufcient number of correctly labeled examples. If we


15/20


16/20

Dependency Graph

Two Possible Schedules with Different Costs

One can view this as learning a relation over triples (P;Ii ;Ij), where P is the partialschedule (the total order of what has been scheduled, and the partial order remaining),and I is the set of instructions from which the selection is to be made. Those triples thatbelong to the relation dene pairwise preferences in which the rst instruction isconsidered preferable to the second. Each triple that does not belong to the relationrepresents a pair in which the rst instruction is not better than the second. Therepresentation used here takes the form of a logical relation, in which known examplesand counter-examples of the relation are provided as triples. It is then a matter ofconstructing or revising an expression that evaluates to TRUE if (P;Ii ;Ij) is a member ofthe relation, and FALSE if it is not. If (P;Ii ;Ij), is considered to be a member of therelation, then it is safe to infer that (P;Ii ;Ij), is not a member. For any representation ofpreference, one needs to represent features of a candidate instruction and of the partialschedule. The authors used the features described in Table below


17/20

The choice of features is pretty obvious:Critical path indicates that another instruction is waiting for the result of this instruction.Delay refers to the latency associated with a particular instruction.

The authors chose the Digital Alpha 21064 as our architecture for the instructionscheduling problem. The 21064 implementation of the instruction set is interestinglycomplex, having two dissimilar pipelines and the ability to issue two instructions percycle (also called dual issue) if a complicated collection of conditions hold. Instructionstake from one to many tens of cycles to execute. SPEC95 is a standard benchmarkcommonly used to evaluate CPU execution time and the impact of compileroptimizations. It consists of 18 programs, 10 written in FORTRAN and tending to useoating point calculations heavily, and 8 written in C and focusing more on integers,character strings, and pointer manipulations. These were compiled with the vendor "scompiler, set at the highest level of optimization offered, which includes compile- or linktime instruction scheduling. We call these the !Orig " schedules for the blocks. Theresulting collection has 447,127 basic blocks, composed of 2,205,466 instructions. DECrefers to the performance of the DEC heuristic scheduler ( hand crafted and performsthe best). Different supervised learning techniques were employed. Even though theywere not as good as handcrafted, they perform reasonably well ITI refers to decision tree induction program TLU refers to table lookup NN refers to articial neural network


18/20

The cycle counts are tested under two different conditions. In the rst case i.e. !Relevantblocks ", only basic blocks are considered for testing. In the second case i.e. !All blocks ",even blocks of length > 10 are included. Even though blocks of length > 10 were notincluded during !training ", we can see that the learning algorithm performs reasonablywell in this case.

Other Machine Learning MethodsOnline Hardware Reconguration

Online hardware reconguration is similar to the coordinated resource managementmentioned earlier in the paper. The difference is that the resources may be managed ata higher level (operating system) rather then at a low level in hardware. This higherlevel management is useful for domains such as web-servers where large powerfulservers can split their resources into several logical machines. In this case there aresome congurations that are more efcient depending on the workload of each logicalmachine and reconguration dynamically using machine learning can be benecialdespite reconguration costs.

GPU

The graphical processing unit may be exploited for machine learning tasks. Since theGPU is designed for image processing which takes in a large amount of similar piecesof data and processes them in parallel it is ideal for machine learning that needs toprocess large amounts of data.

There are is also potential to apply machine learning methods to graphics processing.Machine learning methods can be used to reduce the amount of data that needs to beprocessed by the GPU at the cost of some error but this can be justied if the imagequality difference is not noticeable to the human eye.

Data Layout

Memory in most computers is organized hierarchically, from small and very fast cachememories to large and slower main memories. Data layout is an optimization problemwhose goal is to minimize the execution time of software by transforming the layout of


19/20

data structures to improve spatial locality. Automatic data layout performed by thecompiler is currently attracting much attention as signicant speed-ups have beenreported. The state-of-the-art is that the problem is known to be NP-complete. Hence,Machine learning methods may be employed to identify good heuristics and improveoverall speedup.

Emulate Highly Parallel Systems

The efcient mapping of program parallelism to multi-core processors is highlydependent on the underlying architecture. Applications can either be written fromscratch in a parallel manner, or, given the large legacy code base, converted from anexisting sequential form. In [15], the authors assume that program parallelism isexpressed in a suitable language such as OpenMP. Although the available parallelismis largely program dependent, nding the best mapping is highly platform or hardwaredependent. There are many decisions to be made when mapping a parallel program toa platform. These include determining how much of the potential parallelism should be

exploited, the number of processors to use, how parallelism should be scheduled etc.The right mapping choice depends on the relative costs of communication, computationand other hardware costs and varies from one multicore to the next. This mapping canbe performed manually by the programmer or automatically by the compiler or run-timesystem. Given that the number and type of cores is likely to change from generation tothe next, nding the right mapping for an application may have to be repeated manytimes throughout an application "s lifetime, thus making Machine learning basedapproaches attractive.

References

1. Greg Hamerly. Erez Perelman, Jeremy Lau, Brad Calder and Timothy Sherwood.Using Machine Learning to Guide Architecture Simulation. Journal of MachineLearning Research 7, 2006.

2. Sukhun Kang and Rakesh Kumar - Magellan: A Framework for Fast Multi-coreDesign Space Exploration and Optimization Using Search and Machine LearningProceedings of the conference on Design, automation and test in Europe, 2008

3. R. Bitirgen, E. #pek, and J.F. Martnez - Coordinated management of multiple

resources in chip multiprocessors: A machine learning approach, In Intl. Symp. onMicroarchitecture, Lake Como, Italy, Nov. 2008.

4. Moss, Utgoff et al - Learning to Schedule Straight-Line Code NIPS 1997.

5. Malik, Russell et al - Learning Heuristics for Basic Block Instruction Scheduling,Journal of Heuristics archive. Volume 14 , Issue 6 (December 2008).


20/20

6. Alan Fern, Robert Givan, Babak Falsa, and T. N. Vijaykumar. Dynamic FeatureSelection for Hardware Prediction. Journal of Systems Architecture 52, 4, 213-234,2006.

7. Alan Fern and Robert Givan. Online Ensemble Learning: An Empirical Study.

Machine Learning Journal (MLJ), 53(1/2), pp. 71-109, 2003.

8. Jonathan Wildstrom, Peter Stone, Emmett Witchel, Raymond J. Mooney and MikeDahlin. Towards Self-Conguring Hardware for Distributed Computer Systems.ICAC, 2005.

9. Jonathan Wildstrom, Peter Stone, Emmett Witchel and Mike Dahlin. MachineLearning for On-Line Hardware Reconguration. IJCAI, 2007.

10. Jonathan Wildstrom, Peter Stone, Emmett Witchel and Mike Dahlin. Adapting toWorkload Changes Through On-The-Fly Reconguration. Technical Report, 2006.

11. Tejas Karkhanis. Automated Design of Application-Specic Superscalar Processors.University of Wisconsin Madison, 2006.

12. Sukhun Kang and Rakesh Kumar. Magellan: A Framework for Fast Multi-coreDesign Space Exploration and Optimization Using Search and Machine Learning.Design, Automation and Test in Europe, 2008.

13. Matthew Curtis-Maury et al. Identifying Energy-Efcient Concurrency Levels UsingMachine Learning. Green Computer, 2007.

14. Mike O'Boyle: Machine Learning for automating compiler/architecture co-designPresentation slides, Institute of Computer Systems Architecture. School ofInformatics, University of Edinburgh.

15. Zheng Wang et al: Mapping parallelism to multi-cores: a machine learning basedapproach. Proceedings of the 14th ACM SIGPLAN symposium on Principles andpractice of parallel programming, 2009.

16. Peter Van Beek. http://ai.uwaterloo.ca/~vanbeek/research.html.

17. Wikipedia. http://en.wikipedia.org/wiki/Machine_learning.

Documents

A Survey of Machine Learning Methods Applied to Computer1422 (1)