PREDICTING MICROBIAL ACTIVITY FOR COMPOSTING USING …PREDICTING MICROBIAL ACTIVITY FOR COMPOSTING USING MACHINE LEARNING TECHNIQUES by REIMAN L. RABBANI (Under the Direction of Khaled

PREDICTING MICROBIAL ACTIVITY FOR COMPOSTING

USING MACHINE LEARNING TECHNIQUES

by

REIMAN L. RABBANI

(Under the Direction of Khaled Rasheed)

ABSTRACT

In this thesis, several Machine Learning (ML) methods and hybrid algorithms that were

developed are applied for the first time to predict microbial activity during composting.

Modeling biological activities for an inadequately understood domain is a difficult task. This

thesis evaluates, compares, and analyzes the improved results of the models created and the

methods used.

The results indicate with statistical significance that hybridizing an eager learner with a

lazy one improves learning performance in this domain. Lazy-eager hybrids can form complex,

irregular hypotheses. They are suitable because the expressive power of the eager learner is

significantly enhanced due to their ability to represent the target function by combining several

complex locally approximated hypotheses. The study also showed hybrid rule-based methods

and trees to be good performers.

INDEX WORDS: Composting, Microbial Activity, Machine Learning, Lazy Eager Learner,

Model Tree, Hybrid Learner, Biosolids, Municipal Solid Waste, MSW, Ethanol, Modeling



by

REIMAN L. RABBANI

B.S. in Computer Science, King College, 2002

A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment

of the Requirements for the Degree

MASTER OF SCIENCE

ATHENS, GEORGIA

2006

© 2006

Reiman L. Rabbani

All Rights Reserved



by

REIMAN L. RABBANI

Major Professor: Khaled Rasheed

Committee: Walter D. Potter Ronald W. McClendon

Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia August 2006

DEDICATION

To my beloved parents and my two wonderful brothers.

iv

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude and respect to Dr. Rasheed for providing me

with guidance, ideas, encouragement and patience. I have thoroughly enjoyed his classes and

teaching style, and he has always been available for my many questions and regular visits to his

office. Having taken my first class and first research course at UGA with Dr. Potter, he

passionately and diligently introduced me to the realm of AI where it was a pleasure to hunt for

snakes on many sleepless nights. Thank you for your sincerity in teaching and for having high

expectations from your students. I am very thankful to Dr. McClendon, who introduced me to

this project, and has been monumental in guiding me to develop this thesis. Dr. Potter’s and Dr.

McClendon’s class on Computational Intelligence is probably the best two-for-one class package

at UGA and I am thankful to them for being on my committee. Many thanks to Dr. K. C. Das

from the Agricultural Engineering Department at UGA for his suggestions and help with the

composting experiment and data. Heartfelt thanks to Dr. Arabnia, without whom the UGA CS

program wouldn’t have been the same – thank you for believing in me. His charisma, positive

attitude, and genuine care for students have made so many of our UGA experiences possible. I

am appreciative of Dr. John Miller for his great teaching style and helpful attitude. I would also

like to thank Boseon Byeon who has helped me with parts of this research in a class project.

Thanks to the friendly and jovial staff of the CS department, especially Claudia, Jean and

Elizabeth, who were always eager to help with a smile. My warmest thanks goes to my family -

you guys have always been there for me in every aspect of my life. Last but not least it was a

pleasure to have made so many wonderful friends at UGA, thank you all.

v

TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS.............................................................................................................v

CHAPTER

1 INTRODUCTION .........................................................................................................1

1.1 MACHINE LEARNING...................................................................................2

1.2 MUNICIPAL SOLID WASTE, BIOSOLIDS, AND ETHANOL....................3

1.3 COMPOSTING AND ITS MODELING ..........................................................5

1.4 THESIS MOTIVATION...................................................................................6

1.5 RELATED WORK............................................................................................8

1.6 THESIS OBJECTIVES...................................................................................10

2 ANALYZING THE DATA .........................................................................................11

2.1 COMPOSTING EXPERIMENT ....................................................................11

2.2 DIFFICULTIES IN LEARNING....................................................................12

3 MACHINE LEARNING METHODOLOGIES ..........................................................19

3.1 BACKGROUND.............................................................................................19

3.2 EAGER AND LAZY LEARNERS.................................................................21

3.3 LINEAR REGRESSION.................................................................................23

3.4 k-NEAREST NEIGHBOR..............................................................................24

3.5 LOCALLY WEIGHTED LINEAR REGRESSION (LWR) ..........................26

3.6 RADIAL BASIS FUNCTION NETWORKS (RBF) ......................................27

vi

3.7 SUPPORT VECTOR MACHINES (SVM) ....................................................28

3.8 REGRESSION TREES ...................................................................................29

3.9 MODEL TREES..............................................................................................31

3.10 ARTIFICIAL NEURAL NETWORKS (ANN) .............................................33

3.11 HYBRID METHODS ....................................................................................35

3.12 ADAPTIVE NEURO-FUZZY INFERENCING SYSTEM (ANFIS) ...........36

3.13 COMBINING MODELS - ENSEMBLE APPROACHES............................36

4 IMPEMENTATION DETAILS...................................................................................39

4.1 LAZY METHODS ..........................................................................................39

4.2 EAGER METHODS .......................................................................................40

4.3 HYBRID METHODS .....................................................................................43

4.4 HYBRID NEURO-FUZZY SYSTEM............................................................43

5 EVALUTION, RESULTS AND ANALYSIS.............................................................45

5.1 MODEL EVALUATION METRICS .............................................................46

5.2 MODEL DEVELOPMENT & EVALUATION .............................................47

5.3 EAGER LEARNING METHOD RESULTS..................................................51

5.4 LAZY LEARNING METHOD RESULTS ....................................................66

5.5 COMBINING MODELS - ENSEMBLE RESULTS......................................70

5.6 HYBRID METHOD RESULTS .....................................................................72

5.7 EVALUATING MACHINE LEARNING SCHEMES ..................................75

5.8 ANALYSIS .....................................................................................................84

6 CONCLUSION AND FUTURE WORK ....................................................................87

REFERENCES ..............................................................................................................................90

vii

CHAPTER 1

INTRODUCTION

The widespread availability of computational resources at the present time has led to the

deployment of computers to help in every aspect of modern society. The next natural progression

of this technology would be to enable it to automatically learn ways to help us move towards our

goals. This is where Machine Learning (ML) comes into play. It is a relatively young field in

computer science that has had far-reaching practical and commercial implications. According to

Tom Mitchell (1997), ML is a broad multidisciplinary field drawing on concepts from Artificial

Intelligence (AI), probability, statistics, information theory, philosophy, biology, cognitive

science, computational complexity and many other disciplines. It entails the study of algorithms

or techniques that allow the computer to “learn”, i.e., automatically improve its performance

through experience and/or knowledge. Machine Learning has a wide spectrum of applications,

for example, some very successful areas are search engines, bioinformatics, stock market

analysis, 3D object, speech and handwriting recognition, game playing, and robot locomotion

just to mention a few. They are well suited for poorly understood domains where humans lack

the knowledge needed to develop an algorithm (e.g., biological processes); for domains where

the program must dynamically adapt to changing conditions (e.g., adapting to the changing

interests of individuals); and for automatically finding valuable hidden patterns in large

databases (Mitchell 1997). This thesis concentrates on the application and subsequent

comparison of several ML methods and their hybrids for modeling biosolids composting, which

is an inadequately understood biological process.

1

1.1 MACHINE LEARNING

Machine Learning algorithms can be thought of as search algorithms that traverse the set

of all possible hypotheses H (or concepts to be learned) in order to find the unknown concept

underlying the training instances. The concept, c to be learned by the machine is commonly

referred to as the target concept, which can be a skill, pattern, model, relation, function, rule or

some other form of knowledge. In other words, ML focuses on inducing general

functions/patterns/rules from specific training examples. For example, a simple target concept

involving only two descriptors/attributes to identify a car could be, “If object has 4 wheels and

can move, then object is a car”, which could also be expressed as a 3-tuple, (numWheels = 4,

canMove = true, object = car). The most basic algorithm searches through the space of all

possible hypotheses H to find one that is consistent with the training data/knowledge presented.

A hypothesis h is consistent with a set of training examples D, denoted as Consistent(h, D), if

and only if h(x) = c(x) for each example <x, c(x)> in D, the set of training examples. If our goal

is to consider the number of wheels and the ability to move of an object to identify whether it is a

car, then the hypothesis space is quite small. In this case, given a few positive and negative

examples, it would take a simple algorithm, e.g. List-Then-Eliminate (Mitchell 1997) described

below, to learn c.

The List-Then-Eliminate Algorithm is as follows:

1. S = set of all hypotheses in H

2. For each training example, <x, c(x)>

Remove from S any hypothesis h for which h(x) ≠ c(x)

3. Output S, which now contains the list of consistent hypotheses.

2

The above sample algorithm illustrates learning to classify discrete objects; however

other algorithms can be designed to learn real/discrete functions, patterns, rules etc. In this thesis,

learners are grouped based on lazy and eager learning characteristics. Lazy learners delay the

processing of training examples until they must label a new query instance, thereby creating

local models on the fly based on the query, while eager learners create global approximation/s

from the training data and must use that a posteriori model to classify new instances. There are

pros and cons to both types of learners which are further discussed in Chapter 3 of this thesis.

A branch of Artificial Intelligence, Fuzzy Logic (FL), is an extension of Boolean logic to

describe partial facts so that it can model uncertainties of inadequately understood domains

(Engelbrecht 2002). It generates human readable rules in simple linguistic terms and can take

into account vagueness, uncertainty and partial descriptions. Although they have enjoyed

widespread application, they require domain experts to initially define the linguistic terms. In

this thesis, FL has been used in a neural network hybrid method developed by Jang (1993) called

ANFIS, which is further discussed in Chapter 3.

1.2 MUNICIPAL SOLID WASTE, BIOSOLIDS, AND ETHANOL

Municipal solid waste (MSW), more commonly known as garbage or trash is an

unpleasant but inevitable byproduct of modern society. The Environmental Protection Agency

(EPA 2003) estimated an average of 4.5 pounds of waste produced per person per day, which

amounts to about 236 million tons of total waste generated per year. Non-hazardous

biodegradable organic material constitutes almost half of the total waste generated in the US;

other constituents are shown in Figure 1.1.

3

Figure 1.1: Composition of the 229.2 million MSW generated in the US in 2001 (EPA 2003).

Another byproduct of modern society is biosolids, these are the solid dehydrated organic

matter produced from wastewater treatment plants with an estimated annual production of 7.5

million tons for 2005 (EPA 1999). While landfills, ocean/river dumping, combustion are

traditional methods of handling such a high volume of MSW and biosolids, it is merely a

temporary solution and has highly adverse long and short term effects on the environment.

Composting is the safest and most environment friendly method to dispose of MSW and

biosolids. Composting MSW and biosolids go hand in hand because biosolids are best

composted in combination with various organic components of MSW such as yard trimmings

(e.g. leaves, grass clippings, brush), paper and chipped wood debris (EPA 1999), which are also

called bulking agents.

Ethanol, or ethyl alcohol has been dubbed as the fuel of the future by the President of the

United States in his 2006 State of the Union Speech (Bush 2006). It can be produced in two

ways, from petrochemicals by the hydration of ethylene or biologically from the fermentation of

4

various sugars using microbes. Sugars can be found in carbohydrates from agricultural crops, or

from inexpensive, waste and abundant sources like residues of crops, grass, woods and possibly

MSW. However, extracting the carbohydrate content from residues of crops and MSW is

complicated, cost-inefficient and under developed both commercially and academically.

Composting is the suggested method that can be used to break down and free the sugar contents

of crop residues and more ambitiously MSW (Gray 1999). The reader is referred to Roehr (2001)

for details on the current and future state of the ethanol production process.

1.3 COMPOSTING AND ITS MODELING

Composting is the controlled natural biodegradation of organic waste matter into more

stable and naturally useful organic substances. Biodegradation is a biological process in which

various types of microorganisms perform the aerobic decomposition of organic materials.

Composting is a viable high-turnout solution for waste reduction in wide use across the US and

the world. A high percent of MSW can be composted into stable and less toxic material, which

reduces the load on landfills, river/ocean dumping, and combustion. This in turn decreases

pollution of water, air and soil, and thereby conserves the natural resources by processing these

organic wastes into a soil-building substance. As mentioned in Section 1.2, composting can also

play a vital role for the production of ethanol from cellulose containing crop residues, wood

chips, grass and stalks.

The dynamics of composting can be compared to the widely known simple cellular

automaton called the “Game of Life” - it would be a more complex version of the game with

various colonies of multiple species of microbes in a continually changing eco-system. It is a

perpetual survival game involving several types of microbes thriving at different stages of the

5

compost with various factors affecting them such as, time, temperature, nitrogen,

carbon/nitrogen ratio, substrate/bulking agent, pH, moisture, oxygen level (aeration), particle

size, type of microbes, nutrient balance etc. factors. The physical and chemical environment

surrounding the microbes is constantly changing, primarily as a result of consumption of oxygen

by respiration, increase in temperature, crowding, and accumulation of metabolic products and

byproducts (Liang et al. 2003a). This makes it difficult to create a unified model. In order to

predict composting a flexible yet complex high-dimensional modeling mechanism is required,

this is where machine learning can be applicable.

While composting depends on various chemical and physical factors, according to Miller

(1992), the most important parameters for the microorganisms are temperature, moisture,

oxygen, pH, substrate/bulking agent composition. Research conducted by others also agrees with

these parameters but place different emphases. However it is generally agreed that the most

dominant factors for microbial activity growth are moisture, temperature, and oxygen (aeration)

which should all be varied with time (McCartney 1998; Liang et al. 2003b; Rosso 1993) to

achieve maximal microbial activity. The aforementioned are the main attributes considered in

this research.

1.4 THESIS MOTIVATION

In spite of a 350% increase in the number of composting facilities in the US in the last 15

years (Goldstein and Gray 1999), composting only accounts for less than 30% of MSW and 21%

of biosolids recovery into nature. Other means of managing MSW, such as incineration/

combustion, landfills, river/ocean dumps, waste natural resources and have detrimental long and

short term effects on the environment. According to a 2001 EPA report (EPA 2003) on MSW, it

6

is imperative to invest in composting, recycling and source-reduction practices to sustain the

economic growth of the nation and society. However, most of the commercial composting

processes in operation today are in primitive stages; there is limited understanding of the

biological, chemical, and physical interactions during composting (Liang et al. 2003a; EPA

1995), processing efficiency is low, and costs are relatively high. Compost prices can be as high

as $26 per ton for landscape mulch to more than $100 per ton for high-grade compost which is

bagged and sold at the retail level (EPA 2006a).

Ethanol, considered as the environment-friendly renewable fuel source of the future is in

its latent stages of development in the United States. Brazil has created an infrastructure to use

ethanol produced from cane-sugar to supply almost 40% of the country’s automobile fuel needs.

It is apparent that ethanol is a feasible solution to the energy crisis in the US. It is only a matter

of time before cost-effective composting methods are developed to produce it from abundant

residue, waste crops and MSW.

While there has been research into developing mathematical models (deterministic,

stochastic etc.), empirical rule-based models and mechanistic models, these models do not work

well for accurate process control. ML techniques are attractive for modeling such complex

biological processes due to their ability to dynamically modify their behavior, store experimental

knowledge, and make that knowledge available for modeling. ML methods can also point out the

importance of specific descriptors and implicit relations among them that may prompt further

investigation. This thesis compares and studies several ML methods and their hybrids applied to

the composting domain and shows favorable modeling results in hopes of broadening the

understanding of composting models, facilitating composting applications and improving process

control.

7

1.5 RELATED WORK

Modeling composting is related to biodegradation modeling. Many sophisticated models

have been developed for commercial and academic purposes to predict certain material’s

biodegradability. EnviroSim offers a popular commercial package, BIOWIN for the prediction of

the rate of aerobic microbial biodegradation of various mixed substances using linear and non-

linear models of regression and a comprehensive data store. Baker et al. (2004) presents an

overview of various Machine Learning methods that have been applied to model biodegradation,

namely Artificial Neural Networks, Partial least squares discriminant analysis, Inductive rule and

knowledge based learning systems, and Bayesian analysis. Although these software and models

could be used to predict the rate of biodegradation of some substances, they have not been used

in tandem for composting process control.

Several researchers have modeled aerobic composting since the early 90’s using

deterministic, stochastic, mechanistic, or steady-state models. Physically based process models

introduced by Haug (1993) and further studied by Das et al. (1998) using non-linear

mathematical functions have been used by industry and municipal scale bioconversion centers.

While the formers are solely based on physical environmental factors, Stombaugh and Nokes

(1996) created a simple model based on physical-microbiological parameters in the reactor using

differential equations that considered microbial, substrate and Oxygen concentrations, moisture,

temperature and vessel size. Mechanistic approaches incorporating the non-linearity in microbial

degradation have also been developed (Liang et al 2003a,b; Xi et al. 2005) that consider the

biological component of composting and kinetics at a particle level. These models have the

practical benefit of being able to optimize certain operational parameters, but in general they are

too simple and often are not able to model the dynamics involved. There are also some software

8

simulators for basic composting modeling that do not employ Machine Learning methodologies,

e.g. STELLA, ORWARE (2002) etc.

Machine Learning methods have been successfully applied to a broad range of biological

process modeling over the past decade causing the application of the discipline to be far ahead of

the theory. Liang et al. (2003a) applied ANN’s to predict biosolids composting from a pilot scale

experiment and evaluated the model against a different experiment to achieve promising results.

Based on his earlier research, the composting problem was treated as a black box dependent on

several environmental variables. A primary drawback with ANN is that it does not yield any

human readable rules that may be used by different systems, and it is almost impossible to

interpret the model created. Furthermore, the Backpropagation algorithm employed by most

ANNs does not produce stable models, i.e., different training runs using the same dataset will

produce different models. Morris (2005) used an evolutionary method for Fuzzy learning called

FISSION in order to model composting and achieved results similar to Liang et al. (2003a).

This thesis has studied and compared the use of various other ML methods like Support

Vector Machines, Radial Basis Function Networks, Instance Based Learning, Model Trees, and

Regression Trees. It has also explored aggregate methods and hybrid ML methods namely,

Neuro-Fuzzy System, lazy-eager learners, e.g. Lazy SVM, Lazy Model Trees etc. In the past

others have used hybrid schemes; boosted lazy-decision trees (Brodley 2003) for classification,

hybrid schemes to reduce computation and memory load (Zhou et al. 1999), etc. The hybrid

developed algorithms were able to improve upon the results of published research.

9

1.6 THESIS OBJECTIVES

This project seeks to apply several machine learning algorithms and develop suitable

hybrid algorithms to model and predict composting dynamically with the intention of real-world

applicability. Results are analyzed and statistical tests are performed to understand the

comparative pros and cons of each method. This research is performed on the same composting

dataset so that it can be compared to published results (Liang et. al 2003; Morris 2005).

The modeling goal is to predict the Oxygen uptake rate, which is an indicator of

microbial activity and is directly proportional to the rate of composting. The most important

descriptors in predicting composting is moisture content, temperature and time. The data was

used from the research of Liang et al. (2003b).

Specifically, the objectives of this thesis are to:

1) Analyze the data from the pilot experiment and pre-process it for use with various

ML schemes.

2) Study and explore ML algorithms and their hybrids suitable for this domain.

3) Develop and implement proposed ML algorithms as required.

4) Evaluate and compare the predictive accuracy of the ML models developed on the

basis of their ability to predict unseen data from another composting experiment.

5) Study the performance, robustness, and applicability of the ML schemes using

statistical significance tests and uncertainties in the data.

10

CHAPTER 2

ANALZYING THE DATA

The advantage of applying ML methods is that minimal domain knowledge is required

and they can discover relations/patterns among descriptors. In order to learn the composting

function, we can use the traditional empirical methods currently employed by engineers &

scientists to design a full scale composting reactor. According to Das et al. (1997), a few steps

are ideally performed in designing a full scale composting model for a previously untested

substance. These are: (1) substance characterization, (2) pilot scale process evaluation to provide

essential data, (3) product testing, and (4) design parameter evaluation, and (5) scale up of

system. The data gathered from the pilot scale experiment can be used to train the ML system,

which can be then tested on a separate composting run.

2.1 COMPOSTING EXPERIMENT

The composting data used in this experiment was provided by the University of

Georgia’s Biological and Agricultural Engineering Department. Research conducted by Liang et

al. (2003a, b) on biosolids mixed with pine sawdust and water provided an excellent source of

composting data. Six batches of composting experiments under 6 discrete temperature settings

(22°C, 29°C, 36°C, 43°C, and 50°C) were run over a period of eight months using the same

material. Each batch simultaneously incubated two replicates of 5 composting chambers under

the same temperature but 5 different moisture settings (30%, 40%, 50%, 60%, and 70%),

providing 146 observations per replicate over a 10-day period. This resulted in 6T × 5M × 2R ×

11

146 readings, yielding a total of 8760 training instances. To provide for an independent

evaluation of the model, separate experiments were also conducted under a different temperature,

34°C, which was not present in the training data. Figure 2.1 shows the behavior of microbial

activity with time over a period of 240 hours at 36°C temperature and 70% moisture setting.

O2 rate

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

0 50 100 150 200 250

Figure 2.1: Plot of microbial activity over 240 hours at 36°C temperature and 70% moisture content. (Axes: x-indicator of microbial activity, y-time)

2.2 DIFFICULTIES IN LEARNING

There are no pre-defined rules that govern the methodology of collecting data for ML

algorithms; each algorithm has its strengths and weaknesses in dealing with various aspects of a

dataset. In the remainder of this chapter, we hypothesize on some issues that may contribute to

the difficulty in modeling composting for ML methods.

2.2a LIMITED DISCRETE VALUES IN DESCRIPTORS

Each training example has one of 5 discrete values for moisture, one of 6 discrete values

for temperature and one of 146 discrete values for the time elapsed. These limited discrete

attribute values in the training data may pose a complication to the learning algorithm. Because

12

there are complex non-linear relationships among the descriptors, it is not merely a matter of

interpolating the values in between. In other words, there are several empty regions of the input

space with no indication of the nature of the hypercurve for those regions (e.g., there are 1752

instances for each of the moisture settings of 40% and 50%, but none for 45%).

2.2b POTENTIAL DESCRIPTOR INADEQUACY

When two experiments are conducted under the same settings, one would expect similar

composting behavior exhibited, but in many cases, the behaviors are sometimes significantly

different. Due to the biological nature of composting, two instances with the same attribute

values are likely to have different target class values. For example, two instances with

descriptors having identical values of M40%, T22°C, and time of 150 hrs, have substantially

different values for Oxygen uptake rate; 0.5038 and 0.0452 mg/gH. This renders a problem to the

learner as it provides a degree of inconsistency in the problem domain. This is not an isolated

occurrence with some instances but represents a pervasive feature of this biological process. This

discrepancy is shown in the graphs of Figures 2.2 through 2.10; ideally the graphs should be the

same. Notice that the graphs of the 30% (M30%) and 40% (M40%) moisture settings usually

exhibit dramatically different behaviors; referring to Figure 2.6, we observe that for one graph

(M40% R2) composting does not start in the specified time frame, while in the other graph

(M40% R1), composting takes a long time to initialize. It should also be noted that, for the

M30% runs, composting ultimately fails to occur in both replicates at several temperatures

(22°C, 29°C, 36°C, and 57°C). Therefore, high discrepancy in composting behavior is shown at

certain settings. This may suggest the need for additional descriptors that are not considered in

the experiment.

13

Previous research has shown that moisture is possibly the most important factor in

composting (Miller 1992; Liang et al. 2003b) and our data analysis and models imply the same.

The three-dimensional plot of microbial activity against moisture and temperature shown in

Figure 5.43 also indicates that moisture has more effect than temperature. The algorithm that

generates rule sets from Model Trees (Holmes et al. 1999), M5Rules, was also able to learn this

observation as shown by the rules inferred by the algorithm; Figure 5.18 shows some sample

rules.

2.2c UNCERTAINTY IN TARGET VALUES

Due to the extremely low microbial activity readings at 30% and 40% moisture settings

(Please refer to Figures 2.2, 2.3, and 2.6) there is a possibility of having uncertainty in the target

values introduced by measuring instruments. Two inevitable aspects of experimentation are: 1)

noise introduced by sensory devices, and 2) difficulty in maintaining environmental settings.

The Oxygen sensor had a sensitivity level of 0.0065 mg(O2)/g Hr, and a histogram shows that

2919 of the 8760 training examples had target values in the range of 0 to 0.065 mg/g Hr of

Oxygen. Although the oxygen sensor was a high precision device, it is likely that uncertainty had

been introduced into the target values. The target class values had a mean of 0.445 mg/gH and a

standard deviation of 0.391 for the training data (8760 instances). Combined with the effect of

unknown descriptors involved in composting and the difficulty to control experimental settings,

it is clearly noticeable that a high level of uncertainty is involved.

14

Microbial Activities for 2 replicates at Temp=22C and Moisture=30%

0.00

0.01

0.01

0.02

0.02

0.03

0.03

0.04

0 60 120 180 240Time

Oxy

gen

Upt

ake

Rat

e M30% R1 O2 rate

M30% R2 O2 rate

Figure 2.2: Discrepancy in microbial activity between two identical

experiments (replicates) at 22°C and 30% moisture.


0.00

0.10

0.20

0.30

0.40

0.50

0.60

0 60 120 180 240Time

Oxy

gen

Upt

ake

Rat

e M40% R1 O2 rate

M40% R2 O2 rate


experiments (replicates) at 22°C temperature and 40% moisture.


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

0 60 120 180 240Time

Oxy

gen

Upt

ake

Rat

e

M50% R1 O2 rate

M50% R2 O2 rate



15


0.00

0.20

0.40

0.60

0.80

1.00

1.20

0 60 120 180 240Time

Oxy

gen

Upt

ake

Rat

e

M70% R1 O2 rate

M70% R2 O2 rate




0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.00 60.00 120.00 180.00 240.00Time

Oxy

gen

Upt

ake

Rat

e

M40% R1 O2 rate

M40% R2 O2 rate




0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.00 50.00 100.00 150.00 200.00Time

Oxy

gen

Upt

ake

Rat

e M30% R1 O2 rate

M30% R2 O2 rate



16


0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

0.00 50.00 100.00 150.00 200.00Time

Oxy

gen

Upt

ake

Rat

e

M40% R1 O2 rate

M40% R2 O2 rate




0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.00 60.00 120.00 180.00 240.00Time

Oxy

gen

Upt

ake

Rat

e M40% R1 O2 rate

M40% R2 O2 rate




0.00

0.20

0.40

0.60

0.80

1.00

1.20

0 60 120 180 240Time

Oxy

gen

Upt

ake

Rat

e

M50% R1 O2 rate

M50% R2 O2 rate



17

These observations can help us decide on the ML algorithm, provide guidance for

parameter tuning, and suggest any necessary data pre-processing. In summary, the data contains

certain inconsistencies, noise, possibly insufficient descriptors for the domain, and a narrow

range of discrete values (6 for temperature, 5 for moisture, 146 for time) to provide adequate

coverage of distribution of the real-valued attributes. Approaches to minimize their effects

during learning can include 1) taking the mean of the output class values of the two replicates, 2)

using more attributes, 3) using multiple models to predict, or 4) pruning noisy exemplars (these

are already-seen instances, used for classification, that have been identified as problematic or

outliers) (Aha 1992; Witten 2005) etc.

18

CHAPTER 3

MACHINE LEARNING METHODOLOGIES

3.1 BACKGROUND

Machine Learning algorithms can be thought of as search algorithms that traverse the set

of all possible hypotheses H (or concepts to be learned) in order to find a hypothesis h that can

be used to describe the training instances. Let us refer back to Section 1.1 to continue on the

definition and description of the simple List-Then-Eliminate algorithm. We have correctly

assumed that given the right examples, this learner will successfully learn our intended target

concept - (numWheels = 4, canMove = true, object = car). This type of learner is able to classify

unknown instances due to an inherent bias that guides them during the search of H, and this

methodology is referred to as Inductive Learning (Mitchell 1997). The set of assumptions made

in an algorithm to guide its search is called the Inductive Bias; for this algorithm it is the

assumption that the target concept exists in the hypothesis space (c ∈ H). Occam’s Razor is a

very famous inductive bias employed by ML algorithms like ID3 for Decision Trees (Quinlan

1986), which states that simpler hypotheses should be preferred over complex ones.

The List-then-Eliminate algorithm is not pragmatic as the computational complexity in

exhaustively searching the hypothesis space (which may be incomplete) is of the order:

)()(1

HnOanO i

m

i=∏

=…. .........................................................................................(3.1)

where ai = ith attribute of an instance, |ai| = cardinality of discrete valued ai attribute, n = number

of instances, m = number of attributes, and |H| is size of the hypothesis space. A problem arises

when a target concept is too complex to be represented by the learner’s expressive power. This

19

can happen if the List-Then-Eliminate algorithm encounters an instance like, (numWheels = 4,

canMove = true, object = shoppingCart). In the case it is a query instance, the learner will

misclassify it as a car, conversely if it is a training instance, the learner will fail to learn the

concept. A solution to this situation would be to enrich the concept space; for example by adding

more descriptors, instead of merely using two attributes to classify a car. Another solution to

enriching the concept space can be by using a more expressive representation, for example, by

using disjunctions as well as conjunctions in the representation, we can express any finite

discrete-valued function (Mitchell 1997). Another predicament appears when the training

instances presented to the learner contains erroneous information, for example, (numWheels = 0,

canMove = true, object = car), this instance would either 1) lead to the learning of a false

concept, or 2) cause the algorithm to fail to converge upon a concept (and not learn anything)

due to conflicting information. Therefore, the number of descriptors, expressiveness of the

learner, erroneous data, and other factors have to be taken into account when considering and

developing Machine Learning algorithms.

The hypothesis space H explodes when the target class and/or attributes are not discrete,

but rather real numbers; referring back to equation (3.1), |a| = ∝, thereby making, |H| = ∝. The

simple List-Then-Eliminate algorithm will not work since it is not computationally feasible to

traverse through infinite H. It is then, no longer a classification problem but rather a regression or

function approximation problem. Fortunately, there are ML methods for real-valued prediction

or modeling problems often based on principles drawn from classification methods and other

fields like biology, statistics, mathematics etc. Although techniques from other disciplines like

regression analysis, multivariate spline regression, extrapolation, interpolation, etc. are available

to approximate the target function, ML methods have proven to be very successful and efficient.

20

The next sub-sections provide an overview of the ML methods used in this thesis, which are,

Artificial Neural Networks (ANN), Support Vector Regression, Radial Basis Function Networks,

Rule-inferring systems, Model and Regression Trees, and Instance Based Learning methods.

3.2 EAGER AND LAZY LEARNERS

Eager learners are those algorithms that use the training instances to create an a posteriori

model which is later used to classify query instances. These learners explicitly commit to a single

hypothesis, which is a global approximation of the target function that covers the entire instance

space. They cannot subsequently alter their global hypothesis, until retrained. In most domains

where the concepts do not change frequently (e.g. games, object recognition, locomotion etc.)

eager learners are highly suitable and successful. However, composting behavior radically

changes based on various factors unable to be taken into consideration – a model created with a

certain substance under a specific temperature is not likely to be applicable on a different

substance and/or temperature.

Instance-Based Learners, also known as lazy learners do not explicitly commit to a global

approximation instead they create a hypothesis when given a query instance. This increases their

expressive ability by creating many local approximations to the target function, consequently

making them more flexible by being able to modify parts of their hypothesis on the fly. These

properties of lazy learners make them enticing to this domain; as a result we have the considered

k-Nearest Neighbor (k-NN) and Locally Weighted Regression (LWR) algorithms. They do not

perform any computation initially with the training instances; they merely store the instances for

use during classification, which is when most of the computation is performed. Using a distance

function (e.g. Euclidean, Hamming or Levenshtein etc. distance) the nearest neighbors to the

21

query instance are selected, which is then used to construct a hypothesis. Applying the locally

created hypothesis it then performs classification/prediction.

To illustrate the benefit of lazy learners, let us consider a hypothesis space consisting of

linear functions, the eager learner will learn a single linear function to cover the entire training

instance space and future instances. Conversely, although a lazy learner may be using linear

regression, it effectively uses a richer hypothesis space because it uses many different local

linear functions to form its implicit global hypothesis of the target concept.

It is computationally expensive to compute the distance for every single instance during

classification, which is of the order O(mn), where n is number of instances, m is number of

attributes. However, efficient data structures have been developed to quickly retrieve nearby

instances based on the distance function, e.g. kd-trees (Bentley 1975; Witten & Frank 2005)

which his further discussed in Section 4.1.

The following is a list of the various terminologies used in the proceeding sections:

n – the total number of instances

m – the number of attributes in an instance.

xi or ix - the ith instance which can be thought of as a vector of m dimensions.

or - rth attribute of the instance xi. ira )(

1ia

wj – the weight corresponding to the jth attribute.

- the actual output value of target class for the ith instance. )(ix

)(xf - the target function that is to be learned.

)(ˆ xf - the learned function which is an approximation of the target function.

)(ˆ ixf - the predicted output value of the target class for the ith instance.

22

3.3 LINEAR REGRESSION

Linear Regression is one of the simplest function approximation techniques that can be

used when all the attributes are numeric and the concept to be learned is fairly simple –

specifically if there is a linear relationship between the attributes and the target class. It can be

used to classify or predict and is presented here as a building block for other algorithms

described in this section. If the instance vector xi from the instance space is

, then the learned function>< )()(3

)(2

)(1 ,...,,, i

kiii aaaa )(ˆ xf , which is an approximation of the target

function )(xf , after regression is of the form:

)1(

0

)1()1(11

)1(00

1 .......)(ˆj

k

jjkk awawawawxf ∑

=

=+++== …. .............................................................(3.2)

The goal in this learning is to choose weights, w0, w1 ,…, wk for the above equation that will

minimize the sum squared error E given below (this is also the inductive bias that guides the

search through the space of all possible weight values):

…....................................................................................................(3.3) 2

0 0

)()( .∑ ∑= =

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

n

i

k

j

ijj

i awxE

Using Equation (3.2) we can rewrite Equation (3.3) as,

(2

0

)( )(ˆ∑=

−=n

i

ii xfxE ) ….......................................................................................................(3.4)

Where, is the actual output class value of ith instance, and n is the total number of

instances. LWR approximates the target function using k neighbors, therefore in Equation (3.4),

n is replaced by k instances. The gradient descent algorithm can be used to find the weights of

Equation (3.2). A shortcoming of this method is that it can only find hypotheses of linear

expressiveness; if the target concept is nonlinear then regression will yield poor results.

)(ix

23

3.4 k-NEAREST NEIGHBOR

This is a popular and thoroughly studied method that has been useful for a wide range of

practical problems in spite of its simplicity. The lazy learner, k-nearest neighbor algorithm (k-

NN), finds the k nearest neighbors of the query instance using a distance function (e.g.

Euclidean, Hamiltonian, Levenshtein etc.), then another function is used to return a discrete or

real output. Euclidean distance function in an m-dimensional space (where m is the number of

attributes) is described in Equation (3.5). Using the notations described in the beginning of this

chapter, the Euclidean distance function is with respect to two vector instances, xi and xj is:

Euclidean Distance = d(xi, xj) = ( )∑=

−m

r

jr

ir aa

1

2…. ..................................................................(3.5)

The value of k is usually determined empirically or through cross-validation techniques.

Once the neighbors are selected, there are several algorithms to compute the output value, for

example (k is the number of neighbors, qx is the query instance, 1x … kx is the k instances

nearest to qx ):

1) For discrete function, Vf n →ℜ: : ∑=∈

=k

ii

Vv

q xfvxf1

))(,(maxarg)(ˆ δ ….......................(3.6)

where 1),( =baδ if a = b, else it is equal to 0. A majority voting scheme can also be

used, which makes the algorithm similar to the Naïve Bayes method.

2) For real function, ℜ→ℜnf : :k

xfxf

k

iiq ∑= =1)(

)(ˆ …......................................................(3.7)

24

3) Weighing the contribution of each neighbor based on distance:

The weight of the ith neighboring instance can be calculated as, 2),(1

iqi xxdw =

Using wi in equations (3.6) and (3.7) we get,

∑=∈

=k

iii

Vv

q xfvwxf1

))(,(maxarg)(ˆ δ and ∑

∑=

=

=k

ii

k

iii

q

w

xfwxf

1

1)(

)(ˆ respectively.

Due to the simple yet effective nature of this algorithm, it is susceptible to pitfalls – the

curse of dimensionality caused by too many irrelevant attributes can confuse the algorithm,

query classification can take a long time causing slow performance on large training sets, and it

does create explicit generalizations about the data that maybe easily visible. Aha et al. (1991)

developed techniques to overcome some shortcomings of the k-NN algorithm: 1) noisy examplar

pruning to better handle noisy instances, 2) attribute weighting to minimize the effect of

irrelevant attributes, and 3) framework for inferring the hypothesis learned.

Naturally one may ask if k-NN does not perform any training-time computation to build a

model, what hypothesis has it learned? It builds one using the k neighbors during query time to

classify; this has the implicit effect of splitting the instance space into many distinct regions. One

way to view the hypothesis of the k-NN algorithm is by extracting rules that bound each region;

in this manner rule-sets can be inferred. It can be shown that when k and n, the number of

instances, both become infinite in such a way that k/n → 0, the probability of error approaches

the theoretical minimum for the dataset (Witten & Frank 2005). In that case k-NN approaches

the Naïve Bayes Classifier, as it takes a weighted vote among all members of the instance space.

25

3.5 LOCALLY WEIGHTED LINEAR REGRESSION (LWR)

The concepts of Linear Regression and k-NN are applied in LWR, which attempts to

approximate a real valued target function by selecting some adjacent points in the hyperspace

and fitting them with a local approximation (e.g. a line using Linear Regression). These regional

piecewise approximations can then be combined to successfully approximate the entire target

function. The local approximations are made using a linear, quadratic, Gaussian function etc.

which are called kernels. In order to use LWR effectively, it is necessary to decide upon distance

and weighing functions. Similar to k-NN, Euclidean distance can be used and there are several

choices for the weighing functions, e.g. linear, Epnechinikov, Tricube, Gaussian or simple

constant (Atkeson 1997). Figure 3.1 illustrates a linear piecewise approximation to the neighbors

of a query point x in 2-dimensional space. The choice of kernel can affect the weight given to

each neighbor, as well as the number of neighbors selected. Gaussian kernels are portrayed

below.

Figure 3.1: The nearest neighbors to the query point x are used fit a linear equation.

The number of neighbors considered and the weight given to them can be determined by a kernel function (the shaded area under the curve above).

26

3.6 RADIAL BASIS FUNCTION NETWORKS (RBF)

RBF networks represent an interesting blend of instance-based and neural network

learning algorithms (Broomhead and Lowe, 1998). In this approach, the learned hypothesis is a

function that is a linear combination of m functions, often called basis or kernel

functions or centers (Mitchell 1997). It is of the following form:

)],([ xxdK uu

∑=

+=m

uuuu xxdKwwxf

10 )],([)(ˆ … .............................................................................................(3.8)

The ability of RBF networks stems from the freedom to choose different values for the

weights and kernel functions. The basis function is defined so that it decreases as the distance

increases. Although ) is a global approximation to the target function, each of the

m basis functions only contribute to a region near the point xu in the instance space. Like LWR

this has the effect of forming a smooth piece by piece fit of the target function as illustrated in

Figure 3.2. Usually the basis functions are Gaussian functions centered at a point xu with some

variance , although many others exist. The weights are usually found using the global error

criterion defined in Equation (3.4); other measures also exist.

),( xxd u (ˆ xf

2uσ

RBF networks give every attribute the same weight since all are treated equally in the

distance computation. This renders them ineffective in dealing with irrelevant attributes, unlike

multi-layer perceptrons which can automatically handle this. When the training example set is

too large and/or the target function is too complex, more centers (basis functions) should be used

and/or strategically placed, otherwise the RBF network may not form a good approximation as

illustrated in Figure 3.2. To alleviate this situation clustering algorithms can be applied to the

instances to suitably place basis functions in the instance space. RBF networks require less

computational resources to train compared to ANN’s feedforward backpropagation algorithm.

27

Figure 3.2: Illustration of fitting training points using an RBF network with 10 Gaussian kernels (centers or basis functions). The width of the Gaussian Kernels is the standard deviation (0.5 in

this case). Observing the approximated graph we notice that 10 kernels are not enough for a good approximation of this many instances (100). Created with RBF Java Applet (Hong 1996)

3.7 SUPPORT VECTOR MACHINES (SVM)

SVM originated from research in statistical learning theory (Vapnik 1999). SVM uses

linear models to implement nonlinear classification boundaries by transforming the input space

into a higher dimensional space using a non-linear mapping. Using principles of computational

learning theory, in simplified terms we can state that, a hyperplane in the new higher

dimensional space is likely to be a complex curvy hyperplane in the lower dimensional space,

illustrated in Figure 3.3. This principle is exploited by SVM to classify complex instance spaces.

SVM principles are can also be extended to approximate real-valued target functions, which is

also called Support Vector Regression (SVR), both terms are synonymously used in this thesis.

SVM are also called kernel machines, they can use different functions to compute the

distance between two instance vectors. Due to the increased number of calculations involved in

the kernel computations, SVM are inherently slow, however a very effective algorithm,

Sequential Minimal Optimization (SMO) developed by (Platt 1998; Smola 1998) can be used to

aide the implementation as it runs in O(n2). SVM makes use of the maximal marginal hyperplane

28

and a regression method using the ε-tube, which strongly prevents it overfitting the data. This

creates an excessive smoothing effect as shown in the graph of Figure 5.12.

Figure 3.3: SVM principle illustrated, using dimensionality transformation,

from 2D to 3D to find a hyper-plane that separates the instances.

Recently SVM have been stealing much of the limelight enjoyed by ANN, probably due

to the fact that it requires much less parameter tuning, thereby making it easier to use. SVM

share the same problem as RBF networks in dealing with irrelevant attributes. In fact, support

vector machines with Gaussian kernels (RBF kernels) are a particular type of RBF network, in

which basis functions are centered on certain instances, and the outputs are combined linearly by

computing the maximum margin hyperplane (Witten and Frank 2005).

3.8 REGRESSION TREES

Regression trees, based on decision trees and proposed by Breiman (1984) are able to

approximate real-valued functions. Decision tree learning is a powerful eager ML method

capable of learning any finite discrete-valued function by searching a powerfully expressive

29

hypothesis space which is complete. They learn disjunctive expressions and are robust to noisy

data, missing values, and errors. The regression tree building algorithm generates a decision tree

in which the leaves are marked with real values and the nodes with tests for real/discrete

attributes as illustrated in Figure 3.4. Because the tree is greedily grown as training instances are

encountered, this may cause the tree to over-fit the data, consequently resulting in poor

performance for unseen examples. To overcome this problem, pruning methods use validation

sets and cost-complexity measures. In spite of pruning, regression trees often yield trees that are

cumbersome and difficult to interpret. In contrast to a single global regression equation, this

method implicitly builds many regression functions to approximate the target function. The

decisions made at each internal node can be thought of as dividing the instance space and

discovering patterns in the data. These properties make it perform much better than simple

regression and allow it to handle non-linear relationships.

Figure 3.4: An example of a regression tree predicting real values.

30

3.9 MODEL TREES

Composting facilities in operation today, use empirical and rule-based models that have

been created from pilot experimental runs (Das et al. 1998). This implies that algorithms

producing rules would be highly applicable to this domain. Hence, we consider Regression and

Model Trees and a variant algorithm which uses trees to find rules, called M5PRules (Frank &

Witten 1998). Model trees are decision trees with linear regression equations at the leaves

instead of terminal output values as shown by Figure 3.5. Thus they offer a more sophisticated

approach for predicting numeric values and are appropriate when the attributes are also numeric.

The M5 algorithm proposed by J. R. Quinlan (1992) for constructing model trees is described

below.

Figure 3.5: An example of a model tree (subsection) for composting

showing 7 linear regressions at each leaf node; LM44-LM50.

Model trees are constructed by a divide-and-conquer approach similar to decision trees.

The set of training examples, T is either associated with a leaf, or some test is chosen that splits

T into subsets based on the test outcome and the same is applied recursively to the subsets. The

31

first step in building the tree is computing the standard deviation of the target values of the

training examples, which is used as a measure of the error at that node. Then the expected error

reduction is calculated as a result of testing an attribute at that node. Using a greedy approach,

the attribute which maximizes the expected error reduction is chosen for splitting at that node.

The expected error reduction, or Standard Deviation Reduction (SDR), is calculated by,

∑ ×−=i

ii Tsd

TT

TsdSDR )()( … ..............................................................................................(3.9)

Where T1, T2, …, Tn are the sets produced from splitting the node according to the chosen

attribute. The splitting process terminates when the target values of the instances that reach a

node vary only slightly, or when just a few instances reach the node.

When a model tree is used to predict the target value for an unseen instance, the tree is

followed down to a leaf as in a decision tree, using the instance’s attribute values to make routing

decisions at each node. The leaf contains a linear model based on some of the attribute values,

and this is evaluated for the test instance to yield a predicted value. However instead of using this

value directly, a smoothing process is used to compensate for the sharp discontinuities that will

inevitably occur between adjacent linear models at the leaves of the pruned tree, particular for

some models constructed from a smaller number of training examples.

Smoothing can be done by producing linear equations for each internal or leaf node, as

well as leaf nodes during the treed building time. A linear model is created at each node, using

standard regression techniques considering only the attributes that are tested in the sub-tree

below this node. Once the leaf model is used to obtain the raw predicted value for a test instance,

that value is filtered along the path back to the root, which is then combined with the value

predicted by the linear model. A smoothing calculation used to combine all the values at all

32

nodes is (Frank et al. 2005):knkqnpp

++

=' , where p’ = prediction passed up to the next higher

node, p = prediction passed to this node from below, q is the value predicted by the model at this

node, n is the number of training instances that reach the node below, and k is a smoothing

constant. Empirical results show that smoothing substantially increases the accuracy of

predictions.

3.10 ARTIFICIAL NEURAL NETWORKS (ANN)

An ANN is based on biological neurons and has been highly successful in capturing

complex non-linear relationships among many variables. There is much literature on ANN and it

has been successfully applied to many fields, including our domain of study using the same

dataset. Previously ANN has given good results on modeling composting by Liang et al. (2003b)

and hence it is considered in this thesis for comparative purposes. A brief overview of ANN will

be presented in this sub-section but the reader is suggested to the publication by Liang et al.

(2003a) for a more thorough exposition on the usage of ANN to this domain.

An artificial neuron (AN) also called a perceptron, is basically a function of all the

weighted inputs that come into it. It uses an activation function based on the net input signal

(usually a summation or a product) and a bias to output a real value. There are several types of

activation functions which usually have the property of being a monotonically increasing

mapping, e.g. the popular Sigmoid function. The goal of a single perceptron is to learn the

weights of the various inputs that come into it.

Mimicking the structure of the brain, we can create an interconnected network of

perceptrons which is known as an artificial neural network. The weights are learnt subject to

minimizing an error criterion, which is usually the sum of the squared errors of all the instances.

33

If we think of the hypersurface produced by the error as a function of all the weights, then our

goal is to find the lowest point on that surface. Therefore the learning process is to find the

weight vector that minimizes the error. The possible number of weights is infinite, so the

hypothesis space is quite vast, however it can be proved that a single perceptron can learn any

linear function using the perceptron training rule given the learning rate is sufficiently small

(Minsky & Papert 1969; Mitchell 1997).

It has been proved that Feedforward Neural Networks (FFNN) with monotonically

increasing differentiable functions can approximate any continuous function with one hidden

layer provided that the hidden layer has enough hidden neurons (Hornik 1989). However, due to

of the nature of the Backpropagation algorithm, training time is usually slow and can take up to

several thousands of iterations. Furthermore, the computational complexity in training an

exponential number of neurons is impractical; therefore, often we must consider the ratio of

approximation quality to the time required. A major concern in using ANN is that it is a black

box approach – there are no human understandable indicators to show what the network has

learnt. Also it is not possible to encode prior knowledge into an ANN. Although they might

perform very well with the training data and in some test cases, there is always the chance it may

fail for some unknown cases. There are many advantages to using this powerful ML tool but in

order to make a wise decision about the algorithm a comparative table is presented below in

Table 3.1.

34

Table 3.1: Pros and Cons of an Artificial Neural Network Advantages Disadvantages

Very fast query performance (classification) Training time is considerably slow. Robust to noise Creates a black box model, with no human readability,

cannot be ported. Robust to missing values Cannot assimilate existing domain knowledge Can handle irrelevant attributes Different training runs using the same settings and data

will create different models due to the nature of the Backpropagation algorithm.

Can handle a large number of attributes Creates an excessive smoothing effect. Refer to Figure 5.4 and Section 5.3 for details.

Can deal with sensory data, that is not human readable.

Many parameters need to be set, requiring strong familiarity with ANN. Structure of network is an important consideration.

Cannot modify model once it is already trained.

3.11 HYBRID METHODS

The lazy versions of eager learning method are designed to improve performance and

prediction accuracy (Zhou et al. 1999). The k-nearest training instances close to a testing

instance are selected and then the k neighbor instances are used to build a model during testing

time. The k-nearest methods of model tree and support vector machine regression are developed

for this research and coded in Java programming language. Even if model tree and support vector

machine regression are eager methods, the models are created during testing time. Because these

learning methods are lazy, model building time is non-existent, but testing time may be longer

than that of original eager versions. These methods can sometimes avoid over-fitting because

they use only k-nearest neighbors to build models. These methods are more robust to training

noise as well (Mitchell 1997). One of the key advantages of these methods is their flexibility to

change the model created during query time. In an environment like composting this has proven

to be successful.

35

3.12 ADAPTIVE NEURO-FUZZY INFERENCING SYSTEM (ANFIS)

This is a true hybrid machine learning method with a Fuzzy Inferencing System (FIS) at

the core and an adaptive ANN to learn the parameters of the FIS. This method has the advantage

of using minimal domain knowledge and producing human readable rules and has been

developed by Jang (1993). Other ways of learning the parameters also exist, e.g. Morris (2005)

used an Evolutionary Algorithm to evolve the membership functions. Traditionally FIS generally

requires some prior knowledge usually in the form of human domain expertise to create

linguistic variables, which may not always be available. ANN is capable of handling complex

relationships among many variables, and FIS can model vagueness, produce human-readable

rules and is transparent. Thus, the neuro-fuzzy method combines the best of both paradigms

where minimal prior knowledge is required and the powerful learning ability of ANN can be

used to create human readable rules/knowledge. An implementation of Jang’s algorithm is

available from Matlab’s package called ANFIS, which was used in this research. ANFIS

exhibited excellent performance using 216 rules to describe the composting process and

improved upon the results of previous ANN models.

3.13 COMBINING MODELS - ENSEMBLE APPROACHES

The idea behind ensemble approaches is quite obvious – they try to make predictions

more reliable by combining the output of several different models. The most prominent methods

are called bagging, boosting and stacking (Witten & Frank 2005). The remainder of this

subsection describes the ensemble approaches that were used, followed by a discussion of their

usability.

36

Bagging uses the same learning scheme to create multiple models from different

samplings of the training data. The models are trained on randomly sampled bags from the

training data with replacement and then their predictions are combined by a weighted average.

This may appear to be redundant but it allows several hypotheses to be developed concurrently

by each of those learning methods, and surprisingly those hypotheses turn out to be quite

different from each other due to the stochastic nature of ML. This is useful when a single model

cannot converge on a hypothesis to describe most of the training data or when there is the

possibility of the search algorithm to be stuck on a local optimum. Thinking in terms of the

search problem, bagging allows methods to simultaneously rake the hypothesis space. Using

multiple searches it is unlikely that two methods will take the same search path to create the

same model.

Boosting can be thought of as an enhanced version of bagging, where the same machine

learning algorithm is used to create multiple models, but in a way that complements each other

(Witten & Frank 2005). Boosting, unlike bagging, does not build the models separately, but

rather learns from the performance of previously built models. It encourages new models to

become experts for training instances misclassified by preceding models (it does so by assigning

weight values to instances). Additive regression, based on boosting principles allows boosting to

be applied for regression models rather than classification. For a detailed explanation of this

method please refer to Witten & Frank (2005).

Stacking, unlike bagging, is built by combining different learning algorithms. It is less

popular than the formers due to its difficulty in analysis and lack of a generally accepted best

practices; there are too many variations of the basic idea (Witten & Frank 2005). It uses a

metalearner (level-1 learner) to replace the voting and weighted average mechanism used in

37

bagging. The metalearner learns a functional relationship between the outputs of the base

models, also called the level-0 models. Because multiple learners can be combined in a

sophisticated manner, it can take the best from each algorithm. However, it is very slow and each

learner has to be adjusted, requiring specific understanding. Furthermore, the wide choice of

level-0 learners also presents a problem because there is no rule about the selection of level-0

learners, other than performing several empirical runs using a combination of learners based on

user expertise.

In most cases combining multiple models has the effect of improving predictive

performance, although the reverse can also happen. Improvements are not guaranteed, theoretical

discussions and degenerative cases exists that show failure. A major disadvantage of these

ensemble methods is that they are difficult to analyze, especially with incomprehensible models

(Witten & Frank 2005). When they perform well, it is not easy to understand what factors are

contributing to the improved decisions. Some researchers cast doubt if aggregate methods really

have anything to offer over traditional ones if properly used (Seeger 2001). Situations may arise

where the increase in accuracy versus computational cost of training and classifying may not be

worth the additional effort.

38

CHAPTER 4

IMPEMENTATION DETAILS

4.1 LAZY METHODS

The lazy learning algorithms used in this thesis, k-NN and Locally Weighted Regression,

were implementations from the WEKA package (WEKA 2006; Witten & Frank 2005). WEKA

standing for Waikato Environment for Knowledge Analysis is an open source software package

for data mining, written in Java. It is powerful, flexible, and expandable, allowing machine

learning schemes written in Java to be incorporated into the package. It has a graphical user

interface (GUI), command line interface, scripting capability and sophisticated full-featured API,

which is one of the primary reasons for using WEKA in this thesis.

The best implementations of k-NN uses a kd-tree (Bentley 1975; Witten & Frank 2005)

to store the training instances, which provides O(logn) for insertion, deletion and query, and

O(nlogn) time for creation of the tree. For k-NN, values of k (the number of neighbors to use)

were varied in certain intervals from 3 to 50 to find the best number. Cross validation can also be

used to find the best value of k, although it is computationally expensive. Inverse distance

weighting function was used to penalize neighbors. Due to the small number of attributes in this

dataset, we did not have to apply techniques to deal with irrelevant ones. The range of values in

the attributes (moisture: 30%-70%, temperature: 22°C-57°C) was also comparable, so no

normalization techniques were used. We did not put an upper limit on the window size for

training instances stored in memory.

39

The LWR implementation in WEKA is based on Akeson’s (1997) algorithm. The k

nearest neighbors are found based on the Euclidean distance using values which performed best

based on cross validation (between 20 and 30). The Gaussian weighting kernel is then applied to

the k neighbors, whose values are then combined in a decision stump to generate the output.

Often several runs are required to obtain the best parameters; Java programs were written to

automate this process, screenshot shown in Figure 4.3.

4.2 EAGER METHODS

Eager method implementations required for this project were available from the WEKA

package: Support Vector Machines, Artificial Neural Network, RBF network, linear regression,

model trees and regression trees. Figure 4.1 provides the graphical interface of WEKA

Explorer’s classification panel.

Figure 4.1: WEKA Explorer’s GUI Classification Panel.

40

WEKA’s SVM implementation used the Sequential Minimal Optimization (SMO)

algorithm developed by Smola and Scholkopf (1998) for training a support vector regression

model. Although not required for our dataset, it can globally replace all missing values and

transform nominal attributes into binary ones. All attributes were normalized, and the complexity

factor of SVR was varied from 0.1 to 1000 in uneven steps. It is necessary to carefully set the

value for epsilon (the amount to which deviations are tolerated when fitting the curve, a value in

the SVM equation) because it somewhat controls the overfitting extent. When using the

polynomial kernels, we did not feel it was necessary to transform the input space by an exponent

higher than 3, as cubic transformations will suffice for most complex surfaces in the feature

space (Isozaki 2002). Additionally, RBF kernels were also used, and the gamma for the kernel

was adjusted accordingly.

ANNs and RBF networks are types of feedforward networks because they do not contain

any cycles and the network’s output depends only on the current input instances. The RBF

network implementation from WEKA used normalized Gaussian kernels with the k-means

clustering algorithm to provide the basis functions. It standardizes all numeric attributes to zero

mean and unit variance, and next applies the clustering algorithm to generate clusters, and then

uses symmetric multivariate Gaussians to fit the data from each cluster. The number of clusters

(for k-means to generate) and the ridge value of the linear regression were adjusted empirically

through cross validation. The ANN implementation in WEKA provides a GUI panel (shown in

Figure 4.2) in which the nodes, structure and connections of the neural net can be customized,

however a major drawback in the use of WEKA’s neural network is the inability to save and load

a network structure once created. Because ANNs were not a significant focal part of this research

we did not use structural variations of networks, e.g. Ward networks, Jordan-Elman networks

41

etc. The WEKA implementation uses a sigmoid activation function, has abilities to decay the

learning rate per epoch, can use a seed to randomly initialize the weights, and has other standard

options. However, it has no heuristics for automatically suggesting the number of hidden nodes.

Figure 4.2: Sample structure of an ANN used for prediction with

a single hidden layer consisting of 20 neurons. (WEKA)

For Model Trees, a modified implementation of Quilan’s (1992) M5 algorithm was used

which is capable of post-pruning to reduce overfitting and smoothing to compensate for sharp

discontinuities that will likely occur between adjacent linear models at the leaves of the pruned

tree (Witten & Frank 2005). To build a regression tree in WEKA, a decision tree induction

algorithm is used to build an initial tree using information gain/variance in a greedy manner.

Afterwards it uses reduced-error pruning with backfitting to generate regression equations along

the path back up. It produces stable models, which means given the same training set, it will

output the same model. It can handle noise, missing attribute values and irrelevant attributes.

42

4.3 HYBRID METHODS

The 4 dimensional instance space represented by time, moisture, temperature, and oxygen

uptake rate appears to have seemingly random patterns for areas that could not be captured by

experimentation. In order to interpolate those complex areas of hyper-curves we developed an

algorithm that uses powerful eager learners like SVM and Model Trees to build a model during

query time from several selected nearest neighbors. Writing our own hybrid method instead of

using the Knowledge Flow feature of WEKA to graphically design customized hybrid schemes

gave us more control. It was written in Java 5.0 programming language using the powerful API

features provided by WEKA keeping in mind integration with the GUI. The lazy-eager methods

used k-NN with Euclidean-distance as the lazy component and one of 3 other eager methods

namely parameter tuned SVM, Model Trees, and RBF Network. We did not use a weighing

kernel function for the k neighbors found, because we felt a powerful eager learner would be

able to implicitly weigh each instance during training time.

4.4 HYBRID NEURO-FUZZY SYSTEM

We continued our hybrid learning scheme experimentation with Matlab’s implementation

of the neuro-fuzzy scheme developed by Jang (1993) in a package called ANFIS (Adaptive

Neuro Fuzzy Inferencing System). ANFIS provides a platform based on Matlab to

programmatically or graphically use the learning scheme. Several utility functions were written

in Matlab’s programming language for partitioning and manipulating the data for testing and

training. An example structure of the adaptive network is shown in Figure 4.4.

43

Figure 4.3: Screenshot of Lazy-Eager hybrid algorithm running in batch mode.

Figure 4.4: The structure of an adaptive neuro-fuzzy system created by ANFIS.

44

CHAPTER 5

EVALUTION, RESULTS AND ANALYSIS

In this thesis, we have employed various types of ML methods to understand the

applicability of ML schemes to this domain and extract an accurate model. Hence, if one model

performs better than another, can we state that it is a better model? The difference in model

accuracy based on one dataset may be due to estimation errors and should not be the sole basis to

conclude on comparative model performance. Alternatively, how do we know if one ML scheme

is actually better than another for this domain? The difference in method performance may be a

chance effect (Witten & Frank 2005) in the estimation process. This chapter attempts to answer

these questions based on experimental results and statistical tests. It presents quantitative

measurements of performance, followed by graphical presentation of how well each model

approximated the evaluation dataset. Finally, statistical significance tests are also performed to

compare the various learning methods within at least 95% confidence levels.

Generalizing is the primary goal of Machine Learning algorithms, so the more data

available, the better are the chances of learning the correct pattern; however, limited data poses a

great challenge to the learning process. Estimating the accuracy of a model or ML scheme on the

training data is quite simple, but due to bias and variance in the estimate, we are unsure of the

accuracy of the model and method on future unseen data with unknown distribution (Mitchell

1997). Therefore, when only limited data is available, numerous metrics for comparing the

accuracy of two models and statistical tests for comparing the accuracy of two ML methods were

considered. Although there is ongoing debate regarding the best way to learn and compare

45

hypotheses from limited data, it is generally accepted that the metrics presented in this chapter

are the best ways (Witten & Frank 2005; Mitchell 1997) to evaluate models and ML methods for

continuous prediction.

5.1 MODEL EVALUATION METRICS

In order to evaluate the performance of the various models obtained from a learning

scheme, we used an independent test set, which has not only been absent from training but also is

the product of a separate composting experiment performed at a different temperature setting

(contains 1460 instances at 34°C). We felt that this approach would have the best objectivity

given the limited data and would simulate the real world situation. The ML methods are trained

on the 8760 patterns of training samples, and subsequently tested on the independent 1460 test

bed.

Several measures of the error estimates are used to remain objective in our evaluations,

namely Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Correlation

Coefficient (CC or r2), which are described below. The following notations are used in the

descriptions: pi – predicted value of the ith instance, ai – actual/observed value of the ith

instance, n – number of instances used in testing, and ∑=

=n

jja

na

1

1 .

• RMSE is the most commonly used measure, ∑=

−=n

iii ap

nRMSE

1

2)(1 ..................(5.1)

• MAE is another common measure, n

apMAE

n

iii∑

=

−= 1 …........................................(5.2)

46

• The Correlation Coefficient (r2) measures the statistical correlation between the actual

and predicted target class values. It has values between 1 and -1, where 1 indicates

perfect correlation, and 0 indicates no correlation. This measure is scale independent,

meaning that the difference between pi and ai does not matter as long as they move in

the same direction. A high r2 indicates good correlation between ai and pi implying

good performance.

5.2 MODEL DEVELOPMENT & EVALUATION

The three input variables (descriptors) that were used for all model developments were

temperature, moisture content and time during composting. In order to objectively and fairly

compare our results with those that have been published, we used a model development

methodology similar to Liang et al. (2003a). Model development for each method had two

phases: 1) the first stage consisted of selecting the preferred parameter settings for that method

based on minimizing the errors (RMSE, MAE) and maximizing r2, and 2) the second stage

consists of evaluating the model created using the preferred parameters against the separate

evaluation data set (this data set was not used during model development). The model

development dataset consisted of 8760 patterns for six temperatures and five moisture contents,

which was solely used to determine the preferred parameter settings for each method. Once the

preferred parameter settings were selected, the final model was built and then evaluated on the

evaluation data set of 1460 data patterns. The evaluation data set contained patterns at 34°C

temperature (this temperature setting was not present in the development data set) with five

moisture contents (30%, 40%, 50%, 60%, and 70%). For a more complete description, please

refer to the source of the data set, Liang et al. (2003a,b).

47

Different methods need different forms of data partitions, e.g. ANN, ANFIS use three

partitions while SVM, k-NN, RBF etc. use two. The ANN used the following: 1) a training set to

adjust the weights (i.e. learn) during the training process, 2) a validation set to evaluate the

accuracy of ANN models during training in order to determine when to stop training and avoid

overfitting, and 3) a testing set to evaluate the resulting model created by ANN. Other methods

like SVM, RBF do not make use of a validation set; they simply use training and testing sets as

previously described. Although literature may refer to these sets by different names, they are

essentially the same. The validation set and testing set as defined in this thesis are synonymous

to the testing set and production set respectively as used by Liang et al. (2003a). For all methods,

one replicate of 43°C (730 instances) was set aside as the testing set, while the remaining data

(8030 instances) was allocated for training. When required, the training set partitioning strategy

to create the validation set was chosen based on how the model created by a method performed

against the remaining test set (730 instances); the effects of choosing different training set

partitions for post-pruning in the Model Tree algorithm is shown in Table 5.1 (87% partitioning

strategy gave the lowest errors and is thus preferred). Next, the preferred parameters for the ML

method were selected based on the performance on the chosen testing set. An example of how

the preferred parameters were chosen for ANNs during the model development phase is shown

by Tables 5.2 and 5.3, which tabulates the effects of using different learning rates and number of

hidden nodes. Table 5.4 shows the effect of different k-values (number of nearest neighbors) for

models created by two hybrid lazy-eager learners based on the training-testing split mentioned

before. The shaded green cell indicates the selected value for a parameter that was chosen for

final model training.

48

Table 5.1: Effect of different training set partitioning strategies for Model Trees on the testing set. The 87% partitioning, which gave the lowest errors, was chosen.

Data partitioning (% in training set) CC MAE RMSE

50% 0.9432 0.0831 0.1303 66% 0.9441 0.0815 0.1294 70% 0.9473 0.08 0.1263 75% 0.9462 0.0811 0.1287 80% 0.9488 0.0794 0.1268 82% 0.9472 0.0805 0.1286 85% 0.9495 0.077 0.1247 87% 0.9507 0.0767 0.1233 90% 0.9478 0.0778 0.1263

Table 5.2: Effect of different learning rates (with decay) on model errors for ANN on the testing set during the model development phase. The value of 0.5 was chosen since it gave lowest errors.

learning rate Momentum training Set Size

(% of all data)

validation Set size (% of

training Set)

nodes in hidden layer

CC MAE RMSE

0.2 0.1 90% 20% 10 0.8957 0.1577 0.19850.3 0.1 90% 20% 10 0.857 0.1489 0.20290.4 0.1 90% 20% 10 0.8539 0.1501 0.20510.5 0.1 90% 20% 10 0.8961 0.1216 0.1750.6 0.1 90% 20% 10 0.8952 0.1295 0.18370.8 0.1 90% 20% 10 0.8866 0.1342 0.1837

Table 5.3: Effect of hidden node numbers on ANN model errors (using the testing set) during the model development phase. The value of 22 was chosen.

nodes in hidden layer learning rate momentum CC MAE RMSE

15 0.5 0.1 0.8824 0.1302 0.1854 18 0.5 0.1 0.8895 0.1276 0.1801 20 0.5 0.1 0.8966 0.1242 0.1812 22 0.5 0.1 0.8971 0.1211 0.1731 25 0.5 0.1 0.8899 0.1297 0.1799 30 0.5 0.1 0.8873 0.1281 0.1818

49

Table 5.4: Effect of different k-values on MAE for lazy-eager hybrid learners on testing set during the model development phase. Values of 15 and 20 were chosen for the two methods.

k-value Lazy model tree Lazy SVM

5 0.07822373 0.078247598

10 0.07768349 0.077152532

15 0.07538994 0.074708158

20 0.07689541 0.071553716

25 0.07684661 0.074557203

30 0.07719097 0.074925701

50 0.07885324 0.083115533

To aide in finding the preferred parameter settings during model development, programs

were written in Java (Java 1.5 2005) (Figure 4.3) using the WEKA API to automate parts of this

lengthy process. When applicable, literature recommendations were also used for parameter

tuning. The results of various runs are tabulated in Tables 5.1 through 5.4. When required by an

algorithm, preprocessing was performed, for example, normalizing for SVM, standardizing for

RBF, etc.

Table 5.5 shows the performance of the final models created by each method against the

evaluation dataset. The results indicate that models created by hybrid lazy-eager learners have

the best performances, followed closely by models created by lazy, tree based, and rule based

ML schemes. Kernel based methods like SVM, ANN and RBF network, usually exhibiting

superior results in other domains, surprisingly did not produce the best models.

50

Table 5.5: Summary of preferred models (after finding suitable parameters during model development phase) applied to evaluation data set (1460 instances). *Result from Liang et al.

(2003a), **Result from Morris (2005), - indicates not available, yellow shade indicates tree based method, green shade indicates rule-based method, bold red indicates best results.

Eager Learners MAE RMSE r2 SVM regression 0.1523 0.1864 0.8847 linear regression 0.2639 0.3687 0.7100

RBF Network 0.1230 0.1636 0.9329 ANN (1 hidden layer) 0.1368 0.1769 0.9226

M5Rules 0.0756 0.1180 0.9624 Model Tree 0.0740 0.1190 0.9657

Regression Tree 0.0856 0.1448 0.9114 ANN (Ward network with 3 slabs in hidden layer)* 0.110* - 0.9280

Lazy Learners k-NN (k=20) 0.0770 0.1282 0.9615

LW linear regression (k=30) 0.0764 0.1246 0.9624 Ensemble – Multiple Models

Model Tree w/bagging 0.0751 0.1211 0.9583 Regression Tree w/bagging 0.0798 0.1315 0.9474

k-NN w/bagging 0.0772 0.1289 0.9361 Hybrids

lazy SVM (k=20) 0.0709 0.1187 0.9661 lazy Model tree (k=15) 0.0755 0.1252 0.9616 lazy M5Rules (k=20) 0.0753 0.1244 0.9624

ANFIS 0.0771 0.1161 - FISSION** 0.1095 0.1411 0.9028

5.3 EAGER LEARNING METHOD RESULTS

The performances of models created by eager methods against the model evaluation set

of 1460 data patterns are graphically presented in this section. For each moisture setting, the

performance of the model was compared to the two sets of observed values, which is plotted in

the following graphs. The blue and green lines indicate observed values (two replicates) during

the experiment, while the red line shows the values predicted by the model. Good performance is

indicated by how well it is able to fit the observed plots.

The performance of the ANN model is shown in the Figures 5.1 to 5.4. As discussed

previously in Section 3.10, ANNs use the gradient descent algorithm in Backpropagation to learn

the target function by searching for the best weight vector. The possible weight values represent

51

the space of learnable functions. Given two positive examples with no negative examples

between them, Backpropagation will tend to label points in between as positive. This smooth

interpolation between data points is the inductive bias of Backpropagation learning algorithm

(Mitchell 1997). Using this analysis, we can observe that for a highly irregular complex instance

space, ANN will tend to smoothen unknown parts of the hypercurve. Recall that we

hypothesized an issue with the training data – it contained a few distinct attribute values spaced

at regular intervals. Due to the unavailability of points in between the discrete intervals, ANN

would tend to smooth out irregularities of the hyperspace, which might have been an important

part of the model. This effect is exhibited in the graph of Figure 5.4 as excessive smoothing.

ANN modeling performance - 40%M

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e

Observed1Observed2Predicted

Figure 5.1: Comparison between experimentally observed values (replicate 1 & 2) and

ANN predictions. (Patterns at 34°C and 40% moisture content).

52


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

eObserved1Observed2Predicted




0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e




53


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat




The performance of RBF networks are plotted in the graphs shown in Figures 5.5 to 5.8.

The models in this method are built eagerly from local approximations centered around the

training examples, or around clusters of training examples, in which the RBF centers are placed.

A high number of kernel centers were used to follow the contour of the hypercurve, however,

after a certain number of kernels, it started overfitting and resulted in significantly lower testing

performance. Too many Gaussian kernels with small standard deviations would cause a good fit

of the samples, but would also render the jagged appearance as shown in the graphs of the RBF

network model performance. This adverse side-effect of closely fitting the testing set possibly

accounted for average evaluation set performance.

54

RBF Network modeling performance - 40%M

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat



RBF Network predictions. (Patterns at 34°C and 40% moisture content).


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e




55


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat





0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e




56

It was expected that SVM would perform well because it is a widely acclaimed and

powerful eager method for classification and function approximation; however, this was not the

case. The graphs of the model performances shown in Figures 5.9 to 5.12 are bad fits and too

smooth, which are unable to handle the irregularities and complexities of the target class, leading

to poor performance. The smoothness is possibly caused due to SVM’s mechanism of using

maximum margin hyperplanes based on support vectors to find classification boundaries (Witten

& Frank 2005). Once a non-linear mapping transforms the input space, the algorithm identifies

important points called support vectors (SV) in the new transformed feature space that defines

the classification boundaries (important points on the convex hull). The SVs are then used to find

a regression function which can tolerate up to a certain error value, ε. This essentially forms a

tube (Figure 5.13) around the target curve thereby creating a smoothing effect on the hypercurve

which is exhibited in the SVM graphs. In later experiments (please refer to Section 5.7), we

attempted to reduce this oversmoothing effect by minimizing the width of the ε-tube, but this

resulted in all training instances becoming support vectors (10,220 and 8,760) and minimal

predictive improvement. In the worst case scenario, if no error is tolerated, then the algorithm

will simply perform least-absolute-error regression (Witten & Frank 2005). Vapnik et al. (1997)

states that for even moderate approximation quality, the number of SVs can be considerably

high.

57

SVM modeling performance - 40%M

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat


Figure 5.9: Comparison between experimentally observed values (replicate 1 & 2) and SVM regression predictions. (Patterns at 34°C and 40% moisture content).


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e



58


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat




0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e


Figure 5.12: Comparison between experimentally observed values (replicate 1 & 2)

and SVM regression predictions. (Patterns at 34°C and 70% moisture content).

59

Figure 5.13: The ε –tube in Support Vector Regression. ε is the precision (error margin) to which the approximated curve should fit the instance points. ξ is called a slack variable to

allow the curve to deal with erroneous instance points. (Smola & Scholkopf 2003)

The following graphs in Figures 5.14 through 5.17 show the performance of the models

created by Model Trees on the evaluation data set. According to the error metrics in Table 5.5

and the following figures, it was one of the best performing models for this domain.

Model Tree modeling performance - 40%M

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e



and Model tree predictions. (Patterns at 34°C and 40% moisture content).

60


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250Time (hrs)

Oxy

gen

Upt

ake

Rat





0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250Time (hrs)

Oxy

gen

Upt

ake

Rat

e




61


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat




The model created by the M5Rule algorithm was able to successfully handle this domain

as shown by the error metrics in Table 5.5 and the following graphs in Figures 5.19 through 5.22.

The M5Rule algorithm infers rules using a divide-and-conquer approach by iteratively creating

model trees. For each iteration, a model tree is built using the M5 model tree building algorithm

and it selects the best leaf as a rule. The M5Rule graphs illustrate the effect of rules for numeric

prediction. Unlike other methods, M5Rules do not produce smooth curves to approximate a

function, but rather a series of connected straight lines, which are a result of the rule firing

process. The advantage of using rule inferring algorithms is that they generate human readable

rules. Typically, the number of rules generated for this domain ranged from 20 to 200 based on

the parameter setting. Figure 5.18 shows two rules generated from the M5Rules algorithm. The

rules generated by the M5Rules algorithm placed moisture as the parent node and also as parents

for several other subtrees, indicating that moisture has the highest information gain and is

possibly the most significant attribute.

62

Rule: 14 IF moisture <= 45 time > 67.5 moisture <= 35 temp > 46.5 THEN o2rate= 0.0007 * moisture + 0 * time - 0.0004 * temp + 0.0151 [427/2.687%]

Rule: 15 IF moisture <= 45 Moisture <= 35 Temp <= 39.5 time <= 47.5 THEN o2rate= 0.0002 * moisture + 0 * time + 0 * temp + 0.0023 [1142/2.46%]

Figure 5.18: Sample rules extracted from the M5Rules algorithm which is based on Model-trees. From the several rules generated it is observed that moisture and time are

predominantly at parent levels. This indicates their importance over other factors.

Although the use of model and regression trees for solving complex real-valued function

regressions has been declining, trees performed exceptionally well for this dataset as shown by

the M5Rules and Model tree graphs. Intuitively, it seems that model trees should be limited in

their expressiveness since they merely use a set of linear equations to describe a complex

hypercurve. However, by splitting the instance space, and then applying necessary regression

equations to fit those subsets, trees are possibly able to capture the phases involved in

composting using a linear yet highly relevant approximation (this is further discussed in Section

5.8 and illustrated in Figure 5.44).

63

M5Rules modeling performance - 40%M

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat



and M5Rules predictions. (Patterns at 34°C and 40% moisture content).


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e



and M5Rules predictions. (Patterns at 34°C and 50% moisture content).

64


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat


Figure 5.21: Comparison between experimentally observed values (replicate 1 & 2) and M5Rules predictions. (Patterns at 34°C and 60% moisture content).


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e


Figure 5.22: Comparison between experimentally observed values (replicate 1 & 2) and M5Rules predictions. (Patterns at 34°C and 70% moisture content).

65

5.4 LAZY LEARNING METHOD RESULTS

Lazy learning methods produced remarkably accurate models that performed well on the

independent test data. This is in contrast to many domains where powerful eager learners like

ANN, SVM perform remarkably well. k-NN achieved a best MAE value of 0.0770 with a

correlation coefficient of 96.15%; Locally Weighted Linear Regression (LWR) achieved similar

MAE value of 0.0764 and correlation coefficient of 96.24%. Both these methods showed

approximately 30% improvement over previously published models using the same evaluation

set and training/testing partitions. The following graphs in Figures 5.23 through 5.26 show the

model performance of LWR on the evaluation data set. The graphs in Figures 5.27 through 5.30

show k-NN model performance.

LW Linear Regression modeling performance - 40%M

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e


Figure 5.23: Comparison between experimentally observed values (replicate 1 & 2) and Locally

Weighted Linear Regression predictions. (Patterns at 34°C and 40% moisture content).

66


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat





0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e




67


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat




kNN modeling performance - 40%M

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e



and k-NN predictions. (Patterns at 34°C and 40% moisture content).

68


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat



and k-NN predictions. (Patterns at 34°C and 50% moisture content).


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e


Figure 5.29: Comparison between experimentally observed values (replicate 1 & 2) and k-NN predictions. (Patterns at 34°C and 60% moisture content).

69


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat


Figure 5.30: Comparison between experimentally observed values (replicate 1 & 2) and k-NN predictions. (Patterns at 34°C and 70% moisture content).

5.5 COMBINGING MODELS – ENSEMBLE RESULTS

Although expected, our experimentation confirms that the eager learning schemes can

benefit from bagging, boosting and stacking, provided they are designed properly. Bagging

produced some improvements as shown in Table 5.6 using eager learners, but failed to do so for

lazy learners. Eager schemes already form their hypothesis before query time, resulting in an

inflexible model, which may be a disadvantage for certain domains. Bagging can provide more

diversity as a result of forming many hypotheses by 1) using different versions of the training set

by randomly sampling the entire data set and then training the algorithm, or 2) by relying on the

inherent instability of some learners, e.g. ANN, randomly seeded RBF network, etc.

When bagging is applied to lazy learners, it does not harness the power of using multiple

models because lazy learners only create their models during query time. During query time, a

bagged lazy learner will produce a multiple of k neighbors, which is essentially the same lazy

70

learning algorithm with a higher k-value. A higher k-value does not necessarily produce better

prediction results due the possible inclusion of irrelevant or noisy instances (this effect is

observed during parameter tuning and shown in table 5.4). Therefore, it is not surprising to

observe that bagging did not produce any improvement for lazy learners.

Because bagging stores the numerous models built in main memory, performance and

hardware can be an issue, as encountered during the experimentation. Model trees can grow to

enormous lengths with large training sets, and with many unpruned trees used in bagging, we

often ran out of main memory (1GB).

Table 5.6: Summary of MAE values for several ML algorithms with and without bagging. (ibk # = instance based learning, which is the k-NN learning using # neighbors)

No bagging bagging ML Scheme MAE MAE ibk 1 0.0792 0.0798 ibk 5 0.0782 0.0786 ibk 10 0.0776 0.0776 ibk 15 0.0771 0.0772 ibk 20 0.077 0.0772 ibk 25 0.0775 0.0774 ibk 30 0.0783 0.0787 ibk100 0.0932 0.0942 linear regression function 0.2639 0.2638 regression tree-pruning, unsmooth 0.0843 0.0794 regression tree-pruning, smooth 0.0937 0.0905 model tree-pruning, unsmooth 0.0768 0.0751 model tree-unpruned, unsmooth 0.0796 out of memory model tree-pruning, smooth 0.0778 0.0783 model tree-unpruned, smooth 0.0794 out of memory

Boosting methods like additive regression slightly improved the results of already well

performing learners, but were unable to help the other ones. Stacking theoretically should

perform no worse than its worst base classifier. This was confirmed when strong base classifiers

were used to give good performance and vice versa as shown in table 5.7. A major drawback of

stacking is the complexity and amount of time required for training.

71

Table 5.7: Sample results of Stacking, Boosting (additive regression) Method CC MAE RMSE time (s)

Stacking (Meta: Model Tree, Base: k-NN, Model Tree, RBF, LWR) 0.964 0.0751 0.1234 1634

Stacking (Meta: ANN, Base: RBF, ANN, SVM) 0.9281 0.1221 0.1682 6295

Boosting: Additive Regression: Decision Stump X10 0.823 0.2179 0.2618 8

Boosting: Additive Regression: Model Trees X3 0.9628 0.0777 0.1257 385

Boosting: Additive Regression: Model Trees X10 0.9624 0.0825 0.1316 1245 Boosting: Additive Regression: SMOreg X2 0.6511 0.2502 0.3406 1502

5.6 HYBRID METHOD RESULTS

Hybrid methods performed the best in this research, closely followed by rule based and

lazy learners. Based on unpromising results from SVM, ANN, and RBF, lazy learners were

applied yielding marked improvements. This inspired us to develop hybrid methods of some

eager learners. Eager methods other than trees did not perform satisfactorily on the dataset.

Therefore, we studied the effect of hybridizing an eager algorithm with a lazy learner. In

experiments described in Section 5.7, where we compared the performances of several

algorithms and effects of hybridization, we noted that the same eager learners (e.g. the kernel-

based methods) were able to substantially boost their performance once hybridized and perform

at par with other algorithms. Figure 5.31 shows their comparative improvement in performances

(please refer to Section 5.7 for details). Table 5.4 summarizes the effect of using different nearest

neighbor values on the hybrid algorithm during the model evaluation phase.

The graphs in Figures 5.32 though 5.35 present the performance of the model created by

lazySVM on the evaluation dataset, which exhibits superior function approximation as well as

low error values.

72

Improvements in Hybridizing Eager Learners

0.05

0.07

0.09

0.11

0.13

0.15

0.17

0.19

0.21

0.23

0.25

Model Tree Lazy ModelTree

SVM Lazy SVM ANN Lazy ANN Regr. Tree Lazy Regr.TreeML Scheme

Erro

r Met

ric

mean RMSEmean MAE

Figure 5.31: Effect of hybridizing an eager method is illustrated in terms of mean MAE and RMSE with standard deviations (over 30 bootstrapped runs, refer to Section 5.7 for details)

lazySVM modeling performance - 40%M

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e



and lazySVM predictions. (Patterns at 34°C and 40% moisture content).

73


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat





0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat

e




74


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 50 100 150 200 250

Time (hrs)

Oxy

gen

Upt

ake

Rat




5.7 EVALUATING MACHINE LEARNING SCHEMES

When two models need to be compared, it is usually sufficient to evaluate them using an

unknown evaluation set or by performing k-fold cross validation. However, it is more

complicated to compare machine learning methods against each other, for example how would

one make the statement - algorithm A will perform better than algorithm B in this domain? The

problem arises due to several reasons: 1) there is a limited amount of evaluation data, 2) the

underlying distribution of the unseen instances is unknown, and 3) there is uncertainty involved

in target and attribute values. An algorithm performing better on the evaluation set is not enough

justification to claim superiority, because the difference in performance may be due to estimation

errors (Mitchell 1997), e.g. the training set or testing set may not be representative of the real

world.

A solution to comparing algorithms with limited data is to use n-fold cross validation,

however, statisticians have developed the bootstrap, which many agree to be a better estimation

75

method (Efron 1997). It is a strong statistical procedure for estimating the sampling distribution

of the data (Mitchell 1997), which can then be used to obtain the confidence levels of the

difference in true prediction error of the two methods. For the 0.632 bootstrap method, a new

training set is created by sampling the original dataset (which contains n instances) n times with

replacement. Because some elements in this second dataset will almost certainly be repeated, the

remaining instances that have not been picked are put in the test set (Witten & Frank 2005),

which would on average be 36.8% of the data. Although the bootstrap is an estimation method,

theoretical studies show that given enough samples, it approaches the true sample means and true

underlying distribution of the domain (Witten & Frank 2005).

In this thesis, the total number of instances is bootstrapped 30 times, resulting in 30 folds

of training and testing sets. Each machine learning scheme is run on each fold, for a total of 30

runs. To make a fair comparison and ensure more accurate results for the paired t-tests, the same

folds are used for each machine learning scheme. A total of 300 runs (30 folds x 10 schemes)

were performed and results are shown in Table 5.8. The mean performance measured in MAE

and RMSE over 30 folds for each algorithm is plotted in Figure 5.36 and the standard deviations

of the means are shown in table 5.9. Figure 5.37 shows the average correlation coefficient (r2) of

the schemes considered; in this case, performance is directly proportional to r2.

76

Table 5.8: Results gathered from 30 runs of the 0.632 bootstrap using all the instances. Different partitions were used for each of the 30 runs. For each algorithm, exactly the same bootstrapped

partitions/folds were used for all schemes in order to make a fair comparison. * best eager performer, ** best hybrid performer, ~worst hybrid performer

ML algorithm mean CC mean MAE mean RMSE RBF 0.9236 0.1106 0.1531 SVM 0.8896 0.1373 0.1809 ANN 0.9101 0.1254 0.1700

regression tree 0.9512 0.0698 0.1235 model tree* 0.9576 0.0695 0.1150

k-NN 0.9518 0.0688 0.1225 lazy SVM 0.9469 0.0638 0.1290 lazy ANN 0.9595 0.0617 0.1126

lazy model tree** 0.9629 0.0597 0.1077 lazy regression tree~ 0.9523 0.0673 0.1122

Table 5.9: Mean MAE and RMSE along with the standard deviations of the 30 runs.

Model Tree

Lazy Model Tree

SVM Lazy SVM ANN Lazy

ANN RBF Regr. Tree

Lazy Regr. Tree

k-NN

mean RMSE 0.1151 0.1077 0.2409 0.1290 0.1700 0.1126 0.1531 0.1236 0.1201 0.1226

σ RMSE

0.0027 0.0022 0.0027 0.0035 0.0092 0.0026 0.0026 0.0037 0.0030 0.0022

Mean MAE 0.0698 0.0597 0.1373 0.0638 0.1254 0.0612 0.1106 0.0701 0.0671 0.0682

σ MAE 0.0016 0.0012 0.0018 0.0015 0.0077 0.0012 0.0018 0.0019 0.0015 0.0011

Although the graphs in Figures 5.36 and 5.37 along with tables 5.8 and 5.9 provide a

good indication of relative algorithm performance, they are not sufficient to conclude that one

method is better than another. It is therefore necessary to find the statistical significance to

provide confidence levels. The next several figures starting at 5.38 show the results of the

statistical paired t-tests and present some interesting observations. The same 30 folds of data that

was generated from bootstrapping were given to each ML scheme. A method is considered better

than another if it achieved at least 95% confidence level on the paired t-tests using the 30 paired

values.

77

Performance Comparison of ML schemes

0.05

0.07

0.09

0.11

0.13

0.15

0.17

RBFSVM ANN

regr

essio

n tre

e

model

tree

kNN

lazy S

VM

lazy A

NN

lazy m

odel

tree

lazy r

egre

ssio

n tre

e

ML scheme

Perfo

rman

ce M

etric

mean MAEmean RMSE

Figure 5.36: Performance comparison of different ML schemes on the composting dataset.

Values are average of 30 runs on the same 30 folds of bootstrapped data.

Performance Comparison of ML Scheme

0.880.890.900.910.920.930.940.950.960.97

RBFSVM ANN

regre

ssion

tree

mod

el tre

ekN

N

lazy S

VM

lazy A

NN

lazy m

odel

tree

lazy r

egres

sion t

ree

ML Scheme

Corr

elat

ion

Coef

ficie

nt

Figure 5.37: Performance comparison based on Correlation Coefficient of different ML schemes.

Values are average of 30 runs on the same 30 folds of bootstrapped data.

78

mean MAE Comparison of Algorithms

Model Trees k-NN

0.045

0.050

0.055

0.060

0.065

0.070

0.075

1Mean MAE &

Standard Devations

MA

E Er

ror

Model Treesk-NN

mean RMSE Comparison of Algorithms

Model Trees k-NN

0.12

0.12

0.12

0.12

0.12

0.12

0.12

0.12

0.13

0.13

0.13

1

Mean RMSE & Standard Deviations

RM

SE E

rror

Model Treesk-NN

(a) Comparison of mean MAE values showing standard deviations.

(b) Comparison of mean RMSE values showing standard deviations.

Model Trees k-NN mean MAE 0.06966 0.06820 stdDev MAE 0.00135 0.00115 t-Stat (MAE) 6.83399 mean RMSE 0.12299 0.12255 stdDev RMSE 0.00291 0.00217 t--Stat (RMSE) 1.07112 t Critical Val (95% Confidence) 2.04523

Figure 5.38: Comparison of best eager learner to the worst lazy learner. Model Trees vs. k-NN. Although k-NN has lower mean values for MAE and

RMSE, only one measure (MAE) provides with a 95% confidence level that k-NN performs better than Model Trees. The performance based on RMSE is statistically insignificant as t-Stat (RMSE) is lower than t Critical Val (for 95% confidence). Therefore, it is statistically uncertain whether k-NN outperforms Model Trees.

79


Model Tree

Lazy Model Tree

0.045

0.050

0.055

0.060

0.065

0.070

0.075

1

Mean MAE & Standard Devations

MA

E Er

ror

Model Tree

Lazy Model Tree


Model Tree

Lazy Model Tree

0.098

0.100

0.102

0.104

0.106

0.108

0.110

0.112

0.114

0.116

0.118

0.120

1


RM

SE E

rror

Model Tree

Lazy Model Tree



Model Tree Lazy Model Tree mean MAE 0.06984 0.05972 stdDev MAE 0.00164 0.05972 t-Stat (MAE) 23.72485 mean RMSE 0.11513 0.10771 stdDev RMSE 0.00267 0.00218 t--Stat (RMSE) 10.70025 t Critical Val (95% Confidence) 2.04523

Figure 5.39: Comparison of the hybrid version to the original version: Model Trees vs. lazy Model Trees. With a 95% confidence level we can say

that Lazy Model Trees perform better than Model Trees in regards to both error metrics. t-Stat() > t Critical Val for 95% confidence level.

80


SVM

lazy SVM

0.045

0.055

0.065

0.075

0.085

0.095

0.105

0.115

0.125

0.135

0.145

1Mean MAE &

Standard Devations

MA

E Er

ror

SVM

lazy SVM


SVM

lazy SVM

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

1


RM

SE E

rror

SVM

lazy SVM



SVM lazy SVM mean MAE 0.13731 0.06378 stdDev MAE 0.00181 0.00146 t-Stat (MAE) 164.57321 mean RMSE 0.18093 0.12897 stdDev RMSE 0.00273 0.00345 t--Stat (RMSE) 62.07635 t Critical Val (95% Confidence) 2.04523

Figure 5.40: Comparison of the hybrid version to the original version: SVM vs. lazy SVM. With a 95% confidence level we can state that

lazySVM outperforms SVM in regards to both error metrics. t-Stat() > t Critical Val for 95% confidence level.

81


lazy SVM lazy Model Tree

0.045

0.050

0.055

0.060

0.065

0.070

1


MA

E Er

ror

lazy SVMlazy Model Tree



lazy SVMlazy

Model Tree

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

1


RM

SE E

rror

lazy SVM

lazy Model Tree


lazy SVM lazy Model Tree mean MAE 0.06378 0.05972 stdDev MAE 0.00146 0.00119 t-Stat (MAE) 29.12331 mean RMSE 0.12897 0.10771 stdDev RMSE 0.00345 0.00218 t--Stat (RMSE) 50.18343 t Critical Val (95% Confidence) 2.04523

Figure 5.41: Comparison of Learning methods: lazy SVM vs. lazy Model Tree. With a 95% confidence level we can state that lazyModel trees perform better than lazySVM

using both error metrics. t-Stat() > t Critical Val for 95% confidence level.

82


ANN

Lazy ANN

0.045

0.055

0.065

0.075

0.085

0.095

0.105

0.115

0.125

0.135

1


MA

E Er

ror

ANN

Lazy ANN


ANN

Lazy ANN

0.000

0.020

0.040

0.060

0.080

0.100

0.120

0.140

0.160

0.180

1


RM

SE E

rror

ANN

Lazy ANN



ANN Lazy ANN mean MAE 0.12539 0.06166 stdDev MAE 0.00774 0.06166 t-Stat (MAE) 44.79261 mean RMSE 0.16999 0.11257 stdDev RMSE 0.00924 0.00257 t--Stat (RMSE) 31.80514 t Critical Val (95% Confidence) 2.04523

Figure 5.42: Comparison of the hybrid version to the original version: ANN vs. Lazy ANN. With a 95% confidence level we can state that

lazyANN outperforms ANN using both error metrics. t-Stat() > t Critical Val for 95% confidence level.

83

5.8 ANALYSIS

It appears from our research that lazy learners, hybrid learners, (i.e. lazy versions of eager

learners, trees and rule-based system, e.g. neuro-fuzzy systems, M5Rules algorithm) and trees

are highly successful in modeling this dataset. Statistical significance tests confirm the relative

method performance for this domain. The remainder of this section explores the observations and

provides possible explanations and further discussions.

The training patterns obtained from the experiments were carefully analyzed due to the

unexpected behavior of the machine learning algorithms. As described in Chapter 2 of this thesis,

it was observed from the data that the composting behavior was not steady. Even though the

experiments were conducted in the same facility under similar conditions, the microbial activity

often changed erratically at certain temperatures and/or moisture contents. This behavior is

exhibited in the graphs in section 2.2.

Most microbial activity is pronounced near the high moisture area as illustrated in the 3D

plot in Figure 5.43. Low moisture contents show minimal activity regardless of temperature. This

indicates that moisture plays a more important role in composting. This was also confirmed from

the Rule Sets generated by the M5Rules algorithm, please refer to Figure 5.18. Liang et al.

(2003b), Miller (1992) confirms this observation about the significance of moisture content as

well. The peak of the graph indicates that a temperature of 30°-35°C and moisture content of

50%-60% is most conducive for composting based on these two variables only.

84

Figure 5.43: Reduced to 3 dimensions, this graph plots microbial activity (output) against

moisture (input2) and temperature (input3). It should be noted that this graph changes over the time required for composting a substance.

Figure 5.44: (left) The three phases in composting, A-initial or mesophilic phase, B- thermophilic or high-rate phase and C-curing or cooling phase as shown in

literature (EPA 1995; Haug 1993; Das et al. 1998). (left graph from EPA) (right) Composting behavior created by our model based on the training data.

Composting literature identifies three phases in composting as shown in Figure 5.44

(left); each phase is characterized by the prevalence of certain microbes, temperature, pH,

particle density/size, etc., chemical, physical and biological factors. Some researchers go further

and insert other phases in between the aforementioned phases, one such phase is the lag phase,

85

where microbial activity recedes temporarily until optimal balance for it to thrive is restored. In

Figure 5.44 (right), we can clearly see how our models roughly follow the experimental patterns,

and also show the possibility of more phases. It is clear that composting obeys certain patterns,

and the overall progress is governed by phases. Trees and rule-based systems shatter the instance

space into many parts based on the tests at the node or the rules, and then a function is used to fit

each shattered part of the instance space. Similar principles are employed by the hybrid learners;

by only considering the k-nearest instances, they are implicitly shattering the instance space.

Subsequently, function approximation methods, like ANN, SVM, regression, etc. are used to fit

each shattered space. Essentially, the hybrid learner and trees (rule-based systems) are using a

collection of locally approximated functions (one is using it implicitly, while the other is doing it

by rules) to enhance their expressive power and approximate the overall complex target function.

This analysis can help us understand why trees, rule-based systems and lazy learners performed

very well in this domain. Our work in modeling has revealed that many phases are involved in

composting and with the help of appropriate ML methods (data mining), we may identify more

phases, patterns, rules etc.

86

CHAPTER 6

CONCLUSION AND FUTURE WORK

In this research we have successfully been able to use ML techniques and hybrid

algorithms to yield highly accurate prediction results. From this study, we have also gained a

better understanding of the applicability of machine learning schemes to this domain. The effects

of hybridizing eager learners with a lazy approach have been studied using the 0.632

bootstrapping method. We have shown with 95% confidence levels or higher that the hybrid

versions of eager algorithms performed better than the original. Furthermore, the worst hybrid

method was able to outperform the best eager method with a 95% confidence level (refer to

Figure 6.1). Finally, the best model developed using the hybrid scheme had errors of MAE =

0.0709, RMSE=0.1187 and correlation coefficient of 0.9661 on the evaluation dataset, which is

more than a 35% improvement in MAE over published results. We conclude that hybridizing an

eager learner using a lazy approach can significantly boost performance in this domain.

This research further indicates that the following types of algorithms can perform well for

this domain: 1) rule-generating algorithms, e.g. neuro-fuzzy system (ANFIS), and M5Rules

algorithm, 2) algorithms that can learn complex functions and can change parts of their model on

the fly, e.g. hybrid eager-lazy learners, lazy methods 3) tree-based methods, e.g. regression &

model trees. Rules inferred from the M5Rules algorithm confirmed previous research findings

about the importance of moisture (Liang et al. 2003b; Miller 1992).

87


Model Tree

lazyRegTree

0.045

0.050

0.055

0.060

0.065

0.070

0.075

1


MA

E Er

ror

Model Tree

lazyRegTree

Figure 6.1: Performance comparison of the best eager method with the worst hybrid method: Model Tree vs. lazy Regression Tree. With a 95% confidence

level, the errors of lazy regression tree were lower than model trees. t-Stat(9.68) >critical t Value for 95% confidence level (2.04523).

ML methods are highly applicable to real world composting facilities, where 1) the

composting substance does not have to be analyzed beforehand, 2) data from pilot scale

experiments can be used, 3) physical, chemical, empirical rules from research can be assimilated

and 4) minimal understanding of the composting process is required. Composting runs can be

used to obtain the data, which can then be used to train machine learning algorithms to yield an

accurate model. The model created can subsequently be used in a support system to control the

composting process and boost its efficiency.

While this work has met its goal, we have identified several areas that merit attention for

further research. Because composting has several phases, a meta-learner could be used to

identify the phases and then apply suitable base-learners to each phase. Further research could be

conducted to identify the most suitable algorithm for each phase of the composting process. This

phase-based stacking approach may lead to more dynamic, robust and accurate models.

88

It would be of interest to create a model tree that generates the membership functions for

a Fuzzy Inferencing System (FIS) rather than regression leaves. These generated membership

functions may be able to automatically capture the rules of this domain without domain experts.

Another ML scheme that may be appropriate for this domain is a hybrid learning classifier,

where the M5Rule algorithm is used to generate rules, and then a Genetic Algorithm is used to

find the best population of rules to classify a given query instance.

89

REFERENCES

Aha, D.W., Kibler, D., Albert, M.K. (1991). Instance-based learning algorithms. Machine Learning,

Vol.6, pp. 37-66.

Atkeson, C., A. Moore, and S. Schaal (1997). Locally weighted learning. AI Review, Vol.11, 11-73.

Baker, J.R. et al. (2004). Evaluation of Artificial Intelligence based models for chemical biodegradability

prediction. Molecules, Vol.9, 989-1004.

Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Commun.

ACM 18, 9 (Sep.), 509-517

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees.

Chapman & Hall, New York.

Brodley, C.E., Fern, X. Z. (2003). Boosted lazy decision trees. The Twentieth International Conference on

Machine Learning.

Broomhead, D.S., Lowe, D. (1998). Multivariable functional interpolation and adaptive networks.

Complex Systems, Vol.2, 321-355.

Bush, G.W. (2006). State of the Union Address by the President. Available from

http://www.whitehouse.gov/stateoftheunion/2006/index.html . Internet; accessed June 2006.

Das, K.C., Tollner, E.W. (1998). Development and preliminary validation of a compost process

simulation model. Proceedings Composting in the Southeast Conference and Expo. Sept. 9-11, 1998.

Efron, B. and Tibshirani, R.J. (1997), Improvements on cross-validation: The .632+ bootstrap method. J.

of the American Statistical Association, 92, 548-560.

Engelbrecht, A.P. (2002). An Introduction to Computational Intelligence. John Wiley & Sons Inc.

EPA (1995). Decision-Makers’ Guide to Solid Waste Management. EPA Publication Vol.2, EPA530-R-

95-023, 7-12.

90

http://www.whitehouse.gov/stateoftheunion/2006/index.html

EPA (1999). Biosolids Generations, use and disposal in the United States. EPA Report 1999, EPA530-R-

99-009, p.12-15, 27-35.

EPA (2003). Municipal Solid Waste in The United States: 2001 Facts and Figures. EPA Report 2003,

EPA530-R-03-011.

EPA (2006a). Environmental Protection Agency. Composting – Basic Information. Available from

http://www.epa.gov/epaoswer/non-hw/composting/basic.htm. Internet; accessed May 2006.

EPA (2006b). Municipal Sold Waste (MSW) – Basic Information. Available from

http://www.epa.gov/epaoswer/non-hw/muncpl/facts.htm. Internet; accessed May 2006.

Frank, E., Witten, I.H (1998) Generating accurate rule sets without global optimization. Proc. of the 15th

International Conference on Machine Learning. Morgan Kaufmann 144-151

Goldstein, N., Gray, K. (1999). Biosolids composting in the United States. 1999 BioCycle 40 (1), 63-75.

Gray, K. (1999). MSW and biosolids become feedstocks for ethanol. BioCycle, Vol. 40, No.8, 37-38.

Haug, R.T. (1993). The Practical Handbook of Compost Engineering. Boca Raton, FL, Lewis Publishers.

Holmes, G., Hall, M., and Frank, E. (1999). Generating rule sets from model trees. Proc 12th Australian

Joint Conference on Artificial Intelligence, Sydney, Australia, Springer, 1-12.

Hong, J. W. (1996). RBF Java Applet. MIT. Available from

http://diwww.epfl.ch/mantra/tutorial/english/rbf/html/index.html . Internet; accessed June 2006.

Hornik, K. (1989). Multilayer Feedforward Networks are Universal Approximators. Neural Networks,

Vol.2, 359-366.

Isozaki, H., Kazawa, H. (2002). Efficient Support Vector Classifiers for Named Entity Recognition.

Proceedings of the 19th International Conference on Computational Linguistics (COLING'02),

Taipei, Taiwan, 390-396.

Jang, J.S.R. (1993). ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on

Systems, Man and Cybernetics, Volume 23, Issue 3, 665–685

91

http://www.epa.gov/epaoswer/non-hw/composting/basic.htm

http://www.epa.gov/epaoswer/non-hw/muncpl/facts.htm

http://diwww.epfl.ch/mantra/tutorial/english/rbf/html/index.html

Liang, C., Das, K. C., McClendon, R. W. (2003a). Prediction of Microbial Activity during biosolids

composting using Artificial Neural Networks. American Society of Agricultural Engineers, 2003,

ISSN 0001-2351, Vol. 46(6): 1713-1719.

Liang, C., Das, K. C., McClendon, R. W. (2003b). The influence of temperature and moisture contents

regimes on the aerobic microbial activity of a biosolids composting blend. Bioresource Technology

86, 2003, 131-137, Elsevier Science Ltd.

McCartney, D., Tingley, J. (1998). Development of a rapid moisture content method for compost

materials. Compost Science and Utilization, 6(3), 14-25.

Miller, F.C. (1992). Composting as a process based on the control of ecologically selective factors. Soil

Microbial Ecology: Applications in Agriculture Environment Management, 515-544, F.Blaine-

Metting, ed. N.Y.:Marcel Dekker.

Mitchell, T. (1999). Machine Learning. The McGrawhill Companies.

Morris, E. (2005). FISSION: An Evolutionary Method for Fuzzy Learning. M.S. Thesis, Computer

Science, University of Georgia, Athens.

Nakasaki, K. and Akihito, O. (2002). A simple numerical model for predicting organic matter

decomposition in a Fed-Batch composting operation. J.Environ. Qual.,31, 997-1003

ORWARE (2001). ORWARE – A simulation tool for waste management. Eriksson, O., Frostell, B. et

al., Royal Institute of Technology at Stockholm.

Platt, J. (1998). Fast Training of Support Vector Machines using Sequential Minimal Optimization.

Advances in Kernel Methods - Support Vector Learning, B. Schoelkopf, C. Burges, and A. Smola,

eds., MIT Press.

Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, Vol.1, 81-106.

Quinlan, J.R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.

Roehr, M. (2001). The Biotechnology of Ethanol: Classical and Future Applications. Wiley-VCH

92

Rosso, L., Lobry, J.R., Flandrois, J.P., (1993). An unexpected correlation between cardinal temperatures

of microbial growth highlighted by a new model. J.Theor. Biol. 162, 447-463.

Smola, A.J., Scholkopf, B. (1998). A Tutorial on Support Vector Regression. NeuroCOLT2 Technical

Report Series - NC2-TR-1998-030.

Stombaugh, D.P., Nokes, S.E. (1996). Development of a biologically based aerobic composting

simulation model. Trans. ASAE 39, 239-250.

Vapnik, V. (1999). The nature of statistical learning theory. 2nd Ed., Springer-Verlag, NY.

WEKA (2006). Waikato Environment for Knowledge Analysis. University of Waikato, NZ.

http://www.cs.waikato.ac.nz/ml/weka/

Witten, I., Frank, E. (2005). Data Mining-Practical Machine Learning Tools and Techniques. 2nd Ed.,

Elsevier Inc.

Xi, B., Wei, Z., Liu, H. (2005). Dynamic Simulation for Domestic Solid Waste Composting Processes.

The Journal of American Science, 1(1).

Zhou, Y. and Brodley, C. E. (1999). A hybrid lazy-eager approach to reducing the computation and

memory requirements of local parametric learning algorithms. The 16th International Conference on

Machine Learning, June 27-30, 503-512.

93

http://www.cs.waikato.ac.nz/ml/weka/

Documents

PREDICTING MICROBIAL ACTIVITY FOR COMPOSTING USING …PREDICTING MICROBIAL ACTIVITY FOR COMPOSTING USING MACHINE LEARNING TECHNIQUES by REIMAN L. RABBANI (Under the Direction of Khaled