Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
RIGA TECHNICAL UNIVERSITY
Faculty of Computer Science and Information Technology
Institute of Information Technology
Jurijs Čižovs Management Information Technology doctoral programme student
DEVELOPMENT AND STUDY OF A
CONTROLLED MARKOV DECISION MODEL
OF A DYNAMIC SYSTEM BY MEANS OF
DATA MINING TECHNIQUES
Ph.D. Thesis Summary
Scientific supervisor
Dr.habil.sc.comp., Professor
A. BORISOVS
Riga 2012
2
UDK 519.857(043.2)
Či 958 d
Čižovs J. Development and study of a controlled Markov
decision model of a dynamic system by means of data mining
techniques. Ph.D. Thesis Summary.-R.: RTU, 2012.-38 p.
Printed according to the decision of the RTU Information
Technology Institute Council meeting, January 10, 2012,
protocol No. 12-02.
This work has been partly supported by the European Social Fund within
the National Programme „Support for the carrying out doctoral study
programme’s and post-doctoral researches” project „Support for the
development of doctoral studies at Riga Technical University”.
ISBN 978-9934-10-272-1
3
PH.D. THESIS
IS NOMINATED IN RIGA TECHNICAL UNIVERSITY FOR TAKING
DOCTOR’S DEGREE IN ENGINEERING SCIENCE
Defence of the Ph.D. Thesis for obtaining a doctor’s degree in Engineering
Science will take place on 12 March, 2012 at Riga Technical University, Department of
Computer Science and Information Technology, 1/3 Meža Street, Room 202.
OFFICIAL REVIEWERS
Professor, Dr.math. Kārlis Šadurskis
Riga Technical University, Latvia
Professor, Dr.habil.sc.ing. Jevgeņijs Kopitovs
Transport and Telecommunication Institute, Latvia
Professor, Dr. rer. nat. habil. Juri Tolujew
Otto-von-Guericke University Magdeburg, Germany
DECLARATION
I, Jurijs Čižovs, declare that I have written this Thesis, which is submitted for
reviewing at Riga Technical University for taking a doctor’s degree in Engineering
Science.
Jurijs Čižovs ………………………………..
signature
Date: 3 February, 2012
Ph.D. Thesis is written in Latvian. It contains an introduction, 6 chapters, a
conclusion, a list of references, 4 appendixes, 70 figures, 15 tables and 51 formulae. The
list of references contains 83 entries. There are 137 pages in total.
4
GENERAL DESCRIPTION OF THE THESIS
Introduction
With the development of electronic management of production, trade, finance, etc., new
features of structured data storage have appeared which reflect the economic activities of an
enterprise. The analysis of the available data of the enterprise activity in the past aimed to
develop relevant management decisions is one of the mechanisms which determine the efficient
enterprise management. Since the enterprise activity is observed over time, the data are
multidimensional time series. Thus, there arises the problem of decision making under
uncertainty with the data which are multidimensional time series. The uncertainty stems from the
fact that it is technically impossible to reflect all the internal and external factors affecting the
parameters observed.
Topicality of problem
The mathematical framework of Markov Decision Process (MDP) has been used
successfully to find the optimal management strategy in discrete stochastic processes developing
over time. There are a number of modifications and enhancements aimed at solving tasks with
continuous parameters, partially observable environments, etc. However, the issues related to the
building of an MDP-model which contains the data represented as time series, are open for
research. The complexity of the model building is due to the requirements of the MDP
framework to the structure of the researched data. For the observed parameters the
implementation of a certain type time series must be extracted from the relational data and
converted to the structure of MDP.
The extension of the framework for working with time series allows one to take
advantage of a standard MDP framework to make decisions on economic problems in online
mode.
Goal of the research
The goal of the doctoral thesis is to develop the decision making framework based on the
Markov Decision Process for the dynamic systems in which the data are represented as time
series. To achieve the goal stated, the following tasks have to be solved:
1. To review the current state of the MDP framework application to the problems expressed in
terms of multidimensional time series and to explore the existing approaches.
2. To develop a method based on data mining to build time series, to process and transform
them into the structures that satisfy the MDP requirements.
3. To develop the new method’s software which is based on the agent-oriented architecture and
is an intelligent system for decision making and support.
4. To consider the possibility of improving the method by the decision space approximation
implemented by means of Artificial Neural Networks.
5. To perform the statement of Dynamic Pricing Policy problem task as a dynamic
programming task (in the context of Markov Decision Processes) in order to obtain an object
for practical experiments.
6. To test the obtained decision making intelligent agent system in the problem of Dynamic
Pricing Policy for assessing the effectiveness of the developed method for the real-world
problems.
Object of the research
The object of the research is the advanced decision making method based on Markov
Decision Process. The scope of the framework application is oriented towards dynamic
programming tasks which contain the data expressed as time series.
5
Research hypotheses
The study put forward the following hypotheses:
1. the task in which the data are represented as time series can be viewed as a task of dynamic
programming and expressed in terms of Markov Decision Process;
2. the approximation of the state space and decision space of Markov Decision Process can be
performed with the help of the artificial neural network approach.
Methods of the research
In the Ph.D. Thesis the advanced decision making method based on Markov Decision
Process is under central consideration. The maximum-likelihood technique, which is a statistical
method for estimating the unknown parameter, is used to construct the probabilistic model in
framework of the apparatus. Data mining techniques including tools for data normalization,
clustering and classification are employed. The methods of computational intelligence:
Reinforcement Learning and Artificial Neural Networks are used. The agent-oriented
architecture is used for the software systems under development.
Scientific novelty
The decision making method based on Markov Decision Process is of scientific interest.
The main characteristic that distinguishes it from a standard Markov Decision Process is the
possibility to use it in the tasks with multidimensional time series.
The approach of the decision space approximation by means of Artificial Neural
Networks has been demonstrated for the task of Dynamic Pricing Policy (which is characterized
by multidimensional time series).
Besides, a new approach to building the agent-based system architecture is provided. This
approach allows one to avoid the conflict between the definition of the purpose and the agent
environment in case the agent does not directly interact with the object of the task to be solved.
Practical use of the Thesis and approbation
The developed decision making method based on Markov Decision Process is designed
for tasks in which the state of the system is described by parameters that change over time and
not by static parameters.
The practical application of the intellectual agent system based on Markov Decision
Process was demonstrated in the task of Dynamic Pricing Policy. The testing data are the actual
sales records of the real manufacturing and trade management system 1С:Enterprise v7. The data
cover a two year period of manufactured food products sales.
The development of intelligent agent system based on Markov Decision Process was
implemented in 1C: Enterprise v7 framework, which allowed a direct access to sales data.
This work includes a series of experiments of several sub-systems (Artificial Neural
Networks, Markov Decision Process) with toy problems. Besides, a series of experiments by the
example of Dynamic Pricing Policies task in order to numerically evaluate the effectiveness of
the improved MDP framework was carried out.
Some certain stages of work and the results were presented at these scientific
conferences:
1. Chizhov J. An Agent-based Approach to the Dynamic Price Problem, 5th
International KES Symposium Agent and Multi-agent Systems, Agent-Based Optimization
KES-AMSTA/ABO'2011, 29 June – 1 July, 2011, Manchester, United Kingdom. indexed in: SpringerLink, Scopus, ACM DL, DBLP, Io-Port.
6
2. Chizhov J., Kuleshova G., Borisov A. Manufacturer – Wholesaler System Study Based on
Markov Decision Process, 9th International Conference on Application of Fuzzy Systems
and Soft Computing, ICAFS 2010, 26 – 27 August, 2010, Prague, Czech Republic.
3. Chizhov J., Kuleshova G., Borisov A. Time Series Clustering Approach for Decision
Support, 16th
International Multi-conference on Advanced Computer Systems ACS-AISBIS
2009, 14 – 16 October, 2009, Miedzyzdroje, Poland. indexed in: Scopus, Web of Science
4. Chizhov J., Zmanovska T., Borisov A. Temporal Data Mining for Identifying Customer
Behaviour Patterns, Data Mining in Marketing DMM’ 2009, 9th Industrial Conference,
ICDM 2009, 22 – 24 July, 2009, Leipzig, Germany. indexed in: DBLP, Io-port.net.
5. Chizhov J., Borisov A. Applying Q-Learning to Non-Markovian Environments, First
International Conference on Agents and Artificial Intelligence (ICAART 2009), 19 – 21
January, 2009, Porto, Portugal. indexed in: Engineering Village2, ISI WEB of KNOWLEDGE, SCOPUS, DBLP, Io-port.net.
6. Chizhov J., Zmanovska T., Borisov A. Ambiguous States Determining in Non-Markovian
Environments, RTU 49th
International Scientific Conference, Subsection “Information
Technology and Management Science”. 13 October, 2008, Riga, Latvia. indexed in: EBSCO
7. Chizhov J. Particulars of Neural Networks Applying in Reinforcement Learning, 14th
International Conference on Soft Computing “MENDEL 2008”, 18 – 20 June, 2008, Brno
University of Technology, Brno, Czech Republic. indexed in: ISI Web of Knowledge, INSPEC
8. Chizhov J. Reinforcement Learning with Function Approximation: Survey and Practice
Experience, International Conference on Modeling of Business, Industrial and Transport
Systems “MBITS’08”, 7 – 10 May, 2008, Transport and Telecommunication Institute, Riga,
Latvia. indexed in: ISI Web of Knowledge
9. Chizhov J. Software Agent Developing: a Practical Experience, RTU 48th
International
Scientific Conference, Subsection “Information Technology and Management Science”, 12
October, 2007, Riga Technical University, Riga.
10. Chizhov J., Borisov A. Increasing the Effectiveness of Reinforcement Learning by
Modifying the Procedure of Q-table Values Update, Fourth International Conference on Soft
Computing, Computing with Words and Perceptions in System Analysis, Decision and
Control “ICSCCW – 2007”, 27 – 28 August, 2007, Antalya, Turkey.
11. Chizhov J. Agent Control in World with Non-Markovian Property, EWSCS’07: Estonian
Winter School in Computer Science. 4 – 9 March, 2007, Palmse, Estonia.
Publications
Several fragments of the Ph.D. thesis, as well as its results are published in 11 scientific
articles. Most of the publications are indexed by international digital libraries (Springer, ISI
WEB, SCOPUS, DBLP, Io-port.net). The list of publications is included in the complete list of
literature provided at the end of the author's abstract.
Main results of the Ph.D. Thesis
The decision making method based on the MDP and ensuring MDP-model building in
tasks that contain the data presented as multidimensional time series was developed as a part of
the doctoral work. The method was tested in a series of experiments. As a result, the numerical
7
estimates allowing us to conclude that the method is able to build a MDP-model, which
adequately displays the learning sample, were obtained. The following tasks were solved and the
results obtained:
1. The review of mathematical methods based on Markov Decision Processes allows one to
conclude that MDP and Reinforcement Learning can be considered an effective method for
modelling the dynamic systems; some key problems of their use in tasks that contain data as
multidimensional time series were defined.
2. The analysis of several computational intelligence techniques (ANN, RL, agent based
systems, methods of Data Mining, etc.) allowed to describe the main features of the
developed method (pipeline organization to transform the data, timely updates of MDP-model
and so on).
3. Special agent based architecture was developed to avoid an incorrect description of the
interaction of intelligent system with the environment.
4. The approximation of the state space and decision space of MDP with the use of Artificial
Neural Network was implemented. The efficiency of the approach was demonstrated by the
toy problem.
5. The intermediate structure (the profile of a studied value’s behaviour) for storing and
processing the identified patterns of the investigated time series was formulated. The
behaviour profiles of the studied values are used to create the MDP-model.
6. The approach of using different criteria for clustering time series (Euclidean distance and the
shape-based similarity) according to the semantic load of each studied variable was
suggested.
7. For the purpose of testing the method, the problem of Dynamic Pricing Policy within the
MDP framework was formulated, and the software for implementing the experiments was
developed.
8. A series of experiments to quantify the effectiveness of the MDP-model building was carried
out. The assessment was based on the comparison of the resulting model and the training
sample as well as the application of the model on data outside the training set.
Structure and contents of the thesis
Ph.D. Thesis consists of an introduction, 6 chapters, a conclusion, a list of references and
4 appendices. Ph.D. Thesis comprises 137 pages, it includes 70 figures and 15 tables. There are
83 sources in the list of references.
The structure of the Ph.D. Thesis is the following:
INTRODUCTION – the terminology used in the research process is introduced, and the subject
of the research, the objective of the paper and the tasks are formulated.
1 CHAPTER: THE MULTISTEP PROCESS OF DECISION MAKING IN THE DYNAMIC
SYSTEMS – the analysis of the multistep decision making method based on the
Markov Decision Processes is provided in this chapter. The key advantages, as well
as disadvantages that determine the development of an improved MDP apparatus are
described.
2 CHAPTER: THE REVIEW OF THE COMPUTATIONAL INTELLIGENCE METHODS
WHEN APPLIED TO THE DYNAMIC SYSTEMS – the chapter contains the
analysis of the variety of methods from the field of computational intelligence in the
aspect of their use in the developing MDP apparatus.
3 CHAPTER: THE DEVELOPEMENT OF THE DATA MINING BASED SYSTEM FOR THE
DYNAMIC SYSTEM’S MDP MODEL BUILDING – the central chapter is devoted
8
to the development of an MDP-based method able to generate a model based on data
represented by multidimensional time series.
4 CHAPTER: THE APPLICATION OF AN MDP MODEL BUILDING USING METHODS
OF DATA MINING IN THE PROBLEM OF DYNAMIC PRICING POLICY – in
this chapter the task of Dynamic Pricing Policy in the context of Markov Decision
Process for performance of experiments is formulated.
5 CHAPTER: THE PERFORMANCE OF EXPERIMENTS REGARDING THE MODEL
BUILDING IN THE PROBLEM OF DYNAMIC PRICING POLICY – the
experiments aimed at getting the numerical results of the efficiency of a new method
are described in the chapter. The description of the developed software is presented.
6 CHAPTER: THE ANALYSIS OF THE RESULTS AND CONCLUSIONS – the final chapter
is devoted to the analysis of the findings. The directions of further researches are
defined.
APPENDIX – includes the structures of intermediate data, fragments of MDP model in XML
format and algorithms used in the research.
9
SUMMARY OF THESIS CHAPTERS
Chapter 1 (Multistep process of decision making in the dynamic systems)
The first chapter deals with the methods of solving the dynamic tasks, whose solution
principle consists in the consequent performance of operations aimed to achieve the results. The
Dynamic Programming, being a fundamental apparatus, underlies in the basis of Markov
Decision Process system, which, in its turn causes its own family of methods.
However, it appears that there exist certain difficulties as concerns the use of such
methods in the modern economic applications. The main problem consists in the development of
the model, which is the environment for functioning of multistep decision making methods. For
the purpose of solving the problem of model’s development, the possibility of using the
regularities mining procedure for application in dynamic tasks is being investigated.
The optimal decision making in the modern real problems of management may not be
considered in aspect of gaining a short-time or incidental profit. Thus, the efficient management
in economic application implies the achievement of total maximum value of the observed
parameter (for example, profit) within limited or unlimited number of phases. Thereby, a class of
tasks is existing, whose solution is achieved not at once but gradually, step by step. In other
words, the decision making is considered not as a single act, but as a process consisting of many
phases [76].
The Dynamic Programming (DP) is a mathematic apparatus, enabling to perform the
optimal planning of multistep controlled processes, and processes that depend on time [78]. The
planning is the finite sequence of the decisions made. The fundamental property of DP is that,
the decisions under development are not isolated from each other [40], but they are coordinated
with each other in order to achieve the goal state.
The Markov Decision Process (MDP) [17, 36, 54] expands the Markov chain, putting in
the concept of controlling influence. The probability of transition from one state to another is
defined by the probability under condition of chosen influence or decision. Figure 1. shows the
example of transition graph for Markov Decision Process in the classical toy problem of garbage
disposal by robot. Each black spot is an action resulted from an appropriate state. The taken
action determines the possible further transitions.
Figure 1. Example of Markov Decision Process transition graph
In fact, the matrixes of transition probabilities P and rewards (or reinforcements) R
describe the “physics” of the process, proceeding from which an appropriate policy of garbage
disposal is calculated. Thus, even at the minor changes of individual matrix values, the resulting
policies may contain the principal differences. In this chapter the formal definition of MDP and
the structure of its cortege are described by (1).
Charge
Wait
Clean Clean
Wait
p=1;r=0
p = 0,9 r = 0 p = 0,1
r = 0
p = 0,6 r = 1
p = 0,4 r = 1 p = 0,6
r = 1
p = 0,4 r = -1
p = 0,1 r = -1
p = 0,9 r = 0
For the given model it is
necessary to find a garbage
disposal policy that would gain
maximum encouraging and
would not discharge the power
supply of a mechanical robot
Robot’s actions
Robot’s states
p – Transition probability
r – Reinforcement value
10
RPAS ,,, , (1)
where S – the finite set of discrete states, S = {s1, s2, ... , s|S|}. Each state st reflects the current
value of vector consisting of observable parameters st = (x1, x2, … x|s|). Thus, the state is all the
information available on the dynamic system in a certain moment of time.
A – finite set of controlling impacts (actions), A = {a1, a2, ... , a|A|}. Usually the action ai changes
immediately one or several parameters xi of state st.
P – the function of states transitions determines the probability, that the action a, taken under the
state s in the time moment t, will transfer the system to state s' at the time moment t+1. It is a
reflection of kind P : S × A × S P.
R – the function of rewards determine the expected reward, obtained immediately after the
transition in the state s' from the state s as a result of action a. It is a reflection of kind R: S × A ×
S R. In fact, it determines the goal state, which is to be achieved.
The solution of MDP is the optimal action policy π* that defines for each state st the
appropriate actions ai. In such case, the reflection is π: S → A.
The key advantages of MDP are their convergence towards the global optimal policy, as
well as simple structure of the model. The explanation of the gained policy of actions π*
(actually solution of task) is not complicated in comparison with the solutions of ANN. The
policy can be also expressed using different methods of knowledge presentation, for example,
decision trees, decision tables [67].
The significant disadvantage of MDP is the absence of mechanisms of the automatic
model building. The solution as policy, maximizing the expected discounted sum of rewards,
may be gained, if the matrix of transitions P and the function of rewards R, making the base of
model, are known. Their construction is rather difficult in the real tasks.
On the basis of the study and analysis of MDP and RL models in the Table 1 their main
characteristics in the problems of stochastic processes study and building of appropriate model
are presented. The standard approach of work with non-Markov systems is the increase of the
memory for dealing with the prehistory of transitions. Based on this principle, the approach of
state introduction using the time series for creating the Markov model is considered.
Table 1
Advantages and disadvantages of models based on MDP
Markov Decision Process
Advantages
global convergence;
building the policy taking into account
the delay of rewards;
simple methods of calculating policy.
Disadvantages
the prior knowledge of system model are
needed;
the complexity of method implementing in
non-Markov systems.
Reinforcement Learning
Advantages
global convergence;
the system model is not needed
(unsupervised learning case);
possibility to work in the systems
possessing no properties of Markov;
building the policy taking into account
the delay of rewards.
Disadvantages
Exploitation-Exploration trade-off exists;
development of the model through the
research is not allowed in the series of
practical tasks;
the complexity of application in non-Markov
systems.
11
One of the disadvantages of MDP is also the complexity of policy search in so-called
non-Markov systems – dynamic systems, which do not meet the property of Markov. The
development of process in non-Markov systems depends not only on the current state, but also
on the sequences of states that took place in the past. The solving of non-Markov problems is
possible through the description of the process providing the mechanism of memory. It is also
implied, that the static properties in the future are dependent on process evolution character in
the past. Such an approach complicates the solution and actually deprives the possibility of its
application in real tasks.
The review of mathematic apparatus, based on Markov process, enables to make the
conclusions regarding the possibility to consider the Markov Decision Process and
Reinforcement Learning as effective methods for modelling dynamic systems.
The research on contemporary positions of MDP and RL methods detected the key
problems of their application in real tasks of economics, management, etc.
The analysis of the detected problems ensured the possibility to formulate the approach,
based on methods of data elicitation to improve the MDP framework, which enables its use in
modern tasks that are described in the setting of non-Markov dynamic system.
Chapter 2 (The review of the Computational Intelligence methods when applied to the dynamic
systems)
In this chapter the substantiation of the necessity of application of the Computational
Intelligence (CI) methods with the aim to develop an effective system of decision making is
represented. Different architectures of agent systems are analyzed in order to develop an agent
architecture that is specific towards the dynamic system [5]. The agent-based approach, in its
turn, enables to represent the software system and its interaction with task of the real world that
is being solved in a way that is natural for a human. The experiments on model dynamic systems
involving Artificial Neuronal Networks with the aim to approximate the spaces of decisions are
performed as well.
There are several formal definitions of Computational Intelligence. The concept of CI is
determined in the work [40] as a set of computational models and tools bearing the intellectual
adaptation to immediate perception of primer sensor data, their processing involving
parallelization and transferring of task, creating the safe and timely responsive system with high
level of resiliency.
Usually the immediate processing of “raw” data using the intelligent software instruments
is impossible. It is determined by stringent requirements of algorithms towards the data structure.
Thus, for instance, MDP models work with the fixed data structure determining the state. The
mediator of some kind is thereby needed between the physical data carrier and any intellectual
method [5]. This, in its turn, ensures the piping of the task. The sample of the system providing
immediate interaction of intellectual tools with the task is represented in Figure 2.
Figure 2. The way of immediate interaction
Data Base
Expert Action
Raw data
Data
Preprocessing Structured data
Decision
The interface
Intelligence
tools
12
The consideration of an indirect impact of the intellectual system on physical source of a
task is presented in the Ph.D. thesis as well (in this case – some database), which is typical for
tasks that have high expenses of any kind in case of erroneous decisions.
Plenty of methods of Computational Intelligence are involved in the developed approach
to solving the dynamic systems. The current methods for this research and their position among
the methods family of Computational Intelligence [10] are represented in Figure 3. based on the
classification attached in [40].
Figure 3. A fragment of the Computational Intelligence family tree
In this particular work we use the definition of an agent as suggested in the work [28] as a
basis: “An autonomous agent is a system situated within and a part of an environment that
senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect
what it senses in the future”.
There are plenty of agent types that meet these definitions either partly or fully.
Depending on properties that agents possess, some classes of agents are selected [11]. The most
typical among them are: programmable agents (reactive agents, reflexive agents [58]), learning
agents and planning agents [37]. The properties that an agent of some class can possess [28] are
given in the Table 2.
Table 2
The properties of software agents
Property Description
Reactive (sensing and
acting) responds in a timely fashion to changes in the environment
Autonomous exercises control over its own actions
goal-oriented
does not simply act in response to the environment
temporally continuous is a continuously running process
communicative communicates with other agents, perhaps including people
learning changes its behaviour based on its previous experience
mobile able to transport itself from one machine to another
flexible actions are not scripted
character believable "personality" and emotional state.
Computational Intelligence
Neuro-computing
Supervised
learning
Granular
Computing
Reinforcement
Learning
Evolutionary
Computing
Artificial Life
Unsupervised
learning
RL, LCS, …. ANN
13
One of the disadvantages of Reinforcement Learning is the exponential growth of
problem space with each new dimension [58, 62]. Further, the most common methods of dealing
with the problem (known as “the curse of dimensionality”) are considered.
Two main approaches to the working with a large number of states are examined in the
chapter: the approximation of value function and the methods of gradient policy. One of the
methods that belong to nonlinear approximation is the framework of Artificial Neuronal
Networks (ANN).
With the aim to analyse the ANN in the aspect of approximating functions, the multilayer
perceptron with learning using the error back propagation method is realized in this thesis. The
existing commercial and free distributed software implementations of Artificial Neuronal
Networks are reviewed too. The plan of experiments includes the following tasks:
1. to perform author’s own implementation of ANN. To research the efficiency of the
network function on the example of approximation of some one-dimensional stochastic
process;
2. to compare the gained results of approximation with the results of existing ANN
packages;
3. to realize the approximation of states space in the Reinforcement Learning, using toy
problem for demonstration purposes.
Within the framework of the first experiment realized in the Ph.D. thesis, the Artificial
Neural Network demonstrates high results of learning. The application of three hidden layers and
70 neurons in each layer ensures the value of mean-square error ems = 0,0013 (see Figure 4. ).
Such level of error ensures sufficient precision for modelling of one-dimensional stochastic
process consisting of 30 observations.
Figure 4. Function approximated by ANN having three hidden layers
The comparison of the results with the two most common packages of ANN
(Neurosolutions 6.0 and Multiple Back-Propagation v.2.2.2) was performed within the
framework of the second experiment. Neurosolutions 6.0 software for the simplest architecture
of network approximates with the mean-square error ems = 0,00943. The package Multiple
Back-Propagation ensures the convergence by the value of error ems = 0.0012. The comparison
allows us to conclude, that the gained precision meets the precision of side packages. This allows
us to use our own realization of a neural network in the subsequent experiments.
Figure 5. 3-step algorithm of approximated model building
Rough RL
Approximation by
ANN
Final learning
RL + ANN
tabular Q-function
approximated
Q-function
t
ems = 0,0159
Y
14
The idea of approximation of Q-function in Reinforcement Learning using Artificial
Neuronal Network is realized in the third experiment. The key problems of method realization
are represented; to overcome them, a new ANN learning approach (embedded in RL) is
suggested in the Ph.D. thesis (see Figure 5).
To test the approach we use the toy problem of mountain car [9]. According to the
problem, the attractive force exceeds the engine power. It is impossible to ‘climb up’
immediately from the state of rest. The only solution is to develop the strategy of rolling from
one prone to the other, in order to collect the supplementary inertia force. The problem
demonstrates the necessity of multiple movement away from the peak and back, to achieve it in
the future. Available actions: inactivity of engine (0), acceleration forward (+1) and acceleration
back (-1).
The optimal policy of toy problem is obtaining by using discrete RL with table
Q-function. The surface of Q-function in the space of state is represented in Figure 6. The
measurement of operation space is omitted, but the value of optimal action in each point of space
is used.
Scaling factor of Q-axial: 0.1
Space size: 70 x 80
Episodes count: ≈ 8 000
ε: 0.1 probability of random action
: 0.99 discount-rate parameter
: 0.3 learning rate
λ: 0.92 trace-decay parameters
Figure 6. Example of optimal policy Q* = maxa Qt(s)
It is necessary to get a similar surface as a result of a three-step algorithm. The first step
is the development of the first approximation (rough policy). It is experimentally established that
for reaching these objectives the sufficient discretion of space is the network containing of 20 x
20 cells. The second step stipulates the transmission of intermediate Q-function into ANN. Now
the precision of the function depends on the “capacity” of the network. On the basis of series of
experiments, it is identified, that 6 hidden layers containing 110 neurons in each are sufficient to
gain the surface of Q-function [9].
Approaching of network coefficient changes to zero (∆eij→0) allows us to move on to the
third step – learning of network in the mode of interaction with the environment through RL
framework. The experiments demonstrate that after ten iteration of learning, the network
“forgets” the prior learned examples, provided that they were not supplied for training
continuously together with the trained space of training examples. Taking into consideration the
prior steps, the crude policy is a matrix of reference points. Such matrix supports the “memory”
of neuronal network, not allowing to forget the reaction on state rarely found during the learning
in the environment [9]. As a result of learning, the surface of Q-function, demonstrated in Figure
7. is gained on the third step.
The gained policy is looks substantially smoother than its tabular analogue. On the one
hand, it reduces the precision, but, on the other hand, it allows to work in the environment with
continuous parameters.
The experiments with toy problem showed that the algorithm allows to avoid the
problems of absence of initial training set and permits the functioning in continuous
environments [9]. However, the time of learning increases.
position
Speed -1.2
0.6 0.07
-0.07 Q*
value
15
Figure 7. Q-function for action “throttle”, obtained on third step
It is possible to conclude, that ANN learning by error back propagation method provides
a powerful tool for approximation of linear functions set in a tabular form [34]. The performed
experiments on the toy problem demonstrate that this property is successfully used for
approximation of Q-function in the Reinforcement Learning algorithm.
The consideration of the concepts of the Computational Intelligence and the appropriate
methods, described in this chapter, allows us to draw the following conclusions:
1. the analysis of some concepts of Computational Intelligence, as well as its individual
methods (ANN, RL, agent approach, methods of data mining and others), allowed us to
describe the main features of the architecture under development (conveyer organization
of flow aimed at the transformation of data, timely respond to events, etc), and also the
applied methods of intellectual computing;
2. the review of expression of Markov Decision Process through the software agents
demonstrates a number of sufficient methods and a wide range of tasks being solved; at
the same time, problems linked to conveyorization (pipelining) of the problem, require
the development of a special form architecture of agent system;
3. in terms of principle of methods ٰ synergy, the Artificial Neuronal Networks are
considered in this work as an approach to approximation of state space of Markov
Decision Process.
Chapter 3 (The development of the Data Mining based system for the dynamic system’s MDP
model building)
In the third chapter of the research the development and the description of a mathematic
base of intellectual system in the task of decision making, whose data are expressed using the
multidimensional time series structures, is provided.
The decision is based on finding the hidden regularities between the so-called families of
time series. The series of technologies Data Mining are used for that purpose. The concepts of
classes and profiles of observed variables dynamics behaviour are implemented for storing and
operative processing of mined data. The profiles of behaviour are interpreted as single-step
Markov transitions. Markov transitions are used for model creation of process under
investigation.
Within the framework of this chapter, it is necessary to perform the following (in order to
solve the set task):
-1.2
0.6 0.07
-0.07
Q
Speed
Position
16
1. a review of current situation in the field of application of the method to the tasks,
expressed through the multidimensional time series, as well as the research on the
possible existing approaches;
2. the development of data mining based methods to build the time series, their
processing and conversion into the structures of data (states of the system), compatible
with Markov Decision Process.
The research and development of Markov model building method for the data, expressed
through the time series, provides the solution to the following problems:
cleaning of observed process raw data, construction of the time series;
formalization of time series as an environment needed for building of states ٰ space;
composition of a transition graph reflecting the generalized behaviour of dynamics of
observed variable values;
development of optimal policies and their use.
Thereby, the approach is based on generalization of individual observations, building of
state space and the further development of the model. In general, the functioning of system based
on methods being developed has to contain the following steps:
“acquire” the existing observations regarding the development of process under
investigation:
o to identify the regularities of transitions in one or other state with appropriate
influence;
o to build the states transition graph, representing the general model of process
being researched;
to build the particular realization of Markov Decision Process for the sought solution at
the current parameters on the basis of model;
to develop the policy of actions for the particular realization of MDP;
to invite the expert to perform the action according to developed policy at the given
parameters;
to accomplish the permanent renewal of environment, transition graph and policy upon
acquiring of new data concerning the actual process progress.
The data regarding the development of the method of building the MDP model are
provided in the chapter. The method of data transformation into structure of Markov Decision
Process expressed through time series is suggested and described. For that purpose, the concept
of behaviour profile of observed variable values is implemented. The criteria of comparison of
time series are reviewed.
Description of the approach. One of the directions of Markov Decision Process models using is
a research on the dependence of dynamic of one variable on the others. The final objective is the
application of a gained model for the timely decision making concerning the performed activity,
with the aim to achieve the desired indicators of dynamic system. The main problem, as with
most mathematic models, is its building.
In the same way [33], the whole set of observations is considered as a source of
behaviour patterns (profiles), but, unlike [33], in this work model is based not on the particular
realization of time series, but on many realizations, corresponding with different combinations of
parameter values of time series. Clustering of realizations of time series allows us to consider
new operations with transition models: minimum needed supplement of transition mode using
fragments of other analogical models. This operation allows us to continue building of the
existing transition model in such a way that with a certain probability the model will allow the
system to move into a state that was not stipulated during the learning phase (model building).
In general terms, the building of a dynamic system model presupposes the investigation
of the particular regularities that took place in the development of processes under observation,
their generalization and expression in some structure [4,7]
The method of dynamic system model building being developed follows the described
approach. The mathematic framework of Markov Decision Process is considered as a model. The
17
state transition graph is used as a graphical expression of the model. The method of model
building (see Figure 8) is divided into the following main stages [6]:
1) to process the raw data, to build the time series T;
2) to identify the general regularities of the development of processes, to build the
behaviour profile П of observed variables;
3) to find general transitions in the profiles of dynamic behaviour, to build the model of
processes evolution (transition graph).
Figure 8. The main stages of building of the Markov transition models
The key moment of time series transformation into MDP structure is the interpretation of
the dynamic behaviour profile in the aspect of state matrix forming S [6]. In case of timely
consideration of Markov Decision Process the concept of state S is connected with the theory of
software intellectual agents [62]. All the information available to an agent, gained by agent
sensors from the environment at the certain time moment, is called a state.
In case of a task, whose data are represented by time series, the special approach
concerning the determination of dynamic state systems is offered [6]. The key difference is the
consideration not of the static variable values, but of appropriate time series, in other words – the
evolution dynamic of variables being researched is considered. The following interpretation of
dynamic behaviour profile meets these aims. Let the behaviour profile, gained as a result of time
series clustering, consist of two centroids: v1 and v2 (see Figure 9).
Figure 9. Sample of behaviour profile of volumes and sale prices
Thus, we indicate the structure in profile, including three elements: a) dynamic of
variables v1 and v2 before event ei, b) event ei, happening in the time moment ta, c) dynamic of
variables v1 and v2 after event ei. Thus the area of variables evolution observed before the event
we interpret as the initial state s0S (see. Figure 9, area t = [1; ta)). Event ej – as the action aA.
The area of variables evolution observed after the event we interpret as the transition state s1S
(sphere t=[ta; tmax]). The dynamic behaviour profile П, then, can be considered as a single
determined Markov transition process (see Figure 10) [6]. Taking into account that the transition
is built on the actual observations, the values of transition probability p(s1|s0, a0), for the time
being, is equal to one.
The states of set S are marked with white circles, with grey circle – action a0, as a result
of which the determined transition happens. After the appropriation of corresponding state
Observations Behaviour profiles
Time series
P
Data base
s1
s3 s4
s2
p1
p2
p3
p4
p5
p6
Transition graph
s0 - Initial state, t = [1; ta) a – action s1 - transition state,
t = [ta; tmax]
t ta
v2
v1
0
18
identification to each fragment of each profile, it becomes possible to work not with time series,
but with corresponding states, which is a necessary condition for working with Markov Decision
Process.
Figure 10. Interpretation of sales profile
Description of the goal state and the problem. Defining of the goal state on the stage of states
set building is not topical. However, as soon as the model is built, the initial and the goal state
should be defined for building an action policy. Accordingly, we describe in which way the goal
state of Markov Decision Process is interpreted in the problem, expressed through time series.
The goal state is the appropriate proportion of the observed variables vi, meeting the
requirements of an expert. For example, we consider the observed variable in the problem of car
sales v1 – volume of sales and the v2 – sales charge. Then, for some fixed parameters of space Ψ,
a state that can be considered as a goal state is the one where time series of sale volumes v1 and
prices of sale v2 meets the given value.
The action A and reward R matrices. Creating the model of the researched system, the action
matrix may be gained using several methods, for example:
1) set of actions A contains only actions, being presented in set of profiles П;
2) set of actions A contains all the allowed values of change of the controlled variable vi
in some diapason, irrespective of presence of the specific meaning in the profiles П.
Analogically to goal states, without involving the method of reward matrix R building, let
us consider its definition for a case of problem expressed through time series. The reward matrix
R is commensurate with the quantity of states, and for each state keeps the value of reward that is
gained by a system (or an agent) in case of achievement of current state si. For example, let the
reward make the value of 1, if the achieved goal state sgoal, and value -0,04 – in the opposite
case. The matrix R, then, can be presented by the function:
goali
goali
is Ss
SssRr
i ,04.0
,0.1)(
(2)
In such a way, to provide the system with matrix of reward R, it is enough to determine
set of target states Sgoal.
Transition probabilities graph P. A central element of Markov Decision Process is a transition
graph, where it is necessary to find the optimal action policy *. The building of the transition
graph is the most complicated thing concerning the expression of the task being researched in the
aspect of MDP.
In case of many economic tasks the transition model cannot be expressed in the analytical
way and be correct for all its states, but it can be obtained in the form of a table based on
generalization of actual transitions. The table representation makes it possible to describe the
reflection of the state for each state-action combination. Thus, the transition graph is built for the
whole space Ψ (based on the definition of state).
The transition graph is built in the process of generalization of atomic transitions
(dynamic behaviour profiles П). The generalization is the calculation of probability of transition
from the state s performing the activity a into the state s’, and it is based on the number of actual
observations concerning this transition (3). Since the environment is totally observable (training
set is available), it is reasonable to use the statistic approach to evaluate the unknown parameter
for the calculation of transition matrix P. We consider one of the simplest approaches in this
s0 s1
a0
19
research - the maximum likelihood estimation. Thus, the calculation of a factual observation
quantity of each transition in relation to a total number of transitions from the considered state is
expressed as follows:
''
)'',,(
)',,(),|'(
s
sasN
sasNassP , 1),( asP ,
(3)
where s' – goal state, s’’ – any state of set S, into which the transition is possible (it means that
there exist appropriate factual observations) from the state s performing the action a.
N(s,a,s’) - number of factual transitions from the state s into state s', performing the action a;
''
)'',,(s
sasN – the total number of transitions to any possible states from the state s, performing
the action a.
The development of clustering procedure. The procedures of clustering are the central
generalizing mechanisms of the actual observations in the process of MDP-model building. The
precision of prognosis and “adequacy” of a future model to learning data depends on the correct
choice of clustering criteria in relation to clustered data and on the accepted parameter values of
clustering.
In this research, the agglomerative hierarchical clustering of time series is applied, which
requires the forming of symmetric distance matrix. Since the objects of clustering are time series,
first of all it is necessary to define metrics that allow us to compare quantitatively the similarities
of one time series with the other.
To calculate the distance, we use two simple calculation criteria: Euclidian distance and
the shape-based criteria. The last one allows us to compare not the absolute values, but the
shapes of curves of appropriate time series (4).
1,1,
1
1
,, ,,||),(
iiibiiia
N
i
ibia bbaabaS . (4)
Euclidian distance makes it possible to group the time series close to each other by
distance, but the shape-based criteria are groups after outline (profile) of curves (see Figure 11).
Figure 11. Two criteria for evaluation of time series proximity
This kind of approach allows us to gain clusters of time series, close by its shape, in
groups, homogeneous regarding the scale of distribution. The clustering for time series of sale
prices is made only according to shape-based criteria.
MDP-model building method representation in a form of an intelligent system. The basis of
dynamic programming methods is the principle of consequent interim decisions making leading
to the objective. The technique of decision search gains the iteration features. As a consequence,
it becomes more preferably to realize one or the other method in the form of a programming tool.
It is necessary to develop the architecture of programming tool, which will predetermine the
t
y Cluster А
Cluster B
t
y Cluster А
Cluster B
Euclidian distance Difference of change
20
structure of intelligent system. The creation of a system ensures the interaction among the
methods of dynamic programming represented below, as well as the interaction with the
database (data source of problem) and with an expert (operator).
Figure 12. Diagram of intellectual system functioning and its interaction with the environment
Notwithstanding of it, the architecture of intelligent system has to ensure the interaction
with the user database, expert and operator. The architecture of a system meeting all the demands
mentioned below is given in Figure 12. The elements of the intellectual system located in the
sphere are marked with double outline, including the modules of processing, the storage of time
series and the profile storage. The intellectual system possesses the visual interface of interaction
with manager (operators) and programming interface of interaction with the user database.
The directions of data streams are marked with the dotted line, with solid line - control
stream. The system being designed possesses characteristic properties such as autonomous and
interrupted functioning, adaptation to changing data (in other words – learning of some kind),
goal-oriented (presence of a goal to optimize some parameters of the process being researched
by means of model building), communicativeness (interaction with the users and the database),
and others. It allows us to consider the software system as a software intellectual agent.
Chapter 4 (The application of MDP model building using methods of Data Mining in the
problem of Dynamic Pricing Policy)
The practice of sales administration points out the necessity of Dynamic Pricing Policy to
increase the competitiveness. The Markov models are effective when the decision making
includes the uncertainty in the event chronicle, but crucial events can take place repeatedly. The
aim of this chapter is to demonstrate, using the test example, the application possibility of
Markov decision Process in the problem solving of Dynamic Pricing Policy. Respectively, the
tasks of this chapter are the following:
Module of recommendations
Recommendation
Time series
Profiles П
Agent Time series
creation
Model Building
MDP-model
Clusters
New records
Client Database
Clustering
Behaviour profiles creation
Environment
Parameters
Decision
Queries
An expert or operator
Observations
21
to formulate the problem of Dynamic Price Policy as the task of dynamic programming;
to formulate and to describe the method of dynamic control of pricing policy based on
Markov Decision Process;
to formulate the method of MDP-model building on the basis of regularities detected using
Data Mining tools, from the factual data concerning selling.
The basic reasons of Dynamic Pricing Policy choice as an experimental problem are:
time-dimension presence, determining the possibility to represent the selling data as time
series;
possibility of generalization according to the number of observed variables (for example,
wholesale customer, goods and other) with the aim to build the model;
the possibility to consider the price correction process as a system being at every moment
of time in a certain state, possessing the controlling mechanism (change of state) and
having the conception of a goal state.
The aforementioned reasons are important because they allow us to demonstrate the
features of an application approach of Markov Decision Process being developed in the Dynamic
Pricing Policy problem. Finally, the topicality of Dynamic Pricing Policy, dictated by rapid
evolution of internet technologies in modern business, also determines the choice of this task.
A source of information concerning sales, database and the method development platform
in the problem under consideration is the enterprise resource planning 1С:Enterprise v7 system.
The task of Dynamic Pricing Policy has several definitions. Here we consider the
following definition: the dynamic pricing is the operative adjustment of prices to customers
depending upon the value, with which the customers correlate with the production or service [49]
(the definition, in its turn, is based on [56]). As a value, with which the customers correlate with
the production or service, we consider three forms of price differentiation offered in this work
[69].
A system of decision making is developed; it aims at the long term maximization of sale
sums, which is achieved through the correction of existing sale prices taking into account the
available factors of ERP- system (see Figure 13.) and the goal state set by an expert.
Figure 13. Interaction of price generator module and module of price correction
Implementing regularly
Implementing one time
PRICE GENERATOR
Standard subsystem of pricing
PRICE CORRECTION
Integrated intellectual system
Basic tools of ERP
The price correction
block being researched
and developed
Goods
Labor expenses
Raw material
Profit
Wholesale customer
Sales date
Contract conditions
Current price
Price model
Goal price state
price
price
Available factors, determining the initial
price
Available factors and
model, built using data
mining
22
The mechanism of price correction is realized owing to the use of MDP-model. The
building of price correction model includes the finding of regularities of sale evolution process in
the past and their generalization (see Chapter 3). In other words, the solution is reduced to the
analysis of changes in the past and creation of an appropriate model of price evolution process.
The general objective of Markov controlled processes with discounting of incomes is the
choice of such system managing vector everywhere, to gain the maximum profit on the horizon
of its functioning [80]. Due to this property, the MDP framework is appropriate in the Dynamic
Price Policy problem.
Processing the multidimensional space data is originally stipulated in this work, including
the data on wholesale buyers and product names. Each dimension of space is determined in the
space of some hundred values. The methods of Data Mining are applied in order to discover the
regularities that describe the outcomes of pricing policy actions in the past. A program tool
(based on the intellectual agent) for the online tracking of the new data on sales (which comes
from the managers and operators of the ERP-system), is realized. Such tracking makes it
possible to update the model timely by means of inserting a new data on sales and outcomes of
the price corrections in it.
Data model. The problem of price correction is a discrete process, in other words, the behaviour
of a wholesale customer-goods system can be expressed with the final number of states. At any
discrete time moment ti system is in one of the possible states sj S. The sales process is being
observed over time for fixed values of appropriate parameters (measurements of space). Each
state of the system is determined using two vector values (time series):
p - time series of price, representing dynamics of price changes;
v - time series of sales volume, representing the dynamics of sales volumes.
There is a sample system (on the right) in Figure 14, having fixed values “Light” to
measure Buyer and “Led” to measure Goods. The sale price pij and sale volume vij are the
observed values in the framework of the system. Accordingly, the measurements of space are the
observed variables Customer, Goods and Time. Each point of a hypercube (Figure 14, left) is
determined by two static values: price and sales volumes.
Figure 14. Data hypercube (on the left) and time series of the state (on the right)
Apart from the dimensions ‘Goods’ and ‘Customer’, other measurements also exist.
However, the other measurements are abolished in this study for the sake of clarity. Let us also
V,P
v
p
Sales of goods ‘Led’ to
buyer “Light”
Customer
Goods Time
Led
t1
t2
t3
t4
t5
t6
Tumbler
Socket 220V
“Home” “Centre” “Light”
v=3
p=4.75
v=5
p=4.65
v=7
p=4.85
v=20
p=0.30
v=20
p=0.30
v=25
p=0.29
v=15
p=1.33
v=20
p=1.33
v=20
p=1.30
t
“Home” “Centre” “Light”
23
suppose that the data have passed all the stages of cleaning and pre-processing. The description
of procedures of data cleaning is represented in [4].
The price correction problem in terms of dynamic programming. The provision is
nominated to the task in the context of dynamic programming, which means that its solution has
to be represented as some sequence of actions. In other words, the solution of the subtasks stated.
The price correction problem possesses such property. For example, price p regarding some
goods Gi may be sequentially (during some period of time) transferred into a desired value
without significant (given value) loss of sales value. At the same time, the immediate change of
price can cause the loss of wholesale customers. Multistep approach of price correction makes
the using of Dynamic Programming appropriate in the Dynamic Pricing Policy problem.
The price correction problem in MDP terms. We express the problem described above
through recursive Bellman optimality equation. In the terms of MDP, the equation below
expresses the value of an expected reward gained for transition from the current state s in the
state s’ according to a certain policy π [62]:
'
)'())(,|'()()(s
sVsssPsRsV . (5)
The expression of optimal policy is known as the Bellman optimality equation in the
terms of MDP. It describes the current reward for taking action, entailing in the future maximal
expected reward [62]:
'
)'(*),|'(max)()(*s
asVassPsRsV , (6)
where s – state of system, determined by vectors p and v : s = { p ; v };
R(s) – reinforcement, gained in current state s.
The goal state, like any state, is determined through its own values of reinforcement. In
the general case, the function of reward determines how good or bad it is to stay in the current
state (similar to “pleasure” or “pain” in biological context). The reinforcement represents, in the
case of price correction, the local amount gained from wholesale customer, and it is determined
for state s in the following way:
,)()(||
1
s
vpsR
(7)
where |s| - size of time series contained in the state s; then the reward is the amount of incomes
for all the days observed in the frames of state s; - time variable of the researched time series.
Let us continue the consideration of equation variables (6): P(s’|s,a) - probability of
system transition to the state s’ from the state s, performing an action a. The calculation of
probability matrix P is based on the quantity calculation of factual observations of each
transition regarding the total quantity of transitions from the state being considered (3).
Since the function V*() in the expression (6) is both present in left and right part, the
calculation is performed in a recurrent way, i.e. by means of decomposition of the whole task
into subtasks. There exist two main algorithms of an equation solving (5): Value Iteration and
Policy Iteration.
The Dynamic Pricing Policy problem in the terms of Markov Decision Process is
formulated in this chapter. The method of model building is offered. The obtained model is
considered as an environment where MDP is functioning. The method is based on finding of
regularities and their generalization. The methodology of model building is demonstrated in the
framework of elementary system including one wholesale customer and one goods unit. The
analysis of solved problems allows us to formulate the following conclusions:
1. A dynamic system whose development is represented through the time series, can be
expressed through a final number of states and actions, can have the transition model and
reward function;
24
2. decision making and support module interaction with the system wholesale customer -
goods, as well as the necessity in the continuous update of the model, determine the
application of an agent oriented approach.
Chapter 5 (The performance of experiments regarding the model building in the problem of
dynamic pricing policy)
To approbate the suggested model building method, the development of a software
platform for carrying out the experiments is performed, the plan of experiments is developed,
and the process of their implementation is described.
Software development for the experiments execution. The program modules for performing
experiments are realized on the platform 1С:Enterprise version 7.7. The choice is determined by
the existence (for this platform) of data on purchases of products by real enterprise for two years.
The realized modules are also able to serve for creation of final software designed to operate in
the background mode and to perform the decision making concerning the price correction.
Besides, the modules for carrying out the experiments, which are not associated directly
with subject sphere, are realized. These are the programs for working with Artificial Neuronal
Network and Markov Decision Process in the toy problems.
Involvement of the development environment Borland Delphi 7 has to ensure and
accelerate the processing of massive data whose volume exceeds the possibilities of platform
1С:Enterprise v7.7. For example, the creation of distance matrix is restricted up to 5000 elements
in each dimension.
Plan of the experiments. To evaluate the workability and the efficiency of the developed
method concerning the building of MDP-model, the plan of experiments is developed (see
Table 3). Two series of experiments are included in the plan. By means of comparing the model
and the actual processes development, the first series of experiments allows us to evaluate how
MDP-model meets the learning data. The aim of the second series of experiments is to research
the quality of MDP-model created through the approximation of space by Artificial Neuronal
Network.
Table 3
The plan of experiments concerning the building and application of MDP-model in Dynamic
Pricing Policy Problem
Series
Nr Description The experiment aim
I.
Using the numerical characteristics, to evaluate
the similarity of a built MDP-model with respect
to the factual processes
To evaluate the
correctness of algorithms
concerning the building of
MDP- model
Using the numerical characteristics, to compare
the efficiency of solution concerning
MDP-model with factual solutions for testing
data
To evaluate the efficiency
of MDP-model in the
exploitation mode
II.
Using the numerical characteristics, to perform
the estimation of similarity concerning built
MDP + ANN model towards factual processes
To evaluate the
correctness of algorithms
of MDP-model building
with approximated space
As a criteria for evaluation of model quality serves the proportion of the number of
successfully modelled transitions to the number of actual transitions, as well as the evaluation of
profit expressed in conventional units.
25
The pricing policy model building and exploitation. The actual data regarding the sale of
products produced by Latvian food industry are used for creation of pricing policy model in this
experiment. The period of observed data covers three months: May, June, and July. The data
about sales are represented via electronic documents of ERP-system „1C:Enterprise v7.7”. In
addition, each electronic document contains the date of deal, name of wholesale customer,
register of goods, sales volume, and prices. Approximately 28,5 thousand sale documents, 1 725
wholesale customers and 1 725 names of marketable titles are present in the frames of period
being researched.
The obtained model of transitions (see Table 4) represents the Markov Decision Process.
The total number of transitions consisted of 1192, created by 294 states.
Table 4
A transition model fragment
Initial state s Action a Transition state
s’
Transition probability
),|'( assP
Id_7 ]0009.0,05.0(~2 p Cl_11 1,0
Cl_66 ]0009.0,05.0(~2 p Cl_54 0,25
Cl_3 ]05.0,1.0(~1 p Id_11 0,667
Id_14 ]05.0,1.0(~1 p Id_15 0,4
Id_16 ]01.0,0.0(~3 p Id_17 0,667
Cl_66 ]015.0,01.0(~4 p Cl_54 0,333
Cl_3 ]015.0,01.0(~4 p Id_23 1,0
Id_24 ]02.0,015.0(~5 p Id_25 0,75
… … … …
Using the graphical package «yEd», the visual representation of model is possible. Based
on XML-file of model generated by experimental platform in Figure 15., the fragment of model
is given below.
The light ovals designate the states, dark ones - actions, and values over the slats – the
corresponding transition probability. The gained graph represents the pricing model for the
appropriate time period. The fragment of sales volume development for certain observed
combination of wholesale customer and goods is given in lower side of Figure 15.
The states not accompanied by any price changes are outlined with a dotted line (in other
words, the transitions with zero value of price). The transitions caused by a certain price changes
are outlined with a dashed line. The straight arrows show the position of each graph element on
real piece of sales process.
In Figure 16. the number (in percentage for total number 7000) of time series, possessing
the appropriate value of numerical estimation );( GCX (describes the correspondence of model and
its separate parts to actual transitions; for details see paragraph (5.2) in Ph.D. Thesis). We note
the absence of marks with value less than 0.5 – it means that all time series at least are half
described by model and have partial sequence of events. The low mark (less than 0.9) is
associated in 97% of cases with the discontinuity of the model (this effect can be observed in
Figure 15., above for single transitions). Nevertheless, 70% of observations possess the
evaluation of );( GCX =0.8, which characterizes the model as a model that is able to reproduce the
majority of processes.
26
Figure 15. The fragment of transition graph (above); dynamic of sale volume for fixed
combination wholesale customer-goods
Having the transition model, it becomes possible to use the well studied algorithms of
policy search in Markov Decision Process (Policy iteration, Value Iteration), for which it is
necessary to determine the goal state. Generally, the exploitation of model for the process being
researched includes the following stages: (a) to compare the current state of process with one of
the model states, which would determine the initial state s0; (b) to determine the goal state starget;
(c) to build the policy * of price corrections; (d) to track the current state and update the policy
during the process evolution.
Figure 16. Distribution of value );( GCX
The made changes of price come into force, and the system does the transition into the
next state st+1 after the specified period. The transition is characterized through a changed value
of price and the reaction of request. Such interaction of decision making module and the external
system the wholesale customer-goods represents the typical interaction of intellectual system
with the environment.
);( GCX0
5
10
15
20
25
30
1 0,889 0,8 0,715 0,667 0,625 0,5
%
31.07.2008 01.05.2008
Price
Volume
Date
27
The exploitation of gained policy on testing data. The aim of this experiment is the evaluation
of the possibility to apply the model to data not included in learning sample. In fact, the
experiment reflects the initial task, which is nominated to the intellectual system.
Unlike the previous experiment, this experiment stipulates the phase of exploitation of
optimal policy *. This means that it becomes possible to evaluate the efficiency of model using
numerical method in the aspect of profit. Building policy * all available states are considered as
goal states. The strategy is calculated, based on gained profit (7) and appropriate actions. It
allows operating to all variants of process evolution and provides a choice of the optimal one in
terms of discounted profit.
The most demonstrative case of developed policy exploitation for testing data is the
example, given in Figure 17. According to MDP-model, by transition from the state Cl_438 into
Cl_636, two alternatives appear: to make actions concerning corrections of price into value
]1625.0,15.0(~27 p , or action concerning the price correction into value ]125.0,1125.0(~
24 p .
Depending on a chosen action, the terminal states can be the states Id_2181, Cl_683,
Id_734. Since the local reward values are known (the total income at the site of state), the
calculation of terminal state achievement policies and a choice of the optimal one becomes
possible.
Figure 17. The fragment of transition graph and optimal policy * (above); dynamic of sale
values and prices for a particular combination wholesale customer-goods
Despite the fact that the terminal states Cl_683 does not possess the maximum value of
local reward, it is the most “attractive” one in terms of discounted reward constituting the value
V(Cl_683 | 24~p ) = 405,585. The maximal discounted reward determines the optimal politics *,
represented in Figure 17. with light grey bold arrow.
Cl_572
Id_158 0_to_0
Cl_636
Id_84 0_to_0Cl_374 0,035_to_0,04
Cl_438 0,15_to_0,1625 Cl_685 0_to_0 Cl_683
Cl_704 0_to_0 Id_628
0_to_0
0,065_to_0,07
Id_3337 0_to_0
0_to_0
Cl_88
Cl_430
0.1125_to_0.125 Id_627
Id_2181
0_to_0 Id_734
1
1
1
0.2 1
0.5
0.5
1
1
1
0.75
0.25
0.75
0.8
0.25
1
*
Sale price
Sale volume
Data
r=201
r=218
r=247 r=113
r=178
r=195 r=153
28
In the end, the precision depends on which general features of sales process evolution are
created on the base of individual observed cases building MDP-model. With the aim to evaluate
the agent functioning effectiveness in the framework of 5000 combinations in Figure 18 the
histograms reflecting the distribution of combinations according to appropriate value of
comparison evaluations of process being modelled with actual process, are given.
According to Figure 18 it is possible to conclude that the model contains predominantly
the combinations (77,8%) for which the correction of prices lead to the positive results (i.e. the
profit values are more than zero). The negative value of evaluation means that the price
corrections, offered by the model, turned out to be less effective in comparison with the solutions
provided by the expert.
Figure 18. Number of combinations customer-goods distribution according to profit (left),
number of combinations customer-goods distribution according to distance (right)
The presence of combinations customer-goods, for which the price corrections cause
losses, is explained by insufficient number of individual observations found for such
combinations, to create valid transition graph.
MDP space approximation by means of ANN experiments. The practical research concerning
the approximation possibility of Dynamic Pricing Policy decisions space using Artificial
Neuronal Networks is performed in this experiment. The method details and its application in toy
problem of mountain car are provided in the subsection 2.3.
The approximation of transition space is needed for gaining transition probabilities of
states, in which the system was never found before, but has the potential for. In such a case it is
advisable to have at least the estimated value of transition possibilities.
The curves of convergences for various numbers of hidden layers and neurons in the
hidden layer of ANN are given in Figure 19. The configuration “22-66-66-1” can be marked as
the “quickest” (after the number of used iterations of learning) network configuration (two
hidden layers, 66 neurons in each). The “slowest” is the configuration «22-88-1». If we take into
consideration the time spent on one iteration of learning, then we find that the configuration 22-
22-22-1 is the “quickest”. It allows us to achieve the result comparable with configuration “22-
66-66-1” for larger number of iterations, but in shorter time. In this connection, we will use the
configuration 22-22-22-1 in the further experiments.
With the aim to evaluate the quality of approximation we perform the cross validation
test. The whole learning sample is divided into 10 blocks for tgat purpose. Each block is made by
compilation of records taken from the learning sample through a given interval. The plots of
convergence for 10 cross validations are represented in Figure 20. All validations are performed
on the network with configuration “22-22-22-1”.
,% ∆D
29
Figure 19. Convergence of ANN for different parameters
We note that the convergence of test sample has no asymptotic approach to zero (like
learning sample), but it is kept on a certain level. At the same time, the value of average square
error (RMSE) of test sample has the same scale of values as the error of learning sample. When
analyzing the approximation result of one of the cross validation blocks (see Figure 21.), it is
possible to note that the network is able to repeat the main traits of test function (RMSE error
value is 0,17).
Figure 20. Convergences of cross validations
Such error value allows us to use the model approximation with the purpose of building
MDP action policy. Now the policy building algorithm uses not the tabular representation of
transition probability function, but its approximation performed by ANN. With the aim to
evaluate the precision, which could be proportional to evaluations of previous experiments, we
use expression (5.2), which is represented in the promotion thesis. This expression allows for
each combination wholesale customer - goods );( GC to calculate the numeric
evaluation );( GCX of model accordance (or its separated parts) to factual transitions.
RMSE
Epoch of learning
Convergence
of test
samples
Convergence
of learning
samples
rmse
Epoch of learning
0
0,05
0,1
0,15
0,2
0,25
0,3
0320
640
960
1280
1600
1920
2240
2560
2880
3200
3520
3840
4160
4480
4800
5120
5440
5760
6080
6400
6720
7040
7360
7680
8000
8320
8640
8960
9280
9600
9920
RMSE
30
Figure 21. Fragment of test sample approximation
The number (in percentage of total amount 5000) of time series having the appropriate
value of numerical estimation );( GCX is represented in Figure 22. In comparison with the results
given in Figure 16., the approximated model has smaller precision. The important advantage of
the approach is that there occurs a probability to take decision within the states, in which the
system has not been before.
Figure 22. Distribution of estimation value );( GCX for approximated model
In practice, taking into account such relatively high error, such decision can be offered to
the expert to consider the possible strategies concerning the price correction.
Chapter 6 (The results analysis and conclusions)
It is shown that the dynamic system characterized by the presence of time series can be
expressed through final states and actions, have the transition model and reward function. Most
attention was concentrated on the building method of the model considered as the environment,
where the MDP apparatus functions. The method is based on finding of regularities of the
observed variables evolution within the time and the generalization of the detected regularities.
The employed agent-oriented architecture ensured such interaction of the MDP-model with the
environment, where the permanent learning of the MDP-model and the transformation of the
policy in frames of dynamic environment can be performed. The offered experimental platform
allowed us to approve the developed method of model building in the circumstances of real task.
The platform allows the expert to choose the desirable goal state. At the same time, the goal state
);( GCX
%
Desired value
Network’ outcome
Record number of test sample
Transition
probability
31
can be selected automatically from a subset of states with maximum value of reward or any other
terminal state, whose achievement has the maximum discounted income.
The method of fully automated price evolution model building is demonstrated in the
framework of a real system including hundreds of customers and product names. The validation
of the model showed the acceptable precision. To increase the precision, the implementation of
the appropriate experiments (search for parameters, revision of data structure) is needed, as well
as the improvement of certain algorithms, etc. The main results of this promotion thesis are the
following:
the current state of the problem concerning the building of Markov model of dynamic
system, represented through time series within multidimensional space of observed
variables is researched;
such areas of Computational Intelligence as agent intellectual systems, Data Mining
procedures, Artificial Neuronal Networks, etc are researched with the aim to develop the
method concerning the building of MDP-model of dynamic system and its applications
for the testing data;
the new multistep approach concerning the building of MDP-model in case of
approximation of state value table using the Artificial Neuronal Network is developed
and approved;
the method of MDP-model building based on regularities searching of observed variables
evolution processes and their transformations into the set of states, actions and transition
probability function is developed;
the approaches for transformation of real problem data into structure, meeting the MDP
framework are offered: the methods and data structure are developed (behaviour profile
of observed variables) to transform the multidimensional time series into states of
Markov Decision Process, the methods of action set building are suggested, the method
of search of goal states is offered;
to approbate the suggested method of MDP-model building, the experimental software
platform is developed, as well as the series of accompanying software tools;
In the course of experiments based on real data on sales, the numerical evaluations of an
MDP-model closeness to the factual under investigation processes evolution, as well as
the evaluation of the agent system functioning on testing data.
32
MAIN RESULTS OF THE THESIS
The decision making method based on the MDP and ensuring the MDP-model building in
problems which contain the data presented as multidimensional time series was developed as a
part of the doctoral work. The method was tested in a series of experiments. As a result, the
numerical estimates allowing us to conclude that the method is able to build an MDP-model
which adequately displays the learning sample, were obtained. The following tasks have been
solved and the results obtained.
1. The review of mathematical methods based on Markov Decision Processes allows us to
conclude that MDP and Reinforcement Learning can be considered an effective method for
modelling of dynamic systems; some key problems of their use in tasks which contain data as
multidimensional time series were defined.
2. The analysis of several computational intelligence techniques (ANN, RL, agent based
systems, methods of Data Mining, etc.) allowed to describe the main features of the
developed method (pipeline organization to transform the data, timely updates of an MDP-
model and so on).
3. Special agent based architecture was developed to avoid an incorrect description of the
interaction of intelligent system with the environment.
4. The approximation of the state space and decision space of MDP with the use of Artificial
Neural Network was implemented. The efficiency of the approach was demonstrated with the
toy problem.
5. The intermediate structure (the profile of a studied value’s behaviour) for storing and
processing of the identified patterns of the investigated time series was formulated. The
behaviour profiles of values being researched are used for the MDP-model creation.
6. The approach of using different criteria for clustering time series (Euclidean distance and the
shape-based similarity) according to the semantic load of each studied variable was offered.
7. For the purpose of testing the method, the problem of the Dynamic Pricing Policy within the
MDP framework was formulated, and the software for implementing the experiments was
developed.
8. A series of experiments enabling one to quantify the effectiveness of the MDP-model
building was carried out. The assessment was based on the comparison of the resulting model
and the training sample, and, also on the application of the model to the data outside the
training set.
33
LIST OF REFERENCES
1. Carkova V., Šadurskis K. Gadījuma procesi. – Rīga : RTU, 2005. – 138 lpp.
2. Čižovs J., Borisovs A. Markov Decision Process in the Problem of Dynamic Pricing Policy
// Automatic Control and Computer Sciences. - No. 6, Vol 45. (2011), pp 77-90.
indexed in: SpringerLink, Ulrich's I.P.D., VINITI.
3. Čižovs J., Borisovs A., Zmanovska T. Ambiguous States Determination in Non-Markovian
Environments // RTU zinātniskie raksti. 5. sēr., Datorzinātne. - 36. sēj. (2008), 140.-147.
lpp.
indexed in: EBSCO
4. Chizhov Y., Zmanovska T., Borisov A. Temporal Data Mining for Identifying Customer
Behaviour Patterns // Workshop Proceedings Data Mining in Marketing DMM’ 2009, 9th
Industrial Conference, ICDM 2009, Leipzig, Germany, 22-24 July, 2009.– IBaI Publishing,
2009. – P. 22-32.
indexed in: DBLP, Io-port.net.
5. Chizhov Y. An Agent-Based Approach to the Dynamic Price Problem // Proceedings of 5th
International KES Symposium Agent and Multi-agent Systems, Agent-Based Optimization
KES-AMSTA/ABO'2011, Manchester, U.K., June 29-July 1, 2011.– Heidelberg: Springer-
Verlag Berlin, 2011. – P. 446-455.
indexed in: SpringerLink, Scopus, ACM DL, DBLP, Io-Port.
6. Chizhov Y., Kuleshova G., Borisov A. Manufacturer – Wholesaler System Study Based on
Markov Decision Process // Proceedings of 9th International Conference on Application of
Fuzzy Systems and Soft Computing, ICAFS 2010, Prague, Czech Republic, August 26-27,
2010.– b-Quadrat Verlag, 2010 .– P. 79-89.
7. Chizhov Y., Kuleshova G., Borisov A. Time series clustering approach for decision support
// Polish Journal of Environmental Studies. – Vol.18, N4A (16th International Multi-
Conference ACS-AISBIS, Miedzyzdroje, Poland, 16-18 October, 2009), pp.12-17.
indexed in: Scopus, Web of Science
8. Chizhov J., Borisov A. Applying Q-Learning To Non-Markovian Environments //
Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART
2009), Porto, Portugal, January 19 - 21, 2009. – INSTICC Press, 2009.- P. 306-311.
indexed in: Engineering Village2, ISI WEB of KNOWLEDGE, SCOPUS, DBLP, Io-port.net.
9. Chizhov Y. Particulars of Neural Networks applying in Reinforcement Learning //
Proceedings of 14th International Conference on Soft Computing „MENDEL 2008”, Czech
Republic, Brno, 18.-20. June, 2008. – Brno: BUT, 2008. – P. 154-160.
indexed in: ISI Web of Knowledge, INSPEC
10. Chizhov Y. Reinforcement learning with function approximation: survey and practice
experience // Proceedings of International Conference on Modelling of Business, Industrial
and Transport Systems, Latvija, Rīga, May 7-10, 2008.- Riga: TSI, 2008.- P. 204-210.
indexed in: ISI Web of Knowledge
11. Chizhov J. Software agent developing: a practical experience, Scientific proceedings of Riga
Technical University: RTU 48. rakstu krājums, 5. sērija, 31. sējums, 12 October, 2007, Riga
Technical University, Riga, Latvia.
12. Chizhov J., Borisov A. Increasing the effectiveness of reinforcement learning by modifying
the procedure of Q-table values update // Proceedings of Fourth International Conference on
Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and
Control „ICSCCW-2007”, Antalya, Turkey, 27-28 August 2007.– b-Quadrat Verlag, 2007.–
P. 19-27.
34
13. Chizhov J., Agent Control In World With Non-Markovian Property // Poster presentation in
EWSCS’07: Estonian Winter School in Computer Science, Palmse, Estonia, March 4-09,
2007.
14. Athanasiadis I.N., Mitkas P.A. An agent-based intelligent environmental monitoring system
// Management of Environmental Quality.–Vol.15 (2004), P. 229-237.
15. Baxter J., Bartlett P.L. Infinite-horizon policy-gradient estimation // Journal of Artificial
Intelligence Research. –Vol.15 (2001), P. 319–350.
16. Beitelspacher J., Fager J., Henriques G., …[etc]. Policy Gradient vs. Value Function
Approximation: A Reinforcement Learning Shootout. Technical Report No.CS-TR-06-001,
School of Computer Science University of Oklahoma Norman, OK 73019, 2006.
17. Bellman R. Dynamic Programming. – New Jersey: Princeton University Press, 1957
18. Bertsekas D.P., Tsitsiklis J. Neuro-Dynamic Programming. - Athena Scientific, 1996. – 512
p.
19. Butz M.V. Rule-based evolutionary online learning systems: learning bounds, classification,
and prediction. – Submitted in partial fulfilment of the requirements for the degree of Doctor
of Philosophy in Computer Science in the Graduate College of the University of Illinois at
Urbana-Champaign, 2004.
20. Carreras M. and other. Application of SONQL for real-time learning of robot behaviours //
Robotics and Autonomous systems. – Vol. 55, Issue 8 [2007], P. 628-642.
21. Cervenka R., Trencansky I. AML. The Agent Modeling Language. A comprehensive
Approach to Modeling MAS. – Berlin: Springer, 2007. – 355 p.
22. Chakraborty D., Stone P. Online Model Learning in Adversarial Markov Decision Processes
// Proceedings of 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS
2010), Toronto, Canada, May 10-14, 2010. – International Foundation for Autonomous
Agents and Multiagent Systems, Richland, SC, 2010. –P. 1583-1584.
23. Cheung T., Okamoto K., Maker F., … [etc]. Markov Decision Process Framework for
Optimizing Software on Mobile Phones // Proceedings of the 9th ACM IEEE International
conference on Embedded software, EMSOFT 2009, Grenoble, France, October 12-16, 2009.
– New York: ACM, 2009. – P. 11-20.
24. Cotofrei P., Stoffel K. Rule extraction from time series databases using classification trees //
Proceedings of the 20th IASTED Conference on Applied Informatics, Innsbruck, Austria,
February 18-21, 2002. – Calgary, Canada: ACTA Press, 2002. – P. 327-332.
25. Crespo F., Weber R. A methodology for dynamic data mining based on fuzzy clustering //
Fuzzy Sets and Systems. – Vol.150 (2005), P. 267-284.
26. Das G., Lin K.-I., Mannila H., … [etc]. Rule discovery from time series // Proceedings of
the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98),
New York City, USA, August 27-31, 1998. – NY: AAAI Press, 1998. – P. 16-22.
27. Fager J. Online Policy-Gradient Reinforcement Learning using OLGARB for SpaceWar. //
Technical report, University of Oklahoma, 660 Parrington Oval, Norman, OK 73019 USA.
2006. –P.5.
28. Franklin S., Graesser A. Is it an Agent, or just a Program?: A Taxonomy for Autonomous
Agents // Proceedings of the Workshop on Intelligent Agents III, Agent Theories,
Architectures, and Languages, ECAI '96, Hungary, Budabest, August 11-16, 1996. –
London: Springer-Verlag, 1997. – P.21-35.
29. Ganzhorn D., de Beaumont W. Learning Algorithms and Quake // Technical report,
University of Rochester, March 19th, 2004. –P.13.
35
30. Gearhart C. Genetic Programming as Policy Search in Markov Decision Processes // Genetic
Algorithms and Genetic Programming at Stanford.– (2003), P. 61-67.
31. Goto J., Lewis M.E., Puterman M.L. A Markov Decision Process Model for Airline Meal
Provisioning // Transportation Science. – Vol. 38, No. 1 (2004), P. 107-118.
32. Guestrin C., Koller D., Gearhart C., … [etc]. Generalizing Plans to New Environments in
Relational MDPs // Proceedings of the Eighteenth International Joint Conference on
Artificial Intelligence, Acapulco, Mexico, August 9-15, 2003. – San Francisco, USA:
Morgan Kaufmann, 2003. – P. 1003-1010.
33. Hassan Md.R., Nath B. Stock market forecasting using hidden Markov model: A new
approach // Proceedings of the 5th International Conference on Intelligent Systems Design
and Applications, ISDA’05, Wroclaw, Poland, 8-10 September, 2005.– Washington, USA:
IEEE Computer Society, 2005. – P. 192-196.
34. Haykin S. Neural Networks and Learning Machines (3rd Edition).– New Jersey: Prentice
Hall, 2008.– 936 p.
35. Hewitt C. Viewing Control Structures as Patterns of Passing Messages // Artificial
Intelligence. – Vol. 8(3) (1977), P. 323-364.
36. Howard R.A. Dynamic Programming and Markov Processes.– Cambridge, MA: MIT Press,
1960.– 136 p.
37. Jacobs S. Applying ReadyLog to Agent Programming in Interactive Computer Games //
Diplomarbeit, Fakultät für Mathematik, Informatik und Naturwissenschaften der Rheinisch-
Westfälischen Technischen Hochschule Aachen, 2005.
38. Kampen van N.G. Remarks on Non-Markov Processes // Brazilian Journal of Physics.–Vol.
28, Nr 2 (1998), P. 90-96.
39. Keogh E., Lin J., Truppel W. Clustering of Time Series Subsequences is Meaningless:
Implications for Previous and Future Research // Proceedings of the 3rd IEEE International
Conference on Data Mining (ICDM 2003), Melbourne, Florida, USA November 19-22,
2003.– Melbourne: IEEE Computer Society, 2003.– P. 115-122.
40. Konar A. Computational Intelligence: Principles, Techniques and Applications.– Berlin
Heidelberg: Springer-Verlag, 2005.– 732 p.
41. Krzysztof L. Markov Decision Processes in Finance // Master’s Thesis, Department of
Mathematics Vrije Universiteit Amsterdam, 2006.
42. Lazaric A., Taylor M.E. Transfer Learning in Reinforcement Learning Domains // Lecture
material of European Conference on Machine Laerning and Principles and Practice of
Knowledge Discovery in Databases ‘09, Bled, Slovenia, September 7-11, 2009.
43. Lee S.-l., Chun S.-J., Kim D.-H., Lee J.-H., … [etc]. Similarity Search for Multidimensional
Data Sequences // Proceedings of IEEE 16th International Conference on Data Engineering,
San Diego, USA, 28 February - 3 March, 2000.– IEEE Computer Society, 2000.– P. 599-
608.
44. Li C., Wang H., Zhang Y. Dynamic Pricing Decision in a Duopolistic Retailing Market //
Proceedings of the 6th World Congress on Intelligent Control and Automation, Dalian,
China, 21-23 June 2006.– IEEE, 2006.– P. 6993-6997.
45. Lin L-J. Reinforcement learning for Robots Using Neural Networks // PhD thesis, Carnegie
Mellon University, Pittsburgh, CMU-CS-93-103, 1993.
46. Lind J. Issues in Agent-Oriented Software Engineering // Agent-Oriented Software
Engineering: First International Workshop, AOSE 2000. Lectures Notes in Artificial
Intelligence. – Vol. 1957 (2001), P. 45-58.
36
47. Melo F.S., Ribeiro M.I. Coordinated Learning in Multiagent MDPs with Infinite State-Space
// Autonomous agents and multi-agent systems.– Vol. 21, Number 3 (2010), P. 321-367.
48. Mitkus S., Trinkūnienė E. Reasoned Decisions In Construction Contracts Evaluation //
Baltic Journal on Sustainability. – Vol. 14, Nr3 (2008), P. 402-416.
49. Narahari Y., Raju C., Ravikumar K., … [etc]. Dynamic Pricing Models for Electronic
Business. Sadhana // Sadhana. – Vol. 30, Part 2 & 3 (2005), P. 231–256.
50. Palit A.K., Popovic D. Computational Intelligence in Time Series Forecasting. Theory and
Engineering Applications. – London: Springer, 2005. – 372 p.
51. Povinelli R.J., Xin F. Temporal pattern identification of time series data using pattern
wavelets and genetic algorithms // Artificial Neural Networks in Engineering. – New York:
ASME Press, 1998. – P.691-696.
52. Powell W.B. Approximate Dynamic Programming I: Modeling // Encyclopedia of
Operations Research and Management Science. – John Wiley and Sons, 2011. – P. 1-11.
53. Pranevičius H., Budnikas G. PLA-Based Formalization Of Business Rules And Their
Analysis By Means Of Knowledge-Based Techniques // Baltic Journal on Sustainability. –
Vol. 14, Nr 3 (2008), P. 328-343.
54. Puterman M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. –
New York: John Wiley & Sons, 1994. – 649 p.
55. Pyeatt L.D., Howe A.E. Decision Tree Function Approximation in Reinforcement Learning
// Proceedings of the Third International Symposium on Adaptive Systems: Evolutionary
Computation and Probabilistic Graphical Models, Vol. 2, Issue: 1/2 (2001). – Citeseer 2001.
– P. 70-77.
56. Reinartz W. Customising Prices in Online Markets // European Business Forum. – Issue 6
(2001).– P. 35-41.
57. Rosu I. The Bellman Principle of Optimality // Pieejas veids: tīmeklis WWW. URL:
http://appli8.hec.fr/rosu/research/notes/bellman.pdf. – Resurss aprakstīts 2011.g. 12.dec.
58. Russel S., Norvig P. Artificial Intelligence: A modern approach, 2nd edition.– New Jersey:
Prentice Hall, 2003. – 1132 p.
59. Song H., Liu C.-C. Optimal Electricity Supply Bidding by Markov Decision Process // IEEE
transactions on power systems. – Vol. 15, Nr. 2 (2000). – P. 618-624.
60. Sonnenberg F.A., Beck J.R. Markov models in medical decision making: a practical guide //
Medical Decision Making. – Vol. 13, Nr 4 (1993). – P. 322-338.
61. Sunderejan R., Kumar A.P.N., Badri K.K.N., ... [etc]. Stock market trend prediction using
Markov models // Electron. – Vol.1, Issue 1 (2009). – P. 285-289.
62. Sutton R.S., Barto A.R. Reinforcement learning. An Introduction. – Cambridge, MA: MIT
Press, 1998. – 342 p.
63. Sutton R.S., McAllester D., Singh S., … [etc]. Policy Gradient Methods for Reinforcement
Learning with Function Approximation // Advances in Neural Information Processing
Systems. – Vol. 12 (2000). – P. 1057-1063.
64. Symeonidis A.L., Kehagias D., Mitkas P.A. Intelligent policy recommendations on
enterprise resource planning by the use of agent technology and data mining techniques //
Expert Systemswith Applications. – N 25 (2003). – P. 589-602.
65. Taylor M.E. Transfer in Reinforcement Learning Domains. Studies in Computational
Intelligence. – Berlin: Springer-Verlag, 2009. – 244 p.
66. Tokic M. Exploration and Exploitation Techniques in Reinforcement Learning // Invited
lecture, Ravensburg-Weingarten University of Applied Sciences, Germany, November
2008.
37
67. Vanthienen J. Ruling the business: About business rules and decision tables // New
Directions in Software Engineering. – (2001) pp. 103-120.
68. Varges S., Riccardi G., Quarteroni S., … [etc]. The exploration/exploitation trade-off in
Reinforcement Learning for dialogue management // Proceedings of IEEE Workshop on
Automatic Speech Recognition & Understanding ASRU’09, Merano, Italy, December 13-
17, 2009. – IEEE Signal Processing Society, 2009. – P. 479-484.
69. Varian H.R. Differential Pricing and Efficiency // First Monday. – Vol.1, Nr. 2 (1996). – P.
1-10.
70. Vengerov D. A Gradient-Based Reinforcement Learning Approach to Dynamic Pricing in
Partially-Observable Environment // Future Generation Comp. Syst. – Vol. 24/7 (2008). –P.
687-693.
71. Wang R., Sun L., Ruan X.-G., … [etc]. Control of Inverted Pendulum Based on
Reinforcement Learning and Internally Recurrent Net // Proceedings of International
Conference on Intelligent Computing, HeFei, China, August 23-26, 2005. – IEEE
Computational Intelligence Society, 2005. – P. 2133-2142.
72. Weld M., Weld D. Solving Concurrent Markov Decision Processes // Proceedings of the
19th national conference on Artifical intelligence, San Jose, California, S.J. Convention
Center, July 25-29, 2004. – San Jose, California: AAAI Press, 2004. – P. 716-722.
73. Witten I.H., Frank E. Data Mining: Practical Machine Learning Tools and Techniques,
Second Edition – Morgan Kaufman, 2005. – 560 p.
74. Беллман Р., Дрейфус С. Прикладные задачи динамического программирования –
Москва: «Наука», 1965.– 460 с.
75. Бережная Е.В., Бережной В.И. Математические методы моделирования
экономических систем. – Москва: «Финансы и статистика», 2006. – 432 с.
76. Воробьев Н.Н. Предисловие редактора перевода к Беллман Р. Динамическое
программирование. – Москва: «Издательство иностранной литературы», 1960. – 400 с.
77. Горбань А.Н. Обобщенная аппроксимационная теорема и вычислительные
возможности нейронных сетей // Сибирский журнал вычислительной математики. –
Т.1, Nr.1 (1998), с. 12-24.
78. Кузнецов Ю.Н., Кузубов В.И., Волощенко А.Б. Математическое программирование:
Учеб. пособие. – 2-е изд., перераб. и доп. – Москва: «Высшая школа», 1980. – 300 с.
79. Растригин Л.А. Современные принципы управления сложными объектами.– Москва:
«Советское радио», 1980. – 232 с.
80. Таха Х.А. Введение в исследование операций. – 7-е. изд. – Москва: «Вильямс», 2005.
– 912 с.
81. Тёрнер Д. Вероятность, статистика и исследование операций. – Москва:
«Статистика», 1976. – 431 с.
82. Фомин Г.П. Математические методы и модели в коммерческой деятельности. –
Москва: «Финансы и статистика», 2005. – 616 с.
83. Черноусько Ф.Л. Динамическое программирование // Соросовский образовательный
журнал. – № 2 (1998), с. 139-144.
38
Jurijs ČIŽOVS
DEVELOPMENT AND STUDY OF A
CONTROLLED MARKOV DECISION MODEL
OF A DYNAMIC SYSTEM BY MEANS OF DATA
MINING TECHNIQUES
Ph.D. Thesis Summary
Registered for printing on 31.01.2012. Registration Certificate
No. 2-0282. Format 60x84/16. Offset paper. 2,25 printing sheets,
1,78 author’s sheets. Calculation 30 copies. Order Nr. 10.
Printed and bound at the RTU Printing House, 1 Kalku Street,
Riga LV- 1658, Latvia.