DEVELOPMENT AND STUDY OF A CONTROLLED ......2 UDK 519.857(043.2) Či 958 d Čižovs J. Development and study of a controlled Markov decision model of a dynamic system by means of data

1

RIGA TECHNICAL UNIVERSITY

Faculty of Computer Science and Information Technology

Institute of Information Technology

Jurijs Čižovs Management Information Technology doctoral programme student

DEVELOPMENT AND STUDY OF A

CONTROLLED MARKOV DECISION MODEL

OF A DYNAMIC SYSTEM BY MEANS OF

DATA MINING TECHNIQUES

Ph.D. Thesis Summary

Scientific supervisor

Dr.habil.sc.comp., Professor

A. BORISOVS

Riga 2012

2

UDK 519.857(043.2)

Či 958 d

Čižovs J. Development and study of a controlled Markov

decision model of a dynamic system by means of data mining

techniques. Ph.D. Thesis Summary.-R.: RTU, 2012.-38 p.

Printed according to the decision of the RTU Information

Technology Institute Council meeting, January 10, 2012,

protocol No. 12-02.

This work has been partly supported by the European Social Fund within

the National Programme „Support for the carrying out doctoral study

programme’s and post-doctoral researches” project „Support for the

development of doctoral studies at Riga Technical University”.

ISBN 978-9934-10-272-1

3

PH.D. THESIS

IS NOMINATED IN RIGA TECHNICAL UNIVERSITY FOR TAKING

DOCTOR’S DEGREE IN ENGINEERING SCIENCE

Defence of the Ph.D. Thesis for obtaining a doctor’s degree in Engineering

Science will take place on 12 March, 2012 at Riga Technical University, Department of

Computer Science and Information Technology, 1/3 Meža Street, Room 202.

OFFICIAL REVIEWERS

Professor, Dr.math. Kārlis Šadurskis

Riga Technical University, Latvia

Professor, Dr.habil.sc.ing. Jevgeņijs Kopitovs

Transport and Telecommunication Institute, Latvia

Professor, Dr. rer. nat. habil. Juri Tolujew

Otto-von-Guericke University Magdeburg, Germany

DECLARATION

I, Jurijs Čižovs, declare that I have written this Thesis, which is submitted for

reviewing at Riga Technical University for taking a doctor’s degree in Engineering

Science.

Jurijs Čižovs ………………………………..

signature

Date: 3 February, 2012

Ph.D. Thesis is written in Latvian. It contains an introduction, 6 chapters, a

conclusion, a list of references, 4 appendixes, 70 figures, 15 tables and 51 formulae. The

list of references contains 83 entries. There are 137 pages in total.

4

GENERAL DESCRIPTION OF THE THESIS

Introduction

With the development of electronic management of production, trade, finance, etc., new

features of structured data storage have appeared which reflect the economic activities of an

enterprise. The analysis of the available data of the enterprise activity in the past aimed to

develop relevant management decisions is one of the mechanisms which determine the efficient

enterprise management. Since the enterprise activity is observed over time, the data are

multidimensional time series. Thus, there arises the problem of decision making under

uncertainty with the data which are multidimensional time series. The uncertainty stems from the

fact that it is technically impossible to reflect all the internal and external factors affecting the

parameters observed.

Topicality of problem

The mathematical framework of Markov Decision Process (MDP) has been used

successfully to find the optimal management strategy in discrete stochastic processes developing

over time. There are a number of modifications and enhancements aimed at solving tasks with

continuous parameters, partially observable environments, etc. However, the issues related to the

building of an MDP-model which contains the data represented as time series, are open for

research. The complexity of the model building is due to the requirements of the MDP

framework to the structure of the researched data. For the observed parameters the

implementation of a certain type time series must be extracted from the relational data and

converted to the structure of MDP.

The extension of the framework for working with time series allows one to take

advantage of a standard MDP framework to make decisions on economic problems in online

mode.

Goal of the research

The goal of the doctoral thesis is to develop the decision making framework based on the

Markov Decision Process for the dynamic systems in which the data are represented as time

series. To achieve the goal stated, the following tasks have to be solved:

1. To review the current state of the MDP framework application to the problems expressed in

terms of multidimensional time series and to explore the existing approaches.

2. To develop a method based on data mining to build time series, to process and transform

them into the structures that satisfy the MDP requirements.

3. To develop the new method’s software which is based on the agent-oriented architecture and

is an intelligent system for decision making and support.

4. To consider the possibility of improving the method by the decision space approximation

implemented by means of Artificial Neural Networks.

5. To perform the statement of Dynamic Pricing Policy problem task as a dynamic

programming task (in the context of Markov Decision Processes) in order to obtain an object

for practical experiments.

6. To test the obtained decision making intelligent agent system in the problem of Dynamic

Pricing Policy for assessing the effectiveness of the developed method for the real-world

problems.

Object of the research

The object of the research is the advanced decision making method based on Markov

Decision Process. The scope of the framework application is oriented towards dynamic

programming tasks which contain the data expressed as time series.

5

Research hypotheses

The study put forward the following hypotheses:

1. the task in which the data are represented as time series can be viewed as a task of dynamic

programming and expressed in terms of Markov Decision Process;

2. the approximation of the state space and decision space of Markov Decision Process can be

performed with the help of the artificial neural network approach.

Methods of the research

In the Ph.D. Thesis the advanced decision making method based on Markov Decision

Process is under central consideration. The maximum-likelihood technique, which is a statistical

method for estimating the unknown parameter, is used to construct the probabilistic model in

framework of the apparatus. Data mining techniques including tools for data normalization,

clustering and classification are employed. The methods of computational intelligence:

Reinforcement Learning and Artificial Neural Networks are used. The agent-oriented

architecture is used for the software systems under development.

Scientific novelty

The decision making method based on Markov Decision Process is of scientific interest.

The main characteristic that distinguishes it from a standard Markov Decision Process is the

possibility to use it in the tasks with multidimensional time series.

The approach of the decision space approximation by means of Artificial Neural

Networks has been demonstrated for the task of Dynamic Pricing Policy (which is characterized

by multidimensional time series).

Besides, a new approach to building the agent-based system architecture is provided. This

approach allows one to avoid the conflict between the definition of the purpose and the agent

environment in case the agent does not directly interact with the object of the task to be solved.

Practical use of the Thesis and approbation

The developed decision making method based on Markov Decision Process is designed

for tasks in which the state of the system is described by parameters that change over time and

not by static parameters.

The practical application of the intellectual agent system based on Markov Decision

Process was demonstrated in the task of Dynamic Pricing Policy. The testing data are the actual

sales records of the real manufacturing and trade management system 1С:Enterprise v7. The data

cover a two year period of manufactured food products sales.

The development of intelligent agent system based on Markov Decision Process was

implemented in 1C: Enterprise v7 framework, which allowed a direct access to sales data.

This work includes a series of experiments of several sub-systems (Artificial Neural

Networks, Markov Decision Process) with toy problems. Besides, a series of experiments by the

example of Dynamic Pricing Policies task in order to numerically evaluate the effectiveness of

the improved MDP framework was carried out.

Some certain stages of work and the results were presented at these scientific

conferences:

1. Chizhov J. An Agent-based Approach to the Dynamic Price Problem, 5th

International KES Symposium Agent and Multi-agent Systems, Agent-Based Optimization

KES-AMSTA/ABO'2011, 29 June – 1 July, 2011, Manchester, United Kingdom. indexed in: SpringerLink, Scopus, ACM DL, DBLP, Io-Port.

6

2. Chizhov J., Kuleshova G., Borisov A. Manufacturer – Wholesaler System Study Based on

Markov Decision Process, 9th International Conference on Application of Fuzzy Systems

and Soft Computing, ICAFS 2010, 26 – 27 August, 2010, Prague, Czech Republic.

3. Chizhov J., Kuleshova G., Borisov A. Time Series Clustering Approach for Decision

Support, 16th

International Multi-conference on Advanced Computer Systems ACS-AISBIS

2009, 14 – 16 October, 2009, Miedzyzdroje, Poland. indexed in: Scopus, Web of Science

4. Chizhov J., Zmanovska T., Borisov A. Temporal Data Mining for Identifying Customer

Behaviour Patterns, Data Mining in Marketing DMM’ 2009, 9th Industrial Conference,

ICDM 2009, 22 – 24 July, 2009, Leipzig, Germany. indexed in: DBLP, Io-port.net.

5. Chizhov J., Borisov A. Applying Q-Learning to Non-Markovian Environments, First

International Conference on Agents and Artificial Intelligence (ICAART 2009), 19 – 21

January, 2009, Porto, Portugal. indexed in: Engineering Village2, ISI WEB of KNOWLEDGE, SCOPUS, DBLP, Io-port.net.

6. Chizhov J., Zmanovska T., Borisov A. Ambiguous States Determining in Non-Markovian

Environments, RTU 49th

International Scientific Conference, Subsection “Information

Technology and Management Science”. 13 October, 2008, Riga, Latvia. indexed in: EBSCO

7. Chizhov J. Particulars of Neural Networks Applying in Reinforcement Learning, 14th

International Conference on Soft Computing “MENDEL 2008”, 18 – 20 June, 2008, Brno

University of Technology, Brno, Czech Republic. indexed in: ISI Web of Knowledge, INSPEC

8. Chizhov J. Reinforcement Learning with Function Approximation: Survey and Practice

Experience, International Conference on Modeling of Business, Industrial and Transport

Systems “MBITS’08”, 7 – 10 May, 2008, Transport and Telecommunication Institute, Riga,

Latvia. indexed in: ISI Web of Knowledge

9. Chizhov J. Software Agent Developing: a Practical Experience, RTU 48th

International

Scientific Conference, Subsection “Information Technology and Management Science”, 12

October, 2007, Riga Technical University, Riga.

10. Chizhov J., Borisov A. Increasing the Effectiveness of Reinforcement Learning by

Modifying the Procedure of Q-table Values Update, Fourth International Conference on Soft

Computing, Computing with Words and Perceptions in System Analysis, Decision and

Control “ICSCCW – 2007”, 27 – 28 August, 2007, Antalya, Turkey.

11. Chizhov J. Agent Control in World with Non-Markovian Property, EWSCS’07: Estonian

Winter School in Computer Science. 4 – 9 March, 2007, Palmse, Estonia.

Publications

Several fragments of the Ph.D. thesis, as well as its results are published in 11 scientific

articles. Most of the publications are indexed by international digital libraries (Springer, ISI

WEB, SCOPUS, DBLP, Io-port.net). The list of publications is included in the complete list of

literature provided at the end of the author's abstract.

Main results of the Ph.D. Thesis

The decision making method based on the MDP and ensuring MDP-model building in

tasks that contain the data presented as multidimensional time series was developed as a part of

the doctoral work. The method was tested in a series of experiments. As a result, the numerical

7

estimates allowing us to conclude that the method is able to build a MDP-model, which

adequately displays the learning sample, were obtained. The following tasks were solved and the

results obtained:

1. The review of mathematical methods based on Markov Decision Processes allows one to

conclude that MDP and Reinforcement Learning can be considered an effective method for

modelling the dynamic systems; some key problems of their use in tasks that contain data as

multidimensional time series were defined.

2. The analysis of several computational intelligence techniques (ANN, RL, agent based

systems, methods of Data Mining, etc.) allowed to describe the main features of the

developed method (pipeline organization to transform the data, timely updates of MDP-model

and so on).

3. Special agent based architecture was developed to avoid an incorrect description of the

interaction of intelligent system with the environment.

4. The approximation of the state space and decision space of MDP with the use of Artificial

Neural Network was implemented. The efficiency of the approach was demonstrated by the

toy problem.

5. The intermediate structure (the profile of a studied value’s behaviour) for storing and

processing the identified patterns of the investigated time series was formulated. The

behaviour profiles of the studied values are used to create the MDP-model.

6. The approach of using different criteria for clustering time series (Euclidean distance and the

shape-based similarity) according to the semantic load of each studied variable was

suggested.

7. For the purpose of testing the method, the problem of Dynamic Pricing Policy within the

MDP framework was formulated, and the software for implementing the experiments was

developed.

8. A series of experiments to quantify the effectiveness of the MDP-model building was carried

out. The assessment was based on the comparison of the resulting model and the training

sample as well as the application of the model on data outside the training set.

Structure and contents of the thesis

Ph.D. Thesis consists of an introduction, 6 chapters, a conclusion, a list of references and

4 appendices. Ph.D. Thesis comprises 137 pages, it includes 70 figures and 15 tables. There are

83 sources in the list of references.

The structure of the Ph.D. Thesis is the following:

INTRODUCTION – the terminology used in the research process is introduced, and the subject

of the research, the objective of the paper and the tasks are formulated.

1 CHAPTER: THE MULTISTEP PROCESS OF DECISION MAKING IN THE DYNAMIC

SYSTEMS – the analysis of the multistep decision making method based on the

Markov Decision Processes is provided in this chapter. The key advantages, as well

as disadvantages that determine the development of an improved MDP apparatus are

described.

2 CHAPTER: THE REVIEW OF THE COMPUTATIONAL INTELLIGENCE METHODS

WHEN APPLIED TO THE DYNAMIC SYSTEMS – the chapter contains the

analysis of the variety of methods from the field of computational intelligence in the

aspect of their use in the developing MDP apparatus.

3 CHAPTER: THE DEVELOPEMENT OF THE DATA MINING BASED SYSTEM FOR THE

DYNAMIC SYSTEM’S MDP MODEL BUILDING – the central chapter is devoted

8

to the development of an MDP-based method able to generate a model based on data

represented by multidimensional time series.

4 CHAPTER: THE APPLICATION OF AN MDP MODEL BUILDING USING METHODS

OF DATA MINING IN THE PROBLEM OF DYNAMIC PRICING POLICY – in

this chapter the task of Dynamic Pricing Policy in the context of Markov Decision

Process for performance of experiments is formulated.

5 CHAPTER: THE PERFORMANCE OF EXPERIMENTS REGARDING THE MODEL

BUILDING IN THE PROBLEM OF DYNAMIC PRICING POLICY – the

experiments aimed at getting the numerical results of the efficiency of a new method

are described in the chapter. The description of the developed software is presented.

6 CHAPTER: THE ANALYSIS OF THE RESULTS AND CONCLUSIONS – the final chapter

is devoted to the analysis of the findings. The directions of further researches are

defined.

APPENDIX – includes the structures of intermediate data, fragments of MDP model in XML

format and algorithms used in the research.

9

SUMMARY OF THESIS CHAPTERS

Chapter 1 (Multistep process of decision making in the dynamic systems)

The first chapter deals with the methods of solving the dynamic tasks, whose solution

principle consists in the consequent performance of operations aimed to achieve the results. The

Dynamic Programming, being a fundamental apparatus, underlies in the basis of Markov

Decision Process system, which, in its turn causes its own family of methods.

However, it appears that there exist certain difficulties as concerns the use of such

methods in the modern economic applications. The main problem consists in the development of

the model, which is the environment for functioning of multistep decision making methods. For

the purpose of solving the problem of model’s development, the possibility of using the

regularities mining procedure for application in dynamic tasks is being investigated.

The optimal decision making in the modern real problems of management may not be

considered in aspect of gaining a short-time or incidental profit. Thus, the efficient management

in economic application implies the achievement of total maximum value of the observed

parameter (for example, profit) within limited or unlimited number of phases. Thereby, a class of

tasks is existing, whose solution is achieved not at once but gradually, step by step. In other

words, the decision making is considered not as a single act, but as a process consisting of many

phases [76].

The Dynamic Programming (DP) is a mathematic apparatus, enabling to perform the

optimal planning of multistep controlled processes, and processes that depend on time [78]. The

planning is the finite sequence of the decisions made. The fundamental property of DP is that,

the decisions under development are not isolated from each other [40], but they are coordinated

with each other in order to achieve the goal state.

The Markov Decision Process (MDP) [17, 36, 54] expands the Markov chain, putting in

the concept of controlling influence. The probability of transition from one state to another is

defined by the probability under condition of chosen influence or decision. Figure 1. shows the

example of transition graph for Markov Decision Process in the classical toy problem of garbage

disposal by robot. Each black spot is an action resulted from an appropriate state. The taken

action determines the possible further transitions.

Figure 1. Example of Markov Decision Process transition graph

In fact, the matrixes of transition probabilities P and rewards (or reinforcements) R

describe the “physics” of the process, proceeding from which an appropriate policy of garbage

disposal is calculated. Thus, even at the minor changes of individual matrix values, the resulting

policies may contain the principal differences. In this chapter the formal definition of MDP and

the structure of its cortege are described by (1).

Charge

Wait

Clean Clean

Wait

p=1;r=0

p = 0,9 r = 0 p = 0,1

r = 0

p = 0,6 r = 1

p = 0,4 r = 1 p = 0,6

r = 1

p = 0,4 r = -1

p = 0,1 r = -1

p = 0,9 r = 0

For the given model it is

necessary to find a garbage

disposal policy that would gain

maximum encouraging and

would not discharge the power

supply of a mechanical robot

Robot’s actions

Robot’s states

p – Transition probability

r – Reinforcement value

10

RPAS ,,, , (1)

where S – the finite set of discrete states, S = {s1, s2, ... , s|S|}. Each state st reflects the current

value of vector consisting of observable parameters st = (x1, x2, … x|s|). Thus, the state is all the

information available on the dynamic system in a certain moment of time.

A – finite set of controlling impacts (actions), A = {a1, a2, ... , a|A|}. Usually the action ai changes

immediately one or several parameters xi of state st.

P – the function of states transitions determines the probability, that the action a, taken under the

state s in the time moment t, will transfer the system to state s' at the time moment t+1. It is a

reflection of kind P : S × A × S P.

R – the function of rewards determine the expected reward, obtained immediately after the

transition in the state s' from the state s as a result of action a. It is a reflection of kind R: S × A ×

S R. In fact, it determines the goal state, which is to be achieved.

The solution of MDP is the optimal action policy π* that defines for each state st the

appropriate actions ai. In such case, the reflection is π: S → A.

The key advantages of MDP are their convergence towards the global optimal policy, as

well as simple structure of the model. The explanation of the gained policy of actions π*

(actually solution of task) is not complicated in comparison with the solutions of ANN. The

policy can be also expressed using different methods of knowledge presentation, for example,

decision trees, decision tables [67].

The significant disadvantage of MDP is the absence of mechanisms of the automatic

model building. The solution as policy, maximizing the expected discounted sum of rewards,

may be gained, if the matrix of transitions P and the function of rewards R, making the base of

model, are known. Their construction is rather difficult in the real tasks.

On the basis of the study and analysis of MDP and RL models in the Table 1 their main

characteristics in the problems of stochastic processes study and building of appropriate model

are presented. The standard approach of work with non-Markov systems is the increase of the

memory for dealing with the prehistory of transitions. Based on this principle, the approach of

state introduction using the time series for creating the Markov model is considered.

Table 1

Advantages and disadvantages of models based on MDP

Markov Decision Process

Advantages

global convergence;

building the policy taking into account

the delay of rewards;

simple methods of calculating policy.

Disadvantages

the prior knowledge of system model are

needed;

the complexity of method implementing in

non-Markov systems.

Reinforcement Learning

Advantages

global convergence;

the system model is not needed

(unsupervised learning case);

possibility to work in the systems

possessing no properties of Markov;

building the policy taking into account

the delay of rewards.

Disadvantages

Exploitation-Exploration trade-off exists;

development of the model through the

research is not allowed in the series of

practical tasks;

the complexity of application in non-Markov

systems.

11

One of the disadvantages of MDP is also the complexity of policy search in so-called

non-Markov systems – dynamic systems, which do not meet the property of Markov. The

development of process in non-Markov systems depends not only on the current state, but also

on the sequences of states that took place in the past. The solving of non-Markov problems is

possible through the description of the process providing the mechanism of memory. It is also

implied, that the static properties in the future are dependent on process evolution character in

the past. Such an approach complicates the solution and actually deprives the possibility of its

application in real tasks.

The review of mathematic apparatus, based on Markov process, enables to make the

conclusions regarding the possibility to consider the Markov Decision Process and

Reinforcement Learning as effective methods for modelling dynamic systems.

The research on contemporary positions of MDP and RL methods detected the key

problems of their application in real tasks of economics, management, etc.

The analysis of the detected problems ensured the possibility to formulate the approach,

based on methods of data elicitation to improve the MDP framework, which enables its use in

modern tasks that are described in the setting of non-Markov dynamic system.

Chapter 2 (The review of the Computational Intelligence methods when applied to the dynamic

systems)

In this chapter the substantiation of the necessity of application of the Computational

Intelligence (CI) methods with the aim to develop an effective system of decision making is

represented. Different architectures of agent systems are analyzed in order to develop an agent

architecture that is specific towards the dynamic system [5]. The agent-based approach, in its

turn, enables to represent the software system and its interaction with task of the real world that

is being solved in a way that is natural for a human. The experiments on model dynamic systems

involving Artificial Neuronal Networks with the aim to approximate the spaces of decisions are

performed as well.

There are several formal definitions of Computational Intelligence. The concept of CI is

determined in the work [40] as a set of computational models and tools bearing the intellectual

adaptation to immediate perception of primer sensor data, their processing involving

parallelization and transferring of task, creating the safe and timely responsive system with high

level of resiliency.

Usually the immediate processing of “raw” data using the intelligent software instruments

is impossible. It is determined by stringent requirements of algorithms towards the data structure.

Thus, for instance, MDP models work with the fixed data structure determining the state. The

mediator of some kind is thereby needed between the physical data carrier and any intellectual

method [5]. This, in its turn, ensures the piping of the task. The sample of the system providing

immediate interaction of intellectual tools with the task is represented in Figure 2.

Figure 2. The way of immediate interaction

Data Base

Expert Action

Raw data

Data

Preprocessing Structured data

Decision

The interface

Intelligence

tools

12

The consideration of an indirect impact of the intellectual system on physical source of a

task is presented in the Ph.D. thesis as well (in this case – some database), which is typical for

tasks that have high expenses of any kind in case of erroneous decisions.

Plenty of methods of Computational Intelligence are involved in the developed approach

to solving the dynamic systems. The current methods for this research and their position among

the methods family of Computational Intelligence [10] are represented in Figure 3. based on the

classification attached in [40].

Figure 3. A fragment of the Computational Intelligence family tree

In this particular work we use the definition of an agent as suggested in the work [28] as a

basis: “An autonomous agent is a system situated within and a part of an environment that

senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect

what it senses in the future”.

There are plenty of agent types that meet these definitions either partly or fully.

Depending on properties that agents possess, some classes of agents are selected [11]. The most

typical among them are: programmable agents (reactive agents, reflexive agents [58]), learning

agents and planning agents [37]. The properties that an agent of some class can possess [28] are

given in the Table 2.

Table 2

The properties of software agents

Property Description

Reactive (sensing and

acting) responds in a timely fashion to changes in the environment

Autonomous exercises control over its own actions

goal-oriented

does not simply act in response to the environment

temporally continuous is a continuously running process

communicative communicates with other agents, perhaps including people

learning changes its behaviour based on its previous experience

mobile able to transport itself from one machine to another

flexible actions are not scripted

character believable "personality" and emotional state.

Computational Intelligence

Neuro-computing

Supervised

learning

Granular

Computing

Reinforcement

Learning

Evolutionary

Computing

Artificial Life

Unsupervised

learning

RL, LCS, …. ANN

13

One of the disadvantages of Reinforcement Learning is the exponential growth of

problem space with each new dimension [58, 62]. Further, the most common methods of dealing

with the problem (known as “the curse of dimensionality”) are considered.

Two main approaches to the working with a large number of states are examined in the

chapter: the approximation of value function and the methods of gradient policy. One of the

methods that belong to nonlinear approximation is the framework of Artificial Neuronal

Networks (ANN).

With the aim to analyse the ANN in the aspect of approximating functions, the multilayer

perceptron with learning using the error back propagation method is realized in this thesis. The

existing commercial and free distributed software implementations of Artificial Neuronal

Networks are reviewed too. The plan of experiments includes the following tasks:

1. to perform author’s own implementation of ANN. To research the efficiency of the

network function on the example of approximation of some one-dimensional stochastic

process;

2. to compare the gained results of approximation with the results of existing ANN

packages;

3. to realize the approximation of states space in the Reinforcement Learning, using toy

problem for demonstration purposes.

Within the framework of the first experiment realized in the Ph.D. thesis, the Artificial

Neural Network demonstrates high results of learning. The application of three hidden layers and

70 neurons in each layer ensures the value of mean-square error ems = 0,0013 (see Figure 4. ).

Such level of error ensures sufficient precision for modelling of one-dimensional stochastic

process consisting of 30 observations.

Figure 4. Function approximated by ANN having three hidden layers

The comparison of the results with the two most common packages of ANN

(Neurosolutions 6.0 and Multiple Back-Propagation v.2.2.2) was performed within the

framework of the second experiment. Neurosolutions 6.0 software for the simplest architecture

of network approximates with the mean-square error ems = 0,00943. The package Multiple

Back-Propagation ensures the convergence by the value of error ems = 0.0012. The comparison

allows us to conclude, that the gained precision meets the precision of side packages. This allows

us to use our own realization of a neural network in the subsequent experiments.

Figure 5. 3-step algorithm of approximated model building

Rough RL

Approximation by

ANN

Final learning

RL + ANN

tabular Q-function

approximated

Q-function

t

ems = 0,0159

Y

14

The idea of approximation of Q-function in Reinforcement Learning using Artificial

Neuronal Network is realized in the third experiment. The key problems of method realization

are represented; to overcome them, a new ANN learning approach (embedded in RL) is

suggested in the Ph.D. thesis (see Figure 5).

To test the approach we use the toy problem of mountain car [9]. According to the

problem, the attractive force exceeds the engine power. It is impossible to ‘climb up’

immediately from the state of rest. The only solution is to develop the strategy of rolling from

one prone to the other, in order to collect the supplementary inertia force. The problem

demonstrates the necessity of multiple movement away from the peak and back, to achieve it in

the future. Available actions: inactivity of engine (0), acceleration forward (+1) and acceleration

back (-1).

The optimal policy of toy problem is obtaining by using discrete RL with table

Q-function. The surface of Q-function in the space of state is represented in Figure 6. The

measurement of operation space is omitted, but the value of optimal action in each point of space

is used.

Scaling factor of Q-axial: 0.1

Space size: 70 x 80

Episodes count: ≈ 8 000

ε: 0.1 probability of random action

: 0.99 discount-rate parameter

: 0.3 learning rate

λ: 0.92 trace-decay parameters

Figure 6. Example of optimal policy Q* = maxa Qt(s)

It is necessary to get a similar surface as a result of a three-step algorithm. The first step

is the development of the first approximation (rough policy). It is experimentally established that

for reaching these objectives the sufficient discretion of space is the network containing of 20 x

20 cells. The second step stipulates the transmission of intermediate Q-function into ANN. Now

the precision of the function depends on the “capacity” of the network. On the basis of series of

experiments, it is identified, that 6 hidden layers containing 110 neurons in each are sufficient to

gain the surface of Q-function [9].

Approaching of network coefficient changes to zero (∆eij→0) allows us to move on to the

third step – learning of network in the mode of interaction with the environment through RL

framework. The experiments demonstrate that after ten iteration of learning, the network

“forgets” the prior learned examples, provided that they were not supplied for training

continuously together with the trained space of training examples. Taking into consideration the

prior steps, the crude policy is a matrix of reference points. Such matrix supports the “memory”

of neuronal network, not allowing to forget the reaction on state rarely found during the learning

in the environment [9]. As a result of learning, the surface of Q-function, demonstrated in Figure

7. is gained on the third step.

The gained policy is looks substantially smoother than its tabular analogue. On the one

hand, it reduces the precision, but, on the other hand, it allows to work in the environment with

continuous parameters.

The experiments with toy problem showed that the algorithm allows to avoid the

problems of absence of initial training set and permits the functioning in continuous

environments [9]. However, the time of learning increases.

position

Speed -1.2

0.6 0.07

-0.07 Q*

value

15

Figure 7. Q-function for action “throttle”, obtained on third step

It is possible to conclude, that ANN learning by error back propagation method provides

a powerful tool for approximation of linear functions set in a tabular form [34]. The performed

experiments on the toy problem demonstrate that this property is successfully used for

approximation of Q-function in the Reinforcement Learning algorithm.

The consideration of the concepts of the Computational Intelligence and the appropriate

methods, described in this chapter, allows us to draw the following conclusions:

1. the analysis of some concepts of Computational Intelligence, as well as its individual

methods (ANN, RL, agent approach, methods of data mining and others), allowed us to

describe the main features of the architecture under development (conveyer organization

of flow aimed at the transformation of data, timely respond to events, etc), and also the

applied methods of intellectual computing;

2. the review of expression of Markov Decision Process through the software agents

demonstrates a number of sufficient methods and a wide range of tasks being solved; at

the same time, problems linked to conveyorization (pipelining) of the problem, require

the development of a special form architecture of agent system;

3. in terms of principle of methods ٰ synergy, the Artificial Neuronal Networks are

considered in this work as an approach to approximation of state space of Markov

Decision Process.

Chapter 3 (The development of the Data Mining based system for the dynamic system’s MDP

model building)

In the third chapter of the research the development and the description of a mathematic

base of intellectual system in the task of decision making, whose data are expressed using the

multidimensional time series structures, is provided.

The decision is based on finding the hidden regularities between the so-called families of

time series. The series of technologies Data Mining are used for that purpose. The concepts of

classes and profiles of observed variables dynamics behaviour are implemented for storing and

operative processing of mined data. The profiles of behaviour are interpreted as single-step

Markov transitions. Markov transitions are used for model creation of process under

investigation.

Within the framework of this chapter, it is necessary to perform the following (in order to

solve the set task):

-1.2

0.6 0.07

-0.07

Q

Speed

Position

16

1. a review of current situation in the field of application of the method to the tasks,

expressed through the multidimensional time series, as well as the research on the

possible existing approaches;

2. the development of data mining based methods to build the time series, their

processing and conversion into the structures of data (states of the system), compatible

with Markov Decision Process.

The research and development of Markov model building method for the data, expressed

through the time series, provides the solution to the following problems:

cleaning of observed process raw data, construction of the time series;

formalization of time series as an environment needed for building of states ٰ space;

composition of a transition graph reflecting the generalized behaviour of dynamics of

observed variable values;

development of optimal policies and their use.

Thereby, the approach is based on generalization of individual observations, building of

state space and the further development of the model. In general, the functioning of system based

on methods being developed has to contain the following steps:

“acquire” the existing observations regarding the development of process under

investigation:

o to identify the regularities of transitions in one or other state with appropriate

influence;

o to build the states transition graph, representing the general model of process

being researched;

to build the particular realization of Markov Decision Process for the sought solution at

the current parameters on the basis of model;

to develop the policy of actions for the particular realization of MDP;

to invite the expert to perform the action according to developed policy at the given

parameters;

to accomplish the permanent renewal of environment, transition graph and policy upon

acquiring of new data concerning the actual process progress.

The data regarding the development of the method of building the MDP model are

provided in the chapter. The method of data transformation into structure of Markov Decision

Process expressed through time series is suggested and described. For that purpose, the concept

of behaviour profile of observed variable values is implemented. The criteria of comparison of

time series are reviewed.

Description of the approach. One of the directions of Markov Decision Process models using is

a research on the dependence of dynamic of one variable on the others. The final objective is the

application of a gained model for the timely decision making concerning the performed activity,

with the aim to achieve the desired indicators of dynamic system. The main problem, as with

most mathematic models, is its building.

In the same way [33], the whole set of observations is considered as a source of

behaviour patterns (profiles), but, unlike [33], in this work model is based not on the particular

realization of time series, but on many realizations, corresponding with different combinations of

parameter values of time series. Clustering of realizations of time series allows us to consider

new operations with transition models: minimum needed supplement of transition mode using

fragments of other analogical models. This operation allows us to continue building of the

existing transition model in such a way that with a certain probability the model will allow the

system to move into a state that was not stipulated during the learning phase (model building).

In general terms, the building of a dynamic system model presupposes the investigation

of the particular regularities that took place in the development of processes under observation,

their generalization and expression in some structure [4,7]

The method of dynamic system model building being developed follows the described

approach. The mathematic framework of Markov Decision Process is considered as a model. The

17

state transition graph is used as a graphical expression of the model. The method of model

building (see Figure 8) is divided into the following main stages [6]:

1) to process the raw data, to build the time series T;

2) to identify the general regularities of the development of processes, to build the

behaviour profile П of observed variables;

3) to find general transitions in the profiles of dynamic behaviour, to build the model of

processes evolution (transition graph).

Figure 8. The main stages of building of the Markov transition models

The key moment of time series transformation into MDP structure is the interpretation of

the dynamic behaviour profile in the aspect of state matrix forming S [6]. In case of timely

consideration of Markov Decision Process the concept of state S is connected with the theory of

software intellectual agents [62]. All the information available to an agent, gained by agent

sensors from the environment at the certain time moment, is called a state.

In case of a task, whose data are represented by time series, the special approach

concerning the determination of dynamic state systems is offered [6]. The key difference is the

consideration not of the static variable values, but of appropriate time series, in other words – the

evolution dynamic of variables being researched is considered. The following interpretation of

dynamic behaviour profile meets these aims. Let the behaviour profile, gained as a result of time

series clustering, consist of two centroids: v1 and v2 (see Figure 9).

Figure 9. Sample of behaviour profile of volumes and sale prices

Thus, we indicate the structure in profile, including three elements: a) dynamic of

variables v1 and v2 before event ei, b) event ei, happening in the time moment ta, c) dynamic of

variables v1 and v2 after event ei. Thus the area of variables evolution observed before the event

we interpret as the initial state s0S (see. Figure 9, area t = [1; ta)). Event ej – as the action aA.

The area of variables evolution observed after the event we interpret as the transition state s1S

(sphere t=[ta; tmax]). The dynamic behaviour profile П, then, can be considered as a single

determined Markov transition process (see Figure 10) [6]. Taking into account that the transition

is built on the actual observations, the values of transition probability p(s1|s0, a0), for the time

being, is equal to one.

The states of set S are marked with white circles, with grey circle – action a0, as a result

of which the determined transition happens. After the appropriation of corresponding state

Observations Behaviour profiles

Time series

P

Data base

s1

s3 s4

s2

p1

p2

p3

p4

p5

p6

Transition graph

s0 - Initial state, t = [1; ta) a – action s1 - transition state,

t = [ta; tmax]

t ta

v2

v1

0

18

identification to each fragment of each profile, it becomes possible to work not with time series,

but with corresponding states, which is a necessary condition for working with Markov Decision

Process.

Figure 10. Interpretation of sales profile

Description of the goal state and the problem. Defining of the goal state on the stage of states

set building is not topical. However, as soon as the model is built, the initial and the goal state

should be defined for building an action policy. Accordingly, we describe in which way the goal

state of Markov Decision Process is interpreted in the problem, expressed through time series.

The goal state is the appropriate proportion of the observed variables vi, meeting the

requirements of an expert. For example, we consider the observed variable in the problem of car

sales v1 – volume of sales and the v2 – sales charge. Then, for some fixed parameters of space Ψ,

a state that can be considered as a goal state is the one where time series of sale volumes v1 and

prices of sale v2 meets the given value.

The action A and reward R matrices. Creating the model of the researched system, the action

matrix may be gained using several methods, for example:

1) set of actions A contains only actions, being presented in set of profiles П;

2) set of actions A contains all the allowed values of change of the controlled variable vi

in some diapason, irrespective of presence of the specific meaning in the profiles П.

Analogically to goal states, without involving the method of reward matrix R building, let

us consider its definition for a case of problem expressed through time series. The reward matrix

R is commensurate with the quantity of states, and for each state keeps the value of reward that is

gained by a system (or an agent) in case of achievement of current state si. For example, let the

reward make the value of 1, if the achieved goal state sgoal, and value -0,04 – in the opposite

case. The matrix R, then, can be presented by the function:

goali

goali

is Ss

SssRr

i ,04.0

,0.1)(

(2)

In such a way, to provide the system with matrix of reward R, it is enough to determine

set of target states Sgoal.

Transition probabilities graph P. A central element of Markov Decision Process is a transition

graph, where it is necessary to find the optimal action policy *. The building of the transition

graph is the most complicated thing concerning the expression of the task being researched in the

aspect of MDP.

In case of many economic tasks the transition model cannot be expressed in the analytical

way and be correct for all its states, but it can be obtained in the form of a table based on

generalization of actual transitions. The table representation makes it possible to describe the

reflection of the state for each state-action combination. Thus, the transition graph is built for the

whole space Ψ (based on the definition of state).

The transition graph is built in the process of generalization of atomic transitions

(dynamic behaviour profiles П). The generalization is the calculation of probability of transition

from the state s performing the activity a into the state s’, and it is based on the number of actual

observations concerning this transition (3). Since the environment is totally observable (training

set is available), it is reasonable to use the statistic approach to evaluate the unknown parameter

for the calculation of transition matrix P. We consider one of the simplest approaches in this

s0 s1

a0

19

research - the maximum likelihood estimation. Thus, the calculation of a factual observation

quantity of each transition in relation to a total number of transitions from the considered state is

expressed as follows:

''

)'',,(

)',,(),|'(

s

sasN

sasNassP , 1),( asP ,

(3)

where s' – goal state, s’’ – any state of set S, into which the transition is possible (it means that

there exist appropriate factual observations) from the state s performing the action a.

N(s,a,s’) - number of factual transitions from the state s into state s', performing the action a;

''

)'',,(s

sasN – the total number of transitions to any possible states from the state s, performing

the action a.

The development of clustering procedure. The procedures of clustering are the central

generalizing mechanisms of the actual observations in the process of MDP-model building. The

precision of prognosis and “adequacy” of a future model to learning data depends on the correct

choice of clustering criteria in relation to clustered data and on the accepted parameter values of

clustering.

In this research, the agglomerative hierarchical clustering of time series is applied, which

requires the forming of symmetric distance matrix. Since the objects of clustering are time series,

first of all it is necessary to define metrics that allow us to compare quantitatively the similarities

of one time series with the other.

To calculate the distance, we use two simple calculation criteria: Euclidian distance and

the shape-based criteria. The last one allows us to compare not the absolute values, but the

shapes of curves of appropriate time series (4).

1,1,

1

1

,, ,,||),(

iiibiiia

N

i

ibia bbaabaS . (4)

Euclidian distance makes it possible to group the time series close to each other by

distance, but the shape-based criteria are groups after outline (profile) of curves (see Figure 11).

Figure 11. Two criteria for evaluation of time series proximity

This kind of approach allows us to gain clusters of time series, close by its shape, in

groups, homogeneous regarding the scale of distribution. The clustering for time series of sale

prices is made only according to shape-based criteria.

MDP-model building method representation in a form of an intelligent system. The basis of

dynamic programming methods is the principle of consequent interim decisions making leading

to the objective. The technique of decision search gains the iteration features. As a consequence,

it becomes more preferably to realize one or the other method in the form of a programming tool.

It is necessary to develop the architecture of programming tool, which will predetermine the

t

y Cluster А

Cluster B

t

y Cluster А

Cluster B

Euclidian distance Difference of change

20

structure of intelligent system. The creation of a system ensures the interaction among the

methods of dynamic programming represented below, as well as the interaction with the

database (data source of problem) and with an expert (operator).

Figure 12. Diagram of intellectual system functioning and its interaction with the environment

Notwithstanding of it, the architecture of intelligent system has to ensure the interaction

with the user database, expert and operator. The architecture of a system meeting all the demands

mentioned below is given in Figure 12. The elements of the intellectual system located in the

sphere are marked with double outline, including the modules of processing, the storage of time

series and the profile storage. The intellectual system possesses the visual interface of interaction

with manager (operators) and programming interface of interaction with the user database.

The directions of data streams are marked with the dotted line, with solid line - control

stream. The system being designed possesses characteristic properties such as autonomous and

interrupted functioning, adaptation to changing data (in other words – learning of some kind),

goal-oriented (presence of a goal to optimize some parameters of the process being researched

by means of model building), communicativeness (interaction with the users and the database),

and others. It allows us to consider the software system as a software intellectual agent.

Chapter 4 (The application of MDP model building using methods of Data Mining in the

problem of Dynamic Pricing Policy)

The practice of sales administration points out the necessity of Dynamic Pricing Policy to

increase the competitiveness. The Markov models are effective when the decision making

includes the uncertainty in the event chronicle, but crucial events can take place repeatedly. The

aim of this chapter is to demonstrate, using the test example, the application possibility of

Markov decision Process in the problem solving of Dynamic Pricing Policy. Respectively, the

tasks of this chapter are the following:

Module of recommendations

Recommendation

Time series

Profiles П

Agent Time series

creation

Model Building

MDP-model

Clusters

New records

Client Database

Clustering

Behaviour profiles creation

Environment

Parameters

Decision

Queries

An expert or operator

Observations

21

to formulate the problem of Dynamic Price Policy as the task of dynamic programming;

to formulate and to describe the method of dynamic control of pricing policy based on

Markov Decision Process;

to formulate the method of MDP-model building on the basis of regularities detected using

Data Mining tools, from the factual data concerning selling.

The basic reasons of Dynamic Pricing Policy choice as an experimental problem are:

time-dimension presence, determining the possibility to represent the selling data as time

series;

possibility of generalization according to the number of observed variables (for example,

wholesale customer, goods and other) with the aim to build the model;

the possibility to consider the price correction process as a system being at every moment

of time in a certain state, possessing the controlling mechanism (change of state) and

having the conception of a goal state.

The aforementioned reasons are important because they allow us to demonstrate the

features of an application approach of Markov Decision Process being developed in the Dynamic

Pricing Policy problem. Finally, the topicality of Dynamic Pricing Policy, dictated by rapid

evolution of internet technologies in modern business, also determines the choice of this task.

A source of information concerning sales, database and the method development platform

in the problem under consideration is the enterprise resource planning 1С:Enterprise v7 system.

The task of Dynamic Pricing Policy has several definitions. Here we consider the

following definition: the dynamic pricing is the operative adjustment of prices to customers

depending upon the value, with which the customers correlate with the production or service [49]

(the definition, in its turn, is based on [56]). As a value, with which the customers correlate with

the production or service, we consider three forms of price differentiation offered in this work

[69].

A system of decision making is developed; it aims at the long term maximization of sale

sums, which is achieved through the correction of existing sale prices taking into account the

available factors of ERP- system (see Figure 13.) and the goal state set by an expert.

Figure 13. Interaction of price generator module and module of price correction

Implementing regularly

Implementing one time

PRICE GENERATOR

Standard subsystem of pricing

PRICE CORRECTION

Integrated intellectual system

Basic tools of ERP

The price correction

block being researched

and developed

Goods

Labor expenses

Raw material

Profit

Wholesale customer

Sales date

Contract conditions

Current price

Price model

Goal price state

price

price

Available factors, determining the initial

price

Available factors and

model, built using data

mining

22

The mechanism of price correction is realized owing to the use of MDP-model. The

building of price correction model includes the finding of regularities of sale evolution process in

the past and their generalization (see Chapter 3). In other words, the solution is reduced to the

analysis of changes in the past and creation of an appropriate model of price evolution process.

The general objective of Markov controlled processes with discounting of incomes is the

choice of such system managing vector everywhere, to gain the maximum profit on the horizon

of its functioning [80]. Due to this property, the MDP framework is appropriate in the Dynamic

Price Policy problem.

Processing the multidimensional space data is originally stipulated in this work, including

the data on wholesale buyers and product names. Each dimension of space is determined in the

space of some hundred values. The methods of Data Mining are applied in order to discover the

regularities that describe the outcomes of pricing policy actions in the past. A program tool

(based on the intellectual agent) for the online tracking of the new data on sales (which comes

from the managers and operators of the ERP-system), is realized. Such tracking makes it

possible to update the model timely by means of inserting a new data on sales and outcomes of

the price corrections in it.

Data model. The problem of price correction is a discrete process, in other words, the behaviour

of a wholesale customer-goods system can be expressed with the final number of states. At any

discrete time moment ti system is in one of the possible states sj S. The sales process is being

observed over time for fixed values of appropriate parameters (measurements of space). Each

state of the system is determined using two vector values (time series):

p - time series of price, representing dynamics of price changes;

v - time series of sales volume, representing the dynamics of sales volumes.

There is a sample system (on the right) in Figure 14, having fixed values “Light” to

measure Buyer and “Led” to measure Goods. The sale price pij and sale volume vij are the

observed values in the framework of the system. Accordingly, the measurements of space are the

observed variables Customer, Goods and Time. Each point of a hypercube (Figure 14, left) is

determined by two static values: price and sales volumes.

Figure 14. Data hypercube (on the left) and time series of the state (on the right)

Apart from the dimensions ‘Goods’ and ‘Customer’, other measurements also exist.

However, the other measurements are abolished in this study for the sake of clarity. Let us also

V,P

v

p

Sales of goods ‘Led’ to

buyer “Light”

Customer

Goods Time

Led

t1

t2

t3

t4

t5

t6

Tumbler

Socket 220V

“Home” “Centre” “Light”

v=3

p=4.75

v=5

p=4.65

v=7

p=4.85

v=20

p=0.30

v=20

p=0.30

v=25

p=0.29

v=15

p=1.33

v=20

p=1.33

v=20

p=1.30

t

“Home” “Centre” “Light”

23

suppose that the data have passed all the stages of cleaning and pre-processing. The description

of procedures of data cleaning is represented in [4].

The price correction problem in terms of dynamic programming. The provision is

nominated to the task in the context of dynamic programming, which means that its solution has

to be represented as some sequence of actions. In other words, the solution of the subtasks stated.

The price correction problem possesses such property. For example, price p regarding some

goods Gi may be sequentially (during some period of time) transferred into a desired value

without significant (given value) loss of sales value. At the same time, the immediate change of

price can cause the loss of wholesale customers. Multistep approach of price correction makes

the using of Dynamic Programming appropriate in the Dynamic Pricing Policy problem.

The price correction problem in MDP terms. We express the problem described above

through recursive Bellman optimality equation. In the terms of MDP, the equation below

expresses the value of an expected reward gained for transition from the current state s in the

state s’ according to a certain policy π [62]:

'

)'())(,|'()()(s

sVsssPsRsV . (5)

The expression of optimal policy is known as the Bellman optimality equation in the

terms of MDP. It describes the current reward for taking action, entailing in the future maximal

expected reward [62]:

'

)'(*),|'(max)()(*s

asVassPsRsV , (6)

where s – state of system, determined by vectors p and v : s = { p ; v };

R(s) – reinforcement, gained in current state s.

The goal state, like any state, is determined through its own values of reinforcement. In

the general case, the function of reward determines how good or bad it is to stay in the current

state (similar to “pleasure” or “pain” in biological context). The reinforcement represents, in the

case of price correction, the local amount gained from wholesale customer, and it is determined

for state s in the following way:

,)()(||

1

s

vpsR

(7)

where |s| - size of time series contained in the state s; then the reward is the amount of incomes

for all the days observed in the frames of state s; - time variable of the researched time series.

Let us continue the consideration of equation variables (6): P(s’|s,a) - probability of

system transition to the state s’ from the state s, performing an action a. The calculation of

probability matrix P is based on the quantity calculation of factual observations of each

transition regarding the total quantity of transitions from the state being considered (3).

Since the function V*() in the expression (6) is both present in left and right part, the

calculation is performed in a recurrent way, i.e. by means of decomposition of the whole task

into subtasks. There exist two main algorithms of an equation solving (5): Value Iteration and

Policy Iteration.

The Dynamic Pricing Policy problem in the terms of Markov Decision Process is

formulated in this chapter. The method of model building is offered. The obtained model is

considered as an environment where MDP is functioning. The method is based on finding of

regularities and their generalization. The methodology of model building is demonstrated in the

framework of elementary system including one wholesale customer and one goods unit. The

analysis of solved problems allows us to formulate the following conclusions:

1. A dynamic system whose development is represented through the time series, can be

expressed through a final number of states and actions, can have the transition model and

reward function;

24

2. decision making and support module interaction with the system wholesale customer -

goods, as well as the necessity in the continuous update of the model, determine the

application of an agent oriented approach.

Chapter 5 (The performance of experiments regarding the model building in the problem of

dynamic pricing policy)

To approbate the suggested model building method, the development of a software

platform for carrying out the experiments is performed, the plan of experiments is developed,

and the process of their implementation is described.

Software development for the experiments execution. The program modules for performing

experiments are realized on the platform 1С:Enterprise version 7.7. The choice is determined by

the existence (for this platform) of data on purchases of products by real enterprise for two years.

The realized modules are also able to serve for creation of final software designed to operate in

the background mode and to perform the decision making concerning the price correction.

Besides, the modules for carrying out the experiments, which are not associated directly

with subject sphere, are realized. These are the programs for working with Artificial Neuronal

Network and Markov Decision Process in the toy problems.

Involvement of the development environment Borland Delphi 7 has to ensure and

accelerate the processing of massive data whose volume exceeds the possibilities of platform

1С:Enterprise v7.7. For example, the creation of distance matrix is restricted up to 5000 elements

in each dimension.

Plan of the experiments. To evaluate the workability and the efficiency of the developed

method concerning the building of MDP-model, the plan of experiments is developed (see

Table 3). Two series of experiments are included in the plan. By means of comparing the model

and the actual processes development, the first series of experiments allows us to evaluate how

MDP-model meets the learning data. The aim of the second series of experiments is to research

the quality of MDP-model created through the approximation of space by Artificial Neuronal

Network.

Table 3

The plan of experiments concerning the building and application of MDP-model in Dynamic

Pricing Policy Problem

Series

Nr Description The experiment aim

I.

Using the numerical characteristics, to evaluate

the similarity of a built MDP-model with respect

to the factual processes

To evaluate the

correctness of algorithms

concerning the building of

MDP- model

Using the numerical characteristics, to compare

the efficiency of solution concerning

MDP-model with factual solutions for testing

data

To evaluate the efficiency

of MDP-model in the

exploitation mode

II.

Using the numerical characteristics, to perform

the estimation of similarity concerning built

MDP + ANN model towards factual processes

To evaluate the

correctness of algorithms

of MDP-model building

with approximated space

As a criteria for evaluation of model quality serves the proportion of the number of

successfully modelled transitions to the number of actual transitions, as well as the evaluation of

profit expressed in conventional units.

25

The pricing policy model building and exploitation. The actual data regarding the sale of

products produced by Latvian food industry are used for creation of pricing policy model in this

experiment. The period of observed data covers three months: May, June, and July. The data

about sales are represented via electronic documents of ERP-system „1C:Enterprise v7.7”. In

addition, each electronic document contains the date of deal, name of wholesale customer,

register of goods, sales volume, and prices. Approximately 28,5 thousand sale documents, 1 725

wholesale customers and 1 725 names of marketable titles are present in the frames of period

being researched.

The obtained model of transitions (see Table 4) represents the Markov Decision Process.

The total number of transitions consisted of 1192, created by 294 states.

Table 4

A transition model fragment

Initial state s Action a Transition state

s’

Transition probability

),|'( assP

Id_7 ]0009.0,05.0(~2 p Cl_11 1,0

Cl_66 ]0009.0,05.0(~2 p Cl_54 0,25

Cl_3 ]05.0,1.0(~1 p Id_11 0,667

Id_14 ]05.0,1.0(~1 p Id_15 0,4

Id_16 ]01.0,0.0(~3 p Id_17 0,667

Cl_66 ]015.0,01.0(~4 p Cl_54 0,333

Cl_3 ]015.0,01.0(~4 p Id_23 1,0

Id_24 ]02.0,015.0(~5 p Id_25 0,75

… … … …

Using the graphical package «yEd», the visual representation of model is possible. Based

on XML-file of model generated by experimental platform in Figure 15., the fragment of model

is given below.

The light ovals designate the states, dark ones - actions, and values over the slats – the

corresponding transition probability. The gained graph represents the pricing model for the

appropriate time period. The fragment of sales volume development for certain observed

combination of wholesale customer and goods is given in lower side of Figure 15.

The states not accompanied by any price changes are outlined with a dotted line (in other

words, the transitions with zero value of price). The transitions caused by a certain price changes

are outlined with a dashed line. The straight arrows show the position of each graph element on

real piece of sales process.

In Figure 16. the number (in percentage for total number 7000) of time series, possessing

the appropriate value of numerical estimation );( GCX (describes the correspondence of model and

its separate parts to actual transitions; for details see paragraph (5.2) in Ph.D. Thesis). We note

the absence of marks with value less than 0.5 – it means that all time series at least are half

described by model and have partial sequence of events. The low mark (less than 0.9) is

associated in 97% of cases with the discontinuity of the model (this effect can be observed in

Figure 15., above for single transitions). Nevertheless, 70% of observations possess the

evaluation of );( GCX =0.8, which characterizes the model as a model that is able to reproduce the

majority of processes.

26

Figure 15. The fragment of transition graph (above); dynamic of sale volume for fixed

combination wholesale customer-goods

Having the transition model, it becomes possible to use the well studied algorithms of

policy search in Markov Decision Process (Policy iteration, Value Iteration), for which it is

necessary to determine the goal state. Generally, the exploitation of model for the process being

researched includes the following stages: (a) to compare the current state of process with one of

the model states, which would determine the initial state s0; (b) to determine the goal state starget;

(c) to build the policy * of price corrections; (d) to track the current state and update the policy

during the process evolution.

Figure 16. Distribution of value );( GCX

The made changes of price come into force, and the system does the transition into the

next state st+1 after the specified period. The transition is characterized through a changed value

of price and the reaction of request. Such interaction of decision making module and the external

system the wholesale customer-goods represents the typical interaction of intellectual system

with the environment.

);( GCX0

5

10

15

20

25

30

1 0,889 0,8 0,715 0,667 0,625 0,5

%

31.07.2008 01.05.2008

Price

Volume

Date

27

The exploitation of gained policy on testing data. The aim of this experiment is the evaluation

of the possibility to apply the model to data not included in learning sample. In fact, the

experiment reflects the initial task, which is nominated to the intellectual system.

Unlike the previous experiment, this experiment stipulates the phase of exploitation of

optimal policy *. This means that it becomes possible to evaluate the efficiency of model using

numerical method in the aspect of profit. Building policy * all available states are considered as

goal states. The strategy is calculated, based on gained profit (7) and appropriate actions. It

allows operating to all variants of process evolution and provides a choice of the optimal one in

terms of discounted profit.

The most demonstrative case of developed policy exploitation for testing data is the

example, given in Figure 17. According to MDP-model, by transition from the state Cl_438 into

Cl_636, two alternatives appear: to make actions concerning corrections of price into value

]1625.0,15.0(~27 p , or action concerning the price correction into value ]125.0,1125.0(~

24 p .

Depending on a chosen action, the terminal states can be the states Id_2181, Cl_683,

Id_734. Since the local reward values are known (the total income at the site of state), the

calculation of terminal state achievement policies and a choice of the optimal one becomes

possible.

Figure 17. The fragment of transition graph and optimal policy * (above); dynamic of sale

values and prices for a particular combination wholesale customer-goods

Despite the fact that the terminal states Cl_683 does not possess the maximum value of

local reward, it is the most “attractive” one in terms of discounted reward constituting the value

V(Cl_683 | 24~p ) = 405,585. The maximal discounted reward determines the optimal politics *,

represented in Figure 17. with light grey bold arrow.

Cl_572

Id_158 0_to_0

Cl_636

Id_84 0_to_0Cl_374 0,035_to_0,04

Cl_438 0,15_to_0,1625 Cl_685 0_to_0 Cl_683

Cl_704 0_to_0 Id_628

0_to_0

0,065_to_0,07

Id_3337 0_to_0

0_to_0

Cl_88

Cl_430

0.1125_to_0.125 Id_627

Id_2181

0_to_0 Id_734

1

1

1

0.2 1

0.5

0.5

1

1

1

0.75

0.25

0.75

0.8

0.25

1

*

Sale price

Sale volume

Data

r=201

r=218

r=247 r=113

r=178

r=195 r=153

28

In the end, the precision depends on which general features of sales process evolution are

created on the base of individual observed cases building MDP-model. With the aim to evaluate

the agent functioning effectiveness in the framework of 5000 combinations in Figure 18 the

histograms reflecting the distribution of combinations according to appropriate value of

comparison evaluations of process being modelled with actual process, are given.

According to Figure 18 it is possible to conclude that the model contains predominantly

the combinations (77,8%) for which the correction of prices lead to the positive results (i.e. the

profit values are more than zero). The negative value of evaluation means that the price

corrections, offered by the model, turned out to be less effective in comparison with the solutions

provided by the expert.

Figure 18. Number of combinations customer-goods distribution according to profit (left),

number of combinations customer-goods distribution according to distance (right)

The presence of combinations customer-goods, for which the price corrections cause

losses, is explained by insufficient number of individual observations found for such

combinations, to create valid transition graph.

MDP space approximation by means of ANN experiments. The practical research concerning

the approximation possibility of Dynamic Pricing Policy decisions space using Artificial

Neuronal Networks is performed in this experiment. The method details and its application in toy

problem of mountain car are provided in the subsection 2.3.

The approximation of transition space is needed for gaining transition probabilities of

states, in which the system was never found before, but has the potential for. In such a case it is

advisable to have at least the estimated value of transition possibilities.

The curves of convergences for various numbers of hidden layers and neurons in the

hidden layer of ANN are given in Figure 19. The configuration “22-66-66-1” can be marked as

the “quickest” (after the number of used iterations of learning) network configuration (two

hidden layers, 66 neurons in each). The “slowest” is the configuration «22-88-1». If we take into

consideration the time spent on one iteration of learning, then we find that the configuration 22-

22-22-1 is the “quickest”. It allows us to achieve the result comparable with configuration “22-

66-66-1” for larger number of iterations, but in shorter time. In this connection, we will use the

configuration 22-22-22-1 in the further experiments.

With the aim to evaluate the quality of approximation we perform the cross validation

test. The whole learning sample is divided into 10 blocks for tgat purpose. Each block is made by

compilation of records taken from the learning sample through a given interval. The plots of

convergence for 10 cross validations are represented in Figure 20. All validations are performed

on the network with configuration “22-22-22-1”.

,% ∆D

29

Figure 19. Convergence of ANN for different parameters

We note that the convergence of test sample has no asymptotic approach to zero (like

learning sample), but it is kept on a certain level. At the same time, the value of average square

error (RMSE) of test sample has the same scale of values as the error of learning sample. When

analyzing the approximation result of one of the cross validation blocks (see Figure 21.), it is

possible to note that the network is able to repeat the main traits of test function (RMSE error

value is 0,17).

Figure 20. Convergences of cross validations

Such error value allows us to use the model approximation with the purpose of building

MDP action policy. Now the policy building algorithm uses not the tabular representation of

transition probability function, but its approximation performed by ANN. With the aim to

evaluate the precision, which could be proportional to evaluations of previous experiments, we

use expression (5.2), which is represented in the promotion thesis. This expression allows for

each combination wholesale customer - goods );( GC to calculate the numeric

evaluation );( GCX of model accordance (or its separated parts) to factual transitions.

RMSE

Epoch of learning

Convergence

of test

samples

Convergence

of learning

samples

rmse

Epoch of learning

0

0,05

0,1

0,15

0,2

0,25

0,3

0320

640

960

1280

1600

1920

2240

2560

2880

3200

3520

3840

4160

4480

4800

5120

5440

5760

6080

6400

6720

7040

7360

7680

8000

8320

8640

8960

9280

9600

9920

RMSE

30

Figure 21. Fragment of test sample approximation

The number (in percentage of total amount 5000) of time series having the appropriate

value of numerical estimation );( GCX is represented in Figure 22. In comparison with the results

given in Figure 16., the approximated model has smaller precision. The important advantage of

the approach is that there occurs a probability to take decision within the states, in which the

system has not been before.

Figure 22. Distribution of estimation value );( GCX for approximated model

In practice, taking into account such relatively high error, such decision can be offered to

the expert to consider the possible strategies concerning the price correction.

Chapter 6 (The results analysis and conclusions)

It is shown that the dynamic system characterized by the presence of time series can be

expressed through final states and actions, have the transition model and reward function. Most

attention was concentrated on the building method of the model considered as the environment,

where the MDP apparatus functions. The method is based on finding of regularities of the

observed variables evolution within the time and the generalization of the detected regularities.

The employed agent-oriented architecture ensured such interaction of the MDP-model with the

environment, where the permanent learning of the MDP-model and the transformation of the

policy in frames of dynamic environment can be performed. The offered experimental platform

allowed us to approve the developed method of model building in the circumstances of real task.

The platform allows the expert to choose the desirable goal state. At the same time, the goal state

);( GCX

%

Desired value

Network’ outcome

Record number of test sample

Transition

probability

31

can be selected automatically from a subset of states with maximum value of reward or any other

terminal state, whose achievement has the maximum discounted income.

The method of fully automated price evolution model building is demonstrated in the

framework of a real system including hundreds of customers and product names. The validation

of the model showed the acceptable precision. To increase the precision, the implementation of

the appropriate experiments (search for parameters, revision of data structure) is needed, as well

as the improvement of certain algorithms, etc. The main results of this promotion thesis are the

following:

the current state of the problem concerning the building of Markov model of dynamic

system, represented through time series within multidimensional space of observed

variables is researched;

such areas of Computational Intelligence as agent intellectual systems, Data Mining

procedures, Artificial Neuronal Networks, etc are researched with the aim to develop the

method concerning the building of MDP-model of dynamic system and its applications

for the testing data;

the new multistep approach concerning the building of MDP-model in case of

approximation of state value table using the Artificial Neuronal Network is developed

and approved;

the method of MDP-model building based on regularities searching of observed variables

evolution processes and their transformations into the set of states, actions and transition

probability function is developed;

the approaches for transformation of real problem data into structure, meeting the MDP

framework are offered: the methods and data structure are developed (behaviour profile

of observed variables) to transform the multidimensional time series into states of

Markov Decision Process, the methods of action set building are suggested, the method

of search of goal states is offered;

to approbate the suggested method of MDP-model building, the experimental software

platform is developed, as well as the series of accompanying software tools;

In the course of experiments based on real data on sales, the numerical evaluations of an

MDP-model closeness to the factual under investigation processes evolution, as well as

the evaluation of the agent system functioning on testing data.

32

MAIN RESULTS OF THE THESIS

The decision making method based on the MDP and ensuring the MDP-model building in

problems which contain the data presented as multidimensional time series was developed as a

part of the doctoral work. The method was tested in a series of experiments. As a result, the

numerical estimates allowing us to conclude that the method is able to build an MDP-model

which adequately displays the learning sample, were obtained. The following tasks have been

solved and the results obtained.

1. The review of mathematical methods based on Markov Decision Processes allows us to

conclude that MDP and Reinforcement Learning can be considered an effective method for

modelling of dynamic systems; some key problems of their use in tasks which contain data as

multidimensional time series were defined.

2. The analysis of several computational intelligence techniques (ANN, RL, agent based

systems, methods of Data Mining, etc.) allowed to describe the main features of the

developed method (pipeline organization to transform the data, timely updates of an MDP-

model and so on).

3. Special agent based architecture was developed to avoid an incorrect description of the

interaction of intelligent system with the environment.

4. The approximation of the state space and decision space of MDP with the use of Artificial

Neural Network was implemented. The efficiency of the approach was demonstrated with the

toy problem.

5. The intermediate structure (the profile of a studied value’s behaviour) for storing and

processing of the identified patterns of the investigated time series was formulated. The

behaviour profiles of values being researched are used for the MDP-model creation.

6. The approach of using different criteria for clustering time series (Euclidean distance and the

shape-based similarity) according to the semantic load of each studied variable was offered.

7. For the purpose of testing the method, the problem of the Dynamic Pricing Policy within the

MDP framework was formulated, and the software for implementing the experiments was

developed.

8. A series of experiments enabling one to quantify the effectiveness of the MDP-model

building was carried out. The assessment was based on the comparison of the resulting model

and the training sample, and, also on the application of the model to the data outside the

training set.

33

LIST OF REFERENCES

1. Carkova V., Šadurskis K. Gadījuma procesi. – Rīga : RTU, 2005. – 138 lpp.

2. Čižovs J., Borisovs A. Markov Decision Process in the Problem of Dynamic Pricing Policy

// Automatic Control and Computer Sciences. - No. 6, Vol 45. (2011), pp 77-90.

indexed in: SpringerLink, Ulrich's I.P.D., VINITI.

3. Čižovs J., Borisovs A., Zmanovska T. Ambiguous States Determination in Non-Markovian

Environments // RTU zinātniskie raksti. 5. sēr., Datorzinātne. - 36. sēj. (2008), 140.-147.

lpp.

indexed in: EBSCO

4. Chizhov Y., Zmanovska T., Borisov A. Temporal Data Mining for Identifying Customer

Behaviour Patterns // Workshop Proceedings Data Mining in Marketing DMM’ 2009, 9th

Industrial Conference, ICDM 2009, Leipzig, Germany, 22-24 July, 2009.– IBaI Publishing,

2009. – P. 22-32.

indexed in: DBLP, Io-port.net.

5. Chizhov Y. An Agent-Based Approach to the Dynamic Price Problem // Proceedings of 5th

International KES Symposium Agent and Multi-agent Systems, Agent-Based Optimization

KES-AMSTA/ABO'2011, Manchester, U.K., June 29-July 1, 2011.– Heidelberg: Springer-

Verlag Berlin, 2011. – P. 446-455.

indexed in: SpringerLink, Scopus, ACM DL, DBLP, Io-Port.

6. Chizhov Y., Kuleshova G., Borisov A. Manufacturer – Wholesaler System Study Based on

Markov Decision Process // Proceedings of 9th International Conference on Application of

Fuzzy Systems and Soft Computing, ICAFS 2010, Prague, Czech Republic, August 26-27,

2010.– b-Quadrat Verlag, 2010 .– P. 79-89.

7. Chizhov Y., Kuleshova G., Borisov A. Time series clustering approach for decision support

// Polish Journal of Environmental Studies. – Vol.18, N4A (16th International Multi-

Conference ACS-AISBIS, Miedzyzdroje, Poland, 16-18 October, 2009), pp.12-17.

indexed in: Scopus, Web of Science

8. Chizhov J., Borisov A. Applying Q-Learning To Non-Markovian Environments //

Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART

2009), Porto, Portugal, January 19 - 21, 2009. – INSTICC Press, 2009.- P. 306-311.

indexed in: Engineering Village2, ISI WEB of KNOWLEDGE, SCOPUS, DBLP, Io-port.net.

9. Chizhov Y. Particulars of Neural Networks applying in Reinforcement Learning //

Proceedings of 14th International Conference on Soft Computing „MENDEL 2008”, Czech

Republic, Brno, 18.-20. June, 2008. – Brno: BUT, 2008. – P. 154-160.

indexed in: ISI Web of Knowledge, INSPEC

10. Chizhov Y. Reinforcement learning with function approximation: survey and practice

experience // Proceedings of International Conference on Modelling of Business, Industrial

and Transport Systems, Latvija, Rīga, May 7-10, 2008.- Riga: TSI, 2008.- P. 204-210.

indexed in: ISI Web of Knowledge

11. Chizhov J. Software agent developing: a practical experience, Scientific proceedings of Riga

Technical University: RTU 48. rakstu krājums, 5. sērija, 31. sējums, 12 October, 2007, Riga

Technical University, Riga, Latvia.

12. Chizhov J., Borisov A. Increasing the effectiveness of reinforcement learning by modifying

the procedure of Q-table values update // Proceedings of Fourth International Conference on

Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and

Control „ICSCCW-2007”, Antalya, Turkey, 27-28 August 2007.– b-Quadrat Verlag, 2007.–

P. 19-27.

34

13. Chizhov J., Agent Control In World With Non-Markovian Property // Poster presentation in

EWSCS’07: Estonian Winter School in Computer Science, Palmse, Estonia, March 4-09,

2007.

14. Athanasiadis I.N., Mitkas P.A. An agent-based intelligent environmental monitoring system

// Management of Environmental Quality.–Vol.15 (2004), P. 229-237.

15. Baxter J., Bartlett P.L. Infinite-horizon policy-gradient estimation // Journal of Artificial

Intelligence Research. –Vol.15 (2001), P. 319–350.

16. Beitelspacher J., Fager J., Henriques G., …[etc]. Policy Gradient vs. Value Function

Approximation: A Reinforcement Learning Shootout. Technical Report No.CS-TR-06-001,

School of Computer Science University of Oklahoma Norman, OK 73019, 2006.

17. Bellman R. Dynamic Programming. – New Jersey: Princeton University Press, 1957

18. Bertsekas D.P., Tsitsiklis J. Neuro-Dynamic Programming. - Athena Scientific, 1996. – 512

p.

19. Butz M.V. Rule-based evolutionary online learning systems: learning bounds, classification,

and prediction. – Submitted in partial fulfilment of the requirements for the degree of Doctor

of Philosophy in Computer Science in the Graduate College of the University of Illinois at

Urbana-Champaign, 2004.

20. Carreras M. and other. Application of SONQL for real-time learning of robot behaviours //

Robotics and Autonomous systems. – Vol. 55, Issue 8 [2007], P. 628-642.

21. Cervenka R., Trencansky I. AML. The Agent Modeling Language. A comprehensive

Approach to Modeling MAS. – Berlin: Springer, 2007. – 355 p.

22. Chakraborty D., Stone P. Online Model Learning in Adversarial Markov Decision Processes

// Proceedings of 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS

2010), Toronto, Canada, May 10-14, 2010. – International Foundation for Autonomous

Agents and Multiagent Systems, Richland, SC, 2010. –P. 1583-1584.

23. Cheung T., Okamoto K., Maker F., … [etc]. Markov Decision Process Framework for

Optimizing Software on Mobile Phones // Proceedings of the 9th ACM IEEE International

conference on Embedded software, EMSOFT 2009, Grenoble, France, October 12-16, 2009.

– New York: ACM, 2009. – P. 11-20.

24. Cotofrei P., Stoffel K. Rule extraction from time series databases using classification trees //

Proceedings of the 20th IASTED Conference on Applied Informatics, Innsbruck, Austria,

February 18-21, 2002. – Calgary, Canada: ACTA Press, 2002. – P. 327-332.

25. Crespo F., Weber R. A methodology for dynamic data mining based on fuzzy clustering //

Fuzzy Sets and Systems. – Vol.150 (2005), P. 267-284.

26. Das G., Lin K.-I., Mannila H., … [etc]. Rule discovery from time series // Proceedings of

the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98),

New York City, USA, August 27-31, 1998. – NY: AAAI Press, 1998. – P. 16-22.

27. Fager J. Online Policy-Gradient Reinforcement Learning using OLGARB for SpaceWar. //

Technical report, University of Oklahoma, 660 Parrington Oval, Norman, OK 73019 USA.

2006. –P.5.

28. Franklin S., Graesser A. Is it an Agent, or just a Program?: A Taxonomy for Autonomous

Agents // Proceedings of the Workshop on Intelligent Agents III, Agent Theories,

Architectures, and Languages, ECAI '96, Hungary, Budabest, August 11-16, 1996. –

London: Springer-Verlag, 1997. – P.21-35.

29. Ganzhorn D., de Beaumont W. Learning Algorithms and Quake // Technical report,

University of Rochester, March 19th, 2004. –P.13.

http://www.msci.memphis.edu/~franklin

http://www.psyc.memphis.edu/iis/iis.htm

35

30. Gearhart C. Genetic Programming as Policy Search in Markov Decision Processes // Genetic

Algorithms and Genetic Programming at Stanford.– (2003), P. 61-67.

31. Goto J., Lewis M.E., Puterman M.L. A Markov Decision Process Model for Airline Meal

Provisioning // Transportation Science. – Vol. 38, No. 1 (2004), P. 107-118.

32. Guestrin C., Koller D., Gearhart C., … [etc]. Generalizing Plans to New Environments in

Relational MDPs // Proceedings of the Eighteenth International Joint Conference on

Artificial Intelligence, Acapulco, Mexico, August 9-15, 2003. – San Francisco, USA:

Morgan Kaufmann, 2003. – P. 1003-1010.

33. Hassan Md.R., Nath B. Stock market forecasting using hidden Markov model: A new

approach // Proceedings of the 5th International Conference on Intelligent Systems Design

and Applications, ISDA’05, Wroclaw, Poland, 8-10 September, 2005.– Washington, USA:

IEEE Computer Society, 2005. – P. 192-196.

34. Haykin S. Neural Networks and Learning Machines (3rd Edition).– New Jersey: Prentice

Hall, 2008.– 936 p.

35. Hewitt C. Viewing Control Structures as Patterns of Passing Messages // Artificial

Intelligence. – Vol. 8(3) (1977), P. 323-364.

36. Howard R.A. Dynamic Programming and Markov Processes.– Cambridge, MA: MIT Press,

1960.– 136 p.

37. Jacobs S. Applying ReadyLog to Agent Programming in Interactive Computer Games //

Diplomarbeit, Fakultät für Mathematik, Informatik und Naturwissenschaften der Rheinisch-

Westfälischen Technischen Hochschule Aachen, 2005.

38. Kampen van N.G. Remarks on Non-Markov Processes // Brazilian Journal of Physics.–Vol.

28, Nr 2 (1998), P. 90-96.

39. Keogh E., Lin J., Truppel W. Clustering of Time Series Subsequences is Meaningless:

Implications for Previous and Future Research // Proceedings of the 3rd IEEE International

Conference on Data Mining (ICDM 2003), Melbourne, Florida, USA November 19-22,

2003.– Melbourne: IEEE Computer Society, 2003.– P. 115-122.

40. Konar A. Computational Intelligence: Principles, Techniques and Applications.– Berlin

Heidelberg: Springer-Verlag, 2005.– 732 p.

41. Krzysztof L. Markov Decision Processes in Finance // Master’s Thesis, Department of

Mathematics Vrije Universiteit Amsterdam, 2006.

42. Lazaric A., Taylor M.E. Transfer Learning in Reinforcement Learning Domains // Lecture

material of European Conference on Machine Laerning and Principles and Practice of

Knowledge Discovery in Databases ‘09, Bled, Slovenia, September 7-11, 2009.

43. Lee S.-l., Chun S.-J., Kim D.-H., Lee J.-H., … [etc]. Similarity Search for Multidimensional

Data Sequences // Proceedings of IEEE 16th International Conference on Data Engineering,

San Diego, USA, 28 February - 3 March, 2000.– IEEE Computer Society, 2000.– P. 599-

608.

44. Li C., Wang H., Zhang Y. Dynamic Pricing Decision in a Duopolistic Retailing Market //

Proceedings of the 6th World Congress on Intelligent Control and Automation, Dalian,

China, 21-23 June 2006.– IEEE, 2006.– P. 6993-6997.

45. Lin L-J. Reinforcement learning for Robots Using Neural Networks // PhD thesis, Carnegie

Mellon University, Pittsburgh, CMU-CS-93-103, 1993.

46. Lind J. Issues in Agent-Oriented Software Engineering // Agent-Oriented Software

Engineering: First International Workshop, AOSE 2000. Lectures Notes in Artificial

Intelligence. – Vol. 1957 (2001), P. 45-58.

36

47. Melo F.S., Ribeiro M.I. Coordinated Learning in Multiagent MDPs with Infinite State-Space

// Autonomous agents and multi-agent systems.– Vol. 21, Number 3 (2010), P. 321-367.

48. Mitkus S., Trinkūnienė E. Reasoned Decisions In Construction Contracts Evaluation //

Baltic Journal on Sustainability. – Vol. 14, Nr3 (2008), P. 402-416.

49. Narahari Y., Raju C., Ravikumar K., … [etc]. Dynamic Pricing Models for Electronic

Business. Sadhana // Sadhana. – Vol. 30, Part 2 & 3 (2005), P. 231–256.

50. Palit A.K., Popovic D. Computational Intelligence in Time Series Forecasting. Theory and

Engineering Applications. – London: Springer, 2005. – 372 p.

51. Povinelli R.J., Xin F. Temporal pattern identification of time series data using pattern

wavelets and genetic algorithms // Artificial Neural Networks in Engineering. – New York:

ASME Press, 1998. – P.691-696.

52. Powell W.B. Approximate Dynamic Programming I: Modeling // Encyclopedia of

Operations Research and Management Science. – John Wiley and Sons, 2011. – P. 1-11.

53. Pranevičius H., Budnikas G. PLA-Based Formalization Of Business Rules And Their

Analysis By Means Of Knowledge-Based Techniques // Baltic Journal on Sustainability. –

Vol. 14, Nr 3 (2008), P. 328-343.

54. Puterman M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. –

New York: John Wiley & Sons, 1994. – 649 p.

55. Pyeatt L.D., Howe A.E. Decision Tree Function Approximation in Reinforcement Learning

// Proceedings of the Third International Symposium on Adaptive Systems: Evolutionary

Computation and Probabilistic Graphical Models, Vol. 2, Issue: 1/2 (2001). – Citeseer 2001.

– P. 70-77.

56. Reinartz W. Customising Prices in Online Markets // European Business Forum. – Issue 6

(2001).– P. 35-41.

57. Rosu I. The Bellman Principle of Optimality // Pieejas veids: tīmeklis WWW. URL:

http://appli8.hec.fr/rosu/research/notes/bellman.pdf. – Resurss aprakstīts 2011.g. 12.dec.

58. Russel S., Norvig P. Artificial Intelligence: A modern approach, 2nd edition.– New Jersey:

Prentice Hall, 2003. – 1132 p.

59. Song H., Liu C.-C. Optimal Electricity Supply Bidding by Markov Decision Process // IEEE

transactions on power systems. – Vol. 15, Nr. 2 (2000). – P. 618-624.

60. Sonnenberg F.A., Beck J.R. Markov models in medical decision making: a practical guide //

Medical Decision Making. – Vol. 13, Nr 4 (1993). – P. 322-338.

61. Sunderejan R., Kumar A.P.N., Badri K.K.N., ... [etc]. Stock market trend prediction using

Markov models // Electron. – Vol.1, Issue 1 (2009). – P. 285-289.

62. Sutton R.S., Barto A.R. Reinforcement learning. An Introduction. – Cambridge, MA: MIT

Press, 1998. – 342 p.

63. Sutton R.S., McAllester D., Singh S., … [etc]. Policy Gradient Methods for Reinforcement

Learning with Function Approximation // Advances in Neural Information Processing

Systems. – Vol. 12 (2000). – P. 1057-1063.

64. Symeonidis A.L., Kehagias D., Mitkas P.A. Intelligent policy recommendations on

enterprise resource planning by the use of agent technology and data mining techniques //

Expert Systemswith Applications. – N 25 (2003). – P. 589-602.

65. Taylor M.E. Transfer in Reinforcement Learning Domains. Studies in Computational

Intelligence. – Berlin: Springer-Verlag, 2009. – 244 p.

66. Tokic M. Exploration and Exploitation Techniques in Reinforcement Learning // Invited

lecture, Ravensburg-Weingarten University of Applied Sciences, Germany, November

2008.

37

67. Vanthienen J. Ruling the business: About business rules and decision tables // New

Directions in Software Engineering. – (2001) pp. 103-120.

68. Varges S., Riccardi G., Quarteroni S., … [etc]. The exploration/exploitation trade-off in

Reinforcement Learning for dialogue management // Proceedings of IEEE Workshop on

Automatic Speech Recognition & Understanding ASRU’09, Merano, Italy, December 13-

17, 2009. – IEEE Signal Processing Society, 2009. – P. 479-484.

69. Varian H.R. Differential Pricing and Efficiency // First Monday. – Vol.1, Nr. 2 (1996). – P.

1-10.

70. Vengerov D. A Gradient-Based Reinforcement Learning Approach to Dynamic Pricing in

Partially-Observable Environment // Future Generation Comp. Syst. – Vol. 24/7 (2008). –P.

687-693.

71. Wang R., Sun L., Ruan X.-G., … [etc]. Control of Inverted Pendulum Based on

Reinforcement Learning and Internally Recurrent Net // Proceedings of International

Conference on Intelligent Computing, HeFei, China, August 23-26, 2005. – IEEE

Computational Intelligence Society, 2005. – P. 2133-2142.

72. Weld M., Weld D. Solving Concurrent Markov Decision Processes // Proceedings of the

19th national conference on Artifical intelligence, San Jose, California, S.J. Convention

Center, July 25-29, 2004. – San Jose, California: AAAI Press, 2004. – P. 716-722.

73. Witten I.H., Frank E. Data Mining: Practical Machine Learning Tools and Techniques,

Second Edition – Morgan Kaufman, 2005. – 560 p.

74. Беллман Р., Дрейфус С. Прикладные задачи динамического программирования –

Москва: «Наука», 1965.– 460 с.

75. Бережная Е.В., Бережной В.И. Математические методы моделирования

экономических систем. – Москва: «Финансы и статистика», 2006. – 432 с.

76. Воробьев Н.Н. Предисловие редактора перевода к Беллман Р. Динамическое

программирование. – Москва: «Издательство иностранной литературы», 1960. – 400 с.

77. Горбань А.Н. Обобщенная аппроксимационная теорема и вычислительные

возможности нейронных сетей // Сибирский журнал вычислительной математики. –

Т.1, Nr.1 (1998), с. 12-24.

78. Кузнецов Ю.Н., Кузубов В.И., Волощенко А.Б. Математическое программирование:

Учеб. пособие. – 2-е изд., перераб. и доп. – Москва: «Высшая школа», 1980. – 300 с.

79. Растригин Л.А. Современные принципы управления сложными объектами.– Москва:

«Советское радио», 1980. – 232 с.

80. Таха Х.А. Введение в исследование операций. – 7-е. изд. – Москва: «Вильямс», 2005.

– 912 с.

81. Тёрнер Д. Вероятность, статистика и исследование операций. – Москва:

«Статистика», 1976. – 431 с.

82. Фомин Г.П. Математические методы и модели в коммерческой деятельности. –

Москва: «Финансы и статистика», 2005. – 616 с.

83. Черноусько Ф.Л. Динамическое программирование // Соросовский образовательный

журнал. – № 2 (1998), с. 139-144.

http://www.informatik.uni-trier.de/~ley/db/journals/fgcs/fgcs24.html#Vengerov08

38

Jurijs ČIŽOVS

DEVELOPMENT AND STUDY OF A

CONTROLLED MARKOV DECISION MODEL

OF A DYNAMIC SYSTEM BY MEANS OF DATA

MINING TECHNIQUES

Ph.D. Thesis Summary

Registered for printing on 31.01.2012. Registration Certificate

No. 2-0282. Format 60x84/16. Offset paper. 2,25 printing sheets,

1,78 author’s sheets. Calculation 30 copies. Order Nr. 10.

Printed and bound at the RTU Printing House, 1 Kalku Street,

Riga LV- 1658, Latvia.

Documents

DEVELOPMENT AND STUDY OF A CONTROLLED ......2 UDK 519.857(043.2) Či 958 d Čižovs J. Development and study of a controlled Markov decision model of a dynamic system by means of data