Features - cdn.whu.edu · Features 14 Market Model Benchmark Suite for Machine Learning Techniques by Martin Prause and Jürgen Weigand 25 Intelligent Asset Allocation via Market

Volume 13 Number 4 ❏ November 2018www.ieee-cis.org

on the cover©ISTOCKPHOTO.COM/KTSIMAGE

Departments 2 Editor’s Remarks

3 President’s Message by Nikhil R. Pal

5 Society Briefs Technical Books in

Computational Intelligence by Vaishali Damle and Jeanne Audino

7 Conference Reports Conference Report on 2018 IEEE

World Congress on Computation-al Intelligence (IEEE WCCI 2018)

by Pablo A. Estévez and Marley M.B.R Vellasco

10 Publication Spotlight by Haibo He, Jon Garibaldi,

Kay Chen Tan, Julian Togelius, Yaochu Jin, and Yew Soon Ong

13 Guest Editorial Computational Intelligence

in Finance and Economics by Okan Duru, Robert Golan,

and David Quintana

96 Conference Calendar by Bernadette Bouchon-Meunier

Features 14 Market Model Benchmark Suite for Machine Learning Techniques by Martin Prause and Jürgen Weigand

25 Intelligent Asset Allocation via Market Sentiment Views by Frank Z. Xing, Erik Cambria, and Roy E. Welsch

35 An Accurate Lattice Model for Pricing Catastrophe Equity Put Under the Jump-Diffusion Process

by Chuan-Ju Wang and Tian-Shyr Dai

IEEE Computational Intelligence Magazine (ISSN 1556-603X) is published quarterly by The Institute of Electrical and Electronics Engineers, Inc. Headquarters: 3 Park Avenue, 17th Floor, New York, NY 10016-5997, U.S.A. +1 212 419 7900. Responsibility for the contents rests upon the authors and not upon the IEEE, the Society, or its members. The magazine is a membership benefit of the IEEE Computational Intelligence Society, and subscriptions are included in Society fee. Replacement copies for members are available for US$20 (one copy only). Nonmembers can purchase individual copies for US$201.00. Nonmember subscription prices are available on request. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limits of the U.S. Copyright law for private use of patrons: 1) those post-1977 articles that carry a code at the bottom of the first page, provided the per-copy fee is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01970, U.S.A.; and 2) pre-1978 articles without fee. For other copying, reprint, or republication permission, write to: Copyrights and Permissions Department, IEEE Service Center, 445 Hoes Lane, Piscataway NJ 08854 U.S.A. Copyright © 2018 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved. Periodicals postage paid at New York, NY and at additional mailing offices. Postmaster: Send address changes to IEEE Computational Intelligence Maga zine, IEEE, 445 Hoes Lane, Piscataway, NJ 08854-1331 U.S.A. PRINTED IN U.S.A. Canadian GST #125634188.

Digital Object Identifier 10.1109/MCI.2017.2770279

Columns 46 Application Notes Fast Unsupervised Edge Detection Using Genetic Programming by Wenlong Fu, Bing Xue, Mengjie Zhang, and Mark Johnston

59 Research Frontier Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic

and Overfitting Approaches by Miriam Seoane Santos, Jastin Pompeu Soares, Pedro Henriques Abreu,

Hélder Araújo, and João Santos

Visualizing the Evolution of Computer Programs for Genetic Programming by Su Nguyen, Mengjie Zhang, Damminda Alahakoon, and Kay Chen Tan

Promoting Sustainable Forestry

SFI-01681

NOVEMBER 2018 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 1

Market Model Benchmark Suite for

Machine Learning Techniques

Martin Prause and Jürgen WeigandInstitute for Industrial Organization, WHU—Otto Beisheim School of Management, Vallendar, GERMANY

©ISTOCKPHOTO.COM/PHONLAMAIPHOTO

Digital Object Identifier 10.1109/MCI.2018.2866726

Date of publication: 15 October 2018

14 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018 1556-603X/18©2018IEEE

Corresponding Author: Martin Prause ([email protected])

Abstract—Recent developments in deep-reinforcement learning have yielded promising results in artificial

games and test domains. To explore opportunities and evaluate the performance of these machine learning

techniques, various benchmark suites are available, such as the Arcade Learning Environment, rllab, OpenAI

Gym, and the StarCraft II Learning Environment. This set of benchmark suites is extended with the open busi-

ness simulation model described here, which helps to promote the use of machine learning techniques as value-

adding tools in the context of strategic decision making and economic model calibration and harmonization.

The benchmark suite extends the current state-of-the-art problems for deep-reinforcement learning by offer-

ing an infinite state and action space for multiple players in a non-zero-sum game environment of imperfect

information. It provides a model that can be characterized as both a credit assignment problem and an optimi-

zation problem. Experiments with this suite’s deep-reinforcement learning algorithms, which yield remarkable

results for various artificial games, highlight that stylized market behavior can be replicated, but the infinite

action space, simultaneous decision making, and imperfect information pose a computational challenge. With

the directions provided, the benchmark suite can be used to explore new solutions in machine learning for stra-

tegic decision making and model calibration.

I. Introduction

In the last decade, the field of strategy has evolved from

being an isolated top-management task to a necessary

skill across all tiers of a firm [1]. Businesses identify risks

and opportunities for dynamic markets in advance by

establishing data-driven competitive intelligence processes [2].

One element in these processes is a business simulation. A

business simulation is a risk-free tool to engage executives in

scenario planning, strategy testing, and competitive analysis

[3]. A business simulation executes a market model in

which various players (firms) repeatedly make deci-

sions. Their actions influence the modeled econo-

my and the situation of other players. Ranging

from educational usage [4], [5] to strategy devel-

opment [6], business simulations help executives

think strategically and perceive the interconnec-

tions across business functions and foster critical

thinking by bridging the gap between theory and

practice [7]. The core of any business simulation is its

economic model. The simulation reflects the interactions

of the market players and helps identify patterns of activities.

Model designs can be either top-down (analytical approach)

or bottom-up (agent-based approach). They include (1) nar-

row equation-based oligopoly models [8], (2) comprehensive

system-dynamics-based market models for analyzing product

diffusion, marketing network effects, or supply chain dynam-

ics [9], and (3) complex agent-based systems, modeling com-

petitive market environments with autonomous agents [10].

Like the homo economicus assumption of pure rationality

vs. the bounded rational behavior of humans, economic models

inherit the trade-off between simplicity and analytical trace-

ability vs. complexity and descriptive accuracy [11]. Complex

economic market models comprise non-linearities, stochastic

dynamics, and non-trivial interaction structures [12], which

impede the decision-making process for scenario planning and

strategy testing. In such simulations, humans typically compete

against each other to test specific strategies. More advanced

simulations, so-called business war games, also include various

stakeholder roles such as regulatory authorities, governments, or

labor unions [13]. However, using these tools to identify multi-

ple strategic scenarios can take days or weeks depending on

their complexity and real-world implementation. Automating

these processes using decision-making agents would yield sig-

nificant business value. Furthermore, with the advent of a data-

driven society [14], economists are increasingly utilizing big

data to enrich their models [15], leading to the problem of cali-

bration (fitting the model to empirical data) and harmonization

(aligning the model to conform to theoretical assumptions and

reflect real-world outcomes) [16]. In such a system of non-

trivial interaction structures, a minor change in one variable

can cause a cascade of effects, including the emergence of

new relationships. Typically, calibration of agent-based models

comprises three steps: simulating the model, measuring the

quality of outcomes, and locating strong and weak levers [17].

Methodologically, manual calibration of parameters, Monte

Carlo simulations, and data-centric optimization using evolution-

ary algorithms to mimic stylized micro- and macroeconom-

ic facts are the most common, but also the most time-consuming,


16 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018

approaches [11]. Therefore, efficient tools are needed to investi-

gate the emergent behavior [18].

These two challenges, decision-making, and calibration/har-

monization have been explored under the umbrella term of

agent-based computational economics, wherein interacting

autonomous agents are simulated to analyze the decision-mak-

ing and learning processes in a tailored and stylized economic

environment [19]. Popular methods for studying the learning

behavior emerge from the field of computational intelligence

for single- and multi-objective optimization [20], [21]. Building

on these advances, recent developments in the field of AI com-

bine optimization and, due to the advent of a data-driven soci-

ety [22], machine learning for pattern recognition to build

rational agents for perceiving the environment and taking

actions [23], [24]. Instead of applying tailored algorithms to spe-

cific markets or social systems, AI research focuses on autono-

mous learning for general problem settings in complex

environments. Recent developments in artificial gameplay, such

as the game of Go [25], [26], Atari jump-and-run games [27], or

the first-person 3D game Doom [28], which are based on deep-

reinforcement learning, have yielded promising results for

advancing decision making in complex environments. In this

light, a full-fledged business simulation represents such a general

setting because it combines all major business aspects, models a

complex and dynamic environment, and is closely related to

artificial games. Actors must make decisions in such a dynamic

environment, and their decisions influence their opponents and

affect the environment.

Combining deep-learning techniques with reinforcement

learning has become a standard method for these problem sets.

Several freely available benchmark suites, such as the Arcade

Learning Environment [29], rllab [30], OpenAI Gym [31],

TorchCraft [32], and the StarCraft II Learning Environment

[33], have been published to train and compare these algo-

rithms. Further application fields and research challenges of

deep-reinforcement learning are mentioned in [34]. To support

the shift from applying these algorithms in artificial games to

the business context and in response to the call [34] for a reli-

able benchmark suite in the context of deep-reinforcement

learning, this article presents an open benchmark suite of a

holistic economic model for machine learning and AI algo-

rithms in general. Following a discussion of the business

applications of such a benchmark suite, a standard deep-

reinforcement learning algorithm is presented to demonstrate

the use of the model and highlight its challenges.

II. The Economic Model of the Benchmark SuiteThe testbed is based on an economic market model. It does

not provide a set of agents and their relationships as would a

full-fledged agent-based model [35] but

defines an agent’s decision environment and

her individual boundaries. The agent herself

can be any algorithm. She can communicate

indirectly with other agents within this model.

In the following sections, we introduce the

economic model in more detail. It may suffice to say that vari-

ous model elements and parameters can be adjusted during the

setup phase to modify the model according to the specific sim-

ulation focus.

A. The Firm’s Decision EnvironmentIn the default mode, the economic model emulates an oligop-

olistic market structure with a small number of firms, many

consumers and moderate barriers to entry and exit. The firms

compete in multiple rounds and take simultaneous decisions

on firm-specific and market-focused variables. According to

the Structure–Conduct–Performance paradigm [36], such a

market structure induces competitive pressure and strategic

interaction among the firms (a lá Cournot or Bertrand), lead-

ing to a balanced surplus for consumers and producers. From a

theoretical perspective, four firms suffice to mimic a competi-

tive environment [37], [38]. The upper limit of eight firms

ensures that patterns of strategic interaction are still traceable

and manageable if applied to human and artificial players. Each

firm can choose to produce either a low-cost, low-price, high-

volume (LLH) product, a high-cost, high-price, low-volume

(HHL) product or both, and sell these products in two sepa-

rate, unrelated markets (domestic and international). We

assume that the overall evolution of market demand for these

products follows the traditional product life-cycle of introduc-

tion, growth, maturity, and decline [39]. Firm decisions are

organized conceptually along Porter’s value chain [40] and fall

into two categories: internal and external decisions. External

decisions—within the scope of product research and develop-

ment (R&D) and marketing and sales—influence both market

demand and competitors. They define the firm’s business strat-

egy, that is, its corporate and competitive action plan. Internal

decisions—within the scope of operations, production, human

resources, and finance—align the firm’s internal organization

and constitute the structure that supports the business strategy.

Decisions are subject to trade-offs between flexibility and path

dependency. Either decisions incur immediate expenditures,

which have short-term effects (e.g., buying an additional facil-

ity), or they are sunk-cost- or people-related investments, giv-

ing rise to additional costs but with long-term benefits (e.g.,

economies of learning). All decision effects are subject to

diminishing marginal returns (e.g., S-curve). Each firm is

assumed to be a publicly traded company so that its financials

serve as performance signals to the market. The default mea-

sure of a firm’s market performance is its share price, primarily

influenced by discounted cash flows, dividends, leverage, and

brand value. In the default configuration, all firms start identi-

cal in terms of assets and financials and can sell their products

directly in both markets.

Combining deep-learning techniques with reinforcement learning has become a standard method for these problem sets.


B. Defining Business StrategyA firm’s business strategy embraces corporate and competitive

strategy [41]. At the corporate level, firms have to define the

scope of their activities in terms of geographic exposure—

options are domestic and international—and degree of diversi-

fication—single product/market versus multiple products/

markets. The combination of the available options creates a

market arena tableau of four distinct non-cannibalizing markets

(LLH product/national, LLH product/international, HHL

product/national, HHL product/international. At the market

level, firms have to decide on their competitive strategy, i.e.

their market-specific strategic positioning as either price-/cost-

leader, differentiator, or outpacer [40]. The actual positioning

results from the relative value–price relationship in each partic-

ular market. The external decisions determine the potential

demand of a firm and contribute to that of the market as a

whole. The core drivers of firm-specific and total market

demand are the firms’ marketing efforts [42], their brand values,

and macroeconomic factors. Marketing efforts are defined by

the “four P’s”: (1) price, (2) promotion, or expenditures that

increase total market demand via advertising, (3) placement,

represented by the number of salespeople, which influences

firm-specific demand, and (4) product value, which can be

increased through R&D investments. The value/price ratio,

promotion, and placement efforts in conjunction with the

brand value and macroeconomic factors determine a firm’s

potential demand. Brand value is a cumulative measure reflect-

ing consumer satisfaction. It represents first- and second-mover

advantages/disadvantages of market entry and can be influ-

enced by corporate identity expenditures, the firm’s attractive-

ness, and its ability to live up to its promises.

This setup ensures that a firm can tap into all sources of rev-

enue advantage such as differentiation (value/price ratio), inno-

vation (product value and brand value), and people (R&D staff

and salespeople). Figure 1 summarizes the domain for external

decisions that define a firm’s business strategy.

C. Aligning the OrganizationInternal firm decisions fall into two categories: operations and

financing. The operations part consists of three pillars: (1) mate-

rial sourcing, (2) product outsourcing, and (3) production. For

each product, the material must be purchased. Starting with a

one-to-one relationship of material-to-product, a firm can

invest in materials development to reduce the amount of

*****

Product Level(1…2 Products)

Decisions of Firm x

AttractivenessExpenditures

CorporateIdentify Exp.

PriceDecision

SalesStaff

PromotionExpenditures

R&DInvestments

Capabilityof

Delivering

BrandValue

MacroeconomicFactors

PricePolicy

PlacementPolicy

PromotionPolicy

ProductValue

ProductPolicy

BrandValue

MacroeconomicFactors

PricePolicy

PlacementPolicy

PromotionPolicy

ProductPolicy

* * * * *

Potential Demandof Firm x

Decisions of All Other Firms

* Relative Effects

Firm Level Market Level (1....4 Markets)

FIGURE 1 The potential demand for firm x is influenced by the relative effects of the four P’s of its marketing, brand value, and macroeconomic factors. Dark rounded rectangles denote decisions.

marti

Notiz

Instead of "Identify" use "Identity"


material needed for production. The sourcing costs depend on

macroeconomic factors. Inventory management is a key factor

because keeping stock maintains flexibility but incurs high

costs. While material sourcing is necessary to produce products

in-house, outsourcing is an alternative. A firm can determine

the number of units to be manufactured outside as well as the

length of the outsourcing contract. Like materials and products

made in-house, outsourced products are immediately available

and can be delivered as ready-made products. The outsourcing

costs are pegged to an external exchange rate, which varies

according to macroeconomic factors. Thus, by selecting con-

tract length, the firm may hedge against cost fluctuations. Final-

ly, a firm must decide on the volume of production for its

product/market portfolio. It can produce more or less than its

potential demand in any market. Potential demand is not

known ex ante because it is a function of all external decisions

made simultaneously by all firms. Therefore, a firm must esti-

mate its potential demand as implied by its business strategy.

The sum of product inventory, outsourced products, and man-

ufactured products serves to satisfy the potential demand. For

the production decision, two independent capacity factors have

to be taken into account: (1) the production capacity of the

respective facility and (2) the production staff capacity. Facilities

come in different sizes and production capacities. A firm can

purchase and retire facilities to adjust its total capacity. It can

also invest in infrastructure and flow optimization to increase

capacity. One unit of capacity equals one unit of the low- or

high-cost product. The second capacity factor is the production

staff. Each blue-collar worker has a base productivity rate per

round. The productivity rates differ by product and can be

increased over time by investing in training and incentives. The

productivity rate times the number of workers defines the total

production staff capacity. This setup enables a firm to exploit

different types of cost advantages, such as economies of scale

(e.g., bulk purchase and mass production), scope (e.g., utilizing

facilities for both products), and learning (e.g., productivity

increases of production staff). The minimum of the facility

capacity and production staff capacity defines the number of

units that can be produced (Figure 2).

The final element of the internal perspective is corporate

financing and shareholder wealth creation. A firm must decide

how to finance its activities, e.g., out of cash flow, using short-

term vs. long-term loans and managing interest and repayment

trade-offs, and how much to pay out to shareholders in terms

of dividends.

III. Characteristics and Applications of the Benchmark SuiteGames such as chess and Go entail perfect information. There-

fore, theoretically, they can be solved by an exhaustive search of

the full game tree. In practice, an exhaustive search is infeasible

because of the large search space: 3580 for chess and 250 015 for

Go [25]. The search space can be reduced, however, by limiting

the state space and the action space and using metaheuristics to

map between both [26]. For Atari jump-and-run games, the

state space consists of multiple subsequent frames (pixel

TotalUnits

PotentialDemand

ProductInventory

Must Be Aligned

Must Be Aligned

OutsourcingQuantity

OutsourcingContractLength

MaterialQuantity

ProductionQuantity

MaterialDevelopmentExpenditures

TrainingExpenditures

IncentiveInvestments

MaterialNeeded

MaterialInventory

Total ProductionCapacity

ProductionStaff

Number ofFacilities

OptimizationExpenditures

InfrastructureExpenditures

ProductionStaff Capacity

FacilityCapacity

AdvanceProductivity

AdvanceCapacity

BaseProductivity

BaseCapacity

*

**

*

* Per Facility Type

FIGURE 2 Internal structure of a firm. Dark rounded rectangles denote decisions. The production quantity, material, outsourced quantity, and inventory are product specific and should be aligned with each other and with the potential demand.


screens), and the action space comprises com-

mands such as moving the game figure right,

left, up, or down. The real-time strategy game

StarCraft II poses the next stage of challeng-

ing problems, with more than two players

interacting, a more extensive state space and

action space (approximately 108 ), imperfect

information due to partially observed game

states, and many thousands of frames of game-

play [33]. The proposed business simulation

testbed pushes these boundaries further along the dimensions

of game complexity, game dynamics, and game objectives. First,

referring to complexity, there are four to eight players in the

testbed, and the action space has become infeasible to address.

Whereas actions in chess, Go, or arcade games are limited to

one per round or frame, StarCraft II allows multiple actions at

once, such as moving a group of units to a specific grid posi-

tion. However, the number of sequential actions per minute is

limited in the StarCraft II Learning Environment to account

for a fair comparison with human players. In the business simu-

lation, an agent can make up to 48 decisions per round. A deci-

sion can take any number within a specific range, such as for

price (1500–2500), number of salespeople (0–999), or produc-

tion volume (0–500,000). The median of ranges is 1000, lead-

ing to 100048 possible decision combinations. While most of

the actions in StarCraft II are sequential over a long period of

time, the business simulation focuses on simultaneous decisions

over a short period of time (a few rounds). Thus, the training

data are very sparse compared to existing test problems. Second,

with respect to the game dynamics, the external environment

changes over time: Demand, costs, exchange rates, and other

macroeconomic factors vary according to the industry life-

cycle and customized market shocks. Finally, the game objec-

tives differ from established games. Various rewards such as share

price, market share, cash position, profit level, and survival of

rounds can be used as performance measures, reflecting agents’

extrinsic and intrinsic motivators [43]. Whereas StarCraft II is a

zero-sum (win/lose) game, the business simulation need not be

if agents have different objectives. The benchmark suite focuses

on the environment and not on the agent. It is a testbed for

algorithms to compare and replicate scenarios without pro-

gramming language limitations. Therefore, the benchmark suite

provides a representational state transfer (REST) application

programming interface (API) to access the current game state,

reward, and decision/action vectors using a JavaScript Object

Notation (JSON) as follows:

[{“SHAREPRICE”:257,”TOTALREVENUES”:257,”

NETINCOME”:13, …},… {…}]

The game state is a representation of the environment con-

sisting of (1) public information from other firms, such as bal-

ance sheet and cash flow statement data, (2) sales, revenues, and

product values from each market and firm, (3) macroeconomic

variables such as the GDP, exchange rate, and material costs,

and (4) each firm’s internal measures such as unit costs and

inventory levels. The workflow for using the benchmark suite

(Figure 3) consists of five main functions: (1) Reset the game

and define specific parameters (demand cap, minimum product

value standards, production costs, etc.) and induce system

shocks, using either artificial or real-world data (RESET), (2)

obtain the upper and lower limits for each variable (LIMITS),

(3) submit the decision vector and proceed to the next round

(STEP), (4) obtain the game status (STATUS), and (5) simulate

internal decisions (SIM).

Business decisions can be decomposed into two distinct

types: external decisions (which influence the market and com-

petitors) and internal decisions (which do not interact with the

environment). Once the business strategy is set, and an agent

estimates the potential sales based on her own decisions and

assumptions about other players, she needs to align the internal

decisions. This setup reflects Alfred Chandler’s paradigm of

Structure follows Strategy. The choice of external decisions can be

characterized as a pattern recognition problem of mapping

states to actions. Aligning the internal decisions, however, is an

optimization problem. When the potential sales (the result of

predicting the effect of the external decisions) for each market

are submitted along with the internal decisions of a firm (simac-

tion vector) and the firm number ( ),x the tool returns all inter-

nal measures such as the inventory, unit cost, cash flow, and

balance sheet. In summary, the design of the benchmark suite

helps to develop and evaluate general machine learning tech-

niques for decision making and calibration and combines the

learning tasks of exploration and exploitation (external deci-

sions) with multivariable optimization (internal decisions).

IV. The Challenges of Reinforcement Learning in Business SimulationsPattern recognition algorithms can be categorized broadly as

unsupervised, supervised, or reinforcement learning [44]. As

data labels in the business simulation are sparse and time-

delayed, the problem is one of reinforcement learning. This

approach assumes an agent situated in an environment. She

observes the environment ( ;s e.g., a JSON object), takes actions

( ;a e.g., a JSON object), and receives a reward ( ,r determined

by the agent; e.g., number of rounds survived, profit after round

,t etc.) and a new state ( ;sl e.g., a JSON object) from the envi-

ronment: , .s r sa" l Reinforcement learning seeks a sequence of

actions, called a policy, to maximize the total reward [45]. It

solves the credit assignment problem by rewarding preceding

The design of the benchmark suite helps to develop and evaluate general machine learning techniques for decision making and calibration and combines the learning tasks of exploration and exploitation (external decisions) with multivariable optimization (internal decisions).


actions that contribute to the final outcome. From the agent’s

perspective, the environment is non-deterministic; the total

reward ( )Rt at time t is the sum of the current reward ( )rt and

discounted future rewards , [ , ]R 0 1t 1 dc c+^ h when a specific

action is pursued: .R r Rt t t 1c= + + c is called the discount

factor, with a value close to 1 to consider future discounted

cumulative rewards. Following [46], a function ,Q s at t^ h can be

defined estimating the total reward for a given state s and

action .a An optimal choice is an action that maximizes the

discounted future rewards: , .maxQ s a Rt t t 1= +^ h Assuming

such a Q-function exists, an optimal policy would choose the

action with the highest Q-value: ,maxs Q s aa

=^ ^h h. Rewrit-

ing this formula gives the Bellman equation, which can be used

to iteratively approximate the Q-function. Lin [47] proves that

this equation converges if the states are finite: ,Q s a =^ h

, .maxr Q s aac+ ) )) ^ h Following the suggestions in [27], a neu-

ral network with multiple hidden layers is used to approximate

the Q-function. The input layer is the game state vector; the

output layer is the Q-values for any possible action. The net-

work is initialized randomly, and the regression task is opti-

mized using the least-squares error method and stochastic

gradient descent.

Unfortunately, the game states are not finite in a business

simulation because the domain is of real numbers. Thus, con-

vergence can no longer be guaranteed. Several countermea-

sures are proposed to stabilize the process, such as experience

replay [47], e-greedy policy [27], or target networks [48]. In

experience replay, state–action transitions are stored in memory

from a knowledge base, random generation, or preceding train-

ing tasks; the neural network is trained by mini-batches from

this memory. The e -greedy policy prevents the algorithms

from always picking the action with the highest Q-value and

thus getting stuck in a local optimum: With some probability, a

random action is chosen instead of the action with the highest

Q-value. A typical implementation follows the policy that over

time, the algorithm first explores and later becomes increasing-

ly greedy; this is reflected by the factor ,m which controls the

speed of decay of exploration: .emin max mint

ee e e= + - m-^ h

Target networks are used to stabilize the training process. For

any forward pass of samples regarding a specific action, all

Q-values associated with similar actions will also be affected,

and the neural network will never stabilize. To avoid this

behavior, every x steps the weights of the training networks are

copied to a target network, and predictions of Q-values (used

Agent: CompileAction/Decision Vector

Setup and Customize the Test Bed(Economic Model)

End?

Start

Workflow REST API Command (Pseudocode)

Get Decision Limits and CurrentGame State

Optional: Simulate theEffect of Internal Decisions

Submit Action Vector andMove to the Next Round

End

RESET Function:game_state <- GET x.x.x.x/reset/token/param

LIMIT and STATUS Function:boundaries <- GET x.x.x.x/limits/token

status <- GET x.x.x.x/status/token

SIM Function:sim_state <- GET x.x.x.x/sim/token/x/simaction

STEP Function:game_state <- GET x.x.x.x/step/token/action

Yes

No

FIGURE 3 The general workflow for using the benchmark suite is depicted on the left. The right part of the figure shows the corresponding REST API commands in Pseudocode. x.x.x.x, the service’s IP address1; The dark rounded rectangle highlights the external part of the suite, the algorithm or agent.

1 Online Resource: Access to the public free service is available at www.stratchal.com/

demo

marti

Notiz

In front of "(s)=max ..." the capital greek letter Pi (\Pi) is missing. Please check the attached image for the correct formula.


in the error function for the training network)

are simply taken from the target network. This

slows the learning process but stabilizes the

Q-values. Various other stabilization and

exploration–exploitation techniques are used

in [26], [48], and [49] for demonstration pur-

poses; these three are the most prominent

ones. A general algorithm is given in Figure 4.

The major challenges in this task are the large

state and action space and the distinction between external

variables and internal optimization. Therefore, three experi-

ments are presented to highlight the challenges and to demon-

strate the use of the benchmark suite. The first experiment

showcases a simple Cournot oligopoly just looking at the

external decisions. The second mimics a cobweb model and

incorporates also the internal decisions but without differenti-

ating on the learning approach. The third experiment advances

the second one by utilizing different learning approaches on

external and internal decisions.

In the first experiment, the state space consisted of 21 vari-

ables: potential sales, actual sales, cash position, balance sheet

total, and share price for each of the four players, plus the

domestic market’s GDP as a hint for the industry cycle. The

action vector was an 8-bit string, resulting in 256 different

actions. Two bits signaled a price increase of 50 ( ),012 a price

decrease of 50 ( ),102 or no price change (112 or 002) for a sin-

gle player. Production volumes, material procurement, and pro-

duction workers were automatically adjusted at a fixed rate,

ignoring any internal complexity. All other variables were set to

default values and left unchanged. The reward, the time-delayed

data label, was the number of rounds

all firms survived. Although different

rewards could have been imposed to

represent different objectives for individ-

ual firms, all firms shared the same

objective in order to compare the results

with other empirical findings in [8].

Once a firm goes bankrupt, the game

restarts. The industry cycle takes four

stages. Each stage lasts for 8 rounds, and

the cycle was repeated 25 times. Further

parameters, whose values were chosen to

optimize the trade-off between comput-

ing time and memory capacity, are (1)

size of the replay memory, 1000; (2)

mini-batch size, 64; (3) target network

updates, 100; and (4) the neural network

structure, two hidden dense layers (first,

512 neurons; second, 256 neurons); hid-

den layer activation functions, rectified

linear; output activation function, soft-

max; input layer, 21 neurons; and output

layer, 256 neurons. The e-greedy policy

parameters were standard lower and

upper probability limits ,mine 0.01, and

,maxe 0.99; with a slow convergence rate of m , 0.001; and the

discount factor ,c 0.99, was chosen according to the literature

in [50]. Figure 5 shows the results of 1100 training runs. The

overall firm survival rate increased from 18 to a maximum of

32 rounds (left image). The decisions in the last 200 rounds

(right image) demonstrate that the algorithm learned to reduce

the price in times of industry recession (cycles , ,a b c and g ) and

to increase the price in times of growth (cycles ,a b and c ). This

collusive behavior aligns with the findings in [8, p. 3287],

wherein the authors applied a Q-learning algorithm to a simple

oligopoly model and concluded, “Q-learning firms generally

learn to collude with each other in Cournot oligopoly games,

although full collusion usually does not emerge, that is, firms

usually do not learn to make the highest possible joint profit.”

The second experiment incorporated price, sales, and basic

internal production decisions. Each decision was represented by

a single bit: (1) a price increase (1) or decrease (0) of 50, (2) an

increase (1) or decrease (0) of 40 in the number of salespeople,

and (3) an increase of 5000 units in the production quantity and

of 70 people in the production staff (1) or a corresponding

decrease (0). The resulting 12-bit string creates an action space

Initialize replay memory ( ) with random transitions

Initialize the training neural network ( ) with random weights

Copy the weights of to a target network ( )

<- GET x.x.x.x/reset/token/param

while (continue):

select action (external variables) at state with -greedy method

Create action vector from and internal aligned variables

∗ <- GET x.x.x.x/step/token/

reward <- GET x.x.x.x/status/token

Store transition ∗ > in ; if is full discard the oldest entry

Sample mini-batches from :

Train

Every k steps, copy weights from to

∗

FIGURE 4 Pseudocode of the reinforcement learning algorithms (with experience replay, e -greedy policy, and target networks) using the benchmark suite; x.x.x.x, the service’s IP address.

The presented business simulation provides a testbed for machine learning algorithms on a problem domain with a continuous state and action space, in a competitive, non-zero-sum game of imperfect information with time constraints.


of 4096 combinations. The underlying associated economic

model is the cobweb model in which production quantities are

chosen before the market price is observed [51]. The defined

reinforcement learning approach is best suited for credit assign-

ment tasks. Therefore, it is also applied to the second experi-

ment without any structural adjustments. Internal and external

decisions are treated with the same learning approach, despite

the fact, that the testbed proposes a problem of credit assign-

ment and optimization. The results (Figure 7) show a slower

learning rate within 1100 training steps (from 8 to 14) than in

the first setup due to the larger action space. However, qualita-

tively, two different scenarios can be identified (Figure 6).

0

2,000

4,000

6,000

8,000

10,000

12,000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

Sta

cked

Pric

e Le

vel

Runs

0

200

400

600

800

1,000

1,200

1,400

1,600

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

Sta

cked

Sal

es P

eopl

e Le

vel

Runs(a) (b)

aL bL cL dL aR bR cR dR

Firm 1 Firm 2 Firm 3 Firm 4

FIGURE 6 Second Experiment. (a) The stacked prices with firm 1 at the bottom and firm 4 on the top. (b) She stacked salespeople decisions. The Red Queen effect is observed in cycles aR and .dR Lifecycles aL to dL and aR to dR in Figure 6 correspond to lifecycles a to d in Figure 5.

0

5

10

15

20

25

30

35

Num

ber

of R

ound

s S

urvi

ved

Training Steps

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

1 12 23 34 45 56 67 78 89 100

111

122

133

144

155

166

177

188

Sta

cked

Pric

e Le

vel

Runs

(a) (b)

0 200 400 600 800 1,000

Firm 1 Firm 2 Firm 3 Firm 4

a b c d e f g

FIGURE 5 First Experiment. (a) The average number of rounds survived for all firms by manipulating the price and neglecting quantity decisions. The red line depicts the regression line. (b) Price decisions for the last 200 runs. Prices are stacked to highlight their similarity, with firm 1 at the bottom and firm 4 on the top. Each vertical dotted line signals the beginning of a new industry lifecycle (a to g ).

marti

Notiz

Instead of "She stacked" use "The stacked"


(1) During recessions, expenditures for sales-

people are used to mitigate a potential price

decrease: In rounds 71–76 and 86–91, the

number of salespeople increased as the price

dropped. (2) Once firms spend for more sales-

people, a rate race begins (as suggested by the

Red Queen principle) and firms hire even

more, especially in growth phases (rounds

16–21 and 106–111) when prices increase.

Due to the option to set internal decisions, the

algorithm was able to adjust better to industry life-cycle fluctu-

ations than in the first experiment (compare Figure 5 cycles

,a ,b and c with Figure 6 cycles ,aL ,bL and cL ), which mim-

icked the behavior in a cobweb model [51]. However, with this

algorithm, the performance (survival rate) was significantly

lower, which highlights the influence of the complexity of the

action space on learning speed.

The third experiment advances the second one by distin-

guishing between learning approaches for external and

internal variables. The explicit distinction between external

and internal variables provides the algorithm with domain

knowledge. Instead of changing the production staff at a

fixed rate an evolutionary algorithm, based on the approach

in [51], is implemented to determine the optimal num-

ber of workers. The inventory, the previous sales volume

and current change in production volume is used to deter-

mine the potential demand of the upcoming round to call

the SIM command. The cash position returned by the SIM

command is used as the fitness measure in the evolution-

ary algorithm. Figure 7 shows that this leads to a better

per formance due to a less complex action space and the

im ple menta tion of domain knowledge (separation of exter-

nal and internal decisions).

The first two experiments demonstrate that the testbed can

be used to replicate economic models and generate economic

behavior patterns in agent-based computational economics [8],

[51]. The benchmark suite improves upon the existing approach-

es by providing decisions across all areas of Porter’s value chain in

a standard setting that can be customized and, due to the REST

implementation, can be accessed by all types of agents (algo-

rithms) regardless of their underlying programming structure and

language. The experiments also demonstrate that a trivial imple-

mentation of state-of-the-art reinforcement learning algorithms

is capable of simulating real market behavior yet explode in

computational time and space [48] for any continuous action

space. Further algorithmic improvement techniques such as

automatic network structure determination [52] or hyperparam-

eter optimization [53] improve the performance, but the main

problem is the combination of simultaneous decision making

and infinite action space in conjunction with the optimization of

internal variables. This calls for other approaches, such as proba-

bility distributions of actions [33], the actor-critic approach with

0

5

10

15

20

25

30

35

Num

ber

of R

ound

s S

urvi

ved

Training Steps900 950 1,000 1,050 1,100

0

5

10

15

20

25

30

35

Num

ber

of R

ound

s S

urvi

ved

Training Steps

(a) (b)

900 950 1,000 1,050 1,100

FIGURE 7 Second and third Experiment: The average number of rounds survived for all four firms in the last 200 training steps. The average number of rounds all firms survived in the third experiment is significantly higher (Right picture, 17 rounds) than in the second experiment (Left picture, 13 rounds). The red line depicts the regression line overall training steps.

The experiments demonstrate that a trivial implementation of state-of-the-art reinforcement learning algorithms is capable of simulating real market behavior yet explode in computational time and space.


discrete [54] or continuous action space [48], [55]. The third

experiment applies the testbed to the joint solution of the credit

assignment and optimization problem. Moreover, the model can

be used to test algorithms to autonomously (without domain

knowledge) differentiate between different learning styles.

Machine learning approaches alone may not be sufficient

because the business simulation is hard to predict, not self-con-

tained and focuses on simultaneous decisions in a short period of

time, rather than sequential decisions over a long period. There-

fore, the authors of [23] suggest that major breakthroughs might

utilize hybrid techniques combining deep learning with reason-

ing, in this case, economic reasoning.

V. ConclusionThis article has presented a benchmark suite for machine learn-

ing algorithms for strategic decision making in a business con-

text. This tool extends the current set of training environments in

an artificial context, such as rllab, OpenAI Gym, or StarCraft II,

by providing a dynamic multi-market, multi-product environ-

ment with few producers and many consumers for applications

of strategic planning. It extends the current state-of-the-art prob-

lems for reinforcement learning by allowing a continuous state

and action space in a non-zero-sum game of imperfect infor-

mation. Three reinforcement-learning approaches using deep

Q-learning with experience replay, e-greedy policy, and target

networks for stabilization were used to demonstrate price-setting

and quantity-setting behaviors of four firms. The results show

that the decisions made by the algorithms align with expected

outcomes in such oligopolistic markets.

References[1] J. P. Kotter, Accelerate: Building Strategic Agility for a Faster-Moving World. Boston, MA,

USA: Harvard Business Review Press, 2014.

[2] B. Gilad, “‘Competitive intelligence’ shouldn’t just be about your competitors,” Harv.

Bus. Rev., 2015, May 18. [Online]. Available: https://hbr.org/2015/05/competitive-intelligence-

shouldnt-just-be-about-your-competitors. Accessed on: August 27, 2018.

[3] M. Reeves and G. Wittenburg, “Games can make you a better strategist,” Harv. Bus.

Rev., 2015, Sept. 7. [Online]. Available: https://hbr.org/2015/09/games-can-make-you-

a-better-strategist. Accessed on: August 27, 2018.

[4] T. M. Connolly et al., “A systematic literature review of empirical evidence on com-

puter games and serious games,” Comput. Educ., vol. 59, no. 2, pp. 661–686, Sept. 2012.

[5] E. A. Boyle et al., “An update to the systematic literature review of empirical evidence

of the impacts and outcomes of computer games and serious games,” Comput. Educ., vol.

94, pp. 178–192, Mar. 2016.

[6] J. P. Davis et al., “Developing theory through simulation methods,” Acad. Manage.

Rev., vol. 32, no. 2, pp. 480–499, Apr. 2007.

[7] R. Bell and M. Loon, “Reprint: The impact of critical thinking disposition on learning

using business simulations,” Int. J. Manage. Educ., vol. 13, no. 3, pp. 362–370, Nov. 2015.

[8] L. Waltman and U. Kaymak, “Q-learning agents in a Cournot oligopoly model,” J.

Econ. Dyn. Control, vol. 32, no. 10, pp. 3275–3293, Oct. 2008.

[9] J. D. Sterman, Business Dynamics: Systems Thinking and Modeling for a Complex World.

Boston, MA, USA: McGraw-Hill, 2000.

[10] K. Warren, Competitive Strategy Dynamics. West Sussex, England: Wiley, 2002.

[11] G. Fagiolo et al., “A critical guide to empirical validation of agent-based models in

economics: Methodologies, procedures, and open problems,” Comput. Econ., vol. 30, no.

3, pp. 195–226, Oct. 2007.

[12] H. Rahmandad and J. Sterman, “Heterogeneity and network structure in the dynam-

ics of diffusion: Comparing agent-based and differential equation models,” Manage. Sci.,

vol. 54, no. 5, pp. 998–1014, May 2008.

[13] M. L. Herman et al., Wargaming for Leaders: Strategic Decision Making from the Battlefield

to the Boardroom. New York, NY, USA: McGraw-Hill, 2009.

[14] A. McAfee and E. Brynjolfsson, “Big data: The management revolution,” Harv. Bus.

Rev., vol. 90, no. 10, pp. 60–68, Oct. 2012.

[15] H. R. Varian, “Big data: New tricks for econometrics,” J. Econ. Perspectives, vol. 28,

no. 2, pp. 3–28, May 2014.

[16] R. Garcia et al., “Validating agent-based marketing models through conjoint analy-

sis,” J. Bus. Res., vol. 60, no. 8, pp. 848–857, Aug. 2007.

[17] J. Grazzini et al., “Bayesian estimation of agent-based models,” J. Econ. Dyn. Control,

vol. 77, pp. 26–47, Apr. 2017.

[18] J.-S. Lee et al., “The complexities of agent-based modeling output analysis,” J. Artif.

Soc. Social Simul., vol. 18, no. 4, pp. 1–26, Oct. 2015.

[19] L. Tesfatsion, “Agent-based computational economics: Modeling economies as com-

plex adaptive systems,” Inform. Sci., vol. 149, no. 4, pp. 262–268, Feb. 2003.

[20] H. Dawid and M. Kopel, “On economic applications of the genetic algorithm: A

model of the cobweb type,” J. Evol. Econ., vol. 8, no. 3, pp. 297–315, Sept. 1998.

[21] M. Barbati et al., “Applications of agent-based models for optimization problems: A

literature review,” Expert Syst. Appl., vol. 39, no. 5, pp. 6020–6028, Apr. 2012.

[22] E. Brynjolfsson and A. McAfee, The Second Machine Age. New York, NY, USA:

Norton, 2014.

[23] D. C. Parkes and M. P. Wellman, “Economic reasoning and artif icial intelligence,”

Science, vol. 349, no. 6245, pp. 267–272, July 2015.

[24] Y. LeCun et al., “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.

[25] D. Silver et al., “Mastering the game of Go with deep neural networks and tree

search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.

[26] D. Silver et al., “Mastering the game of Go without human knowledge,” Nature, vol.

550, no. 7676, pp. 354–359, Oct. 2017.

[27] V. Mnih et al., “Playing Atari with deep reinforcement learning,” arXiv Preprint,

arXiv:1312.5602 [cs.LG], Dec. 2013.

[28] D. S. Ratcliffe et al., “Clyde: A deep reinforcement learning DOOM playing agent,”

in Proc. Workshops 31st Conf. Association Advancement Artificial Intelligence, San Francisco,

CA, USA, 2017, pp. 983–990.

[29] M. G. Bellemare et al., “The arcade learning environment: An evaluation platform

for general agents,” in Proc. 24th Int. Joint Conf. Artificial Intelligence, Buenos Aires, Argen-

tina, 2015, pp. 4148–4152.

[30] Y. Duan et al., “Benchmarking deep reinforcement learning for continuous control,”

in Proc. 33rd Int. Conf. Machine Learning, New York, NY, 2016, pp. 1329–1338.

[31] G. Brockman et al.,“OpenAI Gym,” arXiv Preprint, arXiv:1606.01540v1 [cs.LG],

June 2016.

[32] G. Synnaeve et al., “TorchCraft: A library for machine learning research on real-time

strategy games,” arXiv Preprint, arXiv:1611.00625v2 [cs.LG], Nov. 2016.

[33] O. Vinyals et al., “StarCraft II: A new challenge for reinforcement learning,” arXiv

Preprint, arXiv:1708.04782v1 [cs.LG], Aug. 2017.

[34] K. Arulkumaran et al., “Deep reinforcement learning: A brief survey,” IEEE Signal

Process. Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017.

[35] C. M. Macal and M. J. North, “Tutorial on agent-based modelling and simulation,”

J. Simul., vol. 4, no. 3, pp. 151–162, Sept. 2010.

[36] J. S. Bain, Industrial Organization. New York, NY, USA: Wiley, 1968.

[37] S. Huck et al., “Two are few and four are many: Number effects in experimental

oligopolies,” J. Econ. Behav. Org., vol. 53, no. 4, pp. 435–446, Apr. 2004.

[38] I. Topolyan, “Price competition when three are few and four are many,” Int. J. Ind.

Org., vol. 54, pp. 175–191, Sept. 2017.

[39] F. M. Bass, “A new product growth for model consumer durables,” Manage. Sci., vol.

15, no. 5, pp. 215–227, Jan. 1969.

[40] M. E. Porter, Competitive Strategy: Techniques for Analyzing Industries and Competitors.

New York, NY, USA: Free Press, 2004.

[41] C. Hill, International Business: Competing in the Global Marketplace, 8th ed. New York,

NY, USA: McGraw-Hill, 2011.

[42] P. Kotler and K. L. Keller, Marketing and Management. Upper Saddle River, NJ, USA:

Prentice Hall, 2006.

[43] S. Singh et al., “Intrinsically motivated reinforcement learning: An evolutionary

perspective,” IEEE Trans. Auton. Mental Develop., vol. 2, no. 2, pp. 70–82, June 2010.

[44] T. Hastie et al., The Elements of Statistical Learning. New York, NY, USA: Springer,

2009.

[45] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA,

USA: MIT Press, 1998.

[46] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, no. 3–4, pp.

279–292, May 1992.

[47] L.-J. Lin, “Reinforcement learning for robots using neural networks,” Ph.D. disserta-

tion, School Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, 1992.

[48] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv

Preprint, arXiv:1509.02971 [cs.LG], Sept. 2016.

[49] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature,

vol. 518, no. 7540, pp. 529–533, Feb. 2015.

[50] S. Mahadevan, “To discount or not to discount in reinforcement learning: A case

study comparing R learning and Q learning,” in Proc. 11th Int. Conf. Machine Learning,

New Brunswick, NJ, 1994, pp. 164–172.

[51] J. Arifovic, “Genetic algorithm learning and the cobweb model,” J. Econ. Dyn. Con-

trol, vol. 18, no. 1, pp. 3–28, Jan. 1994.

[52] M. Suganuma et al., “A genetic programming approach to designing convolutional

neural network architectures,” arXiv Preprint, arXiv:1704.00764v2 [cs.NE], Aug. 2017.

[53] J. Bergstra et al., “Algorithms for hyper-parameter optimization,” in Proc. 24th Int.

Conf. Neural Information Processing Systems, Granada, Spain, 2011, pp. 2546–2554.

[54] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” in Proc.

33rd Int. Conf. Machine Learning, New York, NY, 2016, vol. 48, pp. 1928–1937.

[55] E. Di Mario et al., “A comparison of PSO and reinforcement learning for multi-robot

obstacle avoidance,” in Proc. IEEE Congr. Evolutionary Computation, Cancun, Mexico, 2013,

pp. 149–156.

Documents

Features - cdn.whu.edu · Features 14 Market Model Benchmark Suite for Machine Learning Techniques by Martin Prause and Jürgen Weigand 25 Intelligent Asset Allocation via Market