Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Volume 13 Number 4 ❏ November 2018www.ieee-cis.org
on the cover©ISTOCKPHOTO.COM/KTSIMAGE
Departments 2 Editor’s Remarks
3 President’s Message by Nikhil R. Pal
5 Society Briefs Technical Books in
Computational Intelligence by Vaishali Damle and Jeanne Audino
7 Conference Reports Conference Report on 2018 IEEE
World Congress on Computation-al Intelligence (IEEE WCCI 2018)
by Pablo A. Estévez and Marley M.B.R Vellasco
10 Publication Spotlight by Haibo He, Jon Garibaldi,
Kay Chen Tan, Julian Togelius, Yaochu Jin, and Yew Soon Ong
13 Guest Editorial Computational Intelligence
in Finance and Economics by Okan Duru, Robert Golan,
and David Quintana
96 Conference Calendar by Bernadette Bouchon-Meunier
Features 14 Market Model Benchmark Suite for Machine Learning Techniques by Martin Prause and Jürgen Weigand
25 Intelligent Asset Allocation via Market Sentiment Views by Frank Z. Xing, Erik Cambria, and Roy E. Welsch
35 An Accurate Lattice Model for Pricing Catastrophe Equity Put Under the Jump-Diffusion Process
by Chuan-Ju Wang and Tian-Shyr Dai
IEEE Computational Intelligence Magazine (ISSN 1556-603X) is published quarterly by The Institute of Electrical and Electronics Engineers, Inc. Headquarters: 3 Park Avenue, 17th Floor, New York, NY 10016-5997, U.S.A. +1 212 419 7900. Responsibility for the contents rests upon the authors and not upon the IEEE, the Society, or its members. The magazine is a membership benefit of the IEEE Computational Intelligence Society, and subscriptions are included in Society fee. Replacement copies for members are available for US$20 (one copy only). Nonmembers can purchase individual copies for US$201.00. Nonmember subscription prices are available on request. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limits of the U.S. Copyright law for private use of patrons: 1) those post-1977 articles that carry a code at the bottom of the first page, provided the per-copy fee is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01970, U.S.A.; and 2) pre-1978 articles without fee. For other copying, reprint, or republication permission, write to: Copyrights and Permissions Department, IEEE Service Center, 445 Hoes Lane, Piscataway NJ 08854 U.S.A. Copyright © 2018 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved. Periodicals postage paid at New York, NY and at additional mailing offices. Postmaster: Send address changes to IEEE Computational Intelligence Maga zine, IEEE, 445 Hoes Lane, Piscataway, NJ 08854-1331 U.S.A. PRINTED IN U.S.A. Canadian GST #125634188.
Digital Object Identifier 10.1109/MCI.2017.2770279
Columns 46 Application Notes Fast Unsupervised Edge Detection Using Genetic Programming by Wenlong Fu, Bing Xue, Mengjie Zhang, and Mark Johnston
59 Research Frontier Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic
and Overfitting Approaches by Miriam Seoane Santos, Jastin Pompeu Soares, Pedro Henriques Abreu,
Hélder Araújo, and João Santos
Visualizing the Evolution of Computer Programs for Genetic Programming by Su Nguyen, Mengjie Zhang, Damminda Alahakoon, and Kay Chen Tan
Promoting Sustainable Forestry
SFI-01681
NOVEMBER 2018 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 1
Market Model Benchmark Suite for
Machine Learning Techniques
Martin Prause and Jürgen WeigandInstitute for Industrial Organization, WHU—Otto Beisheim School of Management, Vallendar, GERMANY
©ISTOCKPHOTO.COM/PHONLAMAIPHOTO
Digital Object Identifier 10.1109/MCI.2018.2866726
Date of publication: 15 October 2018
14 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018 1556-603X/18©2018IEEE
Corresponding Author: Martin Prause ([email protected])
Abstract—Recent developments in deep-reinforcement learning have yielded promising results in artificial
games and test domains. To explore opportunities and evaluate the performance of these machine learning
techniques, various benchmark suites are available, such as the Arcade Learning Environment, rllab, OpenAI
Gym, and the StarCraft II Learning Environment. This set of benchmark suites is extended with the open busi-
ness simulation model described here, which helps to promote the use of machine learning techniques as value-
adding tools in the context of strategic decision making and economic model calibration and harmonization.
The benchmark suite extends the current state-of-the-art problems for deep-reinforcement learning by offer-
ing an infinite state and action space for multiple players in a non-zero-sum game environment of imperfect
information. It provides a model that can be characterized as both a credit assignment problem and an optimi-
zation problem. Experiments with this suite’s deep-reinforcement learning algorithms, which yield remarkable
results for various artificial games, highlight that stylized market behavior can be replicated, but the infinite
action space, simultaneous decision making, and imperfect information pose a computational challenge. With
the directions provided, the benchmark suite can be used to explore new solutions in machine learning for stra-
tegic decision making and model calibration.
I. Introduction
In the last decade, the field of strategy has evolved from
being an isolated top-management task to a necessary
skill across all tiers of a firm [1]. Businesses identify risks
and opportunities for dynamic markets in advance by
establishing data-driven competitive intelligence processes [2].
One element in these processes is a business simulation. A
business simulation is a risk-free tool to engage executives in
scenario planning, strategy testing, and competitive analysis
[3]. A business simulation executes a market model in
which various players (firms) repeatedly make deci-
sions. Their actions influence the modeled econo-
my and the situation of other players. Ranging
from educational usage [4], [5] to strategy devel-
opment [6], business simulations help executives
think strategically and perceive the interconnec-
tions across business functions and foster critical
thinking by bridging the gap between theory and
practice [7]. The core of any business simulation is its
economic model. The simulation reflects the interactions
of the market players and helps identify patterns of activities.
Model designs can be either top-down (analytical approach)
or bottom-up (agent-based approach). They include (1) nar-
row equation-based oligopoly models [8], (2) comprehensive
system-dynamics-based market models for analyzing product
diffusion, marketing network effects, or supply chain dynam-
ics [9], and (3) complex agent-based systems, modeling com-
petitive market environments with autonomous agents [10].
Like the homo economicus assumption of pure rationality
vs. the bounded rational behavior of humans, economic models
inherit the trade-off between simplicity and analytical trace-
ability vs. complexity and descriptive accuracy [11]. Complex
economic market models comprise non-linearities, stochastic
dynamics, and non-trivial interaction structures [12], which
impede the decision-making process for scenario planning and
strategy testing. In such simulations, humans typically compete
against each other to test specific strategies. More advanced
simulations, so-called business war games, also include various
stakeholder roles such as regulatory authorities, governments, or
labor unions [13]. However, using these tools to identify multi-
ple strategic scenarios can take days or weeks depending on
their complexity and real-world implementation. Automating
these processes using decision-making agents would yield sig-
nificant business value. Furthermore, with the advent of a data-
driven society [14], economists are increasingly utilizing big
data to enrich their models [15], leading to the problem of cali-
bration (fitting the model to empirical data) and harmonization
(aligning the model to conform to theoretical assumptions and
reflect real-world outcomes) [16]. In such a system of non-
trivial interaction structures, a minor change in one variable
can cause a cascade of effects, including the emergence of
new relationships. Typically, calibration of agent-based models
comprises three steps: simulating the model, measuring the
quality of outcomes, and locating strong and weak levers [17].
Methodologically, manual calibration of parameters, Monte
Carlo simulations, and data-centric optimization using evolution-
ary algorithms to mimic stylized micro- and macroeconom-
ic facts are the most common, but also the most time-consuming,
NOVEMBER 2018 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 15
16 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018
approaches [11]. Therefore, efficient tools are needed to investi-
gate the emergent behavior [18].
These two challenges, decision-making, and calibration/har-
monization have been explored under the umbrella term of
agent-based computational economics, wherein interacting
autonomous agents are simulated to analyze the decision-mak-
ing and learning processes in a tailored and stylized economic
environment [19]. Popular methods for studying the learning
behavior emerge from the field of computational intelligence
for single- and multi-objective optimization [20], [21]. Building
on these advances, recent developments in the field of AI com-
bine optimization and, due to the advent of a data-driven soci-
ety [22], machine learning for pattern recognition to build
rational agents for perceiving the environment and taking
actions [23], [24]. Instead of applying tailored algorithms to spe-
cific markets or social systems, AI research focuses on autono-
mous learning for general problem settings in complex
environments. Recent developments in artificial gameplay, such
as the game of Go [25], [26], Atari jump-and-run games [27], or
the first-person 3D game Doom [28], which are based on deep-
reinforcement learning, have yielded promising results for
advancing decision making in complex environments. In this
light, a full-fledged business simulation represents such a general
setting because it combines all major business aspects, models a
complex and dynamic environment, and is closely related to
artificial games. Actors must make decisions in such a dynamic
environment, and their decisions influence their opponents and
affect the environment.
Combining deep-learning techniques with reinforcement
learning has become a standard method for these problem sets.
Several freely available benchmark suites, such as the Arcade
Learning Environment [29], rllab [30], OpenAI Gym [31],
TorchCraft [32], and the StarCraft II Learning Environment
[33], have been published to train and compare these algo-
rithms. Further application fields and research challenges of
deep-reinforcement learning are mentioned in [34]. To support
the shift from applying these algorithms in artificial games to
the business context and in response to the call [34] for a reli-
able benchmark suite in the context of deep-reinforcement
learning, this article presents an open benchmark suite of a
holistic economic model for machine learning and AI algo-
rithms in general. Following a discussion of the business
applications of such a benchmark suite, a standard deep-
reinforcement learning algorithm is presented to demonstrate
the use of the model and highlight its challenges.
II. The Economic Model of the Benchmark SuiteThe testbed is based on an economic market model. It does
not provide a set of agents and their relationships as would a
full-fledged agent-based model [35] but
defines an agent’s decision environment and
her individual boundaries. The agent herself
can be any algorithm. She can communicate
indirectly with other agents within this model.
In the following sections, we introduce the
economic model in more detail. It may suffice to say that vari-
ous model elements and parameters can be adjusted during the
setup phase to modify the model according to the specific sim-
ulation focus.
A. The Firm’s Decision EnvironmentIn the default mode, the economic model emulates an oligop-
olistic market structure with a small number of firms, many
consumers and moderate barriers to entry and exit. The firms
compete in multiple rounds and take simultaneous decisions
on firm-specific and market-focused variables. According to
the Structure–Conduct–Performance paradigm [36], such a
market structure induces competitive pressure and strategic
interaction among the firms (a lá Cournot or Bertrand), lead-
ing to a balanced surplus for consumers and producers. From a
theoretical perspective, four firms suffice to mimic a competi-
tive environment [37], [38]. The upper limit of eight firms
ensures that patterns of strategic interaction are still traceable
and manageable if applied to human and artificial players. Each
firm can choose to produce either a low-cost, low-price, high-
volume (LLH) product, a high-cost, high-price, low-volume
(HHL) product or both, and sell these products in two sepa-
rate, unrelated markets (domestic and international). We
assume that the overall evolution of market demand for these
products follows the traditional product life-cycle of introduc-
tion, growth, maturity, and decline [39]. Firm decisions are
organized conceptually along Porter’s value chain [40] and fall
into two categories: internal and external decisions. External
decisions—within the scope of product research and develop-
ment (R&D) and marketing and sales—influence both market
demand and competitors. They define the firm’s business strat-
egy, that is, its corporate and competitive action plan. Internal
decisions—within the scope of operations, production, human
resources, and finance—align the firm’s internal organization
and constitute the structure that supports the business strategy.
Decisions are subject to trade-offs between flexibility and path
dependency. Either decisions incur immediate expenditures,
which have short-term effects (e.g., buying an additional facil-
ity), or they are sunk-cost- or people-related investments, giv-
ing rise to additional costs but with long-term benefits (e.g.,
economies of learning). All decision effects are subject to
diminishing marginal returns (e.g., S-curve). Each firm is
assumed to be a publicly traded company so that its financials
serve as performance signals to the market. The default mea-
sure of a firm’s market performance is its share price, primarily
influenced by discounted cash flows, dividends, leverage, and
brand value. In the default configuration, all firms start identi-
cal in terms of assets and financials and can sell their products
directly in both markets.
Combining deep-learning techniques with reinforcement learning has become a standard method for these problem sets.
NOVEMBER 2018 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 17
B. Defining Business StrategyA firm’s business strategy embraces corporate and competitive
strategy [41]. At the corporate level, firms have to define the
scope of their activities in terms of geographic exposure—
options are domestic and international—and degree of diversi-
fication—single product/market versus multiple products/
markets. The combination of the available options creates a
market arena tableau of four distinct non-cannibalizing markets
(LLH product/national, LLH product/international, HHL
product/national, HHL product/international. At the market
level, firms have to decide on their competitive strategy, i.e.
their market-specific strategic positioning as either price-/cost-
leader, differentiator, or outpacer [40]. The actual positioning
results from the relative value–price relationship in each partic-
ular market. The external decisions determine the potential
demand of a firm and contribute to that of the market as a
whole. The core drivers of firm-specific and total market
demand are the firms’ marketing efforts [42], their brand values,
and macroeconomic factors. Marketing efforts are defined by
the “four P’s”: (1) price, (2) promotion, or expenditures that
increase total market demand via advertising, (3) placement,
represented by the number of salespeople, which influences
firm-specific demand, and (4) product value, which can be
increased through R&D investments. The value/price ratio,
promotion, and placement efforts in conjunction with the
brand value and macroeconomic factors determine a firm’s
potential demand. Brand value is a cumulative measure reflect-
ing consumer satisfaction. It represents first- and second-mover
advantages/disadvantages of market entry and can be influ-
enced by corporate identity expenditures, the firm’s attractive-
ness, and its ability to live up to its promises.
This setup ensures that a firm can tap into all sources of rev-
enue advantage such as differentiation (value/price ratio), inno-
vation (product value and brand value), and people (R&D staff
and salespeople). Figure 1 summarizes the domain for external
decisions that define a firm’s business strategy.
C. Aligning the OrganizationInternal firm decisions fall into two categories: operations and
financing. The operations part consists of three pillars: (1) mate-
rial sourcing, (2) product outsourcing, and (3) production. For
each product, the material must be purchased. Starting with a
one-to-one relationship of material-to-product, a firm can
invest in materials development to reduce the amount of
*****
Product Level(1…2 Products)
Decisions of Firm x
AttractivenessExpenditures
CorporateIdentify Exp.
PriceDecision
SalesStaff
PromotionExpenditures
R&DInvestments
Capabilityof
Delivering
BrandValue
MacroeconomicFactors
PricePolicy
PlacementPolicy
PromotionPolicy
ProductValue
ProductPolicy
BrandValue
MacroeconomicFactors
PricePolicy
PlacementPolicy
PromotionPolicy
ProductPolicy
* * * * *
Potential Demandof Firm x
Decisions of All Other Firms
* Relative Effects
Firm Level Market Level (1....4 Markets)
FIGURE 1 The potential demand for firm x is influenced by the relative effects of the four P’s of its marketing, brand value, and macroeconomic factors. Dark rounded rectangles denote decisions.
18 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018
material needed for production. The sourcing costs depend on
macroeconomic factors. Inventory management is a key factor
because keeping stock maintains flexibility but incurs high
costs. While material sourcing is necessary to produce products
in-house, outsourcing is an alternative. A firm can determine
the number of units to be manufactured outside as well as the
length of the outsourcing contract. Like materials and products
made in-house, outsourced products are immediately available
and can be delivered as ready-made products. The outsourcing
costs are pegged to an external exchange rate, which varies
according to macroeconomic factors. Thus, by selecting con-
tract length, the firm may hedge against cost fluctuations. Final-
ly, a firm must decide on the volume of production for its
product/market portfolio. It can produce more or less than its
potential demand in any market. Potential demand is not
known ex ante because it is a function of all external decisions
made simultaneously by all firms. Therefore, a firm must esti-
mate its potential demand as implied by its business strategy.
The sum of product inventory, outsourced products, and man-
ufactured products serves to satisfy the potential demand. For
the production decision, two independent capacity factors have
to be taken into account: (1) the production capacity of the
respective facility and (2) the production staff capacity. Facilities
come in different sizes and production capacities. A firm can
purchase and retire facilities to adjust its total capacity. It can
also invest in infrastructure and flow optimization to increase
capacity. One unit of capacity equals one unit of the low- or
high-cost product. The second capacity factor is the production
staff. Each blue-collar worker has a base productivity rate per
round. The productivity rates differ by product and can be
increased over time by investing in training and incentives. The
productivity rate times the number of workers defines the total
production staff capacity. This setup enables a firm to exploit
different types of cost advantages, such as economies of scale
(e.g., bulk purchase and mass production), scope (e.g., utilizing
facilities for both products), and learning (e.g., productivity
increases of production staff). The minimum of the facility
capacity and production staff capacity defines the number of
units that can be produced (Figure 2).
The final element of the internal perspective is corporate
financing and shareholder wealth creation. A firm must decide
how to finance its activities, e.g., out of cash flow, using short-
term vs. long-term loans and managing interest and repayment
trade-offs, and how much to pay out to shareholders in terms
of dividends.
III. Characteristics and Applications of the Benchmark SuiteGames such as chess and Go entail perfect information. There-
fore, theoretically, they can be solved by an exhaustive search of
the full game tree. In practice, an exhaustive search is infeasible
because of the large search space: 3580 for chess and 250 015 for
Go [25]. The search space can be reduced, however, by limiting
the state space and the action space and using metaheuristics to
map between both [26]. For Atari jump-and-run games, the
state space consists of multiple subsequent frames (pixel
TotalUnits
PotentialDemand
ProductInventory
Must Be Aligned
Must Be Aligned
OutsourcingQuantity
OutsourcingContractLength
MaterialQuantity
ProductionQuantity
MaterialDevelopmentExpenditures
TrainingExpenditures
IncentiveInvestments
MaterialNeeded
MaterialInventory
Total ProductionCapacity
ProductionStaff
Number ofFacilities
OptimizationExpenditures
InfrastructureExpenditures
ProductionStaff Capacity
FacilityCapacity
AdvanceProductivity
AdvanceCapacity
BaseProductivity
BaseCapacity
*
**
*
* Per Facility Type
FIGURE 2 Internal structure of a firm. Dark rounded rectangles denote decisions. The production quantity, material, outsourced quantity, and inventory are product specific and should be aligned with each other and with the potential demand.
NOVEMBER 2018 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 19
screens), and the action space comprises com-
mands such as moving the game figure right,
left, up, or down. The real-time strategy game
StarCraft II poses the next stage of challeng-
ing problems, with more than two players
interacting, a more extensive state space and
action space (approximately 108 ), imperfect
information due to partially observed game
states, and many thousands of frames of game-
play [33]. The proposed business simulation
testbed pushes these boundaries further along the dimensions
of game complexity, game dynamics, and game objectives. First,
referring to complexity, there are four to eight players in the
testbed, and the action space has become infeasible to address.
Whereas actions in chess, Go, or arcade games are limited to
one per round or frame, StarCraft II allows multiple actions at
once, such as moving a group of units to a specific grid posi-
tion. However, the number of sequential actions per minute is
limited in the StarCraft II Learning Environment to account
for a fair comparison with human players. In the business simu-
lation, an agent can make up to 48 decisions per round. A deci-
sion can take any number within a specific range, such as for
price (1500–2500), number of salespeople (0–999), or produc-
tion volume (0–500,000). The median of ranges is 1000, lead-
ing to 100048 possible decision combinations. While most of
the actions in StarCraft II are sequential over a long period of
time, the business simulation focuses on simultaneous decisions
over a short period of time (a few rounds). Thus, the training
data are very sparse compared to existing test problems. Second,
with respect to the game dynamics, the external environment
changes over time: Demand, costs, exchange rates, and other
macroeconomic factors vary according to the industry life-
cycle and customized market shocks. Finally, the game objec-
tives differ from established games. Various rewards such as share
price, market share, cash position, profit level, and survival of
rounds can be used as performance measures, reflecting agents’
extrinsic and intrinsic motivators [43]. Whereas StarCraft II is a
zero-sum (win/lose) game, the business simulation need not be
if agents have different objectives. The benchmark suite focuses
on the environment and not on the agent. It is a testbed for
algorithms to compare and replicate scenarios without pro-
gramming language limitations. Therefore, the benchmark suite
provides a representational state transfer (REST) application
programming interface (API) to access the current game state,
reward, and decision/action vectors using a JavaScript Object
Notation (JSON) as follows:
[{“SHAREPRICE”:257,”TOTALREVENUES”:257,”
NETINCOME”:13, …},… {…}]
The game state is a representation of the environment con-
sisting of (1) public information from other firms, such as bal-
ance sheet and cash flow statement data, (2) sales, revenues, and
product values from each market and firm, (3) macroeconomic
variables such as the GDP, exchange rate, and material costs,
and (4) each firm’s internal measures such as unit costs and
inventory levels. The workflow for using the benchmark suite
(Figure 3) consists of five main functions: (1) Reset the game
and define specific parameters (demand cap, minimum product
value standards, production costs, etc.) and induce system
shocks, using either artificial or real-world data (RESET), (2)
obtain the upper and lower limits for each variable (LIMITS),
(3) submit the decision vector and proceed to the next round
(STEP), (4) obtain the game status (STATUS), and (5) simulate
internal decisions (SIM).
Business decisions can be decomposed into two distinct
types: external decisions (which influence the market and com-
petitors) and internal decisions (which do not interact with the
environment). Once the business strategy is set, and an agent
estimates the potential sales based on her own decisions and
assumptions about other players, she needs to align the internal
decisions. This setup reflects Alfred Chandler’s paradigm of
Structure follows Strategy. The choice of external decisions can be
characterized as a pattern recognition problem of mapping
states to actions. Aligning the internal decisions, however, is an
optimization problem. When the potential sales (the result of
predicting the effect of the external decisions) for each market
are submitted along with the internal decisions of a firm (simac-
tion vector) and the firm number ( ),x the tool returns all inter-
nal measures such as the inventory, unit cost, cash flow, and
balance sheet. In summary, the design of the benchmark suite
helps to develop and evaluate general machine learning tech-
niques for decision making and calibration and combines the
learning tasks of exploration and exploitation (external deci-
sions) with multivariable optimization (internal decisions).
IV. The Challenges of Reinforcement Learning in Business SimulationsPattern recognition algorithms can be categorized broadly as
unsupervised, supervised, or reinforcement learning [44]. As
data labels in the business simulation are sparse and time-
delayed, the problem is one of reinforcement learning. This
approach assumes an agent situated in an environment. She
observes the environment ( ;s e.g., a JSON object), takes actions
( ;a e.g., a JSON object), and receives a reward ( ,r determined
by the agent; e.g., number of rounds survived, profit after round
,t etc.) and a new state ( ;sl e.g., a JSON object) from the envi-
ronment: , .s r sa" l Reinforcement learning seeks a sequence of
actions, called a policy, to maximize the total reward [45]. It
solves the credit assignment problem by rewarding preceding
The design of the benchmark suite helps to develop and evaluate general machine learning techniques for decision making and calibration and combines the learning tasks of exploration and exploitation (external decisions) with multivariable optimization (internal decisions).
20 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018
actions that contribute to the final outcome. From the agent’s
perspective, the environment is non-deterministic; the total
reward ( )Rt at time t is the sum of the current reward ( )rt and
discounted future rewards , [ , ]R 0 1t 1 dc c+^ h when a specific
action is pursued: .R r Rt t t 1c= + + c is called the discount
factor, with a value close to 1 to consider future discounted
cumulative rewards. Following [46], a function ,Q s at t^ h can be
defined estimating the total reward for a given state s and
action .a An optimal choice is an action that maximizes the
discounted future rewards: , .maxQ s a Rt t t 1= +^ h Assuming
such a Q-function exists, an optimal policy would choose the
action with the highest Q-value: ,maxs Q s aa
=^ ^h h. Rewrit-
ing this formula gives the Bellman equation, which can be used
to iteratively approximate the Q-function. Lin [47] proves that
this equation converges if the states are finite: ,Q s a =^ h
, .maxr Q s aac+ ) )) ^ h Following the suggestions in [27], a neu-
ral network with multiple hidden layers is used to approximate
the Q-function. The input layer is the game state vector; the
output layer is the Q-values for any possible action. The net-
work is initialized randomly, and the regression task is opti-
mized using the least-squares error method and stochastic
gradient descent.
Unfortunately, the game states are not finite in a business
simulation because the domain is of real numbers. Thus, con-
vergence can no longer be guaranteed. Several countermea-
sures are proposed to stabilize the process, such as experience
replay [47], e-greedy policy [27], or target networks [48]. In
experience replay, state–action transitions are stored in memory
from a knowledge base, random generation, or preceding train-
ing tasks; the neural network is trained by mini-batches from
this memory. The e -greedy policy prevents the algorithms
from always picking the action with the highest Q-value and
thus getting stuck in a local optimum: With some probability, a
random action is chosen instead of the action with the highest
Q-value. A typical implementation follows the policy that over
time, the algorithm first explores and later becomes increasing-
ly greedy; this is reflected by the factor ,m which controls the
speed of decay of exploration: .emin max mint
ee e e= + - m-^ h
Target networks are used to stabilize the training process. For
any forward pass of samples regarding a specific action, all
Q-values associated with similar actions will also be affected,
and the neural network will never stabilize. To avoid this
behavior, every x steps the weights of the training networks are
copied to a target network, and predictions of Q-values (used
Agent: CompileAction/Decision Vector
Setup and Customize the Test Bed(Economic Model)
End?
Start
Workflow REST API Command (Pseudocode)
Get Decision Limits and CurrentGame State
Optional: Simulate theEffect of Internal Decisions
Submit Action Vector andMove to the Next Round
End
RESET Function:game_state <- GET x.x.x.x/reset/token/param
LIMIT and STATUS Function:boundaries <- GET x.x.x.x/limits/token
status <- GET x.x.x.x/status/token
SIM Function:sim_state <- GET x.x.x.x/sim/token/x/simaction
STEP Function:game_state <- GET x.x.x.x/step/token/action
Yes
No
FIGURE 3 The general workflow for using the benchmark suite is depicted on the left. The right part of the figure shows the corresponding REST API commands in Pseudocode. x.x.x.x, the service’s IP address1; The dark rounded rectangle highlights the external part of the suite, the algorithm or agent.
1 Online Resource: Access to the public free service is available at www.stratchal.com/
demo
NOVEMBER 2018 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 21
in the error function for the training network)
are simply taken from the target network. This
slows the learning process but stabilizes the
Q-values. Various other stabilization and
exploration–exploitation techniques are used
in [26], [48], and [49] for demonstration pur-
poses; these three are the most prominent
ones. A general algorithm is given in Figure 4.
The major challenges in this task are the large
state and action space and the distinction between external
variables and internal optimization. Therefore, three experi-
ments are presented to highlight the challenges and to demon-
strate the use of the benchmark suite. The first experiment
showcases a simple Cournot oligopoly just looking at the
external decisions. The second mimics a cobweb model and
incorporates also the internal decisions but without differenti-
ating on the learning approach. The third experiment advances
the second one by utilizing different learning approaches on
external and internal decisions.
In the first experiment, the state space consisted of 21 vari-
ables: potential sales, actual sales, cash position, balance sheet
total, and share price for each of the four players, plus the
domestic market’s GDP as a hint for the industry cycle. The
action vector was an 8-bit string, resulting in 256 different
actions. Two bits signaled a price increase of 50 ( ),012 a price
decrease of 50 ( ),102 or no price change (112 or 002) for a sin-
gle player. Production volumes, material procurement, and pro-
duction workers were automatically adjusted at a fixed rate,
ignoring any internal complexity. All other variables were set to
default values and left unchanged. The reward, the time-delayed
data label, was the number of rounds
all firms survived. Although different
rewards could have been imposed to
represent different objectives for individ-
ual firms, all firms shared the same
objective in order to compare the results
with other empirical findings in [8].
Once a firm goes bankrupt, the game
restarts. The industry cycle takes four
stages. Each stage lasts for 8 rounds, and
the cycle was repeated 25 times. Further
parameters, whose values were chosen to
optimize the trade-off between comput-
ing time and memory capacity, are (1)
size of the replay memory, 1000; (2)
mini-batch size, 64; (3) target network
updates, 100; and (4) the neural network
structure, two hidden dense layers (first,
512 neurons; second, 256 neurons); hid-
den layer activation functions, rectified
linear; output activation function, soft-
max; input layer, 21 neurons; and output
layer, 256 neurons. The e-greedy policy
parameters were standard lower and
upper probability limits ,mine 0.01, and
,maxe 0.99; with a slow convergence rate of m , 0.001; and the
discount factor ,c 0.99, was chosen according to the literature
in [50]. Figure 5 shows the results of 1100 training runs. The
overall firm survival rate increased from 18 to a maximum of
32 rounds (left image). The decisions in the last 200 rounds
(right image) demonstrate that the algorithm learned to reduce
the price in times of industry recession (cycles , ,a b c and g ) and
to increase the price in times of growth (cycles ,a b and c ). This
collusive behavior aligns with the findings in [8, p. 3287],
wherein the authors applied a Q-learning algorithm to a simple
oligopoly model and concluded, “Q-learning firms generally
learn to collude with each other in Cournot oligopoly games,
although full collusion usually does not emerge, that is, firms
usually do not learn to make the highest possible joint profit.”
The second experiment incorporated price, sales, and basic
internal production decisions. Each decision was represented by
a single bit: (1) a price increase (1) or decrease (0) of 50, (2) an
increase (1) or decrease (0) of 40 in the number of salespeople,
and (3) an increase of 5000 units in the production quantity and
of 70 people in the production staff (1) or a corresponding
decrease (0). The resulting 12-bit string creates an action space
Initialize replay memory ( ) with random transitions
Initialize the training neural network ( ) with random weights
Copy the weights of to a target network ( )
<- GET x.x.x.x/reset/token/param
while (continue):
select action (external variables) at state with -greedy method
Create action vector from and internal aligned variables
∗ <- GET x.x.x.x/step/token/
reward <- GET x.x.x.x/status/token
Store transition ∗ > in ; if is full discard the oldest entry
Sample mini-batches from :
Train
Every k steps, copy weights from to
∗
FIGURE 4 Pseudocode of the reinforcement learning algorithms (with experience replay, e -greedy policy, and target networks) using the benchmark suite; x.x.x.x, the service’s IP address.
The presented business simulation provides a testbed for machine learning algorithms on a problem domain with a continuous state and action space, in a competitive, non-zero-sum game of imperfect information with time constraints.
22 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018
of 4096 combinations. The underlying associated economic
model is the cobweb model in which production quantities are
chosen before the market price is observed [51]. The defined
reinforcement learning approach is best suited for credit assign-
ment tasks. Therefore, it is also applied to the second experi-
ment without any structural adjustments. Internal and external
decisions are treated with the same learning approach, despite
the fact, that the testbed proposes a problem of credit assign-
ment and optimization. The results (Figure 7) show a slower
learning rate within 1100 training steps (from 8 to 14) than in
the first setup due to the larger action space. However, qualita-
tively, two different scenarios can be identified (Figure 6).
0
2,000
4,000
6,000
8,000
10,000
12,000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
Sta
cked
Pric
e Le
vel
Runs
0
200
400
600
800
1,000
1,200
1,400
1,600
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
Sta
cked
Sal
es P
eopl
e Le
vel
Runs(a) (b)
aL bL cL dL aR bR cR dR
Firm 1 Firm 2 Firm 3 Firm 4
FIGURE 6 Second Experiment. (a) The stacked prices with firm 1 at the bottom and firm 4 on the top. (b) She stacked salespeople decisions. The Red Queen effect is observed in cycles aR and .dR Lifecycles aL to dL and aR to dR in Figure 6 correspond to lifecycles a to d in Figure 5.
0
5
10
15
20
25
30
35
Num
ber
of R
ound
s S
urvi
ved
Training Steps
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
1 12 23 34 45 56 67 78 89 100
111
122
133
144
155
166
177
188
Sta
cked
Pric
e Le
vel
Runs
(a) (b)
0 200 400 600 800 1,000
Firm 1 Firm 2 Firm 3 Firm 4
a b c d e f g
FIGURE 5 First Experiment. (a) The average number of rounds survived for all firms by manipulating the price and neglecting quantity decisions. The red line depicts the regression line. (b) Price decisions for the last 200 runs. Prices are stacked to highlight their similarity, with firm 1 at the bottom and firm 4 on the top. Each vertical dotted line signals the beginning of a new industry lifecycle (a to g ).
NOVEMBER 2018 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 23
(1) During recessions, expenditures for sales-
people are used to mitigate a potential price
decrease: In rounds 71–76 and 86–91, the
number of salespeople increased as the price
dropped. (2) Once firms spend for more sales-
people, a rate race begins (as suggested by the
Red Queen principle) and firms hire even
more, especially in growth phases (rounds
16–21 and 106–111) when prices increase.
Due to the option to set internal decisions, the
algorithm was able to adjust better to industry life-cycle fluctu-
ations than in the first experiment (compare Figure 5 cycles
,a ,b and c with Figure 6 cycles ,aL ,bL and cL ), which mim-
icked the behavior in a cobweb model [51]. However, with this
algorithm, the performance (survival rate) was significantly
lower, which highlights the influence of the complexity of the
action space on learning speed.
The third experiment advances the second one by distin-
guishing between learning approaches for external and
internal variables. The explicit distinction between external
and internal variables provides the algorithm with domain
knowledge. Instead of changing the production staff at a
fixed rate an evolutionary algorithm, based on the approach
in [51], is implemented to determine the optimal num-
ber of workers. The inventory, the previous sales volume
and current change in production volume is used to deter-
mine the potential demand of the upcoming round to call
the SIM command. The cash position returned by the SIM
command is used as the fitness measure in the evolution-
ary algorithm. Figure 7 shows that this leads to a better
per formance due to a less complex action space and the
im ple menta tion of domain knowledge (separation of exter-
nal and internal decisions).
The first two experiments demonstrate that the testbed can
be used to replicate economic models and generate economic
behavior patterns in agent-based computational economics [8],
[51]. The benchmark suite improves upon the existing approach-
es by providing decisions across all areas of Porter’s value chain in
a standard setting that can be customized and, due to the REST
implementation, can be accessed by all types of agents (algo-
rithms) regardless of their underlying programming structure and
language. The experiments also demonstrate that a trivial imple-
mentation of state-of-the-art reinforcement learning algorithms
is capable of simulating real market behavior yet explode in
computational time and space [48] for any continuous action
space. Further algorithmic improvement techniques such as
automatic network structure determination [52] or hyperparam-
eter optimization [53] improve the performance, but the main
problem is the combination of simultaneous decision making
and infinite action space in conjunction with the optimization of
internal variables. This calls for other approaches, such as proba-
bility distributions of actions [33], the actor-critic approach with
0
5
10
15
20
25
30
35
Num
ber
of R
ound
s S
urvi
ved
Training Steps900 950 1,000 1,050 1,100
0
5
10
15
20
25
30
35
Num
ber
of R
ound
s S
urvi
ved
Training Steps
(a) (b)
900 950 1,000 1,050 1,100
FIGURE 7 Second and third Experiment: The average number of rounds survived for all four firms in the last 200 training steps. The average number of rounds all firms survived in the third experiment is significantly higher (Right picture, 17 rounds) than in the second experiment (Left picture, 13 rounds). The red line depicts the regression line overall training steps.
The experiments demonstrate that a trivial implementation of state-of-the-art reinforcement learning algorithms is capable of simulating real market behavior yet explode in computational time and space.
24 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018
discrete [54] or continuous action space [48], [55]. The third
experiment applies the testbed to the joint solution of the credit
assignment and optimization problem. Moreover, the model can
be used to test algorithms to autonomously (without domain
knowledge) differentiate between different learning styles.
Machine learning approaches alone may not be sufficient
because the business simulation is hard to predict, not self-con-
tained and focuses on simultaneous decisions in a short period of
time, rather than sequential decisions over a long period. There-
fore, the authors of [23] suggest that major breakthroughs might
utilize hybrid techniques combining deep learning with reason-
ing, in this case, economic reasoning.
V. ConclusionThis article has presented a benchmark suite for machine learn-
ing algorithms for strategic decision making in a business con-
text. This tool extends the current set of training environments in
an artificial context, such as rllab, OpenAI Gym, or StarCraft II,
by providing a dynamic multi-market, multi-product environ-
ment with few producers and many consumers for applications
of strategic planning. It extends the current state-of-the-art prob-
lems for reinforcement learning by allowing a continuous state
and action space in a non-zero-sum game of imperfect infor-
mation. Three reinforcement-learning approaches using deep
Q-learning with experience replay, e-greedy policy, and target
networks for stabilization were used to demonstrate price-setting
and quantity-setting behaviors of four firms. The results show
that the decisions made by the algorithms align with expected
outcomes in such oligopolistic markets.
References[1] J. P. Kotter, Accelerate: Building Strategic Agility for a Faster-Moving World. Boston, MA,
USA: Harvard Business Review Press, 2014.
[2] B. Gilad, “‘Competitive intelligence’ shouldn’t just be about your competitors,” Harv.
Bus. Rev., 2015, May 18. [Online]. Available: https://hbr.org/2015/05/competitive-intelligence-
shouldnt-just-be-about-your-competitors. Accessed on: August 27, 2018.
[3] M. Reeves and G. Wittenburg, “Games can make you a better strategist,” Harv. Bus.
Rev., 2015, Sept. 7. [Online]. Available: https://hbr.org/2015/09/games-can-make-you-
a-better-strategist. Accessed on: August 27, 2018.
[4] T. M. Connolly et al., “A systematic literature review of empirical evidence on com-
puter games and serious games,” Comput. Educ., vol. 59, no. 2, pp. 661–686, Sept. 2012.
[5] E. A. Boyle et al., “An update to the systematic literature review of empirical evidence
of the impacts and outcomes of computer games and serious games,” Comput. Educ., vol.
94, pp. 178–192, Mar. 2016.
[6] J. P. Davis et al., “Developing theory through simulation methods,” Acad. Manage.
Rev., vol. 32, no. 2, pp. 480–499, Apr. 2007.
[7] R. Bell and M. Loon, “Reprint: The impact of critical thinking disposition on learning
using business simulations,” Int. J. Manage. Educ., vol. 13, no. 3, pp. 362–370, Nov. 2015.
[8] L. Waltman and U. Kaymak, “Q-learning agents in a Cournot oligopoly model,” J.
Econ. Dyn. Control, vol. 32, no. 10, pp. 3275–3293, Oct. 2008.
[9] J. D. Sterman, Business Dynamics: Systems Thinking and Modeling for a Complex World.
Boston, MA, USA: McGraw-Hill, 2000.
[10] K. Warren, Competitive Strategy Dynamics. West Sussex, England: Wiley, 2002.
[11] G. Fagiolo et al., “A critical guide to empirical validation of agent-based models in
economics: Methodologies, procedures, and open problems,” Comput. Econ., vol. 30, no.
3, pp. 195–226, Oct. 2007.
[12] H. Rahmandad and J. Sterman, “Heterogeneity and network structure in the dynam-
ics of diffusion: Comparing agent-based and differential equation models,” Manage. Sci.,
vol. 54, no. 5, pp. 998–1014, May 2008.
[13] M. L. Herman et al., Wargaming for Leaders: Strategic Decision Making from the Battlefield
to the Boardroom. New York, NY, USA: McGraw-Hill, 2009.
[14] A. McAfee and E. Brynjolfsson, “Big data: The management revolution,” Harv. Bus.
Rev., vol. 90, no. 10, pp. 60–68, Oct. 2012.
[15] H. R. Varian, “Big data: New tricks for econometrics,” J. Econ. Perspectives, vol. 28,
no. 2, pp. 3–28, May 2014.
[16] R. Garcia et al., “Validating agent-based marketing models through conjoint analy-
sis,” J. Bus. Res., vol. 60, no. 8, pp. 848–857, Aug. 2007.
[17] J. Grazzini et al., “Bayesian estimation of agent-based models,” J. Econ. Dyn. Control,
vol. 77, pp. 26–47, Apr. 2017.
[18] J.-S. Lee et al., “The complexities of agent-based modeling output analysis,” J. Artif.
Soc. Social Simul., vol. 18, no. 4, pp. 1–26, Oct. 2015.
[19] L. Tesfatsion, “Agent-based computational economics: Modeling economies as com-
plex adaptive systems,” Inform. Sci., vol. 149, no. 4, pp. 262–268, Feb. 2003.
[20] H. Dawid and M. Kopel, “On economic applications of the genetic algorithm: A
model of the cobweb type,” J. Evol. Econ., vol. 8, no. 3, pp. 297–315, Sept. 1998.
[21] M. Barbati et al., “Applications of agent-based models for optimization problems: A
literature review,” Expert Syst. Appl., vol. 39, no. 5, pp. 6020–6028, Apr. 2012.
[22] E. Brynjolfsson and A. McAfee, The Second Machine Age. New York, NY, USA:
Norton, 2014.
[23] D. C. Parkes and M. P. Wellman, “Economic reasoning and artif icial intelligence,”
Science, vol. 349, no. 6245, pp. 267–272, July 2015.
[24] Y. LeCun et al., “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.
[25] D. Silver et al., “Mastering the game of Go with deep neural networks and tree
search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
[26] D. Silver et al., “Mastering the game of Go without human knowledge,” Nature, vol.
550, no. 7676, pp. 354–359, Oct. 2017.
[27] V. Mnih et al., “Playing Atari with deep reinforcement learning,” arXiv Preprint,
arXiv:1312.5602 [cs.LG], Dec. 2013.
[28] D. S. Ratcliffe et al., “Clyde: A deep reinforcement learning DOOM playing agent,”
in Proc. Workshops 31st Conf. Association Advancement Artificial Intelligence, San Francisco,
CA, USA, 2017, pp. 983–990.
[29] M. G. Bellemare et al., “The arcade learning environment: An evaluation platform
for general agents,” in Proc. 24th Int. Joint Conf. Artificial Intelligence, Buenos Aires, Argen-
tina, 2015, pp. 4148–4152.
[30] Y. Duan et al., “Benchmarking deep reinforcement learning for continuous control,”
in Proc. 33rd Int. Conf. Machine Learning, New York, NY, 2016, pp. 1329–1338.
[31] G. Brockman et al.,“OpenAI Gym,” arXiv Preprint, arXiv:1606.01540v1 [cs.LG],
June 2016.
[32] G. Synnaeve et al., “TorchCraft: A library for machine learning research on real-time
strategy games,” arXiv Preprint, arXiv:1611.00625v2 [cs.LG], Nov. 2016.
[33] O. Vinyals et al., “StarCraft II: A new challenge for reinforcement learning,” arXiv
Preprint, arXiv:1708.04782v1 [cs.LG], Aug. 2017.
[34] K. Arulkumaran et al., “Deep reinforcement learning: A brief survey,” IEEE Signal
Process. Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017.
[35] C. M. Macal and M. J. North, “Tutorial on agent-based modelling and simulation,”
J. Simul., vol. 4, no. 3, pp. 151–162, Sept. 2010.
[36] J. S. Bain, Industrial Organization. New York, NY, USA: Wiley, 1968.
[37] S. Huck et al., “Two are few and four are many: Number effects in experimental
oligopolies,” J. Econ. Behav. Org., vol. 53, no. 4, pp. 435–446, Apr. 2004.
[38] I. Topolyan, “Price competition when three are few and four are many,” Int. J. Ind.
Org., vol. 54, pp. 175–191, Sept. 2017.
[39] F. M. Bass, “A new product growth for model consumer durables,” Manage. Sci., vol.
15, no. 5, pp. 215–227, Jan. 1969.
[40] M. E. Porter, Competitive Strategy: Techniques for Analyzing Industries and Competitors.
New York, NY, USA: Free Press, 2004.
[41] C. Hill, International Business: Competing in the Global Marketplace, 8th ed. New York,
NY, USA: McGraw-Hill, 2011.
[42] P. Kotler and K. L. Keller, Marketing and Management. Upper Saddle River, NJ, USA:
Prentice Hall, 2006.
[43] S. Singh et al., “Intrinsically motivated reinforcement learning: An evolutionary
perspective,” IEEE Trans. Auton. Mental Develop., vol. 2, no. 2, pp. 70–82, June 2010.
[44] T. Hastie et al., The Elements of Statistical Learning. New York, NY, USA: Springer,
2009.
[45] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA,
USA: MIT Press, 1998.
[46] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, no. 3–4, pp.
279–292, May 1992.
[47] L.-J. Lin, “Reinforcement learning for robots using neural networks,” Ph.D. disserta-
tion, School Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, 1992.
[48] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv
Preprint, arXiv:1509.02971 [cs.LG], Sept. 2016.
[49] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature,
vol. 518, no. 7540, pp. 529–533, Feb. 2015.
[50] S. Mahadevan, “To discount or not to discount in reinforcement learning: A case
study comparing R learning and Q learning,” in Proc. 11th Int. Conf. Machine Learning,
New Brunswick, NJ, 1994, pp. 164–172.
[51] J. Arifovic, “Genetic algorithm learning and the cobweb model,” J. Econ. Dyn. Con-
trol, vol. 18, no. 1, pp. 3–28, Jan. 1994.
[52] M. Suganuma et al., “A genetic programming approach to designing convolutional
neural network architectures,” arXiv Preprint, arXiv:1704.00764v2 [cs.NE], Aug. 2017.
[53] J. Bergstra et al., “Algorithms for hyper-parameter optimization,” in Proc. 24th Int.
Conf. Neural Information Processing Systems, Granada, Spain, 2011, pp. 2546–2554.
[54] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” in Proc.
33rd Int. Conf. Machine Learning, New York, NY, 2016, vol. 48, pp. 1928–1937.
[55] E. Di Mario et al., “A comparison of PSO and reinforcement learning for multi-robot
obstacle avoidance,” in Proc. IEEE Congr. Evolutionary Computation, Cancun, Mexico, 2013,
pp. 149–156.