Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Supply Chain Digital TwinA Case Study in a Pharmaceutical Company
João Afonso Ménagé Santos
Thesis to obtain the Master of Science Degree in
Mechanical Engineering
Supervisors: Prof. Susana Margarida da Silva VieiraProf. Joaquim Paul Laurens Viegas
Examination Committee
Chairperson: Prof. Carlos Baptista CardeiraSupervisor: Prof. Susana Margarida da Silva Vieira
Members of the Committee: Prof. Jacinto Carlos Marques Peixoto do NascimentoProf. Rui Fuentecilla Maia Ferreira Neves
November 2019
ii
AcknowledgmentsFirst, I would like thank my supervisors, Prof. Susana Vieira, Prof. Joaquim Viegas and Eng. Miguel
Lopes. Their support was paramount for the success of this work and I could not have asked for better
guidance.
I also extend my appreciation to Hovione Farmaciencia S.A., for giving me the possibility of executing
my work on a highly dynamic and challenging environment, Prof. Joao Sousa, for offering me this
opportunity and for giving advice on the topics of the thesis whenever necessary, and Eng. Andrea
Costigliola, for giving me all the tools needed and for providing counsel and feedback throughout the
duration of this work.
Additionally, I am very grateful to all the teams and colleagues with whom I have had the pleasure of
working with during these past few months, especially the Data Science & Digital Systems, Applications
Development and Supply Chain teams.
Just as importantly, a big thank-you to my girlfriend, who has supported me unwaveringly, my parents
and my brother, who have made me into what I am today, supporting me in every decision, and my
friends and colleagues.
iii
iv
ResumoA cadeia logıstica integrada e uma rede onde todas as areas de negocio sao dependentes entre si.
Apesar de serem estruturas extremamente poderosas, a logıstica por detras destas redes interconec-
tadas de pessoas, produtos, maquinas e informacao e altamente complicada e solucoes para a sua
otimizacao sao cada vez mais necessarias. Os avancos tecnologicos vistos em anos recentes, permi-
tiram uma melhor otimizacao dos seus processos, e sao cada vez mais adotadas solucoes baseadas
em dados, devido aos seus resultados precisos. O conceito de digital twin, quando aplicado a cadeias
logısticas internas, tem a possibilidade de ajudar na gestao das cadeias logısticas integradas, junta-
mente com a sua capacidade de aumentar a percecao dos colaboradores com cargos de decisao das
empresas, enquanto permite a implementacao de modelos de simulacao precisos. Esta tese apresenta
um gemeo digital de uma cadeia logıstica interna de uma empresa farmaceutica, juntamente com uma
ferramenta de planeamento de capacidade bruta aproximada por simulacao, capaz de gerar estimati-
vas da capacidade mensal necessaria para cada area produtiva da empresa a longo-prazo. O trabalho
realizado foi um caso de estudo numa empresa farmaceutica. O digital twin desenvolvido inclui uma
interface grafica, com diversas perspetivas das atividades executadas no passado e presente, e com
a evolucao dos indicadores de desempenho chave. A ferramenta de simulacao esta tambem incluıda
na interface grafica, permitindo aos colaboradores com cargos de decisao a criacao dos seus proprios
cenarios e a obtencao dos resultados da sua simulacao.
Palavras-chave: Digital Twin, Cadeia Logıstica Interna, Investigacao Operacional, Planea-
mento da Capacidade Bruta aproximada por Simulacao, Cadeia Logıstica Farmaceutica
v
vi
AbstractThe integrated supply chain is a process where all the business areas are dependent on each other.
While they are extremely powerful structures, the logistics behind these interconnected networks of peo-
ple, products, machines and information are highly complicated, and solutions for their optimization are
increasingly required. The technological advancements seen in recent years, have allowed for better op-
timization of its processes, and data-driven solutions are being extensively adopted, due to their accurate
results. The concept of the digital twin, when applied to the internal supply chain, has the possibility of
aiding in the management of the integrated supply chains, along with its capability of increasing aware-
ness to the company’s stakeholders and decision-makers, while allowing the deployment of accurate
simulation models. This thesis presents a digital twin of a pharmaceutical internal supply chain, along
with a simulation-based rough cut capacity planning tool, capable of giving estimates of the required
monthly capacity for the different areas of the organization on the long-term. The developed digital twin
offers a graphical user interface, with several views into the past and present tasks performed and the
evolution of the key performance indicators. The simulation tool is also included in the user interface,
giving the possibility to decision-makers of creating their own scenarios and performing the simulation.
Keywords: Digital Twin, Internal Supply Chain, Operations Research, Simulation-based Rough
Cut Capacity Planning, Pharmaceutical Supply Chain
vii
viii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction 1
1.1 Pharmaceutical Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Pharmaceutical SCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Supply Chain 4.0 7
2.1 Digital Twin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Literature review & Commercially Available Solutions . . . . . . . . . . . . . . . . . 9
2.1.3 Examples of Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Supply Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Enterprise Resource Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Production Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 Integrated Supply Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Proposed Solution and Expected Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Knowledge Extraction 25
3.1 Collected Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Processes Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Distributions Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Selecting the Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Data Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Outlier Identification and Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.4 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.5 Fitting the Distributions to the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
ix
3.2.6 Results of the fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Simulation-Based Rough Cut Capacity Planning 43
4.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Code Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Digital Twin User Interface 69
5.1 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 Conclusions 75
6.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.1 Quality & quantity of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.2 Improving the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Bibliography 79
A Goodness-of-fit Tests Comparison A.1
A.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1
A.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2
A.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2
B Digital Twin User Interface Screenshots B.1
x
List of Tables
2.1 Frequencies of appearances of digital entities against the type of the study [27] . . . . . . 9
2.2 Time horizons for the different S&OP cycle stages [5] . . . . . . . . . . . . . . . . . . . . 18
3.1 Results of the optimization of the PDFs parameters . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Example scenario of orders to be sampled . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Efforts Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Monthly and area-wise relative error (with sign) of the median per iteration, compared to
50000 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Monthly relative and absolute (with sign) errors for manufacturing, QA and Warehouse . . 61
4.5 Monthly relative and absolute (with sign) errors for QC IPC, release and release review . 62
4.6 Occurrences of real monthly capacities being within the 1 or 2 IQR . . . . . . . . . . . . . 62
4.7 Consumed capacity percentage for each type of optimization . . . . . . . . . . . . . . . . 67
A.1 Example 1 goodness-of-fit values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1
A.2 Example 2 goodness-of-fit values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2
A.3 Example 3 goodness-of-fit values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2
xi
xii
List of Figures
1.1 R&D productivity evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Basic diagram of the current pharmaceutical Supply Network . . . . . . . . . . . . . . . . 4
2.1 Basic schematic of how a DT works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 High-level external SC relationships scheme [33] . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Pharmaceutical CDMO SC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Internal SC relationships scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Automation Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Production Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Typical evolution of the capacity through the time horizons . . . . . . . . . . . . . . . . . . 19
3.1 Extracted dates and their chronological relation to the real processes. . . . . . . . . . . . 27
3.2 Binomial, negative binomial and Poisson distributions for different parameters . . . . . . . 30
3.3 Examples of PDFs from the obtained data . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Outliers filter results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Mean vs Standard Deviation of the projects . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Statistical properties parallel plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Examples of Kurtosis and Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8 Cullen and Frey graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Results of the PDF fitting process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.10 ECDFs and how the theoretical PDFs fit to them . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 Capacity Planning Process [38] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Examples of truncated distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Example representation of the assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Example representation of the manufacturing tasks scaling . . . . . . . . . . . . . . . . . 49
4.5 Evolution of a month’s capacity distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6 Evolution of the median of the monthly capacity utilization (%) by area . . . . . . . . . . . 55
4.7 Evolution of a month’s capacity distributions for QC and warehouse. . . . . . . . . . . . . 56
4.8 Non-normal examples of distributions at 50000 iterations . . . . . . . . . . . . . . . . . . 56
4.9 Evolution of the number of BA interferences versus the number of iterations . . . . . . . . 57
4.10 Code Efficiency of regular versus parallel computation . . . . . . . . . . . . . . . . . . . . 58
4.11 Code Efficiency – 3 loops vs a single loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.12 Code Efficiency – 6 vs 7 vs 8 CPU cores used . . . . . . . . . . . . . . . . . . . . . . . . 59
4.13 Capacities validation graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.14 Forecasted capacity evolution per month and area . . . . . . . . . . . . . . . . . . . . . . 64
4.15 Percentage of maximum capacity utilized per month and area . . . . . . . . . . . . . . . . 64
xiii
4.16 Gantt chart of the BA’s utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.17 Gantt chart of the BA’s utilization after optimization . . . . . . . . . . . . . . . . . . . . . . 65
4.18 Monthly capacities per area after optimization . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1 Examples of maps shown in the Overview tab . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Map before and after being clicked . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Network graph of a project and corresponding map of buildings . . . . . . . . . . . . . . . 71
5.4 Example representation of the schedule of activities . . . . . . . . . . . . . . . . . . . . . 72
5.5 Gantt representation of the recipe of a project . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6 Example of PDFs of manufacturing, QR and adherence to start date of a project . . . . . 74
A.1 Example 1 goodness-of-fit results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1
A.2 Example 2 goodness-of-fit results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2
A.3 Example 3 goodness-of-fit results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3
B.1 Overview tab screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1
B.2 Activities by building tab screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1
B.3 Activities by project tab screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2
B.4 KPIs tab screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2
B.5 Projects schedule Gantt chart tab screenshot . . . . . . . . . . . . . . . . . . . . . . . . . B.3
B.6 Projects database example view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3
B.7 Example of modal help window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4
B.8 RCCP: main view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4
B.9 RCCP: options modal window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5
B.10 RCCP: start simulation modal window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5
B.11 RCCP: existing scenarios to be loaded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6
xiv
Acronyms
BA bottleneck asset.
BOM bill of materials.
BPR batch production record.
CDF cumulative distribution function.
CDMO contract development and manufacturing organization.
CMO contract manufacturing organization.
CS Chi-squared goodness-of-fit test.
DT digital twin.
ECDF empirical cumulative distribution function.
EDD earliest due date.
ERP enterprise resource planner.
FP final product.
GoF goodness-of-fit test.
IN intermediate product.
IoT internet of things.
IPC in-process control.
IQR interquartile range.
KPI key performance indicator.
LSD latest start date.
MC Monte Carlo.
PDF probability distribution function.
PP production planning.
QA quality assurance.
xv
QC quality control.
QC R QC release.
QC RV QC release review.
QR quality release.
R&D research and development.
RCCP rough cut capacity planning.
RM raw material.
S&OP sales and operations planning.
SC supply chain.
SC4.0 supply chain 4.0.
TU time unit.
UI user interface.
xvi
Chapter 1
Introduction
The 21st century has come as a century of technological innovation and integration, where mod-
ern technologies are being quickly and comprehensively extended to every field. Data has become
increasingly important and abundant, and its large quantities allow many deductions to be made, aiding
in generating accurate predictions.
The supply chain (SC) is an area where digitalization can bring numerous advantages. These advan-
tages can even be more significant for the integrated SC, which envisions the SC as an interconnected
collection of areas which seamlessly interact between each other, allowing for better resource manage-
ment and overall improved performance. The adoption of the most recent technological advances can
help successfully manage this complex network of areas, resources and entities. The internal SC of an
organization, which concerns the interactions between the company’s own areas and agents, is possibly
the most promising recipient of these new technologies, allowing for better internal logistics, which are
often extremely complex and difficult to maintain.
Furthermore, production planning and scheduling, which are central topics of the SC, can greatly
benefit from these mentioned advances. The possibility of generating accurate forecasts can reduce the
mentality of solving unexpected situations, commonly seen at the production planning level, converting
it into a mentality of predicting the unexpected and planning accordingly beforehand. Planning gener-
ally deals with the selection of the most appropriate procedures in order to achieve the objectives of
the project, while scheduling is the process of converting the scope, time, cost and quality plans into
an operating timetable. Both these areas can be substantially improved by the adoption of accurate
forecasting tools, data-driven and based on demonstrated performance.
The pharmaceutical industry has suffered extensive transformation, affected by its changing circum-
stances. Shah [50] states that the industry was historically characterized by good research and devel-
opment (R&D) productivity (number of approved drugs divided by the investment in R&D), long effective
patent lives, large technological barriers to enter and a limited number of substitutes, which resulted in a
strategy based on the exploitation of the price inelasticity to further invest in R&D and on a dependence
on blockbuster products, i.e., extremely popular drugs that generate large annual sales. However, ac-
cording to the author, these trends have dramatically changed in the recent years, with R&D productivity
declining (see figure 1.1), patent lives shortening and the existence of substitutes, such as generics
(when patents have expired). Furthermore, the liberalization of the global marketplace, exposing prod-
ucts to competition, and the creation of stricter laws controlling the drugs’ prices have called for drastic
modifications in how pharmaceutical organizations operate. These circumstances have created a ne-
cessity for achieving operational excellence within the whole enterprise.
1
Companies in the 21st century are pressured to deploy these new technologies, since they can op-
timize their businesses to levels that could not be achieved before. This includes the use of several
concepts, such as: internet of things (IoT), interconnecting devices in a production plant and allow-
ing the use of artificial intelligence to autonomously make decisions; automation, allowing machines
and intelligent systems to replace workers on tedious and repetitive jobs while increasing throughput;
smart manufacturing, using powerful algorithms to optimize processes and to have predictive capabilities
based on historical data. All of these possibilities have arrived due to the advancements in technology
that have been seen in recent years. The concept of the industry 4.0 has expanded into different areas
with the appearance of Pharma 4.0 and supply chain 4.0 (SC4.0), for example. Both of these can be
defined as the application of the industry 4.0 concepts in each respective area. The SC4.0 can benefit
greatly from the forecasting tools enabled by the computational power easily accessible today. Further-
more, the large amounts of data that are collected and stored allow these algorithms to achieve very
accurate forecasts. Optimizing the complex agents and interconnections within the SC also becomes
possible and has the potential of greatly reducing costs and increasing efficiency. The adoption of the
SC4.0 into the pharmaceutical industry (as a part of the Pharma 4.0) can be what is necessary for
achieving the operational excellence that the current paradigm of the pharmaceutical industry demands.
This chapter provides an overview of the current state of the pharmaceutical industry and its evo-
lution in the recent years. A focus on the pharmaceutical SC networks is given, stating how these are
being affected by the changes seen in the industry and how the 21st century and its technological ad-
vancements can provide the necessary improvements to successfully deal with today’s reality. Finally,
the contributions made by this work are presented and the structure of the thesis is enumerated.
1.1 Pharmaceutical IndustryContract manufacturing organizations (CMOs) and contract development and manufacturing orga-
nizations (CDMOs) are two types of key players in the pharmaceutical industry. In fact, Shah [50]
enumerates the key players of the pharmaceutical industry as listed below.
• Large R&D multinationals with presence in branded drugs.
• Generic manufacturers, producing drugs with expired patents.
• Contract manufacturers (both CMOs and CDMOs) which do not have their own product portfo-
lio, but instead produce intermediate products or active pharmaceutical ingredients. Operate by
providing outsourcing services to other companies.
• Drug discovery and biotechnology companies, often small and with limited manufacturing capacity.
Organizations from the first two types are commonly named Big Pharma companies; these are often
large organizations, spread over multiple countries. These companies tend to frequently resort to CMOs
or CDMOs for the manufacturing of the drugs or drug components, due to a variety of reasons: (1) the
lack of production capability, either constant or seasonal; (2) for the first type of organizations, to allow
them to focus on R&D and marketing while leaving the production to external organizations.
2
This focus on R&D by big pharma companies is caused mainly by the rise of generic manufacturers,
which compete on the non-patented blockbuster drugs. This tendency has forced the big multinationals
to develop new drugs that may become blockbusters and since they are patent protected, cannot be
produced by generic manufacturers. This behavior can be verified on the graphs from figure 1.1, which
shows that not only the R&D investments have been rising consistently and significantly (with an average
annual increase of 10.5% from 1980-2018) but also the number of yearly approved drugs by the American
Food and Drug Administration (FDA) has also been increasing since 2002. This means that even though
the R&D productivity has been slightly decreasing (as shown in the bottom graph from figure 1.1), the
Big Pharma companies are still investing increasingly more in R&D and the number of yearly approved
drugs by the FDA is also increasing. Note that the values of investment are from the USA and the
approved drugs are by the FDA. Even though the data shown is exclusively for the USA, this is a fair
approximation since USA’s investment corresponds to around 80% yearly of the global R&D investment
[41]. Furthermore, the USA can be considered without a doubt the biggest pharmaceutical market (and
therefore, the most representative), with sales of new drugs corresponding to 64.1% and of total drugs
corresponding to 48.1% of the global sales (including Canada) [19]. This new necessity of developing
new drugs and focusing on R&D has led to an increase on the importance of CMOs and CDMOs.
0
20
40
60
App
rove
d dr
ugs
& R
&D
Inv
estm
ent
1
2
3
4
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018Year
App
rove
d dr
ugs/
R&
D B
illio
n
Figure 1.1: Graphs of (1) year evolution of approved FDA drugs (bar chart) and R&D investment in Billions of US dollars (linechart) [41, 34] and (2) year evolution of approved drugs by the FDA per Billion of US dollars invested in R&D.
An important note must be made regarding the second graph from figure 1.1. The graph shows the
R&D productivity, defined by the yearly number of approved drugs by the FDA divided by the yearly
investment in R&D in billions of US dollars. This is merely an approximate indicator of the tendency
since the investments made in R&D in one year do not affect directly the number of approved drugs,
due to the lengthy process of creating new drugs and performing the clinical trials. However, due to the
linear tendency of the R&D investments, it is a fair approximation and shows numerically a reality that is
affecting the pharmaceutical industry.
CDMOs differ from CMOs due to their Development component. This R&D component can be mainly
one of two types. It can be the development of new methods for synthesizing components or the de-
velopment of new industrial processes. The first deals with the creation of new methods for creating a
3
specific component and then licensing the method while the second is the development of the industrial
processes for the manufacturing of a certain pharmaceutical product. This is a fundamental step since
when drugs are discovered they are produced in laboratories and their manufacturing methods are not
suitable for industrial upscalling.
Due to the new conditions that pharmaceutical industry is facing, their focus has been shifted towards
optimizing the whole business, not only the R&D processes, marketing and sales, but instead focusing
on other SC agents that need operational excellence to allow for a smooth transition into Pharma 4.0
corporations – the optimization of the SC becomes, therefore, a fundamental requirement.
1.2 Pharmaceutical SCsDue to the aforementioned scenario currently affecting the pharmaceutical industry and the new
organizational structure that is seen today, with the rise of generics manufacturers, the greater focus on
R&D by big pharma companies and the new necessity of outsourcing the manufacturing to third-party
organizations, a new complex supply network has emerged, connecting these different organizations.
RM Drugs Clients
R&D
Big Pharma
Bio
CDMOs
Manufacturing
Generics
Big Pharma
CDMOsCMOs
Raw Material Suppliers
Figure 1.2: Basic diagram of the current pharmaceutical Supply Network
Figure 1.2 shows a diagram of the basic workings of the modern pharmaceutical supply network.
This network features on a higher level the raw materials suppliers, manufacturing, R&D and the clients
or final consumers.
• Raw material (RM) suppliers: these are third party organizations focused on producing the basic
components for the manufacturing process. These materials can be simple, such as acetone or
even ice, ranging to much more sophisticated components with a longer lead-time. Additionally,
CMOs and CDMOs can produce intermediate product (IN) or active pharmaceutical ingredients
which can be used as RM in other manufacturing processes.
• R&D: the R&D area focuses on developing new drugs, procedures or techniques, with the ulti-
4
mate goal of reaching production in order to be more easily accessible to the people afflicted with
the diseases which the drug tries to fight and to become profitable. This means that this area
”produces” intellectual property used in the manufacturing.
• Manufacturing: the different organizations produce the drugs and pharmaceutical products to
be sold to clients. Note that CMOs and CDMOs to not manufacture products to be sold under
their company’s name; their products, independent from what stage in the product cycle they are,
are sold to big pharma or generics companies. Their clients are, therefore, not the hospitals
or pharmacies, but the other manufacturing organizations, which is not explicitly shown in the
diagram.
• Clients: the customers of the pharmaceutical products, often hospitals, clinics, pharmacies and
other health-related private or public organizations. The end-customers of the pharmaceutical
product are the people which require the medical effects of the products.
This complex SC scheme existent in today’s pharmaceutical industry faces several challenges and
opportunities for improvement. The digitalization and integration of the SC are some of the most attrac-
tive solutions for most of the problems that the SCs face today. The larger number of products in the
market nowadays, which can be seen by the increase of approved drugs by the FDA from figure 1.1,
has led to bigger project portfolios in most pharmaceutical companies. This creates a higher complexity
in the SCs of the companies, derived from a series of factors shown below.
• The need for a larger amount and a more varied set of RM. This has a direct impact in the ware-
house management and requires the addition of new RM suppliers.
• New manufacturing processes are needed. These can be independent processes which require
a higher initial investment, or processes which can be partially or completely done in existing
workcenters. A more complex scheduling is a clear consequence of this.
• The addition of a new project generally requires pilot batches and subsequent a process validation
campaign, which is done to confirm that the production recipe works as planned. Pilot batches are
often more prone to delays, occasionally affecting other campaigns.
Due to these reasons, the tendency for larger portfolios has increased the complexity of the phar-
maceutical SCs and a new way of organizing it must be adopted. The digitalization of the SC can bring
integration to its internal processes. This Integrated SC concept features a centralized way of managing
and interacting with the SC agents and benefits greatly from implementing a digital approach. Recent
technologies, such as IoT can provide great amounts of varied data, which can be used to automate
processes that do not need human interaction, such as automatically contacting RM suppliers to order
the products according to each individual lead-time (effectively reducing time in storage), aid in the plan-
ning and scheduling of the manufacturing and support areas’ processes or create forecasts based on
demonstrated performance.
5
1.3 ObjectivesThe main objectives of this thesis are:
• Mapping the internal SC
– Map crucial material, data and information flows
– Establish key dependencies and layers of interaction across and within areas
• Building the internal SC DT
– Bring end-to-end visibility to operational processes
– Monitor key performance indicators based on historical data
– Conduct scenario-based forecasting, through a simulation-based rough cut capacity planning
(RCCP) tool
1.4 ContributionsThis thesis was developed in a partnership between the Institute of Mechanical Engineering of the
Instituto Superior Tecnico (IDMEC-IST ) and the pharmaceutical CDMO Hovione Farmaciencia, S.A..
A SC DT with a simulation-based RCCP tool as the simulation tool was the solution developed for
this work. Furthermore, a graphical user interface was developed, with the objective of (1) intuitively
showing current and past information about the manufacturing plant and the internal SC agents and (2)
delivering the RCCP tool in way that allows the users to perform minor modifications and refinements
to the simulation’s parameters and receive its results in a detailed way. This tool has the objective of
supplying information to the users, for their decisions to be more well-thought and data-driven and will
be actively used by key stakeholders as part of the new operations planning policy at the CDMO in study,
contributing to the key task of RCCP.
1.5 Thesis outlineChapter 1 describes the industry in study, problems and opportunities and the objectives of this
work. Chapter 2 presents the SC4.0, clarifying how a DT can provide support to the SC. The concepts
of DT and SC are thoroughly explained. Furthermore, the related work, proposed solution and expected
results are presented. Chapter 3 deals with knowledge extraction. The section starts by identifying the
data that was used and how it was extracted. Then, a comprehensive description of the process of fitting
a theoretical probability distribution function (PDF) to the processes measured durations is presented.
Chapter 4 introduces and defines the simulation-based RCCP tool. The implementation of such tool is
presented, and thorough convergence and efficiency analysis are described. Lastly, the validation of the
simulation’s results is performed and predictions made by the tool are delivered. Chapter 5 presents
the user interface (UI) of the DT, introducing the chosen methods for translating meaningful statistical
information into graphics and tables, as well as intuitive ways of interacting with the data shown and the
simulations performed. Chapter 6 presents the conclusions and achievements, as well as future work.
6
Chapter 2
Supply Chain 4.0
The improvement of the SC is a requirement that comes from the need of obtaining operational
excellence in all business areas. The most viable and effective way of improving its performance is by
converting it into a digital supply chain, also known as the SC4.0. Since the digital supply chain can
actually have two different meanings (with the second meaning being the supply chain of digital goods,
such as songs or e-books), the term of SC4.0 will be used instead. SC4.0s have the opportunity to
reach the next horizon of operational effectiveness and to do so, they need to become much faster,
more granular and precise [1]. Employing algorithms and frameworks that can aid in optimizing the SC
and the whole business can be an important step for companies to keep their competitive edge.
In fact, Hala Zeine (president of the digital supply chain at SAP AG) states that ”the future of the
supply chain is somewhere where the digital world and the physical world are entangled. It is where you
can simulate in one and execute in another”. Furthermore, she states that the latest technologies, such
as 3D printing, robotics, big data, AI or blockchain are not useful unless there is visibility into whether
the processes are working or not. She concludes by asserting that the way to push the SCs into the
future and to prepare the companies to do it is by giving them visibility, by creating a digital twin of their
end-to-end supply chain [45].
In this chapter the concepts of DTs and SCs are discussed, and the proposed solution and expected
results are presented.
2.1 Digital Twin
2.1.1 Concept
A DT can be defined as a dynamic virtual representation of a physical object or system, using real-
time data to enable understanding, learning and reasoning [8]. Although its definition varies from source
to source, the basic idea consists on a digital representation of an asset (being it tangible [entity] or
intangible [system]), which uses IoT to receive meaningful real-time data and reaches conclusions based
on both the developed model, how it has performed in the past and how it is performing at the present.
The DT is an extremely attractive concept nowadays, having been distinguished in Gartner’s Top 10
Strategic Technology trends for both 2018 and 2019 [20], an annual ranking made by Gartner (an S&P
500 global research and advisory firm) that distinguishes the most promising innovative technologies.
The basic schematic of a DT is shown in Figure 2.1.
The concept of the DT was initially introduced in 2002 by Dr. Michael Grieves from the University of
Michigan, only receiving its name in 2010, on a NASA’s roadmap document by Piascik et al. [42]. Some
7
Digital TwinReal Asset
Decision-Maker
- Physics Models- Statistics Models
- AI Models
Automatic updates
CAD/FEAModels
MaintenanceHistory
OperationalData History
Real-Timeoperating data
CurrentData
HistoricalData
Forecasts
Param
eter c
hang
es/
mainten
ance
alloc
ation
/
other
instr
uctio
ns
Figure 2.1: Basic schematic of how a DT works. Note that the automatic updates transition may be optional.
sources claim that the concept was coined by the USA’s Defense Advanced Research Projects Agency
(DARPA) [21], but no unambiguous evidence of such was found when reviewing the literature.
Being a broad and relatively recent concept, the DT is usually regarded as the digital replica of exclu-
sively physical tangible assets [29], since that is its most common application. In fact, it is mostly used
in situations such as mimicking machines on a production plant or a turbine in an airplane. However,
the concept can be very clearly applied to intangible assets, with more recent definitions of the con-
cept defining it as digital copy of a physical system rather than a physical entity or asset, which can be
semantically more excluding towards intangible assets. These intangible assets can be transportation
networks, economic flows, manufacturing processes, HVAC systems or, similarly to the one in study,
SCs.
Since a DT receives real-time data and stores historical data, it is quite common for it to act as a
tool for visualization. Being so dense in the information that it contains, it can be generally used to show
certain performance indicators, states and numerical figures regarding the present or a specific point in
time. This can provide additional awareness to decision-makers, in order for their decisions to be more
data-driven. Besides this, simulation capabilities are also extremely useful in DTs; in fact, considering
that the system contains both large amounts of data and the behavior and inner workings of the model,
it is straightforward to understand the utility of using this tool for simulation.
The use of IoT is fundamental on DTs, since it is the technology that provides the data to the models.
This can be done resorting to sensors (temperature, pressure or other quantities), or other corporate
systems that collect information in a less straightforward manner, such as the number of trucks that
arrive (by accessing the records of those occurrences) or the throughput of a certain production facility.
Artificial intelligence algorithms (machine learning, deep learning) are also frequently used with DTs.
This is due to two reasons, (1) the current development and widespread use of powerful hardware,
capable of performing complex calculations and (2) the existence of large quantities of data already in the
8
DT, which is a requirement for most artificial intelligence algorithms to perform accurately. 3D modelling
is also frequently related to DTs that deal with machines or other tangible assets. By modelling its
components and the physical interactions between them, the model is able to better forecast the asset’s
state, which can be used for predictive maintenance, for example.
DTs are frequently confused with both monitoring tools and simulation models. In reality, DTs bring
together both concepts, effectively delivering a visualization tool with improved simulation models [32].
DTs differ from simulation models in the sense that they receive real-time data to generate better predic-
tions. Regularly, simulation models have complete descriptions of the object or system in study, but often
lack its historical performance and almost always lack their current state. By having both, the generated
simulations can be verified and improved by simulating on past data and the predictions that the model
can create are based on current states, which will deliver more data-driven and accurate responses. DTs
also supersede monitoring tools in the sense that all the data that these tools possess and display is
also available by DTs. Additionally, DTs have access to forecast data, created by its accurate simulation
models.
2.1.2 Literature review & Commercially Available Solutions
Kritzinger et al. [27] make a clear distinction between 3 types of digital entities that vary on their level
of integration: digital models, digital shadows and DTs. The main difference between the 3 are that
digital models have manual data flows from the physical object to the digital one and vice-versa, digital
shadows have manual data flows from the digital object to physical one, but automatic data flow from
the physical object to the digital one and DTs have automatic data flows both from physical to digital
and from digital to physical objects. On their literature review, the authors analyzed a series of scientific
articles which regarded DTs, both conceptual, review, case-study, definition or study, and determined
the actual type of digital entity that the articles were referring to, according to the authors experience.
What was shown was that the majority of the articles that regarded DTs were actually describing other
digital entities with a smaller level of integration. In fact, as can be seen from table 2.1, only 18.60% of
the articles featured a digital entity with a level of integration that allowed the authors to consider it a DT.
Note that the study made by the authors regarded DTs in manufacturing.
Integration Level Type
Case-study Concept Definition Review Study
Undefined 4.7% 11.6% 0.0% 2.3% 0.0% 18.6%DM 11.6% 14.0% 0.0% 0.0% 2.3% 27.9%DS 7.0% 25.6% 0.0% 2.3% 0.0% 34.9%DT 2.3% 2.3% 4.7% 9.3% 0.0% 18.6%
Total 25.6% 53.5% 4.7% 14.0% 2.3% 100.0%
Table 2.1: Frequencies of appearances of digital entities against the type of the study [27]
Besides scientific articles and academic applications of DTs, there are a few commercially available
implementations of it. It is important to mention that most commercial applications of DTs are of tangible
physical assets, such as machines or production facilities, rather than intangible ones.
9
• SAP: the company places the DT at the center of the design-operate-manufacture-deliver chain. At
the implementation level, SAP offers the Asset Information Workbench which creates a DT to help
managing the enterprise’s assets, by creating a digital replica of the physical assets, processes
and systems.
• Anylogic & Simio: both simulation software packages include DT frameworks that allow enter-
prises to implement the concept. Being simulation software packages, the implementation of a DT
is rather straightforward, and for its most basic form, the addition of real-time data to its models
is the necessary step for the simulated assets to become digital twins. Furthermore, Anylogic
features several case-studies of DTs applications.
• Ansys & Siemens NX: although it is not clear the level of integration that these finite element
modelling software packages have with regard to DTs, their use is paramount in DTs of tangible
assets. These packages simulate physical phenomena on the assets and the addition of real-time
data measurements can further improve the quality of the results.
• Aspentech Aspen Mtell: Aspentech provides supply chain planning solutions and its Aspen Mtell
tool includes optimized production scheduling based on an integrated digital twin maintenance
model.
• PWC’s Bodylogical: a DT of the human body, that harnesses data from, for example, fitness
trackers. PWC claims that it could help people better manage their health and stick to their doc-
tor’s orders and pharmaceutical companies and governments better understand the global health
problems and employ counteractive measures to regulate it.
2.1.3 Examples of Applications
Several examples of DTs are presented below, which can help the reader better visualize how they
are usually implemented.
• DT in energy production: GE engineers developed a DT of the Haliade 150-6 wind turbine’s yaw
motor. The objective of the DT was to simulate the temperature at various parts of the motor,
by using its physical model and a series of sensor data. This allowed for better temperature
monitoring, which ultimately translates how it is being used. By using simulation software and the
real data collected, the temperatures at any point of the motor can be estimated fairly accurately.
Furthermore, this allows for predictive maintenance, effectively reducing downtime. [43]
• DT in healthcare: Bruynseels et al. [9] present a form of therapy as digitally supported engineer-
ing. On it, a virtual patient is created as a model of a person, made possible by experimental big
data collected using advanced technologies, from molecular to macroscopic scale. This creates a
virtual model of the human body. Then, utilizing sensorial data currently employed on fitness track-
ers, for example, the DT could track the health condition of the individual and forecast problems
with the user’s health. The concept of the DT in healthcare has also been applied to individual
organs, such as the creation of a model of the human heart [49].
10
• DT in fleet management: as part of the EU OPTIMISED project, Alstom developed a DT to enable
the correct scheduling and maintenance management of their UK fleet of trains. The DT dealt with
daily operating requirements, maintenance regimes, capacity and abnormal cases of accidents
and failures. The solution created included an interactive UI showing the trains’ current location
and their future location estimation. [3]
• DT in the automotive industry: DTs can be of great use in the automotive sector, from the ve-
hicles’ operation to their manufacturing or sales. Sharma and George [51] explore this possibility:
while driving, a DT of the vehicle could combine the real world with the digital world (for exam-
ple, the navigation system) into an Augmented Reality solution, delivering real-time information to
driver, effectively allowing him to make more data-driven decisions and reducing the distractions
caused by currently employed systems. In more advanced approaches to this DT, it could actually
control the vehicle, or give driving assistance to the driver (becoming an autonomous vehicle). The
DT would interact with the real vehicle, which in turn would supply it with real-time odometry data.
• DT in crisis scenarios: [44] presents a conceptual example of a DT in the form of an interactive
audio assistant, which deals with the water and sewer system of a city. The example shows the
DT alerting the operator that a pipe burst may have happened (through real-time pressure mea-
surements). It then supplies the operator with the required information, for instance the location
of the burst or the anticipated impact. Finally, it gives response options and performs the com-
mands given by the operator, such as notifying emergence response teams, assessing risk and
alerting authorities. Note that this example is different from a regular audio assistant in the way
that it knows the system internally and receives real-time data to estimate its real-time behavior
and performance. The audio interactivity is merely an interaction method, similar to a UI.
• DT in manufacturing: Kritzinger et al. [27] presents a compilation of applications of manufacturing
DTs, with their corresponding opportunities.
– Production planning and control: order planning according to statistical assumptions; im-
proved decision support using detailed diagnosis; automatic planning and execution.
– Maintenance: ability to identify the effect of state changes on the processes of the production
system; evaluation of machine conditions based on descriptive methods and machine learn-
ing algorithms; achieve better insights into the machine’s health by analyzing process data at
distinct phases of the product’s lifecycle.
– Layout planning: continuous production system evaluation and planning; independent data
acquisition.
• DT of an entire country: the UK National Infrastructure Commission has suggested the creation
of a DT of the entire country, mapping power production, water management, communications,
meteorological and demographic history and transportation networks. With this complex model
and the astronomical amounts of data that it would produce, the Commission would have insights
into almost everything that is happening at a given time and it would be able to answer strategic
11
questions regarding investments or improvements to infrastructures or organizations such as: what
would the impact of closing a road be in the traffic; is it possible to avoid building a new hospital
car park by managing appointment times and traffic flows. [8]
2.1.4 Advantages and Limitations
The advantages of DTs can be summarized into improved awareness and predictive capabilities.
Having access to real-time sensorial data and past historical data, the user of the DT has information
regarding the state of the represented asset and can make deductions of seasonality, for example, based
on its history. A more detailed model and a more comprehensive collection of data can lead to more
accurate and more diverse information available. Furthermore, since the DT contains the model of the
asset, further data that is not explicitly collected can be calculated and presented to the user, allowing for
additional information to be displayed. Consider, as an example, a production plant which consistently
tracks a production process. Although it does not measure explicitly the stocks of the raw materials
used on the production, by knowing the quantities ordered and the quantities used in the manufacturing
process (from the production model), they can be tracked and the user can receive this information.
The opportunity presented by DTs in terms of presenting information also comes as a challenge for
data representation. It is extremely important to convey the information in a manner that is clear and
precise. Additionally, there is often the need for these representations to be interactive and allow the
users to filter the data shown to their needs. Several constituents of the chosen representation form,
which are frequently deemed unimportant and irrelevant, are of extreme importance when transmitting
information clearly, such as the type of graphs or the color scheme.
Regarding the predictive capabilities of DTs, their advantage is quite straightforward: a precise model
of the physical asset, receiving meaningful and accurate data, creates forecasts, from manufacturing, to
maintenance or planning and scheduling. Manufacturing forecasts can give insights on when to order
raw materials; predictive maintenance can predict when maintenance is necessary for a specific asset,
allowing it to be planned ahead and effectively reducing downtime and reactive maintenance; accurate
insights into future manufacturing processes, maintenance durations and expected delays, for example,
allow for better planning and scheduling, which can account for these situations and increase the overall
performance of a business.
An additional advantage of DTs which is frequently overlooked is the ability of simulating scenarios in
a risk-free environment. This basically means that since the DT contains a representation of the actual
asset, simulations can be ran that would not be safe to do in the physical asset. Consider, for example,
removing the scheduled maintenance of a certain asset and simulating how long it would run until failure;
the simulation with have no impact, but experimenting it on the physical asset could have catastrophic
implications.
The disadvantages of DTs are mainly the challenges they present on implementation: the initial
cost of sensor installation and the effort of creating a comprehensive model of the asset. An incredibly
detailed model that receives accurate and varied data is more capable of making forecasts, but the
difficulty of implementation increases with the level of detail. Furthermore, especially on non-corporate
12
DTs, such as healthcare or automotive, the collection of data to be supplied to global models to improve
the overall model effectiveness may be seen as privacy infringement. For all the models, data encryption
is also a challenge.
2.2 Supply Chain
An SC is a complex network of agents, such as organizations or people which work in a connected
manner, with the objective of delivering a product or service to a customer. These networks feature
the processes of converting raw material into the final product (FP) and all the logistics that it entails.
Mentzer et al. [33] define SC as ”a set of three or more entities (organizations or individuals) directly
involved in the upstream and downstream flows of products, services, finances, and/or information from
a source to a customer”.
Figure 2.2: High-level external SC relationships scheme [33]
SC management is a term that is customarily used when studying SCs. In fact, this is a fundamental
area that regularly needs to be optimized; doing so, could improve the company’s efficiency and effec-
tiveness, conferring a smoother operation. Mentzer et al. [33] define SC management as ”the systemic,
strategic coordination of the traditional business functions and the tactics across these business func-
tions within a particular company and across businesses within the supply chain, for the purposes of
improving the long-term performance of the individual companies and the supply chain as a whole”. In a
simpler form, this means that it consists in organizing the SC agents and the interactions between them,
so as to improve the performance of the whole company.
In this work, only the internal SC will be studied, which excludes the importing of raw materials from
external suppliers and the exporting of finished produced goods to external clients. Note that from the
diagram of figure 2.2, the suppliers and customers are excluded, leaving mainly the organization and its
direct influences, such as market research firms. It makes sense then to expand the organization block
into a more detailed view, exposing the internal connections; this is shown in figure 2.4. Examples of
internal SC agents are the production areas (which can be subdivided into specific areas), quality con-
trol (QC), quality assurance (QA), warehouse, management, marketing, while examples of processes
between these agents can be the delivery of raw materials from warehouse to a production area, the
delivery of FPs from one production area to the warehouse or the QC of a raw material that needs to
finish before the production can start. The global view of the SC of a typical CDMO company is as
depicted in figure 2.3. For this work only the processes (referenced as ”Process step i” in the figure) will
be considered, along with all its supporting areas.
It is important to understand the internal processes that happen in a CDMO, since they define how
the business operates and how the specific internal SC of such organizations looks like. The main
13
Raw MaterialSuppliers
Storage Process Step 1 Process Step nShipment to
client
Raw MaterialTests
In ProcessControl (IPC)
StabilityTests
...
IntermediateProduct Tests
Final ProductTests
Figure 2.3: Pharmaceutical CDMO SC
agents of these organizations are itemized below.
• Manufacturing: productive areas of the business, where RM or IN are transformed into FPs or
other INs. Often, organizations are comprised of several manufacturing areas which are indepen-
dent between each other in terms of assets, be it machines or workforce, but consume capacity
in the support areas. After the productive tasks are completed in a campaign, the manufacturing
teams are also responsible for reviewing the batch production record (BPR).
• QC: area that receives samples and verifies whether or not these are according to pre-defined
standards. Several areas use QC for a number of reasons:
– Manufacturing requires QC analysis during production (in-process control (IPC); sometimes
production halts for the results of the analysis) and after the production of FPs or INs.
– After the productive tasks, the analytical packages have to be release by the QC. This involves
two stages: the QC release (QC R) and the QC release review (QC RV) performed right
after the QC R. A certain campaign can have several analytical packages to be released
and reviewed, regarding different operations; each of these are paired and the review stage
of a certain operation only starts after the release stage of the same operation. Generally
speaking, all the QC R stages start immediately after the production.
– Raw material received by external suppliers has to be verified by QC.
– R&D departments need frequent QC analysis.
– QC performs regular stability analysis to products that are in the warehouse. This analysis is
scheduled months in advance and therefore is less prone to scheduling complications.
The pharmaceutical CDMOs QC is an area which has been extensively studied. Examples of this
are the work by Costigliola et al. [16], providing a simulation model for optimizing QC’s workflow, or
the work by Lopes et al. [30], a decision support system based on simulation for resource planning
and scheduling.
• QA: in the context of providing support to manufacturing operations, QA is the area that approves
the analytical package (after all the QC reviews are performed) and the BPR (after the manufactur-
14
ing area reviews the BPR). This process tends to take more time and require greater effort when
dealing with FP production rather than intermediate products.
• Warehouse: area that stores and delivers the resources to and from the other areas. The pro-
cesses done by this area are mainly resource storage and delivery (for raw materials, FPs, in-
termediate products, by-products, co-products and other types of resources, such as packaging
material) and measuring quantities. This area interacts with manufacturing (dispensing of raw
materials and receiving FPs), QC (dispensing samples and receiving information on the quality
status), QA (receiving information if FP has been approved for shipping) and R&D (similar to man-
ufacturing).
• R&D: area that is responsible for the discovery and development of new chemical synthesis routes,
industrial processes and QC strategies. Includes both GMP and non-GMP laboratories. GMP lab-
oratories are laboratories that follow the good manufacturing practices while non-GMP laboratories
just follow good laboratory practices (GLP). The second type of laboratories offer more freedom
to the chemists, but the developed methods always have to be converted to GMP for validation
batches and for industrial production. The complete description of the two practices can be further
analyzed in [23]. The R&D area interacts with warehouse and QC, in the sense that it requires
materials from the warehouse and quality control testes to its samples.
Although the areas mentioned above are the ones responsible and that support the production,
there are a few more areas present in the internal supply chain that are of extreme importance, such
as the IT, management, sales, marketing, finance, human resources and purchasing. The relationship
scheme between the main areas is shown in figure 2.4, depicting a business process model and notation
(BPMN) graph. BPMN graphs are graphical representations of business processes, with the objective of
supporting business process management, for both technical and business users. Note that the BPMN
shown can be viewed as a more granular view of a process step from figure 2.3.
The manufacturing process at CDMOs is generally as described in the internal SC relationships
scheme from figure 2.4. Several distinct types of products can be manufactured, being considered
either FPs or INs. These last ones are generally stored and used in the production of an FP or another
intermediate. However, these processes tend to behave differently when a new product enters the
company’s product portfolio. After passing the clinical trials (the long process of discovering a drug and
having it approved by the responsible entities), one or several validation batches have to be performed.
2.2.1 Enterprise Resource Planner
An enterprise resource planner (ERP) is a tool which delivers the ability of integrating a suite of
business applications, according to Gartner, Inc. The majority of this work will be focused on operations,
more specifically, material management and production planning (PP).
The importance of the ERP can be clearly observed through the automation pyramid, which is shown
in figure 2.5. The automation pyramid is defined as the pictorial representation of the distinct levels of
automation in an organization. As depicted in the graph, the higher the position in the pyramid, the
15
Pla
nnin
gan
dS
ched
ulin
g
Pla
nnin
gan
dS
ched
ulin
g
Planned ProcessOrder (PO)
War
ehou
seM
anuf
actu
ring
QC
QA
War
ehou
seM
anuf
actu
ring
QC
QA
PO Arrives
Request RMor IN
WarehouseRM or
IN Shipping
QC Lab(IPC)
Manufacturing
Approve FP
Warehouse:Reject Product
Warehouse:Reject Product
WarehouseReject
WarehouseStore
WarehouseFP Shipping
StoreProduct
ERP
QA AnalyticalPackage Release Inform P&S
Warehouse:Reject Product
ReviewBPR
QC Lab(FP + IN)
QA BPRRelease
Request FPto be shippedERP
Process Request(PR) Created
Are
as
Product Deliveredto Manufacturing
IPCSample
If approved
If notapproved
IfApproved
If notapproved
If FP approvedby QA
Figure 2.4: Internal SC relationships scheme
more information-dense the resource possesses. In fact, comparing the ERP with field level sensors,
the amount of information there contained is completely different, from a single data point per second or
milisecond (assuming that the sensor does not include memory) to much more processed information,
regarding decades of data collection. Similarly, in terms of quantity of devices, sensors exist in large
quantities while the ERP is a single solution.
ERP
MES
Visualization
Advance Control & Diagnostic
Process control
Sensor / Actuator
Am
ou
nt
of
Dat
a
Nu
mb
ero
fC
om
po
nen
ts
Ente
rpri
seLe
vel
SCA
DA
Le
vel
Fiel
d L
evel
Tim
e C
on
stra
ints
ms
seconds
minutes
hours
days
years
Figure 2.5: Automation Pyramid [6].MES ≡ manufacturing execution system, SCADA ≡ supervisory control and data acquisition
The utility of this central system is clear: it stores, manages and controls data collected at lower
levels of the automation pyramid. Furthermore, it has the ability of performing optimizations based on
16
demonstrated performance and can have almost limitless automatic behaviors. Bajer [6] states that
company executives often relay on information from the ERP to make critical decisions in near real-time:
it is the enterprise level of the automation pyramid.
2.2.2 Production Planning
Sales Forecast
Production Planning(material & capacity
requirements)
Start ofProduction
Inspection andQuality Control
Dispatch toCustomers
Customer
Quality Assurance
ProductionScheduling
ProductionControl
Production
Figure 2.6: The production cycle flowchart.Note the production planning stage withinthe cycle, represented as the orange box.
The production cycle corresponds to the sequence of planning,
scheduling and execution steps involved in the manufacturing pro-
cess. This cycle is shown graphically in the diagram from figure
2.6. It shows that the process often repeats itself provided there
is interest from a customer (this interest also aids in driving the
sales forecasts). The central block in this diagram, in terms of lo-
gistics, planning and scheduling is the PP. PP tends to be one of
the most fundamental stages in the internal SC of an organization.
Carefully organizing the materials, workers and workcenters a few
months ahead can be of extreme importance, especially in highly
dynamic environments with little flexibility in the short-term. This
means that a good PP optimization is paramount in obtaining SC
operational excellence.
Another important concept which is deeply connected with the
production cycle is the sales and operations planning (S&OP) and
the S&OP cycle. The American Production and Inventory Control Society (APICS – a non-profit or-
ganization for supply chain management) defines S&OP as the ”function of setting the overall level of
manufacturing output and other activities to best satisfy the current planned levels of sales, while meeting
general business objectives of profitability, productivity, competitive customer lead times, as expressed
in the overall business plan” [17]. The S&OP cycle is then comprised of the different stages in a corpo-
rate plan that are sequential and always repeating and that feature different objectives at different stages
of the cycle.
A production plan is made systematically, for a given time period, known as the planning horizon.
Generally, four planning horizons can be distinguished. These time horizons are also often called S&OP
time fences, since they bound different stages of the S&OP cycle.
• Strategic Horizon: horizon beyond the long-term, which deals with strategy rather than planning.
It is within this horizon that management evaluates the impacts of, e.g., increasing the available
capacity or workforce. Objectives are also defined during this period.
• Long-term Horizon: horizon that features planned orders (orders that will almost certainly hap-
pen) and opportunities exploration. This last part refers to results from forecasts and with the ob-
jective of capacity optimization. Generally, capacity utilization is measured monthly, which means
that it is merely an estimation.
• Medium-term Horizon: horizon when orders’ plans are fixed – no new orders are accepted (un-
17
less agreed by management or due to major production delays) and there should not be changes
bigger than one week.
• Short-term Horizon: horizon where scheduling is dealt with. Individual tasks for specific projects
are allocated to workcenters and operators. During this period, no changes to the plans are al-
lowed, except when caused by manufacturing delays (changes no larger than one shift).
These time horizons are of extreme importance since they define the different planning stages. Ad-
ditionally, the aim within each time horizon changes greatly, from an operational and task-oriented point
of view in the short to medium-term, to a planning and capacity focus in the long-term, to a strategic and
managerial perspective in the strategic horizon.
Although there is not a clear consensus on the exact values of these time fences, which are highly
dependent on the industry in question or even in the specific organization, according to several sources
the time fences are distributed as follows.
Horizon Short-Term Medium-Term Long-Term Strategic
Time 1-8 Weeks 1-3 Months 1-24 Months 3-5+ Years
Table 2.2: Time horizons for the different S&OP cycle stages [5]
2.2.3 Capacity
On a manufacturing organization, available capacity is measured for a given production plant, area or
workcenter and for a specific range of time. It corresponds to the total available time (in the considered
period) multiplied by the number of resources related to the selected scope. This can be expressed both
in worker ·hours or machine ·hours. Capacity utilization is then a measure of how intensively a resource
is being used at a given time, with relation to its available capacity [55]. A similar concept is the effort
(or utilized capacity) that a certain activity requires. It can be simply defined as the amount of work that
is required to complete it. Both capacity and effort can be measured in terms of [worker · hours]. For
example, a specific task that requires an effort of 100 [worker · hours] means that it can take 50 hours
with 2 workers at any given time, 10 hours with a team of 10 workers or any combination of time and
workers as long as t ·w = 100. However, the time that the operation takes in not arbitrary, specially in the
production of chemicals, which often have specific reaction times; these require a given duration, which
cannot be shortened with additional workers. Since both capacity and effort are directly proportional
to the number of workers at any given time, the required monthly capacity can be used to estimate
the number of workers required at any time. Consider the example presented in equation 2.1, where
it is considered that a specific area requires 3500 [worker · hours] in a month. Consider the units as
w ≡ worker, sh ≡ shift, h ≡ hour, m ≡ month, d ≡ day
3500[w·hm
]30[dm
]· 8[hsh
]· 3[shd
] =3500
30 · 8 · 3
[w · h ·m · sh · dm · d · h · sh
]= 4.86 [w] (2.1)
For a specific product, the amount of effort necessary can be defined a priori. Both the effort and
duration of the processes are subject to slight changes from their a priori values, due to the variability
18
inherent to the processes and production plants. These changes, even if relatively small, can have
a substantial impact on the overall productive activities and support areas. The variability in terms of
duration can affect the schedule of the tasks, while the variability in the efforts can affect the capacity
occupancy. To solve this problem, buffers are generally used; these can be in terms of duration, e.g.
scheduling a machine for a longer period than necessary to account for eventual delays in the process,
or in terms of effort, e.g. setting capacity limits below the maximum capacity to account for unexpected
capacity increases or losses of available capacity due to a sick worker. However, adding these buffers
decreases productivity: adding, for example, 10% more time in a specific task as a buffer means that
the next task can only start after the end of the buffer and if it is not necessary, a certain amount of time
would be useless. Better predictions of duration and effort can, therefore, reduce the buffer size and
therefore increase productivity.
Measuring the baseline capacity of a production plant, regarding each individual productive area and
other supporting areas, such as QA, QC or warehouse, is a very important step, since it defines the limit
of asset utilization in each area; in regular conditions, there are not any more workers or workcenters to
increase such capacity.
0
100
Short−Term Medium−Term Long−Term
Time Horizon
Cap
acit
y (%
)
Figure 2.7: Typical evolution of the capacity through the time horizons
Figure 2.7 shows a typical evolution of the allocated capacity along the different time horizons, de-
fined in the previous section. Note that the strategic horizon is not included in the graph since it does not
deal with capacity allocation or resources but instead with managerial decisions. The types of orders
that exist and their impact on the allocated capacity are defined below.
• Planned Order: order placed by a client with a specific deadline and that can be considered
as a confirmed order (that will almost certainly happen). This type of order is placed within the
long-term time horizon, having the possibility of being placed within the medium-term horizon on
certain occasions, if so instructed by management. These orders include mainly information about
the product to be produced, the quantity required, and the deadline agreed upon – the client has
to have in consideration the lead-times offered by the manufacturing company.
• Process Order: a planned order is converted into a process order mainly during the medium-
term time horizon. This process associates the master recipe (which regards to tasks, operations
and workcenters) and the bill of materials (BOM) (which deals with materials necessary for the
19
production) to the order. Two important operations are then performed automatically by the ERP:
(1) the tasks are scheduled to their respective workcenter and their occupied capacity is accounted
for (note that this step may only happen during the short-term horizon when there are more process
orders) and (2) the materials are verified if they are in stock and ordered if not – this means that
projects with raw materials with known long lead-times must be converted into process orders
sooner than other projects. On this work, these orders are often referred to as Current Orders.
• Opportunities: derived from sales forecasts, the opportunities indicate potential clients and their
required products. Often opportunities can also be more on the product-side, trying to find a client;
imagine an independent manufacturing area with its predicted allocated capacity far from its limit
– it may be desirable to increase production there and find a client for those produced goods.
Opportunities are defined in the long-term horizon and sometimes even later.
Having these definitions in mind, one can observe how these influence the allocated capacity accord-
ing to the different time fences. In figure 2.7, it can be distinguished three time horizons. The short-term
fence, which has its capacity at the defined capacity limit, regards mostly process orders. The medium-
term fence contains allocated capacity from a mixture of planned and process orders; some have been
converted while others have not. The long-term fence contains only allocated capacity by planned or-
ders, with available capacity for opportunities exploration or the addition of new orders by clients.
2.2.4 Integrated Supply Chain
The need for an integrated SC derives from the pharmaceutical industry’s current paradigm and the
evolution seen in computational capabilities, which enables the collection, processing and analysis of
enormous quantities of data, in order to extract knowledge and reach conclusions that otherwise would
not be possible (or at least extremely difficult). The integrated SC acts as a centralized entity that
manages all the areas that are integral to the business in order to optimize efficiency. This goes against
the more traditional way of having each SC member in a concentrated view of its objectives and tasks,
instead requiring all members to collaborate towards a common objective.
The digitalization of the integrated SC, converting it into a SC4.0 is a natural and fundamental step for
overall improvement of its performance. A procedure for this transformation goes along the application of
IoT and the adoption of the automation pyramid from figure 2.5. This means that field level measurement
systems should be installed, collecting data at high frequency rates, leading to an ERP dense in correct
information and with all its capabilities fully utilized. The processes should be correctly and exhaus-
tively described, with automatic confirmation systems to avoid human-error and to enable demonstrated
performance-based simulation on each operation. Automatic resource measurement systems should
also be adopted, to control stock levels, order materials automatically according to the necessities and
lead-times of the suppliers and the implementation of heuristics for determining the capacity limits of
individual workcenters should be performed. Given sufficient and meaningful data, the opportunities for
improving performance along the SC agents are near limitless.
Robinson [46] points out several advantages of the digitalization of the SC, among them the ability to
better share information with international sites of the company, the decentralization of the inventories
20
and an overall streamlined route-to-market.
The DT both contributes and is a recipient of the capabilities of the SC4.0. It takes advantage of the
SC4.0’s capabilities by using its large quantities of both raw and processed data, for visualization pur-
poses, while aiding in its development by generating predictions which may be used by other SC agents.
The concept of DT is connected with the SC in an equivalent way to a physical asset. Considering figure
2.1 and a turbine as the physical asset, the DT would be a digital model of the physical components and
interactions between them, physical principles, finite element models and historical data, while receiving
real-time measurement data about pressures and temperatures in specific points. The DT could then
perform accurate forecasts and give insights into the behavior of the physical asset. Analogously, a DT
of an internal SC would be comprised of a model of the SC, featuring models of its agents and their in-
teractions, historical data regarding durations, capacities, maintenance, malfunctions and projects, while
receiving data regarding current projects, available capacity, material stocks and other real-time infor-
mation. Using both the intrinsic knowledge of how the SC (contained within its models) works and how
it is behaving at a given point (including information about the planned future), the DT could display the
knowledge it possesses and could perform forecasts, giving the user insights into the future and what
should be done to increase efficiency.
2.2.5 Related Work
DTs, specially applied to the SC, are still not very researched topics in academia and there are
not many applications in the industry. Using the article by Kritzinger et al. [27], it can be seen that
most of the current applications of DTs and other digital integrations are made in manufacturing context.
The article featured a majority of application of DTs in maintenance, product lifecycle and production
planning & control. Although the latter could be connected to the SC in many ways, further research into
the individual articles reviewed by the authors showed a tendency for using discrete event simulation
in the production planning, autonomous guided vehicles and mechatronic systems, rarely mentioning
capacity planning.
Ivanov et al. [26] describe the DT as a combination of simulation, optimization and data analytics. The
authors proceed to describe the new paradigms in SC management, with the appearance of technolo-
gies such as IoT, cyber-physical systems and connected products. The presented digital technological
applications are described on their effects in risk management and ripple effect. The SCs are then in-
troduced as cyber-physical systems, which utilize the modern technologies in their cyber counter-part
and big data analytics, artificial intelligence, simulation and optimization for the conversion from physical
to cyber SCs. This approach allows for improved resilience and generates contingency and recovery
plans.
Regarding the simulation tool developed in this work, a simulation-based RCCP, it was seen that no
exact application of the concept could be found either in academia or industry. However, some work
has been done in related topics. There are many applications of RCCP tools not based on simulation,
extremely frequent in many enterprise software. Papavasileiou et al. [39] offer a Monte Carlo simulation-
based approach to task scheduling in pharmaceutical environment. Although there are several method-
21
ologies used in both this research and this thesis, the main objective differs from a more operational
point-of-view (at the short-term and more task-related) to a more strategic perspective, not focused on
the individual operations. Spicar and Januska [52] present an application of Monte Carlo Markov chains
in capacity planning. The presented method combines the power of Markov chains in estimating a state
given the previous state with the stochastic character offered by the Monte Carlo methods. The results
showed sales, income and defective produced units for a given month, according to the buying behavior
of the studied company’s clients and likely competition by its main competitors.
2.3 Proposed Solution and Expected Results
Two main objectives for the tool were identified as per the thesis objectives: deliver clear visualization
into past and present metrics of the SC and offer a scenario-based forecasting tool. The DT should
receive historical information from the higher levels of the automation pyramid, while using data from
lower levels of the pyramid for real-time data. Considering that the object of study is the SC, the real-
time data tends to be more relaxed into a daily or weekly timeframe, corresponding to data from the
SCADA level or from the manufacturing execution system.
The DT should feature efficient methods of portraying information in a meaningful but comprehensive
way. Information regarding activities happening at a given time, with the possibility of selecting the
shown data by production area, building or project, as well as the activities schedule and evolution of
the key performance indicators should be present, along with freedom for the users to interact with the
information and filter it in the most appropriate way to fit to their needs. Additionally, and as a support
for the simulation tool, there should be a project database, including information by project, the project’s
BOM, recipes, routings, efforts and demonstrated durations of the measured processes, along with the
probability distribution that was fitted to the existing data.
The simulation tool that best delivers the proposed objectives was a simulation-based RCCP, which
deals with capacity planning. This tool has the objective of acting as a decision-support tool and scenario
explorer for the decision-makers. The tool should perform the RCCP but unlike most currently available
tools, it should be done based on demonstrated performance and on the inherent variability that phar-
maceutical operations possess: it should be based on probability distributions that model the activities
durations, obtained by statistical analysis of the past data. Based on this, Monte Carlo simulation will be
used in order to simulate multiple scenarios and detetect a convergence in the monthly capacity utiliza-
tion. Furthermore, the utilization of bottleneck assets (BAs) should be checked and verified if there are
any consistent overlays in asset utilization.
The objective of this project, as a DT of the internal supply chain of a pharmaceutical CDMO, with
a simulation tool capable of RCCP has not been found in either commercially available solutions or
academic research. Even for less specific components of the project, either simulation-based RCCP or
DTs of the internal SC, available solutions are scarce. This denotes the challenge and significance of
this project.
The definition of DT given by Kritzinger et al. [27] as an object with automatic data flows to and from
digital and physical objects does not agree with the concept here implemented. In fact, the definition
22
may not be seen as the most appropriate for this study, since the author considers studies mostly made
on the operational level and not on a logistics level. This means that the DTs analyzed by the author
were mostly regarding tangible assets, in contrast to the internal SC in study on this work, which has a
much higher abstraction level. At the level of tangible assets, automatic correction of the plans can be
made with little consequence in case of errors. However, on the SC (logistics level), where campaigns
are scheduled months in advance, deadlines with customers are set, raw materials are dependent on
suppliers’ lead-times and there are overall more stakes involved, automatic update of the real plans and
schedules without human interaction cannot simply be made, for responsibility and liability reasons. This
means that the definition made by Kritzinger et al. may not be applicable to every scenario.
The verification of interferences between project’s BAs combines the work by Papavasileiou et al.
[39] in terms of demonstrated performance-based simulation and main asset scheduling with capacity
planning, more specifically, RCCP. Unlike the work done by Spicar and Januska [52], this current thesis
does not have the objective of evaluating sales, clients’ behavior and competition, and is instead focused
mainly on utilized capacity along the different areas that directly affect the production of pharmaceutical
products. Furthermore, the use of Monte Carlo Markov chains is less justifiable in scenarios where the
operations to be performed are known and future states are not dependent on current states, meaning
that there are no transition matrices.
The DT should be in line with the basic definition of DT by Ivanov et al. [26] as a combination of
simulation, optimization and data analytics. In fact, data analytics will be used to uncover the statistics
behavior of the processes with sufficient data; simulation will be used to obtain the expected monthly
capacity utilized; optimization will be used to verify that no BA is in conflict, and correct such conflicts
otherwise.
The developed solution is expected to both increase insights into the internal SC through its data
visualization abilities and deliver data-driven forecasts on future area occupancy based on demonstrated
performance. This allows for (1) better awareness about the current and past states of the SC, which
can aid in making more data-based decisions and (2) assist in better allocating resources and identify
opportunities to insert new campaigns.
23
24
Chapter 3
Knowledge Extraction
The DIKW pyramid is a model which represents the hierarchy and the functional relationship between
data → information → knowledge → wisdom [47]. The model simply states that wisdom is created
from knowledge, which is created from information, obtained from data. The higher the hierarchy the
more actionable and valuable the asset becomes. However, to obtain the subsequent level in the hier-
archy, it must be analyzed and processed, often compiling the larger asset into a smaller, more detailed
and meaningful one. Obtaining wisdom is, therefore, the ultimate goal but the entire process starts with
just data. The levels of the pyramid can be defined as: [47]
• Wisdom: often considered an elusive concept, related to human intuition, understanding and
interpretation. Wisdom can also be defined as accumulated knowledge.
• Knowledge: combination of data and information, to which is added expert opinion, skills and
experience.
• Information: data that has been shaped into a form that is meaningful and useful to human beings.
• Data: discrete, objective facts or observations, which are unorganized and unprocessed, and do
not convey any specific meaning.
It can be concluded that although wisdom can be seen as an almost unattainable goal, everything
starts from the bottom, with the collection of data, followed by its processing into information, which can
be understood by humans, and with their skills and experience acquire knowledge.
Data has always been a crucial resource, but the ”democratized” computational power brought by
the 21st century has allowed the collection of varied types of data in tremendous amounts and its easier
processing into information. The practical advantages are near endless, with the identification of hidden
patterns, prediction of future events based on past data and the possibility of developing black-box
models, which with sufficient data can arrive at the desired conclusions, without the need to define the
behavior of the model itself.
An important aspect has to be mentioned regarding the models that require data: these models are
only as good as the data they are supplied with. This may seem like an obvious statement, but frequently
data is corrupted in some form, and the model will undoubtedly fail to portray the system it intends to
represent. Nevertheless, there are ways of contradicting this problem, especially when only part of the
data is corrupted. The main method for solving this problem is the outlier identification and removal,
which will later be addressed.
25
The CDMO in study uses an ERP which stores its data in databases. The use of databases for
storing large volumes of data is a customary practice, as can be deducted by its semantics. These tend
to be efficient in dealing with large volumes of data and can be queried to upload or retrieve existing data.
For this thesis a developer environment was set-up, based on the same architecture of the production
system. The data contained there was a mirror of the actual production data, retrieved on September
2019 and hosted on a SQL server that mimics the actual system, and thus is prepared to work with live
data with no no significant modifications.
3.1 Collected Datasets
The data used on this work was contained in the PP and material management modules from the
ERP. From the first module, the tables used regarded planned orders, production orders, workcenters,
reservations and tasks, while from the second module only inventory management tables were needed.
The process of determining the tables and fields to be considered was the result on an extensive study,
which provided a dictionary between the data and its meaning and utilization. This study resulted in
a exploratory analysis report that was shared with the project stakeholders and is now being used by
experts at the company to assist in exploratory analysis.
The datasets extracted can be divided into 4 categories:
• Processes duration: duration of the processes in study, calculated as the variation between the
start and end dates of the activities. Explained in more detail in section 3.1.1.
• Planned orders: correspond to the orders that are already confirmed and will happen in the
following months. These can have either or both a start date and a deadline. The extraction of
the data is done assuming that the start date and deadline are contained within the long-term time
horizon, not including any horizon before or after. This is due to the fact that the orders in the short
and medium-term horizons are fixed and only subject to minor changes. It is not in the scope of
this project to deal with orders at those horizons. The data extracted contains the project, the start
date and deadline and the desired quantity.
• Required resources: from the routings and tasks tables, the required resources (regarding worker-
hours and equipment) associated with each project were extracted. The data extracted for each
project regarded mainly the operation description, equipment code, duration and effort. Note that
these values are theoretical values, which means that they are the reference values and are sub-
ject to change in the real operation. However, the ERP uses these durations and efforts when
scheduling activities and allocating capacity, which may lead to errors. Additionally, through the
equipment code, each operation was divided into manufacturing, QA, QC or warehouse.
• Available capacities: to accurately characterize workcenters with respect to their available capac-
ity, their maximum daily effort was extracted. Since the capacity will be evaluated monthly, the daily
values have to be summed by month for the total monthly capacity. These capacities are related
to the required resources since they establish the available daily (or monthly) limit that each area
26
has of a specific resource. Basically, each area has a capacity of available resource and these will
be consumed by the manufacturing and support activities.
3.1.1 Processes Duration
The tables of interest for this extraction regarded the PP, routings and warehouse movements. In
terms of the duration of the processes, the data collection can be summarized in the graph from figure
3.1. This figure shows graphically how the available data is chronologically related to the productive and
support operations.
M
WH (FP)WH WH
IPCIPC
M (BPR) QA (BPR)
QC R QC RV QA
Planned start ofproduction
Actual startof production
Actual finishof production
FP stored inthe warehouse
Stock transferredinto unrestricted use
Shipping of thefinal product
Figure 3.1: Extracted dates and their chronological relation to the real processes. The bars represent the real starts and ends ofthe different stages, while the vertical lines represent the dates that can be extracted. Note that M corresponds to manufacturing
with M (BPR) being the BPR review process by the manufacturing team (not manufacturing itself).
As can be seen from the graph, there is not enough granularity from the extracted data to obtain the
actual starts and ends of all the processes. With this in mind, the quality release (QR) is considered,
which corresponds to the region immediately after production and containing the QC R and RV, manufac-
turing (BPR review) and QA operations. The QR can be measured with its start being the manufacturing
end and its end being the stock transferred to unrestricted use. Note that this QR is a construct and is
not actually a phase of the processes, but was created as the agglomeration of the mentioned stages.
Additionally, the mismatch between the planned start and the actual start results from possible delays
on the operation. These values can be simultaneous – no delays occurred, or the actual start can be
before the planned start.
The process of data extraction encompasses obtaining data from existing data sources, for further
processing. This is a fundamental step in real-world systems because clean and easily accessible data
rarely exists. To obtain the durations of the Manufacturing and QR’s processes, a series of operations
had to be made to the existing datatables, which were validated by the responsible stakeholders. The
process of extracting the mentioned durations comprises a series of steps, as shown and described
below.
1. Define assumptions
A couple of assumptions have to be set before extracting the data. Most of these have already
been mentioned in the description of the timeline graph from figure 3.1. All the assumptions are
presented below.
27
• Actual start denotes the beginning of the manufacturing processes and must always have a
non-null value.
• Actual finish denotes the end of the manufacturing processes and must always have a non-
null value.
• The start of the QR processes is assumed to coincide with the end of the manufacturing.
• The end of the QR processes corresponds to the entry of the accounting document meaning
that stock in quality inspection can be transferred to unrestricted use.
2. Extract the raw data – SQL query
Custom-made SQL queries were written and validated with subject experts within the organization.
3. Prepare the data
A final preparation has to be done to the extracted data. This includes mainly the calculation
of the duration of the processes, by converting the type of some fields and performing a few
mathematical manipulations. The duration of the processes is assumed to be discrete, since the
data rarely allows for increased precision and even when it does it would lead to greater noise;
it can be therefore assumed that the time unit is the smallest scale and that all the values are
discrete. This time unit will not be further defined for confidentiality reasons and will be henceforth
referred to as TU.
This extraction resulted in 2038 observations, regarding 194 unique projects.
3.2 Distributions FittingData as is collected tends to have a variability associated with it, even for the same activities or
processes. There are many unpredictable events and occurrences that lead to distinct outcomes in
terms of duration, produced quantity or cost – in fact, almost all real-world systems contain at least one
source of randomness [28, p. 279]. The manufacturing and QR processes’ duration will be the main
focus of study in this section. Despite the variability inherent to real-world processes, data shows that
there is often a pattern for the duration of the activities. Considering that the duration of the processes
are discrete values in TUs, as stated previously, the probability of it being a certain value can be obtained
from historical data and the aggregation of all the probabilities generally follows a common PDF. This
can be clearly seen in a histogram, where the counts (or frequencies) of occurrence of an activity lasting
a certain amount of time are shown.
An important assumption was made, considering the discrete nature of the data and the requirements
given by the stakeholders: the bin width considered is always the TU itself, the smallest scale considered.
This is due to the fact that a certain level of granularity is required, which can be better observed through
the smallest possible bin width. Furthermore, the values themselves are not too disperse, with datasets
frequently having only 10 to 20 bins even at the smallest bin width possible. This assumption can be
supported by the literature, in the sense that selecting the bin width (or number of bins) is a process
regularly done on a trial-and-error basis, observing the results and adjusting accordingly, until achieving
28
the bin width that creates the most ”rugged” histogram with the smallest bin width possible [28, p. 323].
Additionally, the problem of selecting the bin width is of greater importance in continuous distributions
rather than in discrete ones.
Replacing the PDF of a model by its mean is something that is often done. However, it can be
dangerous to perform this simplification, e.g. in high-variance PDFs, where the mean is not very repre-
sentative. Especially for simulation purposes, an activities’ duration should be modelled by a PDF that
represents it more reliably.
Even though the data itself can be used as the PDF, known as an Empirical Distribution, Hillier et al.
[25, p. 893] state that for simulation purposes, the assumed form of the distribution should be suffi-
ciently realistic that the model provides reasonable predictions while, at the same time, being sufficiently
simple that the model is mathematically tractable. While an empirical distribution does provide reason-
able predictions (sometimes there may be overfitting, causing wrong predictions), they are certainly not
mathematically tractable. Note that storing a few tenths or hundreds of observation data may not be
too computationally demanding, but when there are millions of observations it becomes much more of
a problem; in contrast, using only a few parameters to define the whole distribution is generally a much
better choice. Hillier et al. [25, p. 891,1079] also consider manufacturing systems as queuing systems,
more specifically as Internal Service Systems (considering the case, as an example, of a machine that
can be viewed as a service, with customers being the processed jobs). For these types of systems,
Exponential Distributions are advised by the writers, due to their fitting capabilities and mathematical
tractability [25, p. 887]. Note that it is also mentioned that this type of PDFs is generally used due to their
advantages but that other distributions may also be chosen. Contrarily, Law [28, p. 280-282] suggests
fitting a series of distributions to the data and choosing the one with the best goodness-of-fit test (GoF).
The approach that was chosen lies in between the methods stated in the citations above. First of
all, note that the PDFs of this problem are discrete, which by itself restricts the distributions chosen
to only discrete PDFs. Secondly, a considerable number of projects possesses only a small amount
of observations. This means that fitting a series of distributions to each individual project dataset and
choosing the best fit is not applicable to these few-observations projects.
The idealized approach was to select a single distribution for the projects and then consider it as
the distribution that best describes all the processes. This can also be supported by the fact that since
the operations are similar in nature (pharmaceutical manufacturing processes or pharmaceutical quality
release processes) it can be assumed that their PDFs are also similar, hence a single PDF with dif-
ferent parameters is considered sufficient to correctly describe them. This selected distribution could
be obtained by a similar approach to the one mentioned by Law, fitting a series of PDFs to the most
representative projects (the ones with a considerable amount of observations) and generalizing the dis-
tribution with the best GoF to all the projects; this way, all the data would be fitted to distributions that
are proven to correctly fit data from the CDMO’s own manufacturing and QR processes.
The process of selecting the distribution that best fits the more representative projects’ data follows
the steps enumerated below.
1. Choosing the set of contender distributions
29
2. Defining the number of observations threshold for considering the datasets representative
3. Preprocessing the data
4. Looking for data outliers and removing them
5. Obtaining the statistical properties, such as mean, variance or skewness
6. Determining the most representative distribution for the data, based on the statistical properties
7. Fitting the distributions to the data
8. Evaluating the GoF between the data and the distribution
9. Selecting the most representative distribution
Most of the methods for fitting PDFs to real data are empirical methods based on observation and
trial and error. Nevertheless, a more automatic approach can be taken with only a small loss in accuracy
when dealing with irregular cases. This is necessary in the current scope since the distribution fitting
process will have to be done programmatically in the future, to allow new observations to modify the
distributions’ parameters.
3.2.1 Selecting the Distributions
A comprehensive set of theoretical PDFs must be chosen that can effectively fit data of diverse
types. Note that the pool of available PDFs to be chosen must be of discrete PDFs. According to
Law [28, p. 308-313], the Binomial, Negative Binomial and Poisson distributions are the most common
discrete PDFs and can be fitted to a wide range of data types. This set of distributions is also supported
by the aforementioned claims made by Hillier et al., defending exponential distributions as the most
common and best choice for manufacturing processes; in fact, both these three distributions belong to
the exponential distributions’ family. Examples of these distributions’ behaviors can be seen in Figure
3.2.
t = 10
p = 0.2
t = 10
p = 0.5
t = 5
p = 0.2
0 1 2 3 4 5 6 7 8 9 10
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
Den
sity
Binomial Distribution
s = 10
p = 0.5
s = 2
p = 0.5
s = 3
p = 0.3
0 2 4 6 8 10 12 14 16 18 20
0.00
0.05
0.10
0.15
0.00
0.05
0.10
0.15
0.00
0.05
0.10
0.15
Negative Binomial Distribution
λ = 0.9
λ = 5
λ = 10
0 2 4 6 8 10 12 14 16 18 20
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
Poisson Distribution
Figure 3.2: Binomial, negative binomial and Poisson distributions for different parameters
30
Another important aspect to mention about these PDFs derives from their mathematical form, shown
in equations 3.1a to 3.1c. In contrast to other PDFs, such as the most common Gaussian distribution,
these distributions cannot take negative values.
Binomial p(x) =
t!x!(t−x)! · p
x(1− p)t−x, if x ∈ 0, 1, · · · t
0, otherwise(3.1a)
Negative Binomial p(x) =
(s+x−1)!x!(s−1)! · p
s(1− p)x, if x ∈ 0, 1, · · ·
0, otherwise(3.1b)
Poisson p(x) =
e−λλx
x! , if x ∈ 0, 1, · · ·
0, otherwise(3.1c)
A crucial assumption has to be made that will have a certain impact on the PDFs. This assumption
states that the observation with the smallest duration will bound the lower probability; simply put, this
means that there will never be any value below the smallest one. Although in truth such occurrence
can happen, especially in less populated projects, the duration of a project is physically bounded by the
duration of the chemical processes which can never be sped up. This means that in an ideal system,
the distributions would tend to become increasingly exponential in shape. This can in fact be seen in the
projects with a higher number of observations, especially in manufacturing. Even though the projects
have a lower bound, the distributions allow values greater or equal to zero; cutting the distributions at
the lowest existent value would solve this problem, but the fit would never be quite as good as a different
approach: offsetting the distribution to the smallest value, effectively considering it as the zero-value.
Mathematically, this would imply a minor change in their mass functions, shown in equations 3.2a to
3.2c. Consider x0 as the smallest observed value, in a certain project.
Binomial p(x) =
t!(x−x0)!(t−(x−x0))! · p
(x−x0)(1− p)t−(x−x0), if x ∈ x0, x0 + 1, · · · t+ x0
0, otherwise(3.2a)
Negative Binomial p(x) =
(s+(x−x0)−1)!(x−x0)!(s−1)! · p
s(1− p)(x−x0), if x ∈ x0, x0 + 1, · · ·
0, otherwise(3.2b)
Poisson p(x) =
e−λλ(x−x0)
(x−x0)! , if x ∈ x0, x0 + 1, · · ·
0, otherwise(3.2c)
3.2.2 Data Segmentation
Consider the data extracted according to the conditions detailed in the chapter regarding processes
duration. This data needs to be subdivided into a smaller group of only representative projects, i.e.,
projects with a sufficiently high amount of observations, capable of being fitted into a distribution in a
31
trustworthy manner. First of all, it is important to understand the order of magnitude of the data. There
is a total of 2038 observations, regarding 194 projects. From this dataset, a series of observations
were removed: projects with fewer than 3 observations, since they would not be significantly fitted to
a distribution; observations with negative duration, since they were obviously originated from a data
insertion error. This led to a modification in the existing amount of data: regarding Manufacturing a total
of 1905 observations were recorded for 104 projects, while in QR 1869 observations for 102 projects.
Note that the mismatch between QR and Manufacturing comes from the fact that corrupted observations
in one area do not necessarily imply that they are corrupted in the other area.
Even for this filtered set, the majority of the projects has a small number of observations; the solution
for finding the most representative distributions is to only select projects with more than 30 observations.
This leaves a total of only 10 projects, with the number of observations per project ranging from 36 to
415. Although the number of projects that verify the above conditions is small, it may be enough to reach
the desired conclusions.
0
25
50
75
100
10 20 30 40
Cou
nt
0
10
20
30
10 20 30 40 50Duration [TU]
0
1
2
3
4
0 50 100 150
Figure 3.3: Examples of PDFs from the obtained data
As can be seen from figure 3.3, the PDFs of the real data tend to vary quite substantially from project
to project. However, even with the observed variability in the existent data, all the considered theoretical
PDFs can successfully fit to them, since they can take varied forms (check the graphs from figure 3.2).
Note that the third histogram relates to the data from a QR process; in reality, many of the QR processes
tend to have a much higher variability than the manufacturing ones. The chosen distribution function
has to be able to successfully fit both distributions with an exponential tendency (such as the first graph)
a normal behavior (such as the second graph) and a uniform trend (such as the third graph).
3.2.3 Outlier Identification and Removal
By definition, an outlier is an observation that is far removed from the rest of the observations [31,
p. 89]. Its significance as representative data can be questionable: it may be data incorrectly measured,
that suffered from irregular deviations and should be discarded, or may be merely an extreme manifes-
tation of the random variability inherent in the data, on which case the values should be retained and
processed in the same manner as the other observations in the sample [22].
Current methods for identifying and removing outliers from data can vary greatly. There are extremely
powerful methods using supervised or unsupervised anomaly detection algorithms such as support
vector machines [48], replicator neural networks [24] or fuzzy logic-based [11]. Although these methods
32
may be of interest when considering massive and diverse datasets, the data used is never greater
than 500 observations per project, which means that employing such a sophisticated method for outlier
detection is not necessary for the foreseeable future. A simpler method that performs rather well for the
size of the datasets used is based on the interquartile range (IQR). This method is based on Tukey’s
Fences, a concept introduction by John Tukey in 1977 in [57]. According to his definition, an outlier is an
observation outside of the range defined by:
[Q1 − k(Q3 −Q1), Q3 + k(Q3 −Q1)] (3.3)
On equation 3.3, Q1 refers to the lower quartile (which splits off the lowest 25% of data from the
highest 75%) and Q3 refers to the upper quartile (which splits off the highest 25% of data from the lowest
75%). Note that Q2 corresponds to the median (effectively diving the dataset in half). As proposed
by John Tukey, a value of k = 1.5 corresponds to a dataset without outliers, while a value of k = 3
corresponds to a dataset without only far-out observations.
The effects of the outlier filter were applied to some of the distributions shown in Figure 3.3 and the
results can be seen in Figure 3.4.
0
25
50
75
100
10 20 30 40
Cou
nt
0
10
20
30
10 20 30 40 50Duration [TU]
0
1
2
3
4
0 50 100 150
Accepted Data k = 0 k = 1.5 k = 3
Figure 3.4: Data with outliers filtered out. Different values for k were used to observe its influence on the outlier filtering. Notethat on the histogram on the right, only when k = 0 (which is an overly aggressive filter) were there outliers caught – this meansthat using this method there can be situations when there is no outlier selected. The vertical lines plotted on the graphs denote
the median of the distributions.
As can be seen from the figure, using k = 1.5 it is possible to obtain results that look promising. In
fact, the third graph visually does not seem to have any outlier (which is supported by k = 1.5); regarding
the first histogram, it can be seen that the filter could be slightly more conservative, but the results are
acceptable. Note that using a value k = 0 (which was merely done to observe the results of a far more
aggressive filter – filtering everything out of the [Q1, Q3] range) shows that the outliers can be both on
the left or right of the graph. However, due to the inherent skewness that the histograms have, with
longer tails on the right, outliers on the left will not be as frequent as on the right.
The fact that the third graph shows that no observations were removed also reinforces the definition
of outlier by Grubbs [22], which stated that outliers could be a mere manifestation of the randomness of
a distribution and should not be discarded.
33
3.2.4 Data Statistics
There are interesting statistical characteristics that can be calculated from the data which can be
helpful in determining the PDF that fits to the data in the most appropriate way. These parameters
can provide insights into the shape, tendency, variability and other interesting characteristics. Besides
the more straightforward characteristics, such as the number of observations, mean, standard deviation,
minimum and maximum values, mode and median, a few other significant parameters can be calculated.
• Lexis Ratio: the Lexis Ratio is a parameter that can often provide useful insights into the form of
a discrete distribution. Note that this parameter can only be calculated for discrete distributions. It
is calculated through the expression 3.4.
τ =V ariance
Mean=σ2
µ(3.4)
One empirical discrimination can be made from the lexis ratio, as stated by Law [28, p. 322]. Ac-
cording to the author, a Poisson distribution is characterized by having τ = 1, a Binomial distribution
by having τ < 1 and a negative binomial distribution τ > 1.
• Skewness: the skewness of a distribution is a measure of its asymmetry. It can be calculated
through the expression 3.5. Note that while there are several methods of calculating the skewness
of a distribution, the one shown is the Pearson’s moment coefficient of skewness.
Skew[X] = ν = γ1 = E
[(X − µσ
)3]
(3.5)
A skewness ν = 0 indicates a symmetric distribution (for example, the normal distribution); a
skewness ν = 2 regards an exponential distribution, skewed to the right. Generalizing, a positive
skewness indicates a right-skewed distribution while a negative skewness a left-skewed distribu-
tion.
• Kurtosis: the kurtosis of a distribution is also a property that describes its shape, more specifically,
it is a measure of the ”tail weight” of a distribution [28, p. 322]. The most commonly used method
for calculating the kurtosis is by the excess method, also defined by Karl Pearson.
Kurt[X] = γ2 = E
[(X − µσ
)4]− 3 (3.6)
Note that this method of calculation of the Kurtosis is extremely similar to the calculation of skew-
ness by the moments method. Indeed, the calculation of the Kurtosis by the moments is also part
of the calculation of the property by the excess method: it corresponds to the expected value part.
The excess method of calculation of the Kurtosis only removes 3 units to the moment’s Kurtosis,
with the objective of setting the normal distribution’s kurtosis at zero units.
While historically the Kurtosis has been claimed to give insights into the flatness of the distribution
and its tail weight, this claim as been refuted as can be seen in [58], where the author states that
34
its only unambiguous interpretation is in terms of tail extremity.
The properties for the 20 most relevant distributions (10 for each area: manufacturing and QR) were
calculated and several conclusions could be extracted from the results.
Manufacturing QR
0 10 20 30 40 0 10 20 30 400
10
20
30
Mean [TU]
Stan
dard
Dev
iati
on[T
U]
# of Observations 100 200 300 400
Figure 3.5: Plot of the mean versus the standard deviation of the representative projects, discriminated between manufacturingand QR. Note that a third dimension is included through the size of the markers; as suggested, a bigger marker indicates a
project with a larger number of observations.
From the graphs shown in Figure 3.5, it can be seen that the manufacturing values are much more
consistent on the projects’ duration but especially on their variance. Their mean is never above 20 TUs
(frequently around 10 times units) while their standard deviation is generally below 5 TUs. Dataset size
does not appear to have a direct correlation with the results. Regarding QR’s values, the most interesting
conclusion is the apparent effect that dataset size has on the standard deviation. It can be expected that
a project with a greater number of observations is more well defined and the variance is lower, and this
effect is clearly seen on the QR’s results. The remaining data shows what can be clearly seen from the
observations: QR duration is generally longer and more disperse.
The remaining properties were calculated and can be seen in the parallel plot in Figure 3.6.
A few conclusions can be taken from the parallel plot. Perhaps the most obvious one is the clear
higher variance on most properties for the QR data. In reality, the manufacturing process tends to be
more constant between different projects than QR, whose processes are more chaotic, less predictable
and more prone to external influence. The dataset sizes show that there are 5 projects between 30
and 60 observations and 5 projects between 60 and 420. Note that the mismatch between the dataset
size for the manufacturing and QR of a single project derives from the fact that the outlier filter does not
necessarily remove the same number of observations for each area. The values for mean and standard
deviation show the same information as in Figure 3.5. Regarding the median and the mode, it can be
seen that for the manufacturing duration, the values tend to remain the same for a single observation,
and actually equal to the mean value. This is an occurrence typical in theoretical distributions and shows
that the distribution is not multimodal, for example. Regarding these parameters for the QR’s duration, it
can be seen that they tend to vary more for a single project, which can be explained by the occurrence
of more uniform distributions. Regarding the Lexis ratio, which is a measure of variance, it can be seen
that it is similar to the standard deviation. This parameter can aid in selecting a theoretical distribution
between the binomial, the negative binomial and the Poisson PDFs. The majority of the values can
be seen to be above τ = 1 (4 values for Manufacturing and 8 values for QR), which by the empirical
35
30
60
90
120
150
180
210
240
270
300
330
360
390
420
450
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
40
45
50
0
5
10
15
20
25
30
35
40
45
50
0
5
10
15
20
25
30
35
40
45
50
0
2.5
5
7.5
10
12.5
15
17.5
20
22.5
25
−0.75
−0.5
−0.25
0
0.25
0.5
0.75
1
1.25
1.5
−1.25
−0.95
−0.65
−0.35
−0.05
0.25
0.55
0.85
1.15
1.45
DatasetSize
StandardDeviation
Mean Median Mode Lexis Ratio Skewness Kurtosis
Manufacturing QR
Figure 3.6: In this parallel plot, the information regarding the distributions statistical properties is shown, discriminated betweenmanufacturing and QR duration. This plot aims to clearly picture the patterns that can be found in these statistical properties, and
if they are correlated to the areas they represent. A certain project’s distribution is represented by a line, with each property onthe axis. This can give insights into a single project’s properties, by following a line, or how a certain property is distributed across
the projects, by looking at a single property’s axis values.
rule described would mean that the most appropriate distribution to be fitted to the data would be the
negative binomial distribution.
Regarding the skewness of the projects, it can be seen that they are mostly positive, which indicates
that the distributions have generally a skewness to the right. However, this skewness is not large,
often being neglectable (note that an exponential distribution as a skewness of 2, for example). Finally,
regarding the Kurtosis of the distributions it can be seen that it is mostly negative, for both manufacturing
and QR. In practical terms, and knowing that a Kurtosis of zero corresponds to the normal distribution,
it means that the distributions have a smaller tail weight than a normal distribution. While it may not be
completely accurate, a negative value of kurtosis can also mean that the ”head” of the distribution is also
flatter than that of a comparable normal distribution, which can explain the uniform-like distributions, for
example.
0
25
50
75
5 10 15
Cou
nt
0
1
2
3
25 50 75
Figure 3.7: Examples of distribution with distinct values for skewness and kurtosis. Note that the outliers have already beenremoved in this graph. The first graph has Skew[X] = 1.00 and Kurt[X] = 0.14, while the second graph has Skew[X] = 0.31
and Kurt[X] = −1.02
Considering the graphs from Figure 3.7, it can be seen that the first one shows a high skewness to
36
the right, due its exponential-like figure and a tail weight comparable to that of a normal distribution, with
its head being relatively regular. This explains the values of high skewness and near-zero kurtosis. The
second graph is much different, featuring a more uniform tendency. The kurtosis is that small due to
the fact that the whole distribution is considered the head; this way, not only does it contribute by having
a flat head, it also means that there is no tail weight, because the whole distribution is the head. The
skewness is less meaningful in such a situation and only differs from zero because there are a few peaks
and a few null entries in between existing ones.
There are a few empirical methods for choosing a PDF using the calculated statistical parameters.
One has already been described and its conclusions have been taken (using the Lexis Ratio). Another
method is through the Cullen and Frey graph, a graph presented in 1999 on Alison Cullen and Christo-
pher Frey’s book Probabilistic Techniques in Exposure Assessment [18]. There, a method for graphically
evaluating the most appropriate theoretical PDF is presented, through a plot of the Square of Skewness
versus the Kurtosis (in its unbiased form). This plot can be seen in figure 3.8, where the representative
datasets are plotted against the regions that classify the theoretical PDFs.
0 10 20 30 40 50
Cullen and Frey graph
Square of Skewness
Kur
tosi
s
316
2942
5568
Project Theoretical distributionsNormal PDFNegative binomial PDFPoisson PDF
Figure 3.8: Cullen and Frey graph. Note that as the legend suggests, the different regions for PDFs are distinguished by the linesor areas and the observations are shown as the blue dots.
As can be seen from the figure, the observations appear to be consistently inside the region that
classifies the distributions as following the negative binomial PDF. This is in accordance with the conclu-
sions taken from the Lexis ratio, which makes it likely that the most appropriate distribution to be used
is the negative binomial. However, further analysis will be performed to evaluate how each theoretical
PDF fits to the data.
3.2.5 Fitting the Distributions to the Data
The process of fitting a theoretical PDF to the existing data is paramount in obtaining a PDF that
is representative of the data that it is being fitted to. The procedure of fitting a theoretical PDF is an
optimization of the distribution’s parameters in order to maximize the GoF between the theoretical distri-
bution and the real data.
The GoF is a test that calculates a parameter with the objective of measuring how well a distribution
is fitted. Law [28, p. 344] defines a GoF test as a statistical hypothesis test used to formally assess
whether the observations of a certain dataset are an independent sample from a particular distribution.
37
Several GoF tests exist and are widely used. The most significant ones in the literature will be described
and one will be chosen for the optimization operation. Note that while only one GoF test can be chosen
as an optimization criterion, several tests can be used to assess visually how well the data is fitted,
according to each of their own criteria.
• Chi-squared goodness-of-fit test (CS): Pearson’s CS (χ2) tests are a set of statistical tests used
to evaluate three types of comparison: homogeneity, independence and GoF. Here, only the GoF
test will be performed. For discrete distributions, the procedure for calculating the GoF starts by
dividing the dataset into cells. It is a widespread practice, given the relatively small size of the
dataset, to consider each TU as a cell, effectively rendering the cutting process unnecessary.
The formula for calculating the CS value (called Pearson’s cumulative test statistic) is shown in
expression 3.7.
χ2 =
n∑i=1
(Oi −Npi)2
Npi,with
Oi ≡ number of observations for TU i
N ≡ total number of observations
pi ≡ theoretical probability of TU i
n ≡ number of cells
(3.7)
• Kolmogorov-Smirnov GoF test: the Kolmogorov-Smirnov GoF test compares the empirical cu-
mulative distribution function (ECDF) with the cumulative distribution function (CDF) of the hypoth-
esized distribution [28, p. 351]. The ECDF is calculated through the expression 3.8
Fn(x) =number of Xi’s ≤ x
n(3.8)
The Kolmogorov-Smirnov GoF statistic is then simply defined as the largest vertical distance be-
tween the ECDF and the fitted CDF. The statistic is calculated using expression 3.9.
D+n = max
1≤i≤n
i
n− F (X(i))
, D−
n = max1≤i≤n
F (X(i))−
i− 1
n
Dn = max
1≤i≤n
D+n , D
−n
(3.9)
• Cramer-von Mises GoF test: the Cramer-von Mises GoF tests are a set of tests used to evaluate
the GoF of a CDF when compared to an ECDF, similarly to the Kolmogorov-Smirnov GoF tests.
As stated in [4], although these tests are originally designed for continuous distributions, they have
also been adapted to discrete ones. This thesis uses the definitions given by Choulakian et al. [15]
in their article on Cramer-von Mises statistics for discrete distributions.
The Cramer-von Mises GoF tests can be divided into 3 separate tests, each with its own statistic:
the Cramer-von Mises GoF test, U2, the Watson GoF test, W 2 and the Anderson-Darling GoF test,
A2. Their statistics are defined in equations 3.10.
W 2 = N−1k∑j=1
Z2j pj (3.10a)
38
U2 = N−1k∑j=1
(Zj − Z)2pj (3.10b)
A2 = N−1k∑j=1
Z2j pj
Hj(1−Hj)(3.10c)
k ≡ Number of duration entries
N ≡ Number of total observations
Zj =∑ji=1 oi −
∑ji=1 ei ≡ Difference between observed and expected cumulative frequency
Z =∑kj=1 Zjpj ≡ Average cumulative difference
Hj =∑ji=1 oi/N ≡ Expected CDF
From the GoF tests shown, one had to be chosen, in order to optimize the distribution’s parameters
in relation to the chosen test. The test chosen was the Pearson’s CS test, for several reasons. First of
all, the CS is undeniably the most used, verified and supported test. Although this may not seem an
immediate reason for being a selection criterion, a more reliable test, with results that are meaningful to
a larger number of people is generally preferable. Secondly, in terms of algorithm, the CS, is simpler,
much less prone to errors and its results appear to be consistently good, only matched by the Cramer-
von Mises GoF test in terms of PDF. Finally, from an industry-specific point-of-view, it makes more sense
to optimize according to the PDF, like the CS test, rather than the CDF, like the remaining tests. This
is due to the fact that the optimization made through the CDF tends to optimize the whole distribution,
weighing the tail significantly better than the CS test. This is not necessary for this specific problem and
can even jeopardize the GoF of the remaining distribution. Nevertheless, the results of how the different
GoF tests influence the results, both from a numerical and graphical point-of-views are shown in the
Appendix B. The optimization formulation, using the CS test as the GoF metric can then be made as
shown in expressions 3.11.
mint,p
χ2 =
n∑i=1
(Oi −Npibin)2
Npibin=
n∑i=1
(Oi −N · t!(x−x0)!(t−(x−x0))! · p
(x−x0)(1− p)t−(x−x0))2
N · t!(x−x0)!(t−(x−x0))! · p(x−x0)(1− p)t−(x−x0)
s.t. t > 0
p ∈ [0, 1]
(3.11a)
mins,p
χ2 =
n∑i=1
(Oi −Npinbin)2
Npinbin=
n∑i=1
(Oi −N · (s+(x−x0)−1)!(x−x0)!(s−1)! · p
s(1− p)(x−x0))2
N · (s+(x−x0)−1)!(x−x0)!(s−1)! · ps(1− p)(x−x0)
s.t. s > 0
p ∈ [0, 1]
(3.11b)
minλ
χ2 =
n∑i=1
(Oi −Npipois)2
Npipois=
n∑i=1
(Oi −N · e−λλ(x−x0)
(x−x0)! )2
N · e−λλ(x−x0)
(x−x0)!
s.t. λ > 0
(3.11c)
39
with
pbin, pnbin, ppois ≡ PDFs shown in expressions 3.2
s, t, p, λ ≡ PDFs’ parameters to be optimized
Lastly, an optimization method has to be selected. There are several criteria for choosing an opti-
mization algorithm: the problem’s complexity, efficiency of the algorithm and how well it converges for a
global minimum. In this current problem, the complexity is low, with only one or two parameters to be
optimized; this means that here is no need for a very sophisticated algorithm (such as Metaheuristics)
and that efficiency of the algorithm will never be a problem. The main objective is to choose an algorithm
able to consistently deliver satisfactory results. Due to the relatively simplicity of the optimization, a na-
tive R function was used for the optimization. The function was the constrOptim function, from the stats
package [56]. This function receives the function to be optimized, the initial estimates for the parameters
and the constraints of the optimization (e.g., the constraint that p ∈ [0, 1] for the Binomial and Negative
Binomial PDFs). A series of other parameters can be modified, including the method of the optimization.
The existing methods were tested to obtain the one which performed the best. An implementation of the
Nelder and Mead algorithm and of the BFGS were found to be the ones which delivered the best results
more consistently. Their values were practically the same, which meant that the method of choice was
the Nelder and Mead, which appeared to be less prone to errors. The method presented in the paper
by Nelder and Mead [35] in 1965 is a Heuristic search method characterized by comparing the function
values at the vertices of a simplex with (n + 1) vertices (n being the number of dimensions). As stated
by the authors, the method is shown to be effective and computationally compact.
Estimating the initial parameters for the optimization is something that should be done as frequently
as possible, not only because it provides an approximation to the parameters which reduces the number
of iterations necessary, effectively increasing the algorithm’s efficiency, but also because it makes the
algorithm less prone to falling into local minima. The estimation of the parameters is evidently dependent
on the PDF that is being optimized. The Poisson PDF, the only considered with only one parameter is
the simplest and most effective one at parameter estimation. As stated by Law [28, p. 313], the maximum
likelihood estimation for the Poisson distribution is λ = X(n), i.e., the average value of the distribution’s
observations. Regarding the Binomial and Negative Binomial distributions and considering that both
their parameters are unknown, the maximum likelihood estimation is a much more complex process
which, given the relatively simple nature of the optimization, becomes unfruitful. As a consequence, the
initial parameters were set a priori, based on results obtained empirically and that seemed to lead to
optimizations with fewer iterations and without getting stuck in local minima. The initial values were then
t = 100, p = 0.5 for the Binomial PDF and s = 1, p = 0.2 for the Negative Binomial PDF.
3.2.6 Results of the fitting
The algorithm for the implementation of the PDF fitting is shown in algorithm 1. On the algorithm
shown, a few R functions are considered; first of all, the functions dbinom, dnbinom or dpoisson corre-
spond to the PDF for the binomial, negative binomial and Poisson distributions, respectively. IQR is the
function that calculates the interquartile range, and quantile the function that calculates the quantile at a
40
certain percentage. Lastly, the function constrOptim is the optimization function that receives the initial
parameters, the function to minimize and the constraints. Assume the variable manufacturing values
(and QR values) as the data structure containing the distribution values for all the different projects and
Freq as the vector containing the real frequencies of a project. Note that all the functions mentioned
belong to the R package stats [56].
Algorithm 1 Distribution Fitting Algorithm1: procedure CHISQUARE(par)2: p = dbinom(x, par[1], par[2]) . Could be dbinom(x, t, p), dnbinom(x, s, p) or dpoisson(x, l)3: expected = n ∗ p4: return sum((Freq − expected) ∧ 2/expected)
5: vals = manufacturing values[project] . Manufacturing or QR6: cut = IQR(vals) ∗ 1.5 . (Removing Outliers)7: lower = quantile(vals, 0.25)− cut8: upper = quantile(vals, 0.75) + cut9: vals = vals[vals > lower & vals < upper]
10: gf = constrOptim(par, ChiSquare, constr) . Optimize parameters (initial parameters par )
With the objective of evaluating how well each theoretical distribution fitted the data, the optimization
process was performed, for all the representative projects and for both areas, manufacturing and QR.
The results are presented in table 3.1. Note that for obtaining the time taken by each optimization, they
were ran for 1000 iterations and the values were averaged.
PDF Manufacturing QR Average
CS Time Exc. CS Time Exc. CS Time Exc.
Poisson 336.2 1.1 0 284.4 1.1 7 324.2 1.1 7Binomial 445.3 14.9 0 579.6 17.6 7 476.3 15.5 7
Negative Binomial 20.6 15.4 0 72.0 5.3 0 46.3 10.3 0
Table 3.1: Results of the optimization of the PDFs parameters for distribution fitting. The values presented are the CS value, thetime that each optimization takes (in milliseconds) and the number of exclusions. An excluded value regards to a project which
for a specific distribution has a CS value above 10000. This limit was selected empirically and with the objective of removingvalues that were immensely greater than the values shown, and would therefore, create meaningless average values.
Analyzing the results from table 3.1, a series of conclusions can be taken. Regarding code efficiency,
it can be seen that clearly the Poisson distribution is the most efficient; this comes as no surprise given
the fact, that it is the only distribution that has only one parameter. Negative binomial seems to be
generally better than the binomial. However, it is important to mention that the optimization efficiency
is not a crucial factor when selecting the distribution, since the magnitudes of the values are relatively
small (in the order of the milliseconds). It can be seen that in terms of CS value, the negative binomial is
much better than the remaining distributions, both in Manufacturing and in QR. The binomial distribution
is the one that performs the worst. Furthermore, it can be seen that while the CS value increases
from Manufacturing to QR, regarding the negative binomial distribution, it does not have any exclusion,
contrarily to the remaining distributions, were the majority of the projects had CS values above 10000. In
figure 3.9, the different distributions are fitted to the histograms that have already been shown.
Visually, it is clear that the Negative Binomial PDF offers the best fit to the majority of the datasets.
Note the first and third graphs, describing data with an exponential and uniform tendency, respectively.
41
0
25
50
75
100
10 20 30 40
Cou
nt
0
10
20
30
40
50
10 20 30 40 50Duration [TU]
0
1
2
3
4
5
0 50 100 150
ProbabilityDistribution Function
Binomial NegativeBinomial Poisson
Figure 3.9: Results of the fitting process between the projects’ distributions and the theoretical PDFs, applied to the examplesthat been already shown.
Especially on these cases it can be seen that the negative binomial is the most appropriate PDF. Addi-
tionally, by observing the CDFs shown in figure 3.10, it is possible to see the fits of the different distribu-
tions. Note that while the optimization was not based in a CDF, like most of the described GoF methods,
the fitted distributions (especially the negative binomial) tend to be robust in following the ECDF.
5 10 15 10 15 20 25 0 50 100
0.00
0.25
0.50
0.75
1.00
Duration [TU]
Den
sity
BinomialNegativeBinomialPoisson
Figure 3.10: ECDFs and how the theoretical PDFs fit to them
After analyzing all the data from the fitting process, it is clear that the best choice for the PDF that
delivers the most consistent and most accurate prediction in both Manufacturing and QR processes is
the Negative Binomial. The assumption is then made that all the processes at the CDMO in study follow
a Negative Binomial PDF. In fact, using the Negative Binomial distribution when modelling durations can
be found in literature, for example in the article by Carter and Potts [10], where Hospital length-of-stay is
modelled using this PDF. Literature on modelling pharmaceutical processes durations’ PDFs was found
to be scarce and there were no occurrences of the usage of the negative binomial distribution; however,
its utilization can be justified by the study here presented.
42
Chapter 4
Simulation-Based Rough Cut Capacity
Planning
The simulation tool developed with the objective of conducting scenario-based forecasting and op-
timizing the overall SC service level was a simulation-based RCCP tool, data-driven and built on the
concept of demonstrated performance. The key stakeholders at the CDMO in study considered this to
be the tool which could bring the most benefits, given the nature of the project.
4.1 Problem Description
Capacity planning can be defined as the process of determining and evaluating the amount of capac-
ity required for future manufacturing operations. This capacity can often be in terms of labor, machinery,
warehouse space or supplier capabilities. Planning for capacity is a crucial step, since it can both eval-
uate if the future manufacturing processes can take place without problems and enable better resource
allocation, reducing inventory levels and increasing the overall utilized capacity. It can be performed
at several levels: product-line level (resource requirements planning, RRP), master-scheduling level
(RCCP) and material requirements planning level (capacity requirements planning, CRP) [54]. This ca-
pacity planning process is graphically represented in figure 4.1, where it can be seen that the RCCP is
the capacity planning at the master-scheduling level. The master production schedule is the plan made
by the company regarding production, staffing and inventory [7].
The RCCP step comes as the capacity plan at the tactical level, which regards the master production
schedule at the requirements level. In fact, Oracle Applications [36] defines it as a long-term capacity
planning tool that marketing and production use to balance required and available capacity, and to
negotiate changes to the master schedule and/or available capacity. Using the results from an RCCP,
the master schedule can be modified in order to solve capacity inconsistencies by moving scheduled
dates or increasing/decreasing scheduled production quantities. Additionally, the baseline capacity can
be increased when necessary, by adding overtime shifts or subcontracting personnel; to this end, a
rough estimate of the necessary capacity at a given time has to be known ahead, hence the need for
the RCCP.
The RCCP can be distinguished from the RRP and the CRP due to the level at which they operate.
The main definitions and differences between these are described below.
• RRP: RRP has the objective of creating a profile of the work centers’ load that the system uses to
validate a forecast, determining available capacity and long-range requirements for a work center.
43
Forecasting
Resources
Requirements
Planning
Master
Scheduling
Rough Cut
Capacity
Planning
Master
Requirements
Planning
Capacity
Requirements
Planning
Requirements Capacity
Strategic Plan
Tactical Plan
Operational Plan
Figure 4.1: Capacity Planning Process [38]
This means that it is a planning stage more focused on the strategic level. Usually, the RRP is
generated after generating a long-term forecast, using its data of future sales to estimate time and
resources required for the production operations. Only after the RRP can the master schedule be
produced, which justifies the higher level of the RRP comparing with the RCCP. Due to its strategic
operating level, RRP can aid in several aspects, e.g. expanding existing facilities, staffing loads or
determining capital expenditures for equipment [37].
• CRP: CRP is used to verify if an enterprise has sufficient capacity available to meet the capacity
requirements from the MRP plans. CRP is a more detailed capacity planning tool than RCCP in
the sense that it considers schedules and on-hand inventory quantities when calculating capacity
requirements. The capacity plans that come from the CRP are a direct statement of the capacity
required to meet the company’s net production requirements [36].
The rough character of the RCCP has a series of implications. First of all, the scheduled campaigns
(short-term horizon) and on-hand inventory quantities are not within the scope of the capacity require-
ments calculation. Secondly, the capacity tends to be measured at a large timeframe, frequently monthly
or biweekly.
4.2 Proposed ApproachThe main objective behind developing an RCCP tool was to obtain an estimate of the utilized ca-
pacity in the long-term horizon, regarding workstations and workforce. Since this tool was to be used
as a component of a DT, which possesses large quantities of information regarding current and past
states of the productive areas and models of how the processes tend to occur, adding the concept of
demonstrated performance was deemed as an opportunity for improving the baseline performance of
44
the tool. This elevates the results of the RCCP tool from strictly dependent on recipe information to
results based on performance that was seen, effectively accounting with more scenarios and generat-
ing more accurate and dependable results. Furthermore, the DT directly affects this simulation-based
RCCP by automatically updating the probability distributions that model the processes durations and by
adding the information regarding the current orders (and short/medium-term horizon ones), creating an
additional constraint in terms of activity planning. For this work, and considering the data in table 2.2,
the short-term timeframe is considered at 1 month from the current date, medium-term time frame at 3
months and long-term at 2 years.
The approach for the implementation of the demonstrated performance in the RCCP tool was by
using the Monte Carlo (MC) method. This class of computational algorithms relies on random sampling
of values in order to find a pattern or tendency, and theoretically, is able to solve any problem with
probabilistic interpretation. In the case at hand, it was seen that the manufacturing and QR durations
had probabilistic characteristics that could be measured (see section 3.2), which are propagated to the
area’s efforts.
The use of the MC method is justified by both the non-linear character of the problem and the fact
that the system cannot be accurately modeled. While other methods are more accurate and much less
computationally demanding (such as the Kalman filter for linear systems or the extended Kalman filter
for nonlinear systems), these require an accurate system model. The objective of the MC method is then
to generate distributions of the predicted monthly capacity for each area, given the variability inherent to
the systems.
The MC method can be mathematically formulated by initially defining a probability space (Ω,F , P ),
corresponding to the sample space Ω, the set of event outcomes F and the function that assigns prob-
abilities to the events P . The application of the probability space to the problem at hand is made as
shown in equation 4.1.
Ω = DM1 , DQR1 , DM2 , DQR2 , · · · , DMN
, DQRN ,Ω ∈ N
F = 2Ω
P (x) =[∏N
i=1 P (DMi, sMi
, pMi, x0Mi
) · P (DQRi , sQRi , pQRi , x0QRi)]j
(4.1)
with P (x, s, p, x0) =(s+ (x− x0)− 1)!
(x− x0)!(s− 1)!· ps(1− p)(x−x0) (Negative Binomial PDF)
A few considerations are in order, regarding the probability space of each PDF. Each value of DMi
or DQRi corresponds to a duration in terms of manufacturing or QR, corresponding to project i, taking
a value from Dmini , · · · |M . While establishing Dmin as the minimum duration for a process is correct,
doing so for the upper bound Dmax would not be quite as mathematically correct, since in theory there is
no upper bound. However, in practical terms, an upper bound exists and could be observed. Secondly,
note that this formulation tends to vary from more common applications of the MC method. Oftentimes,
MC simulation is used in gambling, to evaluate the risk of successively playing in a certain game; this
assumes that the probabilities are sampled in a succession, with the complexity increasing with each
additional succession considered. This means that for a considerable complexity not all combinations
45
can be accounted for, but the universes that are being calculated converge to a representative solution.
This would be near-impossible to obtain mathematically for problems with a certain complexity. In this
problem, the probabilities are not sampled in a succession, but they are rather part of the same universe.
This means that for a given iteration of the Monte Carlo simulation, each campaign’s duration is sampled
and translated to a specific resulting capacity. Each iteration will then feature a different universe, char-
acterized by a certain monthly utilized capacity on each area. By analyzing more universes (iterations),
the capacity will converge to a more representative one.
Sugita [53] offers additional formulation of the MC method, and also justifies why the utilization of
pseudorandom numbers in the sampling process is acceptable. This is important because while theoret-
ically MC simulation works for completely random sampling, it has been the subject of some suspicion
when utilizing pseudorandom sampling, which comes as a condition when computationally sampling
random values. However, the author proves mathematically that it is also valid.
4.2.1 Methodology
The methodology followed when constructing the simulation is described in this section. This includes
assumptions made and approaches followed.
For the first approach, the concept of confidence level has to be defined in the scope of the current
work. Given that the PDFs do not have an upper bound and can theoretically assume values up to
infinity, such event is undesirable and can greatly influence the results. To account for this problem a
confidence level was defined, as the percentage corresponding to the maximum acceptable duration in
the CDF. By doing so, the theoretical PDF is truncated, only accepting values inside the chosen confi-
dence level. This level can be chosen by the user, but setting it to 90% has been seen to successfully
remove a sizable portion of undesirable points, while keeping a varied distribution. To implement this
truncation, the algorithm cannot simply assign the maximum value (corresponding to the chosen confi-
dence level) to any sampled value bigger than that, since this would create an unbalanced frequency on
the maximum value. Instead, this truncation is implemented by sampling values in a loop until they are
in the acceptable region. The result from this process can be seen in the graphs from figure 4.2.
0.00
0.01
0.02
0.03
0 30 60 90
Freq
uenc
y
0.00
0.05
0.10
15 20 25 30 35Duration [TU]
0.00
0.01
0.02
0.03
30 60 90
Figure 4.2: Examples of truncated distributions. The confidence level was set to 90% and the number of samples for eachexample was 2000. Note that the grey region represents the theoretical PDF, the bars region corresponds to the sampled values
histogram and the vertical line corresponds to the confidence level.
46
By setting a confidence level and performing the truncation of the PDFs, the range of values from
where DMi and DQRi are taken becomes bounded, in such a way as Dmini , · · · , Dmaxi. To better
understand the mathematical formulation, consider the example defined by a scenario where the projects
to be sampled are defined as shown in table 4.1.
Project, i Possible manufacturing durations Possible QR durations
1 4, 5, 6 19, 202 13, 14 4, 53 3, 4 15, 16, 174 13, 14 4, 5
Table 4.1: Example scenario of orders to be sampled. Note that no start and finish dates are provided: the dates are notnecessary for the sampling process. This example is simplified and generally the range of possible values is much larger, as well
as the total number of projects.
For this specific example, the probability space would be defined as shown in equations 4.2
Ω = DM1, DQR1
, DM2, DQR2
, DM3, DQR3
, DM4, DQR4
j =
= 4, 19, 13, 4, 3, 15, 13, 4, 5, 19, 13, 4, 3, 15, 13, 4, 6, 19, 13, 4, 3, 15, 13, 4, · · · (4.2a)
P (x) =
[N∏i=1
P (DMi, sMi
, pMi, x0Mi
) · P (DQRi , sQRi , pQRi , x0QRi)
]j
(4.2b)
Although this example is extremely simple, the sample space Ω would be composed of 576 possible
scenarios, with their own respective probability of occurrence, as defined by equation 4.2b. Furthermore,
the set of possible events F = 2Ω = 2576 = 2.47 · 10173 would also be immensely large. In fact,
considering the scenario used for the results presented in section 4.3.4, where a total of 547 orders
were considered, resulting in a sample space with around 101392 scenarios, it can be seen that it would
be impossible to calculate the probability of each scenario, and evaluating which scenario would be more
probable. The use of Monte Carlo is justifiable for such a scenario: by randomly sampling values, not all
scenarios can be obtained, but a convergence can be found, which in theory would tend to the scenario
with the highest probability of occurring.
Note that one could say that obtaining the scenario with the highest probability could simply be made
by directly extracting each PDFs highest probability value, generating a scenario comprised of all the
planned orders with their manufacturing and QR durations being the highest probability ones. However,
the objective of this algorithm is obtaining the most probable monthly capacity utilization scenario, which
is not necessarily the scenario with the most probable durations. Instead, the monthly utilized capacities
have to calculated for each scenario and a convergence has to be found.
After a confidence level and the number of iterations of the Monte Carlo algorithm are set, the base
loop can be ran. This loop simply samples durations for the manufacturing and QR processes of each
project and for each iteration. While the capacities could be calculated on this loop, it is not done and
is instead calculated in a separate loop (this is possible because the results are stored). The start and
47
end dates of each campaign are also calculated. These can be calculated through 2 different methods,
chosen by the user: latest start date (LSD) or earliest due date (EDD). Note that the planned start date
and deadline are given with the planned orders. The two types of simulation simply establish if during
the simulation, either the planned start date or deadline are fixed. By fixing the start date, the simulation
is ran and an expected finishing date is obtained, which corresponds to the EDD; in contrast, by fixing
the deadline (while applying a safety buffer), the durations are calculated and the start date is obtained,
corresponding to the LSD.
After running the main loop, the monthly capacities can be calculated for each iteration. The as-
sumptions used for this calculation are fundamental for the results to make sense. These assumptions
are based on the values of manufacturing and QR duration (sampled from PDFs) and on the values
from the projects’ recipes, the processes durations and efforts, for the different productive and support
areas. The main assumption is that the manufacturing effort is scaled with its duration. This means that
if the recipe states that the manufacturing process has a duration of d and an effort of e, but in reality,
the duration is dreal (sampled from a PDF), then the effort would be dreald · e. The remaining efforts are
not scaled in quantity, only in ”location”. While this means that the total effort is always the same, the
monthly effort can vary. Note that the QR processes include the FP storage by the warehouse, manu-
facturing BPR review, QC R and RV operations and QA. In terms of efforts, QR is divided into all these
operations and does not have its own effort. The basic assumptions regarding the manufacturing and
support areas are shown in table 4.2.
Area Sampled Manufacturing Sampled QR
M Scaled BeginningQA – After M + After QC RV
QC IPC Distributed –QC R – Beginning
QC RV – After QC RWH Beginning Beginning
Table 4.2: Efforts assumptions. Scaled means that the daily effort is constant and expanded to the actual duration of the process.Distributed means that the total effort is the same, but the daily effort is changed, to account for the real duration. Beginning
means that the effort is placed in the first day of the process (when there is no more precise information).
These assumptions are also graphically represented in the graph of figure 4.3. Note that while there
is only one set of QC R/RV, in a real campaign there can be multiple occurrences of these stages,
regarding different operations. The way that they operate is simple: all of the QC R stages start after the
end of the manufacturing processes; each QC RV stage starts after the associated QC R stage ends;
finally, the QA operations regarding QC start after the last QC RV operation ends. These assumptions
are specific to the CDMO in study and where verified and approved by the responsible stakeholders of
this project.
One final constraint is used by the algorithm, at a planning level. The basic idea is to verify whether
or not, for a specific simulation, there are clashes in the scheduling of the BA of each campaign. A BA
is defined in the scope of this thesis as an asset which is unique and fundamental for a certain task in
a project and may be used to produce different products. This verification has the objective of analyzing
48
100
101
1
150
101
1
5 8
5 7 6
8
Sampled Manufacturing Sampled QR
5 8
5 7 6
8
0 5 10 15 20 25 30 35 40Time [TU]
AreaM
QA
QC IPC
QC R
QC RV
WH
Figure 4.3: Example representation of the assumptions. Note that the first line represents the durations and efforts according tothe recipe, while the second represents the values scaled according to the rules mentioned. The efforts are expressed inside thebars, while the duration is translated from the x axis. On the second plot, the sampled values are also indicated; these will affect
the used values for both durations and effort according to the rules previously stated.
if there are any campaigns that require a given BA at the same time. This can be an opportunity for
improvement since the ERP at the CDMO in study schedules the campaigns based on their recipes and
it has been seen that it is not always strictly followed. Usually, buffers are set after the campaigns to
account for these problems, but ideally these should be kept as small as possible. A better solution is to
use the demonstrated performance to obtain more representative scenarios. Similarly to the graph from
figure 4.3, the individual tasks start and duration have to be scaled. This is a straightforward process,
represented through the graph of figure 4.4.
0 5 10 15 20 25 30 35 40Time [TU]
Area Manufacturing Task 1 Task 2 Task 3
Figure 4.4: Example representation of the manufacturing tasks scaling. As can be seen, the tasks scale linearly, both in terms ofstart and duration. This is the approach used to obtain the tasks’ beginning and end, to check for asset clashes.
Using BAs as an aid in scheduling has been done in the literature, even applied to the pharmaceu-
tical industry. Papavasileiou et al. [39] consider main equipment when scheduling batches’ activities,
determining the recipe cycle time using such approach, which basically is a method for establishing
which are the main operations that take the longest time, and by doing so, enabling the start of new
campaigns while others are still running, provided they only start after the recipe cycle time.
49
4.2.2 Implementation
Simply put, the implementation of the algorithm can be divided into three parts, which have been
theoretically described in the previous section. These are (1) the sampling of the Manufacturing and QR
durations, (2) the calculation of the monthly capacities and (3) the calculation of the BA’s utilization and
verification of clashes. All of these are calculated per iteration, which can be seen as the calculation
per reality or universe of values – different universes will feature different sampled values, which will
eventually lead to different utilized capacities and occupied assets. Note that while all these operations
could be calculated in the same loop, for reasons of code tractability and ease of modifications, the three
algorithms were separated. The effects of this decision in terms of code efficiency are further analyzed
in section 4.3.2.
The sampling algorithm is rather easy to explain. It is described in the algorithm 2. The main objective
of the algorithm is to sample for each iteration all the manufacturing and QR durations of each project.
Algorithm 2 Durations sampling algorithm1: for i in iterations do2: Initialize vector for QR and manufacturing durations and start and end dates, with length equal
to the number of projects in study3: for campaign in planned orders do4: Sample value for manufacturing duration5: Sample value for QR duration6: if Latest Start Date then7: Set the end as the deadline minus a safety buffer8: Set the start as the end minus the QR and manufacturing durations9: else
10: if Earliest Due Date then11: Set the start as the planned start12: Set the end as the start plus manufacturing and QR durations
The capacity calculation algorithm is slightly more complicated than the sampling one, and assumes
one initial variable that contains necessary information for the correct working of the algorithm. This
variable capacities is a matrix containing as many rows as there are projects and 9 different fields, as
listed below. This information is extracted from the projects’ recipes.
• Daily manufacturing effort
• Manufacturing effort for the BPR review
• QA effort after the manufacturing BPR review (pair of percentage of QR duration and effort)
• QA effort after the last QC RV stage (pair of percentage of QR duration and effort)
• QC IPC effort (pairs of percentage of manufacturing duration and effort)
• QC R effort (pairs of percentage of QR duration and effort)
• QC RV effort (pairs of percentage of QR duration and effort)
• Total warehouse effort during manufacturing
50
• Total warehouse effort during QR
Using this information and the sampled values done with algorithm 2, the monthly capacities are
calculated using the algorithm 3. The monthly capacities are obtained (independently) for manufacturing,
QA, QC IPC, QC R, QC RV and warehouse.
Algorithm 3 Monthly capacities calculation algorithm1: for i in iterations do2: Collect vectors of processes durations, start and end for i3: Initialize vectors of daily capacity for each area4: for campaign in planned orders do5: Determine range of days for manufacturing and QR6: Add the daily manufacturing effort to each manufacturing day7: Add the manufacturing BPR review effort to the first day of QR8: Obtain the manufacturing days when QC IPC takes place and add the corresponding effort9: Obtain the QR days when QC R takes place and add the corresponding effort
10: Obtain the QR days when QC RV takes place and add the corresponding effort11: Obtain the QR day when manufacturing BPR review ends and add the QA BPR effort12: Obtain the QR day when the last QC RV ends and add the QA QC effort13: Add the manufacturing warehouse effort to the first day of manufacturing and the QR ware-
house effort to the first day of QR14: Aggregate the daily efforts by month
After calculating the monthly capacities for every universe sampled, it is necessary to aggregate the
results into a single monthly value with a variance measure. Although using the mean and the standard
deviation tend to be used, it seemed more appropriate to use the median and the IQR, since they tend to
be more representative in skewed distributions. The approach is simple: for each month, the capacity is
converted to the median of the capacities of that month across all the iterations, with a variance measure
equal to plus or minus 1 IQR.
Regarding the calculation of the BAs utilization, the algorithm is described in algorithm 4. Note
that a base structure variable is created before the cycle starts. This variable contains every utilized
asset by every project and the percentage of start and end of the asset utilization in relation to the whole
manufacturing process according to the recipe (similar to what is explained in figure 4.4). The usefulness
is clear, since the variables are pre-allocated, and this is possible to do since the processes are always
going to be the same for a simulation. Additionally, a variable containing the BAs for each project is
loaded before the cycle.
Algorithm 4 BAs utilization calculation algorithm1: Create base structure with project, asset and start and end percentages of manufacturing2: for i in iterations do3: Collect vectors of manufacturing duration and start i4: for campaign in planned orders do5: Multiply the sampled duration of the manufacturing process by the percentages of start and
end of the tasks and sum the start date
This algorithm simply calculates the data necessary for creating a Gantt chart of the BAs tasks, for
each universe of sampled values. Two steps are then necessary: aggregating the iterations results and
detecting clashes between the tasks of a single asset. The aggregation is done once again employing
51
the median and IQR as the central tendency and variance measures. However, it is not as straightfor-
ward as when calculating for the utilized capacities. The process varies whether the simulation is being
ran for EDD or LSD.
• EDD: the parameters used are the processes start and duration. The median and IQR are calcu-
lated for both parameters. The aggregated start becomes the median of the start, and its variation
measure is plus or minus 1 IQR of the start. Regarding the aggregated end of the process, the
value corresponds to the median start plus median duration, while the variation measure is this
value plus or minus the sum of the IQR of the start and duration.
• LSD: the parameters used are the processes end and duration. The median and IQR are cal-
culated for both parameters. The aggregated end becomes the median of the end with variance
equal to plus or minus the IQR of the end. The aggregated start corresponds to the median end
minus the median duration, while its variance measure is plus or minus the sum of the IQR of the
end and duration. The processes tend to be much more variable using this type of simulation,
since both the manufacturing and QR variability affect the results.
After having a fixed result with a measure of variability, a last step as to be done: detecting if there
are any clashes between tasks of different campaigns on a single asset. To do so the algorithm 5 is
followed. Note that the result of this algorithm is to classify each activity into one of three categories:
no interference; interference; possible interference. The first two categories are self-explanatory: if
there is no clash on one activity then it is categorized as no interference; if there is at least one clash
it is categorized as interference. The possible interference category relates to tasks that while have no
interference using their median measures, if the activities are considered by their worst case IQR they
clash. They are said to have the possibility of interference. All activities are initially categorized as no
interference.
Algorithm 5 BAs clash detection algorithm1: for i in unique BAs do2: For every occurrence of asset i obtain the range of dates that the asset is being occupied, both
for the median range and for the worst-case IQR3: for j in occurrences of asset i do4: if Median range of occurrence j with values in the remaining median ranges then5: Occurrence j of asset i categorized as interference6: else7: if IQR of occurrence j with values in the remaining IQR then8: Occurrence j of asset i categorized as possible interference
Additionally, an optimization stage is also performed, if so desired by the user. The objective of such
optimization is to mitigate the BA’s interferences. The formulation of this optimization is presented in
equation 4.3. The formulation assumes a series of variables. Consider BA = BA1, · · · , BAm =
BAi, i ∈ [1,m] as the set of BAs, chosen by the user. i corresponds to the index of the BA, and m is
the total number of BAs. Each BA features a series of activities from different projects; considering the
BA with index i (BAi), Pi = P1i , P2i , · · · , Poi = Pji, ji ∈ [1, oi]. Here, ji corresponds to the index of
each activity (for BAi), with the number of activities being oi.
52
minn
n =∑
[Pji ∩ Pki 6= ∅]
s.t. i ∈ [1,m]
ji, ki ∈ [1, oi]
ji 6= ki
n ≥ 0
(4.3)
The algorithm for clash minimization is described in algorithm 6. Note that this optimization process
is done for a fixed scenario, considered as the aggregation of all the scenarios, which can either be
the median or the median plus a measure of variability. This optimization also considers clashes with
BAs regarding orders in the short and medium-term timeframes (up to 3 months from the current date).
These are used to verify that there are no clashes, but cannot be moved, since they are contained in a
timeframe which does not allow for changes in scheduling.
Algorithm 6 BAs clash optimization algorithm1: Define if optimization is done by possible or actual interference2: for ba in BAs do . Start by dealing with current-planned orders interactions3: Obtain maximum finish date of current orders in ba as maxba4: if maxba bigger than any start date in planned orders of ba then5: Add time difference to affected planned orders6: while There are (possible) interferences do7: Identify the BAs with (possible) interferences8: for ba in BAs with (possible) interferences do9: for activity in activities in ba do
10: if Interference in activity then11: Obtain the amount of interfered time i12: Add i to the start of all the activities in the campaign corresponding to activity13: Update the campaigns start dates14: Calculate capacities of the resulting scenario
After performing the optimization from algorithm 6, the user receives the new corresponding Gantt
chart with the operations related to each BA and the capacity plots. However, both these graphs are
shown in a deterministic way, without any variance associated. This is due to the fact that the variance
cannot be propagated from the raw data after the optimization stage. Since the start dates of the
operations are updated after the optimization stage, the solution for obtaining variance measures is to
re-run the simulation, with the new start dates, which have a greater probability of removing the BA’s
operations interferences. Note that, it is not guaranteed that not interference will take place: due to the
stochastic nature of the process, a new simulation will certainly sample different values and the results
will not be the same ones, creating the possibility for new interferences that were not observed in a
previous simulation.
All of these algorithms were implemented in R, a programming language mainly used for statistical
computing and data science. The main reason for choosing this language over other more common-
place languages, such as Python, were its capabilities for designing intuitive and aesthetic frontend
applications, which is a requirement for a DT (further explained in chapter 5). R is an extremely popular
53
programming language nowadays, with a vast community, support and updates, which justifies its choice
as the main programming language for this work.
4.3 Results
4.3.1 Convergence Analysis
Evaluating the convergence of the results is a crucial step since it can help understand if the sim-
ulation is trying to reach some conclusion and how long it takes to reach said conclusion. In terms of
monthly capacity, it was seen that for most of the areas and during months with considerable activity,
the results did converge to a certain median monthly capacity. Interestingly, increasing the number of
iterations tends to generate a normal probability distribution on most areas, as can be seen in figure 4.5,
where the distributions for 10 to 50000 iterations are shown for the manufacturing and QA areas. Note
that while the shape of the distributions varies significantly from number of iterations to the next, the
mean and median do not fluctuate greatly. While this may justify using smaller number of iterations, it is
riskier to take conclusions regarding the central value from simulation with 10 iterations than with 50000
iterations, since the distributions are much more developed and less influenced by outliers.
Mean: 11518.1Median: 11444.9SD: 339.5IQR: 210.5
Mean: 11828.2Median: 11848.4SD: 347.2IQR: 377.1
Mean: 11802.8Median: 11809.4SD: 371IQR: 482.2
Mean: 11789.3Median: 11781.3SD: 341IQR: 453.7
Mean: 11771.6Median: 11764.5SD: 340.4IQR: 447.1
Mean: 11759.8Median: 11760.1SD: 345.2IQR: 463.4
Mean: 11783.2Median: 11782.5SD: 345.2IQR: 464.1
Mean: 11770.6Median: 11769.6SD: 347.5IQR: 476.6
Mean: 352.1Median: 350.9SD: 54.3IQR: 51.3
Mean: 328.8Median: 326.9SD: 65.7IQR: 106.4
Mean: 308.9Median: 303.2SD: 57.2IQR: 79.1
Mean: 321.3Median: 318.9SD: 63.2IQR: 84.2
Mean: 321.6Median: 319.1SD: 60.4IQR: 81.7
Mean: 321.6Median: 319.6SD: 61.5IQR: 85.3
Mean: 320.4Median: 318.1SD: 61.6IQR: 86.8
Mean: 321.1Median: 319.6SD: 60.8IQR: 83.7
M QA
1050
100
500
1000
5000
1000
050
000
11000 12000 13000 200 300 400 500Duration [TU]
Figure 4.5: Evolution of a month’s capacity distribution. The areas represented are manufacturing and QA, from 10 to 50000iterations. As can be seen, the monthly capacities start to form a normal distribution with the increase in iterations.
The corresponding evolution of all the areas monthly capacities with the number of iterations, regard-
ing the same month as shown in figure 4.5 is shown in figure 4.6. this graph shows the evolution of the
54
median value, depicting the central tendency measure, with the median plus or minus one tenth of the
IQR as the shaded area. The IQR value was divided by 10 because otherwise the graph’s limits would
be affected and would not show the information as intended; the IQR is only meant to show the tendency
of the variability along the number of iterations.
QC R QC RV WH
M QA QC IPC
10 100 1000 10000 10 100 1000 10000 10 100 1000 10000
10 100 1000 10000 10 100 1000 10000 10 100 1000 1000027.3
27.4
27.5
27.6
24.5
24.6
18
19
20
4.2
4.3
4.4
4.5
33.5
34.0
34.5
5.8
5.9
6.0
6.1
# of Iterations
Figure 4.6: Evolution of the median of the monthly capacity utilization (%) by area. The shaded regions correspond to the medianplus or minus one tenth of the IQR, to give insights into the variability evolution.
Some conclusions are in order, regarding the convergence of the capacities. First of all, note that
results from different months vary greatly, depending on the amount and project of the planned cam-
paigns for the month. Secondly, it can be seen that the median does not vary greatly with the number
of iterations. In fact, the data shown in table 4.3 shows the percentage variation between the median
at different numbers of iterations (discriminated for each area) and the median for the 50000 iterations
simulation, which is hereby considered as the ground truth, since it is the most representative result. The
table shows that the results are never too disparate from the 50000 iterations simulation, with arguably
only the 10, 50 and 100 iterations simulations offering less-than-optimal results, as made clear by the
absolute total of the variance of the areas per iteration. Note that even the graphs that appear to feature
greater variances in figure 4.6, if analyzed closely can be seen that the shown variance is on a small
scale.
Iter M QA QC IPC QC R QC RV WH Absolute Total
10 −2.8% 9.8% 0.1% −1.7% 0.2% −0.1% 14.6%50 0.7% 2.3% −0.5% −0.4% 2.0% 0.3% 6.0%100 0.3% −5.1% 0.1% 2.5% −0.4% 0.0% 8.5%500 0.1% −0.2% −0.2% 0.0% 0.2% 0.0% 0.7%1000 −0.04% −0.1% −0.1% 0.0% 0.0% 0.0% 0.3%5000 −0.1% 0.00% −0.1% 0.0% −0.3% 0.0% 0.5%10000 0.1% −0.5% 0.02% 0.0% −0.2% 0.0% 0.8%
Table 4.3: Monthly and area-wise relative error (with sign) of the median per iteration, compared to 50000 iterations
Additionally, the tendency of the distributions to become normal is not always shown and depends
on the month and area in question. Observe, for example the graphs from figure 4.7, which shows the
55
probability distributions for the 3 QC areas and warehouse, for the month in study in figure 4.5 and for
a simulation of 50000 iterations. As can be seen, the results appear to follow an approximately normal
shape (the distribution for the QC R area, for example, follows a normal tendency, even though it is
clearly not a normal distribution). Nevertheless, there are results which feature distributions without such
a clear pattern. Examples of these cases are shown in figure 4.8. However, these can happen for less
represented months, or for unusual conditions. But even on those cases, the convergence is verified,
with little error from the ground truth capacity. In conclusion, and only with the capacity calculation in
mind, simulations of 500 iterations appear to offer reliable results. The smaller the number of iterations
the better, as can be seen in section 4.3.2.
QC_RV WH
QC_IPC QC_R
100 150 200 280 290 300 310
1160 1200 1240 160 180 200 220
Duration [TU]
Figure 4.7: Evolution of a month’s capacity distributions for QC and warehouse. The results shown are for a simulation at 50000iterations.
200 300 400 500 20 30 40Duration [TU]
Figure 4.8: Non-normal examples of distributions at 50000 iterations
Regarding the BA’s utilization graphs, the convergence was tested in a slightly different manner.
Instead of testing the convergence of a single activity’s start or end, analogously to what was done
regarding the capacities, for the BA’s utilization the total number of asset interferences, possible inter-
ferences and no interferences were counted and its evolution with the number of iterations used for the
simulation was monitored. The results are shown in figure 4.9.
As can be seen in figure 4.9, the convergence of the number of operations with interferences is clear,
decreasing approximately 50% from simulations with 10 iterations to simulations with 50000 iterations.
The convergence of this parameter can be seen at around 500 iterations per simulation. Additionally,
it can be seen that the BAs with possible interference tend to increase until a certain point and then
56
0
100
200
300
10 100 1000 10000# of Iterations
Tas
ks
Interference No interference Possible interference
Figure 4.9: Evolution of the number of BA interferences versus the number of iterations
gradually decrease. This is actually a very predictable behavior; the value of possible interferences in-
creases while the value of actual interferences decreases. This happens because after a BA loses its
interferences it automatically becomes a possible interference; during this regime, the interferences are
converted into possible interferences, while the BAs with no interferences tend to remain approximately
stable, not changing much. After the number of interferences converges and stagnates, it can be seen
that the number of possible interferences starts to decline. This can be explained by the fact that in-
creasing the number of iterations per simulation will likely decrease the variation measured in the start
and end of the tasks. This reduces the range for possible interference, effectively reducing the number
of tasks with possible interferences.
4.3.2 Code Efficiency
The algorithms implemented often take long times due to the shear amount of data processed and
generated. Consider simulation with 50000 iterations, the largest number of iterations simulated. Ad-
ditionally, for the results here shown, the number of planned orders was 299. During the first loop, the
sampling algorithm samples values for each campaigns manufacturing and QR and stores 4 vectors
with these durations and the start and end of each campaign. This is done for each iteration and then
aggregated – leading to 299[campaigns] · 4[vectors] · 50000[iterations] = 59800000 values on this loop.
All these values proceed to the capacities calculation loop. There 6 vectors (for each area) containing
1500 values are created, which are then aggregated by month and stored. Although the final amount of
data from this loop is only 25[months] · 6[areas] · 50000[iterations] = 7500000 values, this loop is much
more computationally demanding than the other 2 loops. Finally, the BA’s utilization loop just returns a
structure with the start and end of each BA utilized by each project. For the tested scenarios this corre-
sponded to 305[BAproject] · 3[fields] · 50000[iterations] = 45750000 values. Note that the clash detection
and optimization algorithms are not considered in terms of efficiency because they are performed in
the aggregated data and are only done once, which means that the time taken for these operations will
57
always be significantly smaller.
Due to these reasons, performing simulations of a substantial number of iterations may often take up
to 1 hour, which justifies the application of some strategies to try to reduce the time spent on simulation.
The first strategy was observing the code bottlenecks and try to develop alternative strategies for those
sections. This was done using Profvis a profiling tool for the R computer language [13]. By using this
tool, the bottlenecks were successfully identified and some of those were mitigated. The second and
most significant approach was the application of parallel computing instead of regular computation for
the loops. This was done using the function parLapply from the package Parallel of the R computational
language [56]. This function requires the creation of a cluster before the computation, which receives
the number of CPU cores to be used. Since this thesis was performed on a computer with an 8 core
CPU and it is often recommended to leave one core out of the simulation for other uses, 7 cores were
used for the parallel computations. The results of using parallel computation versus regular computation
are shown in figure 4.10.
1 − Main Loop 2 − Capacities Loop 3 − BA's Utilization Loop
1e−01
1e+00
1e+01
1e+02
1e+03
Dur
atio
n [s
]
Type of Computation Non−Parallel Computing Parallel Computing
10 100 1000 10000 10 100 1000 10000 10 100 1000 10000−1.0
−0.5
0.0
0.5
# of IterationsAdv
anta
ges
of P
aral
lel
vs n
on−
Par
alle
l
Figure 4.10: Code Efficiency of regular versus parallel computation
The results shown in figure 4.10 are extremely interesting. They show that non-parallel computing
is actually more efficient for smaller iterations. This happens because of the creation of the cluster for
parallel computation. Before each loop, the cluster has to be created and after the loop it has to be
destroyed. This process has a duration of around 1.5s. On simulation with a small number of iterations,
1.5s is actually rather substantial, and therefore the results are spoilt by this setup process. When the
simulations start to take longer times, this setup time end up diluted in the time of the simulation and
the advantages start to be noticeable, frequently achieving efficiencies 50% better than the non-parallel
computation ones. Although this may not seem like much, at 50000 iterations a non-parallel simulation
takes an overall time of 42.83 minutes, while a parallel computation simulation takes only 18.07 minutes,
effectively reducing the duration in 24.76 minutes, corresponding to a 57.82% reduction.
As explained in section 4.2.2, the calculations of capacities and BA’s utilization could be performed
all on the same loop. However, due to tractability and ease of code maintenance they were separated
into three distinct functions. The effects of this decision in terms of efficiency are shown in figure 4.11.
The results are predictable for few iterations. Due to the setup times of the cluster it is clear that
58
10
100
1000
10 100 1000 10000# of Iterations
Dur
atio
n [s
]
Loop type Divided loop Single loop
Figure 4.11: Code Efficiency – 3 loops vs a single loop
the separated loop algorithm takes a longer time to process, since it involves the creation of 3 clusters
rather than just one, similar to the joined loop algorithm. However, for simulations with more iterations
the separated functions algorithm starts to become slightly more efficient. Although it is not clear why
this happens, the main conclusion that can be taken from this test is that there are no big advantages or
disadvantages in terms of efficiency between choosing the aggregated functions algorithm versus the
separated one – enabling the choice of the separated algorithm for the reasons presented before.
One last efficiency test was performed, to check the impacts of using 7 CPU cores instead of 8 or 6.
The results of this test are shown in figure 4.12.
1 − Main Loop 2 − Capacities Loop 3 − BA's Utilization Loop
3
10
30
100
300
Dur
atio
n [s
]
# of CPU cores used 6 cores 7 cores 8 cores
10 100 1000 10000 10 100 1000 10000 10 100 1000 10000
−0.2−0.1
0.00.10.20.3
# of Iterations
Adv
anta
ges
of7
core
s
Figure 4.12: Code Efficiency – 6 vs 7 vs 8 CPU cores used. Note that the graph on the bottom shows the advantages of using 7cores instead of using either 6 or 8 cores, e.g., a value of +50% means that using 7 cores takes 50% less time.
The results show that no clear conclusion can be taken from using 7 versus 6 or 8 CPU cores.
One possible explanation for the fluctuation of the results is the system instability created by using the 8
cores and not leaving a single core for other operations required, which may actually negatively affect the
results, or the reduced performance that could be expected in principle from the 6 CPU cores simulation.
59
It can be concluded that there is no advantage in performing the parallel computation with 6 or 8 CPU
cores instead of 7 and that it may even bring unjustified instability to the system.
In conclusion and considering 500 iterations as the ideal amount for simulation, it can be seen that
using parallel computation takes 12.79 seconds while non-parallel computation takes 21.78 seconds.
The advantage of using parallel computation is clear, even though the times are not too long. This
means that simulations with larger number of iterations can actually be used at acceptable lengths, with
the benefits of using parallel computation increasing with it.
4.3.3 Validation
To validate the results obtained by the simulation tool, data from past campaigns was used. The
method followed was specifying a date range where the study would be performed (this excluded the first
3 months of the chosen range – short and medium-term timeframes, but since the date was empirically
chosen, this could be neglected). After having a specific range of dates, the orders executed on such
range were extracted, along with their planned start date, actual start date, actual manufacturing duration
and actual QR duration. Orders that did not have all the fields were filtered out.
With the remaining orders, the planned start date was used for the simulation process, following
the methods described on section 4.2.2. Note that the type of simulation ran was EDD, since at the
CDMO in study, the orders are process based on their start date and not on their deadline, and for
validation purposes, no data could be used to verify a simulation ran by LSD. The simulation was then
ran and the monthly capacities were calculated for every iteration (500 iterations) and aggregated. For
the actual consumed capacities, the approach followed was the conversion of the orders into capacities
by following the assumptions described in section 4.2.1. This was the only possible approach, since
there is no data regarding the actual consumed capacity per order, the values of consumed capacity are
taken directly from the recipe, which does not translate the reality itself. Following these methodologies
for the simulated and real capacities, the results were obtained and are shown in figure 4.13.
Note that the values from the graph shown in figure 4.13 are masked by a multiplicative factor along
all the values, for confidentiality reasons. This hides the true values, but the comparisons and relative
deviations are correct. By visual inspection of the graph, it can be seen that the error between the actual
monthly capacities and the simulated ones does not appear to be large, and that often the correct result
is within the 2 IQRs, occasionally being within the 1 IQR. In fact, tables 4.4 and 4.5 show the numeric
figures behind the graph.
The results from tables 4.4 and 4.5 show a series of interesting behavior of the data. Note that
the average used is the absolute average (average of the absolute values) since it translates better the
tendency of the relative and absolute errors than other common measures of aggregating error, such as
the root mean squared error. It can be seen that the absolute average of the relative error is generally at
around 10%, which is a very good estimation for a rough cut simulation tool. The worst case occurs at the
QA, and by visual inspection of the corresponding graph from figure 4.13, it can be seen that the IQR of
the QA utilized capacity is often much larger than other areas. This derives directly from the fact that the
PDFs that modeled the QR duration are often very disperse and QA processes are the last processes
60
0
5000
10000
15000
Month 4
Month 5
Month 6
Month 7
Month 8
Month 9
Month 10
Uti
lized
Cap
acit
yManufacturing
0
200
400
600
800
Month 4
Month 5
Month 6
Month 7
Month 8
Month 9
Month 10
Month 11
QA
0
500
1000
1500
Month 4
Month 5
Month 6
Month 7
Month 8
Month 9
Month 10
Month 11
Uti
lized
Cap
acit
y
QC Type IPC Release Release Review
QC
0
100
200
300
Month 4
Month 5
Month 6
Month 7
Month 8
Month 9
Month 10
WH
Figure 4.13: Capacities validation graph. The graphs of the four areas are shown with the QC graph being divided into the 3sections. The shaded regions over each bar correspond to the real capacities, while the color bars correspond to the simulated
capacities. The error bars are distinguished between 1 IQR and 2 IQRs: the broader-width error bars correspond to 2 IQRs, whilethe narrower-width bar corresponds to 1 IQR.
Month Manufacturing QA Warehouse
Err % Err Err % Err Err % Err
4 2.5% 252.2 38.0% 15.6 3.5% 5.95 7.7% 1020.6 19.3% −97.6 0.4% 1.16 7.0% −1216.9 4.7% 28.0 2.1% −6.17 1.5% 186.5 42.0% 136.7 5.4% 9.08 13.6% −2067.9 16.4% −77.5 8.4% 18.79 4.0% −590.3 2.5% 11.0 17.0% −43.6
10 11.0% −348.2 28.1% −145.0 36.1% 9.611 − − 82.1% −19.2 − −
Absolute Average 6.7% 811.8 29.2% 66.3 10.4% 13.4
Table 4.4: Monthly relative and absolute (with sign) errors for manufacturing, QA and Warehouse. The absolute average iscalculated as the average of the absolute values, which generates a better measure of variability.
during such stage, which will result in a wider range of values for the QA efforts, generating more varied
capacity distributions that will feature greater variability when aggregated. This means that although the
relative error tends to be greater on this area, this behavior is actually expected and accounted for, made
evident by the larger IQR.
It is important to highlight the fact that all the areas besides QA and QC release review only have
values for 7 months instead of 8. This happens because both QA and QC release review are the final
tasks that are performed, which means that campaigns that started in the end of the sixth month (month
9 – the last allowed month for orders to start) may have had their QA processes happening during month
11.
61
Month QC IPC QC Release QC Release Review
Err % Err Err % Err Err % Err
4 3.2% −33.0 36.3% 38.0 12.6% 13.25 5.3% 74.3 0.5% 1.4 14.7% 30.26 0.8% −12.0 7.3% −24.1 8.7% −33.17 17.1% 199.0 11.2% 22.8 16.9% 34.38 14.7% −210.0 2.7% −6.6 3.7% −9.29 4.5% −61.6 5.0% −13.0 32.1% 72.1
10 25.1% 39.3 20.4% −23.4 36.0% −72.611 − − − − 37.3% −3.6
Absolute Average 10.1% 89.9 11.9% 18.5 20.2% 33.5
Table 4.5: Monthly relative and absolute (with sign) errors for QC IPC, release and release review. The absolute average iscalculated as the average of the absolute values, which generates a better measure of variability.
Table 4.6 shows an aggregation of the counts and percentages of occurrences per area of real
monthly capacities being within either the 1 or 2 IQRs.
IQRs Manufacturing QA QC IPC QC R QC RV Warehouse
1 Count 2 3 2 3 3 1% 28.6% 37.5% 28.6% 42.9% 37.5% 14.3%
2 Count 6 7 4 6 6 3% 85.7% 87.5% 57.1% 85.7% 75.0% 42.9%
Table 4.6: Occurrences of real monthly capacities being within the 1 or 2 IQR
As can be seen from the results presented in table 4.6, considering the range of 2 IQRs, a vast
majority of the real monthly capacities are contained within said range of the simulated capacities. For
the warehouse, which feature the lowest percentage of occurrences at only 42.9%, it can be seen from
table 4.4 that the relative error is generally small. What explains the lack of adherence to the IQR is
their small values, derived from small variances that warehouse effort feature. This could actually be
also seen in the graph referring to the warehouse from figure 4.6, regarding convergence; from the
graph it can be seen that the monthly warehouse consumed capacity converges but fluctuating only
approximately 1 hour per month, which is extremely small. Although this does not unequivocally explain
the low variability of the area, it is actually a reliable indicator.
The validation of the BA’s utilization Gantt chart was not performed for two main reasons. First,
while the simulation was based on the planned start date, the actual start date was often not the same,
sometimes fluctuating by considerable amounts. While the effects of such events did not greatly impact
the monthly capacities, the BA’s utilization is measured on a continuous scale, with tasks taking hours
or days; changes in the start date would greatly impact the allocation time range of the assets, which
would make the real versus simulated occupation almost non-comparable. Secondly, the application of
the BA’s utilization is a mere accessory for the capacity calculation and the results should not be used
strictly but instead as a reference for how the operations tend to happen to avoid the asset’s clashes.
In fact, a schedule of asset utilization made one year in advance or more is not reliable at all, and
the objective of its inclusion in this work is more on the basis of predicting asset’s influence on utilized
62
capacity and how solving the conflicts would affect the overall monthly capacities.
4.3.4 Prediction
The main objective of the algorithms described are to generate future predictions of capacity uti-
lization in the productive and support areas. The approaches followed were described and justified in
section 4.2.1 and the implementation methods in section 4.2.2. Furthermore, the convergence of the
results and the efficiency of the algorithms were tested in sections 4.3.1 and 4.3.2. Lastly, the results
were validation against past scenarios in section 4.3.3. The objective of all these steps was the creation
of trustworthy results that predict how the future capacities would be utilized, in an approximate fashion,
as per the definition of RCCP.
For the prediction, a specific scenario had to be followed. The number of iterations ran was 500,
the type of simulation was EDD. A total of 299 planned orders (3 months – 2 years) and 248 current
orders (present – 3 months) were considered. Note that these orders are actual orders at the CDMO in
study. However, the names of projects and assets are hashed, the months are replaced by a sequence
of months (month 1, month 2, ...) and whenever quantities are shown these are scaled by a hidden
factor. These steps are required for confidentiality reasons, but the modifications allow for the patterns
and tendencies to be conveyed.
Given the described scenario, the simulation was ran and the results were obtained. The first and
most important result is the monthly capacities by area. These results are shown in figure 4.14. The
capacities shown can also be seen in figure 4.15, where they are plotted in percentage of maximum
capacity utilized.
It can be seen from figure 4.15 that the capacity utilization tends to roughly follow the expected pat-
tern, exemplified in figure 2.7. Although the behavior is not perfect, it is actually a particularly good
approximation, which is the objective of an RCCP tool. Furthermore, although the months have been
anonymized, it is possible to observe the seasonality inherent to productive operations. Another interest-
ing point, especially in the manufacturing area, is that the utilized capacity is generally capped at around
75%. This is actually a customary practice to manufacturing operations, to set an upper boundary on
capacity utilization in order to account for delays, unexpected events or priority orders (when customers
pay a premium for expedite manufactruing).
Regarding the Gantt chart of the BA’s utilization for this simulation, it is shown in figure 4.16. Note
that this is a subset of the BAs, which is sufficient to show the tendencies of the assets. The choice of
what are BAs is actually left for the user to decide (a set of BAs is predefined, but further customization
is possible).
The Gantt chart from figure 4.16 shows the utilization of the specific assets and how different projects
may compete for a BA at a given time. When the ERP creates the base schedule, it considers the
durations of the processes as they are in the recipe; in contrast, these algorithms consider simulation
of the demonstrated performance. This means that there may be some clashes between tasks, which
although they are often not large, they should be considered. Note that the interferences that are shown
in red usually have extremely small regions of interference. This comes as a result of the ERP schedule,
63
0
5000
10000
15000
20000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Uti
lized
Cap
acit
yManufacturing
0
300
600
900
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
QA
0
1000
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Uti
lized
Cap
acit
y
QC Type IPC Release Release Review
QC
0
200
400
600
800
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
WH
Figure 4.14: Forecasted capacity evolution per month and area. The full color bars correspond to the actual simulated capacityfor the month and area in question, with an error bar indicating ± 1 IQR. The grey bars correspond to the capacities from thecurrent orders, which even though are orders starting on the first 3 months, often have effects on the following months. The
shaded background area corresponds to the limit capacity of each area, per month.
QC Release QC Release Review Warehouse
Manufacturing QA QC IPC
2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24
2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 240%
25%
50%
75%
100%
0%
25%
50%
75%
100%
0%
25%
50%
75%
100%
0%
25%
50%
75%
100%
0%
25%
50%
75%
100%
0%
25%
50%
75%
100%
Month
Figure 4.15: Percentage of maximum capacity utilized per month and area
64
A1
A10
A11
A12
A13
A2
A3
A4
A5
A6
A7
A8
A9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23Month
Ass
et
Fixed Order Interference No interference Possible interference
Figure 4.16: Gantt chart of the BA’s utilization. Tasks color-coded as grey signify that they are current orders, which can nolonger be modified; green tasks do not have any kind of interference; red tasks have schedule interference with other task(s),considering the median as the aggregation criteria for the simulations; yellow tasks indicate possible interference, when tasks’
schedules collide considering their extended start and end (median start minus IQR and median end plus IQR)
which even though may not be the most precise, it does generally account for the majority of the tasks’
duration.
After running the initial simulation and obtaining the results in terms of utilized capacity and BA’s
utilization, the optimization of the assets can be performed. The objective of the optimization is to remove
any BA’s interferences. However, the user may choose what to consider as the evaluator of interference,
the median of the beginning and end of the tasks or their extended values, considering either 0.25, 0.5,
1 or 2 times the IQR of the beginning and end. The optimization here shown was performed considering
0.5 IQR. The results from this optimization can be seen in figure 4.17 for the Gantt chart of the BA’s
utilization and in figure 4.18 for the capacity utilization resulting from the optimized scenario.
A1
A10
A11
A12
A13
A2
A3
A4
A5
A6
A7
A8
A9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21Month
Ass
et
Figure 4.17: Gantt chart of the BA’s utilization after optimization. The grey rectangles correspond to tasks regarding currentorders (which were not changed during the optimization and cannot be changed) and the green rectangles correspond to the
remaining orders, having been modified or not.
The graphs shown in figures 4.17 and 4.18 show the results in terms of BA’s utilization and utilized
capacities for a specific reality, generated by the optimization of the asset’s utilization. This fixed reality
has an associated capacity consumption. Regarding the first graph it can be seen that the tasks are
much more spread out, when comparing with the Gantt chart from figure 4.16. This is a natural step since
the tasks allocation had to be expanded so as to not clash with each other, with a confidence of 0.5 IQR.
Note that one project may have several tasks on different BAs, which means that if one task is affected,
all the tasks of the project are affected. If the simulation had been ran to mitigate clashes between
65
0
5000
10000
15000
20000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Uti
lized
Cap
acit
yManufacturing
0
300
600
900
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
QA
0
1000
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Uti
lized
Cap
acit
y
QC Type IPC Release Release Review
QC
0
100
200
300
400
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
WH
Figure 4.18: Monthly capacities per area after optimization. Note that the graphs do not included error bars since thiscorresponds to a fixed scenario and not an aggregation of multiple scenarios.
tasks only considering the median, these would be generally closer together. Regarding the capacity
utilization graphs, the new tendency is actually quite predictable, having in mind the optimized scenario
BA’s utilization Gantt chart. Since the tasks have larger spaces between each other, the capacities tend
to be more evenly spread out, instead of mostly being concentrated in the initial months. This can be
observed in the graphs from figure 4.18, where in all the 4 areas, the maximum monthly utilized capacity
is greatly reduced (excluding the capacities derived from the current orders), and the capacity of latter
months is generally increased. The effects of the distinct types of optimization are shown in table 4.7,
where for manufacturing, QA and QC IPC the monthly capacity percentage (of total available capacity)
is shown for the 6 initial months of the medium-term timeframe.
By analysis of the results shown in table 4.7, a few conclusions regarding the data tendencies and
patterns can be reached. First of all, it can be seen that fixating the initial month and one area, the per-
centage decreases along the type of iterations, with an average reduction of −2.57% per optimization for
manufacturing, −0.64% for QA and −1.41% for QC IPC. This pattern of reduction continues consistently
throughout the initial 3 months (4-6), with the third month having the highest reduction per optimiza-
tion for the 3 areas, with −4.27% for manufacturing, −1.38% for QA and −2.80% for QC IPC. Months
7 and 8 do not appear to have a clear pattern, with manufacturing and QC IPC continuing to reduce
per optimization type, but QA already increasing. However, on the last month (month 9), all the areas
increase their average variation of consumed capacity percentage per optimization to positive values,
with manufacturing featuring a value of +0.95%, QA +1.21% and QC IPC +1.99%. This can be seen as
the turning point month, from where the capacities seen in the base optimization will start to increase.
These values tell precisely what can be seen when comparing graphs of utilized capacity before and
after optimization (figures 4.14 and 4.18): the capacities of the initial months decrease, while after a
certain month they start to increase, converging to a more uniform capacity distribution. Furthermore,
this behavior is much more visible the greater the optimization type (considering a ”smaller” optimization
66
Manufacturing
Month Base Median 0.25 IQRs 0.5 IQRs 1 IQR 2 IQRs
4 54.1% 53.5% 51.6% 48.0% 45.3% 41.2%5 33.2% 34.0% 30.6% 25.9% 22.8% 21.1%6 44.9% 51.4% 42.3% 35.8% 29.3% 23.5%7 39.2% 47.5% 38.6% 33.0% 32.3% 24.1%8 42.9% 41.6% 41.8% 39.9% 33.5% 30.8%9 15.3% 8.8% 23.5% 31.4% 26.7% 20.0%
QA
Month Base Median 0.25 IQRs 0.5 IQRs 1 IQR 2 IQRs
4 63.1% 62.9% 60.6% 60.4% 60.3% 59.9%5 24.9% 23.5% 25.6% 23.7% 24.0% 21.7%6 29.0% 34.7% 30.8% 25.8% 22.0% 22.1%7 21.6% 43.7% 40.1% 36.8% 35.6% 22.0%8 12.4% 29.3% 26.0% 27.6% 24.4% 23.2%9 25.7% 15.8% 25.9% 37.2% 33.8% 31.8%
QC IPC
Month Base Median 0.25 IQRs 0.5 IQRs 1 IQR 2 IQRs
4 36.1% 36.1% 35.5% 33.2% 31.0% 29.0%5 20.9% 20.8% 19.1% 16.6% 15.3% 17.8%6 31.1% 31.6% 25.4% 23.0% 18.3% 17.1%7 29.9% 30.6% 26.1% 22.9% 22.9% 16.1%8 36.7% 36.7% 35.7% 33.5% 30.7% 30.5%9 15.1% 14.4% 25.4% 22.2% 20.8% 25.0%
Table 4.7: Consumed capacity percentage of the months 4-9 for the manufacturing, QA and QC IPC and for each type ofoptimization. Base refers to a non-optimized version of the results, while the subsequent columns refer to the optimization by
median, 0.25 IQRs, 0.5 IQRs, 1 IQR and 2 IQRs.
just the median optimization and the ”greatest” the 2 IQRs optimization).
Another interesting pattern that can be observed is the average variation of the utilized capacity along
the months, for each optimization type and area. Here, for all the areas and all the optimization types,
the value is always negative, which makes complete sense, since the utilization tends to be greater on
the initial months, due to a larger number of planned orders. With minor exceptions, the values shown
are as could be predicted: the larger the type of optimization, the smaller the monthly capacity reduction.
What this means is that for the base scenario, the first month features a big capacity, with the second
being much smaller and so on, quickly arriving at no capacity at all; the variation is big in this scenario.
In an optimized scenario, where the tasks are more evenly spread, the monthly capacities are also more
spread out along the months; this means that the first month will have a certain capacity, while the
second will have a slightly smaller capacity and so on, reaching lower monthly capacity reductions.
The optimization process is done in order to reduce BA’s utilization interferences and generate the
respective capacity utilization of the optimized scenario. However, after being run, a good practice is
to rerun the simulation with the altered start dates and deadlines, in order to have many scenarios and
their aggregation, for a situation less likely to feature interferences.
The capacities presented, although generally delivered in hours of effort, can be quite easily con-
verted to the average number of workers per shift, which is the usual output of RCCP tools. In fact, both
67
measures are directly proportional: a larger number of hours of effort directly requires a larger number
of workers per shift. The calculation that has to be performed is shown is equation 4.4 (this equation is
similar to the application shown in equation 2.1). Note that this calculation considers an average of 30
days per month and a total of 3 shifts per day, making 22 hours of daily work. The result is the average
number of workers that should be working at any given time to successfully deliver the required hours of
effort.
W =Cap
30 · 22=Cap
660[w] (4.4)
In conclusion, it can be said that the results from this simulation-based RCCP are very trustworthy,
given the rough nature of the tool. Furthermore, the results obtained are supported by the demonstrated
performance of the activities performed in the past, which is a clear advantage over commonly used
approaches to the problem, which are supported by the recipe information, often imprecise.
68
Chapter 5
Digital Twin User Interface
By definition, the concept of Digital Twin is deeply intertwined with a virtual representation of the
assets in study. This means that a user friendly, interactive and uncomplicated way of showing the data
and allowing for user input comes as a necessity and can be of terrific value within an organization.
For these reasons, a UI for the Digital Twin was implemented. The front-end capabilities of the R
language and more specifically, the Shiny package were used [12]. The UI can be divided into two
parts: visualization and simulation. Note that the latter regards how the simulation tool is delivered to
the users and how can they intuitively interact with it, not how the simulation is performed (see chapter
4). All the tabs from the UI are displayed in appendix B.
The way that data is conveyed is extremely important for the recipient to correctly and quickly detect
patterns and ultimately, take conclusions. There are many ways of showing information and these can
greatly modify the perception that the user has regarding the data in question: some types of graphs may
show the same data in different ways, highlighting different patterns and behaviors. It is then important to
select the most adequate way of representing the information in question, which often vary substantially
from graph to graph.
Additionally, and since this final tool is supposed to be used by decision-makers initially unfamiliar
with it, each tab of the UI includes a FAQ containing usage instructions and explaining what is on-screen.
An example of such help modal window is shown in figure B.7 of appendix B.
5.1 Visualization
The developed UI features 5 tabs regarding data visualization. Each tab has its purpose, with some
of them focusing on current and past activities and some on key performance indicators (KPIs). The
tabs are Overview, (activities) By Building, (activities) By Project, KPIs and Schedule.
The first tab, Overview, has the objective of delivering a very high-level view of the factory in study,
featuring a map of the production plants and a set of gauge plots, measuring the shortest timeframe
KPIs of the selected view. This goes along a selector, for the user to select either the global view, or a
specific productive area or building. Maps are extremely useful and effective ways of delivering data that
has geographical meaning. For this tab, the map acts as an aid in identifying each building’s objective (if
viewing it in the global view), or as a geographical locator of buildings of a specific area or even specific
building. The information conveyed in the map is modified by changing the selector. Figure 5.1 shows
examples of displays that the map can output.
On the Overview tab, the most significant KPIs are shown for the current week, and these are the
ones affecting the selected option. This changes not only the values but also the own KPIs; the most
69
Figure 5.1: Examples of maps shown in the Overview tab. The first graph shows the identification of a single building; the secondmap the identification of all the building in a certain area; the third map shows the global view, indicating all the areas that each
building performs.
significant ones for the manufacturing area may not be the ones that best describe the warehouse’s
performance, for example. The KPIs are shown in gauge plots, which are a graphical way of showing a
value that is ideally bilaterally bounded. It is often used to show performance, since this usually comes
in percentage form. A set of two gauge plots are then shown, measuring the two most significant KPIs
for the select view.
The second tab deals with (activities) By Building. Its objective is merely showing the current, past
and future activities being performed in a specific building, selected by the user. The UI presents a map
of the plant, a slider for defining the range of dates shown and a table containing information according
to the selected dates range and building. This table presents information regarding the activities taking
place, specifying the building, production line, project, planned start date and expected finishing date,
and for activities that have already taken place, the actual start date and actual finishing date. Regarding
the map, it basically takes two forms, the complete map with all the buildings (resulting in the data shown
by the table being referent to the whole plant) and the map zoomed on a single building (after being
clicked by the user). The representation of these map states becomes as shown in figure 5.2.
This tab offers a way of showing the activities taking place at a specific building. This can be par-
ticularly useful for decision-makers to understand the current operations of each building and how they
have performed in the past.
The third tab (activities) By Project is similar to the second one, but offers the activities’ information
based on the project that is being produced, rather than on the building where it is being made. The tab
is characterized by having a selector, where the user can choose which project to see, a network graph,
showing the INs necessary for the production of the FP, a map showing on which buildings a certain
project has productive activities and a table with the same information as the table from the second tab,
but filtered by activities regarding the selected project. An example of a combination of network plus
map is shown in figure 5.3.
This way of showing information can be useful for users who need to check the activities of a single
project, view on which buildings they are being produced and understand the usual duration of a project’s
processes.
70
Figure 5.2: Map before and after being clicked. Note that the crosshair button on the top left of the map is a button for resettingthe map view.
Figure 5.3: Network graph of a project and corresponding map of buildings
The fourth tab shifts its scope from activities to KPIs. The objective of the tab is to show the KPIs,
their current and past values,on a weekly, monthly or yearly timeframe, filtered by area or building. The
tab features a selector (similar to the one on the Overview tab) where the user can choose the global
view, or a specific area or building, a set of the 6 most relevant KPIs (their current values) for the selected
scope, shown in gauge plots, and a line chart, depicting the evolution of the KPIs through time. This
tab offers many options to the user, so that the most appropriate view can be shown. First of all, and
similarly to the Overview tab, the 6 most relevant KPIs depend on what area or building is selected.
The values presented on the gauge plots are referent to the current week’s KPIs. If the user wants to
check the historical evolution of one or more KPIs, a selection button is present on each gauge plot,
which activates the line chart with the historical information regarding the selected KPI(s). The user can
then choose to view the evolution of the weekly, monthly or yearly KPI. Furthermore, there is a selection
71
button to activate the normalization of the data between 0 and 1. This allows two KPIs with information
of different scales to be plotted simultaneously, and while the information regarding the absolute value
of the KPI is lost, the KPIs evolution can be compared. Lastly, the line chart allows for zooming in
on sections. Note that the KPIs are calculated externally to this work, which only has the objective of
bringing together all the KPIs from different areas and at different dates. Some KPIs may be added
later, but the framework for their inclusion is already constructed and it merely a task of modifying the
endpoints of data exchange.
The fifth and last tab regarding visualization is the Schedule tab, focused on delivering the activities
schedule per productive line, in a Gantt-like fashion. The tab features a slider for defining the range of
dates shown and two selectors, one for choosing whether or not there should be any filter applied, and if
so, by building or by project, and another selector for choosing the desired projects or buildings (multiple
choices can be selected at the same time). A graph is then presented showing the activities. This graph
features the production lines on the y axis, the date on the x axis and the occupation is represented
as bars, color-coded according to the project they are referent to. Figure 5.4 shows an example of the
schedule for a given range of time. Note that the activities, projects and production lines are not real
values, due to confidentiality reasons. The only objective is how the data is shown and not what data is
being shown.
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
B12 − L36B12 − L82B12 − L86B12 − L87B3 − L48B3 − L63B3 − L90
B40 − L28B40 − L82B40 − L86B54 − L48B54 − L63B54 − L80B8 − L21B8 − L36B8 − L47B8 − L73B8 − L81B8 − L86
B99 − L49B99 − L63B99 − L86B99 − L99
Apr May
Day
Month
Project
P73
P23
P93
P9
P65
Figure 5.4: Example representation of the schedule of activities
Several packages from the R programming language were used for the creation of the different plots.
The maps were created using the Leaflet package [14]; the gauge plots and line chart were created
using the Billboarder package [40]; the project network was created using the VisNetwork package [2]
and the schedule Gantt-style graph was created using the ggplot2 package [59]. Additionally, the tables
were generated and shown using the DT (DataTables) package [60].
5.2 SimulationInteraction with the simulation can be viewed as one of the most important parts of the UI. It should
be easy and intuitive, but dense with information and complete, in order for the decision-makers that use
the tool to be able to get insights into the projects, modify the simulation according to their needs and
72
access its results in a straightforward way. To this end, the UI features 2 tabs on the topic of simulation.
The first is a Project Database, while the second is the RCCP itself.
The first tab of the simulation category offers a project database where, simply put, the user has ac-
cess to all the projects that can be produced in the production plants. This tab offers diverse information
about the projects, which are selected trough two selectors, with the first allowing the user to choose
the maincode of the project and the second to choose the project itself. A series of information is then
displayed. First of all, one table shows the recipe of the project, indicating the version of the recipe and
the individual steps that are taken, along with information regarding the step itself, the duration and the
associated effort. With this, a Gantt chart is also displayed as a graphical representation of the recipe,
allowing for better inspection of the data there contained. This graph and table allow the user to have
better insights into the processes involved in the production of the chosen project, their duration and
their efforts. An exemplified version of the Gantt chart translating a project’s recipe is shown in figure
5.5.
QA2
QC1
QC8
QC16
QC12
M3
QC11
QC14
QC18
QC4
M9
M19
M7
QC15
QC20
M6
M17
M5
WH10
0 4 8 12Time
Ass
et
Area
M
QA
QC
WH
Figure 5.5: Gantt representation of the recipe of a project
Additionally, the BOM is also displayed, showing the materials used for the production of the project.
A representation of the probability distributions that model the project’s manufacturing and QA durations
is also shown, both the PDFs that were fitted to the available data and the available data itself. This
allows the user not only to check the typical verified durations of the processes, but also to verify whether
or not the durations used in the recipes are accurate, and extract conclusions regarding how accurate
the automatic scheduling made by the ERP are; some projects may be accurately scheduled while
others whose recipes do not translate the reality will certainly be less accurate. A distribution of the
adherence to the planned start date, along with the number of observations used for the fitting process
are also shown. Examples of the PDFs of a project are shown in figure 5.6. Lastly, a table is presented,
containing the history of the selected project’s orders, including information such as planned start date,
actual start date, manufacturing duration, QA duration, quantity produced.
Regarding the final tab, the RCCP, it can be considered the most complex and dense tab, featuring
simulation and optimization processes, a plethora of parameters available for the user to modify, tables
and graphs with results and initial parameters, filters to modify all the displayed information and modal
73
0.00
0.03
0.06
0.09
10 20 30 40 50Duration
Den
sity
Manufacturing
0.0
0.1
0.2
0.3
0 10 20 30 40 50Duration
Den
sity
QR
0.0
0.1
0.2
0.3
0.4
0.5
−20 0 20Duration
Den
sity
Difference from Planned Start
Figure 5.6: Example of PDFs of manufacturing, QR and adherence to start date of a project. Note that the vertical lines indicatethe duration of the processes referenced in the recipe, even though they frequently do not add up to the observed reality
windows to add and edit information. The basic objective of the tab is to run the simulation but doing
so requires a series of parameters. Most of these parameters are not required for the user to input,
since they have predefined values but can be modified if so desired. The initial step is to load the
planned orders. A table then presents these, displaying each project’s name, start date and deadline.
The displayed projects can be modified, by selecting the desired order and pressing the corresponding
button and new projects may be added. For the simulation process, a series of parameters can be
modified, such as the number of iterations or the confidence level. Additionally, several parameters are
compulsory for the simulation to be ran: the name of the simulation (for storing purposes) and the type
of simulation (between EDD and LSD). An estimated duration of the simulation is presented. After the
simulation is ran, a table with the results’ overview is shown, as well as the Gantt chart of BA’s utilization
and monthly capacities, similar to figures 4.16 and 4.14. At this point, selecting an order on the planned
orders table highlights the tasks regarding such order on the Gantt chart and its impact on the monthly
capacities.
After the base simulation is ran, the user can choose to optimize the results, following the method-
ology described in section 4.2.2. The type of optimization simply has to be chosen, between median,
0.25 IQRs, 0.5 IQRs, 1 IQR or 2 IQRs. After the optimization is done, the results are updated, be-
coming similar to the ones shown in figures 4.17 and 4.18. At this point the user has 3 alternatives:
either a new optimization is ran, the non-optimized results are shown, or the start dates and deadlines
are updated, allowing for a new simulation process that will theoretically have less BAs interferences.
During the entire process, the information is stored under the project name defined by the user, with the
creation timestamp added. The objective of this is so that any simulation can be later accessed, giving
the users the power to either continue working on a specific scenario or simply compare their simulation
with others. Additionally, a button is included to export the simulation results into a .pdf file, enabling
offline access to the results, which can also be printed, if so desired.
74
Chapter 6
Conclusions
The main objective of this thesis was the creation of a DT of the internal supply chain at the CDMO
in study. This tool should aid in monitoring the performance indicators across and within areas of the
internal SC and be able to conduct scenario-based forecasting, specifically, through a simulation-based
RCCP tool, capable of predicting monthly capacity utilization.
To this end, the internal SC processes were mapped and extensively studied, to fully understand the
connectivities between the areas and the usual workflows existent throughout the different projects and
campaigns. The processes’ durations, starts and ends were collected from the ERP, and the probability
distributions that defined the manufacturing and the QR processes were created in order to describe
their observed variability. The process of fitting theoretical PDFs to the historical data was accomplished
after an extensive statistical analysis and an optimization process, with the objective of minimizing the
CS GoF. After having modelled the projects duration and their inherent variability, the simulation-based
RCCP tool was constructed, built on the concept of demonstrated performance. The tool used Monte
Carlo simulation as its simulation engine, allowing multiple scenarios to be ran, and a convergence in
the monthly utilized capacity to be found. Furthermore, the tool provided a representation of how the
BAs where utilized and if there were any interferences between projects using a single BA at the same
time. An optimization algorithm was also implemented with the objective of removing (or reducing)
the interferences. Regarding the visualization, the tool offered intuitive ways of delivering visibility to
its users, effectively concentrating information from a series of disperse sources and allowing it to be
filtered according to the users needs.
Overall, the objectives defined for this project were fulfilled and in some cases even added additional
functionalities.
6.1 Achievements
The developed tool was successful in the two objectives that were proposed: visualization and sim-
ulation.
The visualization component of the tool was able to deliver intuitive views into the operations happen-
ing at the production plant at a given time; several methods of delivering these views were implemented
in order for the users to view the information in the way that best fits to their needs. Additionally, there
are ways of viewing the key performance indicators, per area, building or globally, measured by week,
month or year and with the possibility of view each KPI’s historical data. Insights into the tasks schedule
are also delivered, allowing the users to filter the data by date, projects or production building. Users can
also access each project’s information, historical data, BOM, tasks recipe and a probability distributions.
75
All of these ways of showing information are supplied with information from the company’s ERP system,
when the data is available, and endpoints were created for receiving new data that may be generated in
the future.
The simulation tool created was able to successfully generate accurate forecasts (in the scope of a
rough-cut tool), regarding the necessary monthly capacity, or the monthly percentage of the maximum
capacity needed. The simulations ran were validated and showed promising results for a rough-cut
tool, with the possibility of improving the base capacity utilization estimations made by the ERP, since
demonstrated performance values were considered instead of the values strictly derived from the recipe.
Furthermore, the ability of adding new orders and generating new scenarios brings a clear advantage
to this tool. The optimization algorithms offered enable the tool to be more conservative in the asset’s
allocation, effectively reducing the need for time buffers, or at least allowing them to be greatly reduced.
6.2 Future WorkFuture work for this project can be divided into 2 levels:
• DT framework: how it can be improved in the data it collects and how it can translate the current
and past states of the production plants more effective and comprehensively.
• Simulation-based RCCP tool: how the model that supports the tool can be improved, with the
ultimate goal of generating more precise predictions.
6.2.1 Quality & quantity of Data
Future work passes greatly by the implementation of more measurement points, both at the shop-
floor (such as power measurements of the assets, temperatures, pressures) and the corporate level,
improving and increasing the information contained in the ERP. The concept of edge computing, per-
forming data processing at the sensor level, would be extremely useful for the tool. It would allow the
data collected at the sop-floor level to be processes on-site, effectively delivering less meaningless data
to the storage systems and instead, more processed information. Improving the data collection and
processing systems would results in a DT with more and better information, enabling the users to have
access to everything happening (or that happened in the past) in a given production plant, both on a
productive, logistical, or even managerial level, without leaving their office.
6.2.2 Improving the Models
Several improvements could be done to the simulation tool in the future. Many of these improvements
regard the PDFs that model the processes durations. First of all, a correlation between the duration of
the manufacturing and the QR processes has been observed in some projects. This could be studied to
achieve better combination of durations of manufacturing and QR, when sampling values. Additionally,
there may be some influence of the batch size and the manufacturing duration and since the batch size
is known a priori, a correlation could also be studied. The difference between the planned start date of
the projects and the actual start date of the projects could also be modelled. Some projects have also
been seen to feature multimodal distributions (especially in the QR duration). This multimodal behavior
76
could be explained through the manufacturing-QA correlation, but investigating this phenomenon could
be useful in generating better predictions. Weights should be added to the campaigns when fitting the
theoretical PDFs to the real observed durations, in the sense that more recent campaigns should be
more representative of the reality than past ones.
Possibly, the best way to achieve good estimates for the projects’ durations, considering all the con-
ditions already in effect and the ones here described as future work would be the application of black
box models, that would receive a project’s code, start date, deadline, batch size and other meaning-
ful metrics and would output the predicted duration of the manufacturing and QA processes. Artificial
neural networks are computing systems that could provide excellent results for this problem, and are
much easier to implement. Given enough data, they are able to achieve comparable, frequently superior
results in comparison with results from heuristic methods, without nearly as much effort.
77
78
Bibliography
[1] K. Alicke, J. Rachor, and A. Seyfert. Supply chain 4.0 – the next-generation digital supply
chain. Technical report, McKinsey&Company, June 2016. https://www.mckinsey.com/business-
functions/operations/our-insights/supply-chain-40--the-next-generation-digital-
supply-chain.
[2] Almende B.V., B. Thieurmel, and T. Robert. visNetwork: Network Visualization using ’vis.js’ Library,
2019. URL https://CRAN.R-project.org/package=visNetwork. R package version 2.0.6.
[3] Anylogic. Alstom develops a rail network digital twin for railway yard design and predictive fleet
maintenance. Anylogic Case Studies, Apr 2018. URL https://www.anylogic.com/digital-
twin-of-rail-network-for-train-fleet-maintenance-decision-support/?utm source=
white-paper&utm medium=link&utm campaign=digital-twin.
[4] T. B. Arnold and J. W. Emerson. Nonparametric goodness-of-fit tests for discrete null distributions.
R Journal, 3(2):34–39, 2011. doi:10.32614/RJ-2011-016.
[5] AVATA. S&OP/IBP Express, September 2015. Slide 6. Accessed on 2019/08/28. https://
www.slideshare.net/christinabergman/avata-sop-ibp-express-53194300.
[6] M. Bajer. Dataflow In Modern Industrial Automation Systems. Theory And Practice. ABB Corporate
Research Krakow, Poland, 2014.
[7] J. E. Beasly. OR-Notes: master production schedule. Brunel University London, 1990. URL
http://people.brunel.ac.uk/~mastjjb/jeb/or/masprod.html.
[8] R. N. Bolton, J. R. McColl-Kennedy, L. Cheung, A. Gallan, C. Orsingher, L. Witell, and M. Zaki.
Customer experience challenges: bringing together digital, physical and social realms. Journal of
Service Management, 29(5):776–808, 2018. doi:10.1108/JOSM-04-2018-0113.
[9] K. Bruynseels, F. Santoni de Sio, and J. van den Hoven. Digital twins in health care: eth-
ical implications of an emerging engineering paradigm. Frontiers in genetics, 9:31, 2018.
doi:10.3389/fgene.2018.00031.
[10] E. M. Carter and H. W. W. Potts. Predicting length of stay from an electronic patient record system:
a primary total knee replacement example. BMC medical informatics and decision making, 14(1):
26, 2014. doi:10.1186/1472-6947-14-26.
[11] S. Cateni, V. Colla, and M. Vannucci. A fuzzy logic-based method for outliers detection. In Artificial
Intelligence and Applications, pages 605–610, 2007.
[12] W. Chang, J. Cheng, J. Allaire, Y. Xie, and J. McPherson. shiny: Web Application Framework for R,
2019. URL https://CRAN.R-project.org/package=shiny. R package version 1.3.1.
79
[13] W. Chang, J. Luraschi, and T. Mastny. profvis: Interactive Visualizations for Profiling R Code, 2019.
URL https://CRAN.R-project.org/package=profvis. R package version 0.3.6.
[14] J. Cheng, B. Karambelkar, and Y. Xie. leaflet: Create Interactive Web Maps with the JavaScript
’Leaflet’ Library, 2018. URL https://CRAN.R-project.org/package=leaflet. R package version
2.0.2.
[15] V. Choulakian, R. A. Lockhart, and M. A. Stephens. Cramer-von mises statistics for discrete distri-
butions. Canadian Journal of Statistics, 22(1):125–137, 1994. doi:10.2307/3315828.
[16] A. Costigliola, F. A. Ataıde, S. M. Vieira, and J. M. Sousa. Simulation model of a quality control
laboratory in pharmaceutical industry. IFAC-PapersOnLine, 50(1):9014–9019, 2017.
[17] J. F. Cox and J. H. Blackstone. APICS dictionary. Amer Production & Inventory, 2002.
[18] A. C. Cullen and H. C. Frey. Probabilistic Techniques in Exposure Assessment: a handbook for
dealing with variability and uncertainty in models and inputs. Springer Science & Business Media,
1999.
[19] European Federation of Pharmaceutical Industries and Associations. The pharmaceutical industry
in figures. EFPIA, 2018. URL https://efpia.eu/publications/downloads/efpia/2018-the-
pharmaceutical-industry-in-figures/.
[20] Gartner Top 10 Strategic Technology Trends for 2019, Oct 2018. https://www.gartner.com/
smarterwithgartner/gartner-top-10-strategic-technology-trends-for-2019/.
[21] E. Glaessgen and D. Stargel. The digital twin paradigm for future NASA and US Air Force vehicles.
In 53rd AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics and Materials Conference,
page 1818, 2012. doi:10.2514/6.2012-1818.
[22] F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 11(1):
1–21, 1969. doi:10.1080/00401706.1969.10490657.
[23] S. I. Haider. Pharmaceutical master validation plan: the ultimate guide to FDA, GMP, and GLP
compliance. CRC Press, 2001.
[24] S. Hawkins, H. He, G. Williams, and R. Baxter. Outlier detection using replicator neural networks.
In International Conference on Data Warehousing and Knowledge Discovery, pages 170–180.
Springer, 2002. doi:10.1007/3-540-46145-0-17.
[25] F. S. Hillier, G. J. Lieberman, B. Nag, and P. Basu. Introduction To Operations Research. Mc Graw
Hill Education, sie tenth edition, 2017. ISBN 978-93-392-2185-0.
[26] D. Ivanov, A. Dolgui, A. Das, and B. Sokolov. Digital supply chain twins: Managing the ripple effect,
resilience, and disruption risks by data-driven optimization, simulation, and visibility. In Handbook
of Ripple Effects in the Supply Chain, pages 309–332. Springer, 2019.
80
[27] W. Kritzinger, M. Karner, G. Traar, J. Henjes, and W. Sihn. Digital twin in manufacturing: A
categorical literature review and classification. IFAC-PapersOnLine, 51(11):1016–1022, 2018.
doi:10.1016/j.ifacol.2018.08.474.
[28] A. M. Law. Simulation Modeling and Analysis. McGraw Hill Education, International Fifth edition,
2015. ISBN 978-1-259-25438-3.
[29] J. Lee, E. Lapira, B. Bagheri, and H.-a. Kao. Recent advances and trends in predictive
manufacturing systems in big data environment. Manufacturing letters, 1(1):38–41, 2013.
doi:10.1016/j.mfglet.2013.09.005.
[30] M. R. Lopes, A. Costigliola, R. M. Pinto, S. M. Vieira, and J. M. Sousa. Novel governance model
for planning in pharmaceutical quality control laboratories. IFAC-PapersOnLine, 51(11):484–489,
2018.
[31] G. S. Maddala and K. Lahiri. Introduction to Econometrics. Macmillan New York, Second edition,
1992. ISBN 978-0-02-374545-4.
[32] A. M. Madni, C. C. Madni, and S. D. Lucero. Leveraging digital twin technology in model-based
systems engineering. Systems, 7(1):7, 2019.
[33] J. T. Mentzer, W. DeWitt, J. S. Keebler, S. Min, N. W. Nix, C. D. Smith, and Z. G. Zacharia. Defining
supply chain management. Journal of Business logistics, 22(2):1–25, 2001. doi:10.1002/j.2158-
1592.2001.tb00001.x.
[34] A. Mullard. 2018 FDA drug approvals. Nature Reviews – Drug Discovery, January 2019. URL
https://www.nature.com/articles/d41573-019-00014-x.
[35] J. A. Nelder and R. Mead. A simplex method for function minimization. The computer journal, 7(4):
308–313, 1965. doi:10.1093/comjnl/7.4.308.
[36] Oracle Applications. Overview of capacity planning, November 1997. Accessed on 2019/10/07.
https://docs.oracle.com/cd/A60725 05/html/comnls/us/crp/ovwcp.htm.
[37] Oracle Help Center. Overview to resource requirements planning, February 2013. URL https:
//docs.oracle.com/cd/E26228 01/doc.93/e21770/ch over resrc req pln.htm#WEAMP231. Ac-
cessed on 2019/10/07.
[38] Oracle Help Center. JD Edwards Enterprise One Applications Requirements Planning Implemen-
tation Guide: planning production capacity, 2014. URL https://docs.oracle.com/cd/E64610 01/
EOARP/plng production capacity.htm#EOARP00393. Accessed on 2019/09/09.
[39] V. Papavasileiou, A. Koulouris, C. Siletti, and D. Petrides. Optimize manufacturing of pharmaceutical
products with process simulation and production scheduling tools. Chemical Engineering Research
and Design, 85(7):1086–1097, 2007. doi:10.1205/cherd06240.
81
[40] V. Perrier and F. Meyer. billboarder: Create Interactive Chart with the JavaScript ’Billboard’ Library,
2019. URL https://CRAN.R-project.org/package=billboarder. R package version 0.2.5.
[41] Pharmaceutical Research and Manufacturers of America. 2019 PhRMA Annual Membership
Survey. PhRMA, 2019. URL https://www.phrma.org/report/2019-phrma-annual-membership-
survey.
[42] B. Piascik, J. Vickers, D. Lowry, S. Scotti, J. Stewart, and A. Calomino. Materials, structures,
mechanical systems, and manufacturing roadmap. NASA TA, pages 12–2, 2012.
[43] D. Pomerantz. The french connection: Digital twins from paris will protect wind tur-
bines against battering north atlantic gales. GE Reports, April 2018. URL https:
//www.ge.com/reports/french-connection-digital-twins-paris-will-protect-wind-
turbines-battering-north-atlantic-gales/.
[44] Reby Media. Engineering matters ep. 4 – the rise of the digital twin, July 2018. https:
//engineeringmatters.reby.media/2018/07/23/4-the-rise-of-the-digital-twin/.
[45] S. Rehana. Making a digital twin supply chain a reality. ASUG, November 2018. URL https:
//www.asug.com/news/making-a-digital-twin-supply-chain-a-reality.
[46] A. Robinson. The rise of the digital supply chain begets 5 huge benefits. Cerasis, February 2016.
URL https://cerasis.com/digital-supply-chain/.
[47] J. Rowley. The wisdom hierarchy: representations of the DIKW hierarchy. Journal of information
science, 33(2):163–180, 2007.
[48] B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating
the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.
doi:10.1162/089976601750264965.
[49] S. Scoles. A digital twin of your body could become a critical part of your health care. Slate,
2016. URL https://slate.com/technology/2016/02/dassaults-living-heart-project-and-
the-future-of-digital-twins-in-health-care.html.
[50] N. Shah. Pharmaceutical supply chains: key issues and strategies for optimisation. Computers &
chemical engineering, 28(6-7):929–941, 2004. doi:10.1016/j.compchemeng.2003.09.022.
[51] M. Sharma and J. P. George. Digital twin in the automotive industry: Driving physical-
digital convergence (white paper). Technical report, TATA Consultancy Services, Decem-
ber 2018. https://www.tcs.com/content/dam/tcs/pdf/Industries/manufacturing/abstract/
industry-4-0-and-digital-twin.pdf.
[52] R. Spicar and M. Januska. Use of Monte Carlo modified Markov Chains in capacity planning.
Procedia Engineering, 100:953–959, 2015.
82
[53] H. Sugita. A mathematical formulation of the monte carlo method. In Monte Carlo Method, Random
Number, and Pseudorandom Number, pages 9–21. Mathematical Society of Japan, 2011.
[54] Supply Chain Resource Cooperative SME. Capacity planning. NC State University, January 2011.
URL https://scm.ncsu.edu/scm-articles/article/capacity-planning.
[55] Supply Chain Resource Cooperative SME. Capacity utilization. NC State University, January 2011.
URL https://scm.ncsu.edu/scm-articles/article/capacity-utilization.
[56] R. C. Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria, 2019. URL https://www.R-project.org/.
[57] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, 1977.
[58] P. H. Westfall. Kurtosis as peakedness, 1905–2014. RIP. The American Statistician, 68(3):191–
195, 2014. doi:10.1080/00031305.2014.917055.
[59] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. ISBN
978-3-319-24277-4. URL https://ggplot2.tidyverse.org.
[60] Y. Xie, J. Cheng, and X. Tan. DT: A Wrapper of the JavaScript Library ’DataTables’, 2018. URL
https://CRAN.R-project.org/package=DT. R package version 0.5.
83
84
Appendix A
Goodness-of-fit Tests Comparison
The results of the optimization of the theoretical PDFs according to the different goodness-of-fit
tests is shown here, as described in the section regarding Distribution Fitting. Note that the results
are shown for the three example distributions shown in Figures 3.3, 3.4 and 3.9 and considering the
Negative Binomial PDF as the theoretical PDF. The results presented for each distribution are a table
which shows the different goodness-of-fit values for each optimization (tables A.1, A.2 and A.3) and the
PDFs and CDFs of the fitted distributions (figures A.1, A.2 and A.3).
A.1 Example 1
Goodness-of-fit Test CS KS CVM W AD
Chi-Squared 17.8 0.19 0.41 0.16 2.36Kolmogorov-Smirnov 69.95 0.12 1.76 1.45 10.21
Cramer-von Mises 20.12 0.21 0.15 0.15 0.89Watson 20.36 0.21 0.15 0.15 0.89
Anderson-Darling 20.53 0.2 0.15 0.15 0.85
Table A.1: Table showing the goodness-of-fit values for the optimizations ran for the first example. Note that the first columndenotes the goodness-of-fit test chosen for the optimization process and the remaining values show the results from the other
different GoFs regarding that optimization, to enable the comparison between results of different optimizations.
5 10 15 5 10 15
0
100
200
300
400
0
25
50
75
Duration [Time Units]
Cou
nt
Goodness−of−fitTest
Anderson−Darling Chi−Squared Cramér−von Mises Kolmogorov−Smirnov Watson
Figure A.1: Results of the fitted distributions for the first example, optimized according to the different goodness-of-fit testsconsidered in this study.
A.1
A.2 Example 2
Goodness-of-fit Test CS KS CVM W AD
Chi-Squared 38.39 0.15 0.27 0.26 1.82Kolmogorov-Smirnov 44.31 0.09 0.77 0.17 5.35
Cramer-von Mises 38.57 0.14 0.27 0.26 1.84Watson 52.01 0.11 1.14 0.15 7.74
Anderson-Darling 38.59 0.15 0.29 0.29 1.74
Table A.2: Table showing the goodness-of-fit values for the optimizations ran for the second example.
10 15 20 25 10 15 20 25
0
100
200
0
10
20
30
Duration [Time Units]
Cou
nt
Goodness−of−fitTest
Anderson−Darling Chi−Squared Cramér−von Mises Kolmogorov−Smirnov Watson
Figure A.2: Results of the fitted distributions for the second example.
A.3 Example 3
Goodness-of-fit Test CS KS CVM W AD
Chi-Squared 144.68 0.13 0.57 0.13 3.28Kolmogorov-Smirnov 152.93 0.07 0.13 0.11 0.87
Cramer-von Mises 157.45 0.08 0.11 0.11 0.78Watson NaN 1 85.01 3.43E-19 NaN
Anderson-Darling 158.74 0.09 0.11 0.11 0.74
Table A.3: Table showing the goodness-of-fit values for the optimizations ran for the third example.
A.2
0 50 100 150 0 50 100 150
0
25
50
75
0
1
2
3
4
Duration [Time Units]
Cou
nt
Goodness−of−fitTest
Anderson−Darling Chi−Squared Cramér−von Mises Kolmogorov−Smirnov
Figure A.3: Results of the fitted distributions for the third example. Note that on this example the Watson GoF was removed sincethe result it provided was not a good fit and would jeopardize the visualization of the remaining curves. The results are still
presented in table A.3, however.
A.3
A.4
Appendix B
Digital Twin User Interface
Screenshots
Figure B.1: Overview tab screenshot
Figure B.2: Activities by building tab screenshot
B.1
Figure B.3: Activities by project tab screenshot
Figure B.4: KPIs tab screenshot
B.2
Figure B.5: Projects schedule Gantt chart tab screenshot
Figure B.6: Projects database example view
B.3
Figure B.7: Example of modal help window
Figure B.8: RCCP: main view, displaying the planned and user defined orders
B.4
Figure B.9: RCCP: options modal window
Figure B.10: RCCP: start simulation modal window
B.5
Figure B.11: RCCP: existing scenarios to be loaded
B.6