Improved Time Series Decline Curve Analysis For Oil

Artificial Intelligence Principles and Techniques

Improved Time Series Decline Curve Analysis For OilProduction Using Recurrent Neural Network

Cédric Fracès Gasmi∗1, Ouassim Khebzegga†1, and Soheil Esmaeilzadeh‡1

1Stanford University, CA 94305, USA

AbstractIn this work, we provide a more accurate alter-

native to Decline Curve Analysis (DCA), one ofthe most prevalent forecasting techniques used inthe oil & gas industry. DCA largely relies uponnon-linear regression where a hyperbolic function isfitted through historical data to predict the futureproduction. We propose a data-driven approachthat utilizes both the most recent portion of a timeseries and a set of heuristic to improve upon thecurrent benchmark. Given the time series of pro-duction history and a set of contextual meta data(e.g. well location, field’s geological information,operations, etc.) we forecast the hydrocarbon pro-duction of a given well. We show how regressionand clustering techniques can be combined to im-prove prediction accuracy. We use different sequen-tial neural networks with domain informed featureengineering and enhance the accuracy of the fore-cast (in some cases by a large margin) comparedto a DCA approach.

Introduction and BackgroundThe most prevalent forecast method in the oil &

gas industry is Decline Curve Analysis (DCA) (ref.Arps [1945]). This technique consists of fitting acurve to the history data in order to predict thefuture performance of a well. Based on DCA thepseudo-steady state production response of a wellfollows a differential equation of the form

D = −1

q

dq

dt. (1)

The typical time dependent solution of this equa-tion has a hyperbolic form of q(t) = q0

(1+Dt)1/bwhere

∗[email protected]†[email protected]‡[email protected]

b is called the b-factor, D is the initial nominaldecline rate, and q0 is the initial production rate.These parameters are mainly chosen by an engi-neering intuition and are adjusted to account forheuristic considerations (e.g. the basin of produc-tion, the type of driving forces, and the comple-tion techniques used). There are many commer-cial software packages (e.g., Aries, PHDWin, Palis-sade, OFM, and etc.) that help to find the bestfit for any imposed constraints using DCA. Usingthese software packages the process of calibrating agood curve to the history data is typically done bythe reservoir engineer who applies a combinationof heuristics and analytical methods to produce aplausible forecast for the production of a well. Thisprocess is generally probabilistic and takes into ac-count a range of uncertainties.

Figure 1: Typical decline curve forecast - Green: Pro-duction curve history - Black: production forecast(Source: ERE 269, Engineering Valuation and Ap-praisal of Assets)

As we see in Fig. 1, the prediction made usingDCA (solid black line) is a smooth curve that can-not capture the rapid variations that usually hap-pen in the production process (solid green line) andis therefore subject to a certain amount of error,nevertheless, engineers still favor this method due

1

to its simplicity. In this work, we propose a newmethod of forecasting that increases the accuracyof the prediction compared to DCA. This methodallows for the integration of various sources of in-formation, starting with the history production ofthe well and adapting the accuracy of the predic-tion to the amount of available information. Weuse deep learning architectures along with cluster-ing to adaptively integrate the various sources ofinformation and features relevant to the forecast.

There is not a lot of related work on the useof time series prediction for the Oil and Gas in-dustry. Application of data driven approaches toreservoir modeling is relatively new and coincideswith the rise of unconventional production in theUSA, combined with our limited understanding ofthe physics governing the behavior of shale forma-tions. Due to the availability of data for this typeof reservoirs, researchers and engineers started us-ing data-driven approaches to get insights into theproduction patterns. Mohaghegh et al. [2011] de-veloped a top-down approach for modeling and his-tory matching of shale production based on statis-tical and pattern recognition models and appliedtheir approach to three shale reservoirs. Sun et al.[2018] applied Long-Short Term Memory algorithm(LSTM) to predict a well’s oil, water, and gas pro-duction. Their work led to an improvement of theproduction forecast in comparison with standardDCA models. However, the tested portion of theirtime series exhibits weak variations in compari-son with the training part and it remains unclearwhether the algorithm can generalize well to com-plex time series. Ristanto [2018] applied ANN algo-rithms to predict the production of a well based onpressure measurements and compared vanilla Re-current Neural Networks (RNN), Gated RecurrentUnits (GRU), and LSTM algorithms and foundthat the GRU gives the best forecast results.

This document presents the procedure of apply-ing three main models to improve the oil produc-tion forecast based on historical and contextualdata. The analysis is divided into three categoriesas

• Clustering & Unsupervised LearningThis covers the ensemble of methods that helpwith presenting, grouping, and filtering thedata-set to better frame the problem. Themain motivation behind clustering is that en-gineers tend to tweak certain parameters de-pending on a set of heuristic considerations

(e.g. geology, basin, type of completion, andetc.) and clustering will allow for the integra-tion of contextual static data into the predic-tions.

• History Based Time Series ForecastIn this part, we use the history of oil produc-tion (the quantity of interest) in a single timeseries and forecast the future production. Thisis usually the most relevant source of infor-mation and is given a high importance in ourwork. The goal is to detect and capture a setof harmonics in the time series and use themfor forecast.

• Secondary Variables Based Sequence-to-Sequence RegressionIn this class of methods, we establish a cor-relation between the quantity of interest (e.g.oil production) and other time series (e.g. gasproduction, water production, number of flow-ing days, and etc.) that may influence it. Thiswill allow for integration of relevant contextualtemporal data into the forecast process.

Data-set StatisticsProduction records of over 35,000 wells, rep-

resenting a subset of California public data, aredownloaded from the Department of Conservation,Division of Oil, Gas, and Geothermal Resources(DOGGR). The data is cleaned and formatted andcontains well names, operators, fields, completiondates, geographical locations along with time seriesof monthly fluids production and injection. As afirst pass, the focus of this work is on a single field(North Belridge) with around 4000 wells where weuse two sources of data i.e. the contextual (static)data and the historical production (time series).

• Contextual Data: Contains macro informa-tion for each well such as geographical lo-cation, name, type, operator, lease, geologi-cal production zone as well as the start andend dates of production. Contextual data ismainly used for the clustering exercise.

• Historical Data: Contains the productiontime series for each well such as the oil, water,and gas monthly productions as well as thenumber of active days per month for each well.Historical data is used for clustering, historybased time series forecast, as well as the sec-ondary variable based sequence-to-sequenceregression.

2

ClusteringIn this section, our goal is to cluster the wells

using both static and time series data. The cluster-ing process follows two steps. First we separate thewells into macro-clusters using static data. Thisrequires the design of relevant features based onan engineer’s expertise. This is achieved by takinga subset from a global list of 15 features includinggeolocation coordinates, cumulative oil and waterproductions, mean, peak, and standard deviationof the oil production, first month production, thecumulative of the first 12 months of production,and the production duration. The second stepconsists of sub-dividing each macro-cluster intosub-clusters using the production time series. Weuse dynamic time warping (DTW) to measure thesimilarity between the production time series ofeach two wells and use it to group similar wellsinto the same sub-cluster.

For each step, we benchmark different cluster-ing algorithms. We test KMeans, Spectral Cluster-ing, and Agglomerative Clustering, combined withdata normalization and dimensionality reductionusing Principal Component Analysis (PCA). Forthe second step we test Global Alignment KernelKMeans and the Time Series KMeans algorithms.Table 1 summarizes the different combinations ofthese methods.

Table 1: Clustering Models

No. Step 1 Algo. Step 2 Algo.

1 KMeans Time Series KMeans

2 KMeans Global Alignment Kernel KMeans

3 Spectral Clustering Time Series KMeans

4 Spectral Clustering Global Alignment Kernel KMeans

5 Agglomerative Clustering Time Series KMeans

6 Agglomerative Clustering Global Alignment Kernel KMeans

Clustering Hyper-Parameters TuningTable 2 presents the set of hyper-parameters

used for each one of the models in Table 1. Theheuristic we define to compare the different modelsis determined by plotting the time series belongingto each sub-cluster and searching for a coherentstructure characterizing the time series belongingto the same group, some of these coherency mea-sures are

• Separation between High/Low amplitude ofproduction time series

• Separation between Short/Long productiontime series

• Separation between time series that start witha high amplitude and then collapse sharply incontrast with a production that declines grad-ually, this reflects inherent differences in pro-duction conditions.

Table 2: Hyper-Parameters Tuning

Parameter Values

# Clusters Step 1 [5, 7, 10]# Clusters Step 2 [2, 3, 4]Set of static features tested different subsetsWeight of static features [1, 5, 10]

Clustering ResultsBoth model 1 and model 5 in Table 1 lead to

the best results. We present the results of model1 composed of KMeans + Time Series KMeans al-gorithms with 7 macro-clusters, each one decom-posed into 2 sub-clusters. For this case we use thefollowing set of features: (geolocation coordinates,cumulative oil and water productions, mean, peak,and standard deviation of the oil production), andunity weights. The results of the first step are pre-sented in Fig. 2, we see that wells belonging to asame cluster naturally regroup geographically, thisreflects the underlying geological formation char-acteristics of each region. Fig. 3a and 3b show thedecomposition of two macro-clusters, 2 and 3, intosub-clusters, the comparison of the top and bottomplots of Fig. 3a shows that the time series lengthis the characteristic that determines the well’s sub-cluster, with long and short time series groupedinto different clusters. Fig. 3b shows another con-figuration where the amplitude of the production isthe determining characteristic. These clusters canthen be used to apply different forecast models inthe time series forecasts presented in the following.

Time Series ForecastingIn this section, we illustrate forecasting of a time

series e.g. oil production using different variants ofRecurrent Neural Network (RNN) architectures.

History Based Time Series ForecastWe forecast time series of oil production using

the history of production for a single well. We usedifferent Recurrent Neural Network architectures,do a hyper-parameter study, and finally comparethe best RNN model’s outcome with the Decline

3

−10 −5 0 5PCA1

−4

−2

0

2

4

6

PCA2

Centroids

0

1

2

3

4

5

6

Clusters

(a)

35.5159°N

35.5239°N

35.5319°N

35.5399°N

119.795°W

119.787°W

119.779°W

119.771°W

119.763°W

0

1

2

3

4

5

6

Clusters

(b)

Figure 2: (a). wells clusters projected in PCA space, (b). same clusters in the physical geographical space

Curve Analysis results.

Data Pre-processing

The time series relevant to our work are usuallynon-stationary and have variable trends meaningthat the statistical characteristics of the data e.g.,frequency, variance, and mean vary in time. Sta-tionary time series are easier to model and theirtrends are usually easier to capture. Prior to train-ing, we convert the time series into stationary datausing differencing (i.e. computing the differencesbetween consecutive observations). We then nor-malize the data in the interval [−1,+1] to avoidvanishing gradients when using RNN. Next, wesplit a time series into input set X and output y us-ing a lag time approach (i.e. output will be laggedfrom the input by a few time steps). The sizesof lags are treated as a tuning hyper-parameter toget better accuracy. After training, we convert thedata back for error calculations.

History Based Forecast Training Model

We use many-to-one (cref. Fig. 4) recurrent neu-ral network. A many-to-one model produces oneoutput value after receiving multiple input values(an input sequence). The internal state is accumu-lated with each input value before a final outputvalue is produced. We use Root Mean Squared Er-ror (RMSE) to measure the accuracy of the forecastduring the training process. RMSE can be calcu-lated as

RMSE =

√√√√ 1

n

n∑i=1

[ypredi − yreali

yreali

]2× 100, (2)

where n is the total number of time steps in thetime series, and ypredi , yreali are the model predictedand actual values of the time series at each timestep.

Figure 4: A many-to-one sequence model, where an in-put sequence is considered for an output through con-nections made by RNN states

We consider three common recurrent mod-els as (1).Deep Vanilla Recurrent Neural Net-work (DVRNN), (2).Deep Gated Recurrent Units(DGRU) network, and (3).Deep Long Short TermMemory (DLSTM) network. RNNs have feedbackloops in the recurrent layer which lets them main-tain information in memory over time. But, itcan be difficult to train standard RNNs to solveproblems that require learning long-term tempo-ral dependencies due to vanishing gradient prob-lems. On the other hand, LSTM uses special unitsin addition to the standard units that includesmemory cells for maintaining information for longperiods of time. A set of gates is used to con-trol when information enters the memory, when

4

0 100 200 300 400Time Steps [Month]

0

500

1000

1500

2000

Oil Produ

ction [bbl/d]

Main cluster 2 and Sub-Cluster 0Sub-cluster 0Other Sub-clusters


0

500

1000

1500

2000

Oil Produ

ction [bbl/d]


(a)


0

1000

2000

3000

4000

5000

Oil Produ

ction [bbl/d]



0

1000

2000

3000

4000

5000

Oil Produ

ction [bbl/d]


(b)

Figure 3: (a). Macro-cluster 2 decomposition based on the length of time series; (b). Macro-cluster 3 decompositionbased on the amplitude of time series

it gets outputted, and when it’s forgotten. GRUsare similar to LSTMs, but use a simplified struc-ture and have a set of gates to control the flowof information, but they don’t use separate mem-ory cells and have fewer gates. Before doing anactual time series forecast we do an experimentand compare a very simple Vanilla RNN’s (with5 neurons and 1 layer) forecast capability with De-cline Curve Analysis (DCA). We synthetically cre-ate the time series by adding sine and cosine modesto a base exponentially decaying with time signal(i.e. ati sin(ωtit) + bti cos(ωtit) with ati , bti , ωti be-ing random amplitude and frequencies at each timestep ti). Then by splitting the time series into 75%,25% split as training and test set we forecast us-ing both the Vanilla RNN and DCA. Fig. 5 showsthe comparison of the forecast results using VanillaRNN and DCA for a noisy exponentially decay-ing time series. Using Vanilla RNN improves theRMSE by 34% which is a considerable improve-

0 100 200 300 400 500Time

1000

1020

1040

1060

1080

1100

Value

Original Time SeriesTime Series PredictionDCA Prediction

Figure 5: Comparison of Decline Curve Analysis v.s.Recurrent Neural Network for oil production time seriesforecast

ment compared to a naive DCA approach.

History Based Time Series Forecast Results

Table 3 shows the experiment results of an ac-tual time series forecast using the three recurrentarchitectures. We first use a one-layer VanillaRecurrent Neural Network (VRNN) which leads to

5

Table 3: RMSE values of different experiments; GRU:Gated recurrent unit, LSTM: long short-term memory,D.O.: with Drop-out, Reg.: with regularization

No. Experiments RMSE (%)1 Vanilla RNN (VRNN) 10.322 Deep Vanilla RNN (DVRNN) 8.713 DVRNN + D.O. 8.514 DVRNN + D.O. + Reg. 8.115 Deep GRU (DGRU) 7.726 DGRU + D.O. 7.617 DGRU + D.O. + Reg. 7.498 Deep LSTM (DLSTM) 7.119 DLSTM + D.O. 6.8710 DLSTM + D.O. + Reg. 5.92

RMSE of 10.32%. We then use a Deep VRNN bystacking multiple layers of VRNN which improvesthe RMSE to 8.71%. The RMSE decreases byusing drop-out and kernel normalization down to8.11%. In the next step, we use a Deep GatedRecurrent Neural Network (DGRU) that helpsto capture long term dependencies in the timeseries. After hyper-parameter study, a DGRU withdrop-out and kernel regularization could achieve7.49 % RMSE. In the next step, we use Deep LongShort Term Memory (DLSTM) network to bettercapture the long term temporal dependencies, andwe could achieve the highest RMSE of 5.92 % aftercareful hyper-parameter study. Table 4 showsthe details of the optimum network architecturethat have come out of the hyper-parameter studyfor each type of recurrent network. As Table 4shows for each type of network, in cases whereboth drop-out and regularization are used thebest outcome in terms of RMSE is achieved,however only using large drop-out values underperforms compared to using mixed drop-out andregularization. All the models have been trainedusing Adam optimizer and training has terminatedonce the improvement in the RMSE over 30 epochsis less than 1%-2%.

Fig. 6 shows the forecast results for thethree types of architectures used i.e. Deep VRNN,Deep GRU, and Deep LSTM with RMSE of8.11%, 7.49%, and 5.92% respectively. As Fig. 6ashows Deep VRNN can predict the trend howeverit undershoots the peaks and troughs with adelay. However, Fig. 6b shows that the Deep GRUsometimes undershoots/overshoots the peaks andtroughs. On the other hand, Deep LSTM modelin Fig. 6c captures both the trends and peaks andtroughs reasonably well compared to the other two

Table 4: Detailed parameter values of experiments intable 3; column Layer sizes gives the number of neuronsin each layer(s) of a model, D.O. Coef. is the drop-outcoefficient, and Reg. Coef. is the coefficient for L2

regularization

No. Layer sizes D.O. Coef. Reg. Coef.(# neurons)1 (15) 0 02 (20,15,7) 0 03 (20,15,7) 0.4 04 (20,15,7) 0.2 L2(0.03)5 (25,10,5) 0 06 (25,10,5) 0.5 07 (25,10,5) 0.25 L2(0.01)8 (20,15,5) 0 09 (20,15,5) 0.45 010 (20,15,5) 0.15 L2(0.02)

recurrent architectures. In Fig. 6c Deep LSTMimproves RMSE by up to 92% compared to DCA.The change in trend observed at the end of thetime series amplifies the gap between RNN andDCA and is an example of a case where largeimprovements could be achieved using these deepmodels.

Secondary Variables Based Sequence-to-Sequence Regression

This third model aims at making prediction onoil production taking into account contextual timeseries data. We use sequence-to-sequence regres-sion model as a way to correlate oil production ata well with a set of other data measured at the samewell (gas production, water production, pressures)but also at the field level (offset production, injec-tion) and even at a macro-economic level (oil/gasprice, number of rigs available, financial of the op-erating company, and etc.). For this example, weonly look at the contextual data for a given well.

Description of Sequence-to-Sequence Model

We use a sequence-to-sequence model similar tothe one shown in Fig. 7. The inputs are gas produc-tion, water production, dates, number of flowingdays in a month, API number (related to position).The output is Liquid cumulative production. Theoverall network architecture is as following: start-ing with an input sequence with a given dimension(10 as an example), followed by LSTM units thatare connected to a fully connected layer and finallyconnecting to a one neuron layer with a RMSE re-sponse. The details about the size and architectureof each part of the network is given in Table 5 sub-

6

0 50 100 150 200Time

4

5

6

7

8

9

10

Oil P

rodu

ctio

n ×1

03 [B

BL]


(a) Deep Vanilla RNN

0 50 100 150 200Time

4

5

6

7

8

9

10

Oil P

rodu

ctio

n ×1

03 [B

BL]


(b) Deep Gated Recurrent Units

0 50 100 150 200Time

4

5

6

7

8

9

10

Oil P

rodu

ctio

n ×1

03 [B

BL]


(c) Deep Long Short Term Memory

Figure 6: Oil production time series forecast using history of production - 75%-25% split over training and testsets - DVRNN under/overshoots the peaks and troughs with slight delay - DGRU leads to less delay, however,under/overshoots still present - DLSTM is reasonably robust in capturing the trend and spotting the peaks andtroughs.

ject to a hyper-parameter study.

Figure 7: A many-to-many sequence model, multivari-ate input sequences are used to predict and output

We consider time series of various lengths anduse a minibatch gradient descent with padding inorder to train the model. LSTM networks typicallyinput data with varying sequence lengths. Whenpassing data through the network, we must pad ortruncate sequences in each mini-batch to have thespecified length. To reduce the amount of paddedor discarded data when padding or truncating se-quences, we sort our data. Figure 8 illustrates thebenefits of sorting prior to padding.

Figure 8: Padding sequences with and without sorting

We train the model on 3850 wells (testing on350). Training loss (RMSE) is represented inFig. 9. We train the model on a single GPU Geo-Force X 1080 Ti. Each epoch takes about 10s forthe hyper-parameter sets defined in Table 5.

Figure 9: Training (blue) loss (RMSE) for sequence-to-sequence regression. Training occurs over 100 epochs.

We observe a good convergence behavior for eachbatch and overall. We run experiments over around1000 models. Our best model is obtained using anensemble of models. The hyper-parameters usedfor a subset of these models are represented in ta-ble 5.

Sequence-to-Sequence Model Results

Fig. 10 shows the comparison of modelled andactual cumulative production for 4 wells randomlyselected within the training (Fig. 10a) and test set(Fig. 10b). We see that they fit fairly well withthe exception of well 1957. The mean RMSE er-ror for the training set is 63,000 while it exceeds120,000 for the test set . The model learns quitewell and the match on some wells is nearly perfectwhile it misses on others. It is interesting to ob-serve that the model fails to capture the monotonenature of cumulative production while the overallmagnitude seems respected. As expected, the fiton the test set is slightly worse than for the train-ing set. More importantly, we tested this modelon the cumulative production. The signal is hencefairly smoothed compared to the original rate. Wewill migrate to the direct rate prediction in the fu-ture but in this work we aimed to produce accept-

7

0 20 40 60

Time Step

0

2

4

6

8

Liq

uid

Pro

d

104 Well 2098

0 50 100 150 200

Time Step

0

5

10

15

Liq

uid

Pro

d

104 Well 1016

0 20 40 60 80

Time Step

0

1

2

3

4

Liq

uid

Pro

d

104 Well 2044

0 20 40 60

Time Step

0

1

2

3

4L

iqu

id P

rod

104 Well 2071

Train Data

Predicted

(a) for training set

0 10 20 30 40

Time Step

0

0.5

1

1.5

2

Liq

uid

Pro

d

104 Well 105

0 50 100 150

Time Step

0

1

2

3

4

Liq

uid

Pro

d

104 Well 187

0 10 20 30 40

Time Step

0

0.5

1

1.5

2

Liq

uid

Pro

d

104 Well 206

0 100 200 300 400

Time Step

0

5

10

15

20

Liq

uid

Pro

d

104 Well 281

Test Data

Predicted

(b) for test set

Figure 10: Examples of predicted vs actual cumulative production

Table 5: RMSE values of different experiments onSequence-to-Sequence Regression model

Hidden units Dropout FC units Learning rate L2-reg RMSE

140 0.5 72 0.008 0.0005 21

100 0.5 72 0.008 0.0004 28

200 0.5 72 0.008 0.0005 31

200 0.4 76 0.008 0.0005 38

able results on the cumulative. The distribution ofthe RMSE over the whole test set is presented inFig. 11.

Figure 11: Distribution of RMSE over test set forSequence-to-Sequence regression model

We present the results of sequence-to-sequencemodel’s hyper-parameter tuning in Table 5. Theseresults are slightly worse than the baseline set bythe classical DCA approach (30% RMSE). How-ever, none of the historical data is used to com-pute them. We use secondary metrics only to com-

pute the liquid production. This exercise allows tocapture long term effects of production (such as aslow down or an increase). This is what we see inthe cumulative production trends. It is a necessarycomplement to the prediction based purely on his-torical production presented in an earlier part inthis work as that approach can capture short termvariations very well but cannot make predictions inthe long term.

Future WorksAn advantage of DCA is that it provides a

long term estimate of production that is inaccu-rate but very consistent. It is a model with highbias and very low variance. LSTM models tendto have very high variance but they fail to iden-tify long term trends in predictions unless the pre-dictions constantly get adjusted. This approach,although valuable for short term predictions, failsto provide a holistic substitute for DCA whichremains for simplicity and robustness reason themost widely used approach in the industry. In thefuture steps, we will improve the individual modelswe have hereby presented and merge them in orderto produce a holistic forecast. A promising ap-proach would be to use encoder-decoder Sequence-to-Sequence models where all the contextual data(static and dynamic) would be integrated into theencoder and the actual prediction would be madebased on the encoding rather than the actual data.This would allow a denser representation of thedata and help with comparing wells with eachother.

8

ReferencesJ.J. Arps. Analysis of Decline Curves, 1945.

S.D. Mohaghegh, O. Grujic, S. Zargari, andM. Kalantari. SPE 143875 Modeling , HistoryMatching , Forecasting and Analysis of ShaleReservoirs Performance Using Artificial Intelli-gence Top-Down , Intelligent Reservoir Model-ing for Shale Formations. SPE Digital EnergyConference and Exhibition, page 14, 2011. doi:143875.

Tita Ristanto. Deep Learning Application in WellProduction. Deep Learning Course Reports,CS230, Stanford University, 2018.

J Sun, X Ma, M Kazi, and C S E Icon. SPE-190104-MS Comparison of Decline Curve Analysis DCAwith Recursive Neural Networks RNN for Pro-duction Forecast of Multiple Wells. 2018.

9

Documents

Improved Time Series Decline Curve Analysis For Oil