Volume 5 Number 1, January-June 2017
118
THE DATA-DRIVEN MODEL TO ANALYZE HAZE OCCURRENCE IN
NORTHERN THAILAND
Nittaya Kerdprasop1,
Kittisak Kerdprasop1,
Paradee Chuaybamroong2
1School of Computer Engineering,Suranaree University of Technology,
2Faculty of Science and Technology, Thammasat University, Rangsit Campus,
Thailand.
ABSTRACT
The seasonal smoke-
haze pollution has been a chronic
problem for large areas in Southeast
Asia, including Thailand. The major
cause of smoke-haze is the
intentional forest and vegetation field
burning to clear land for cultivation.
Smoke-haze has occurred in almost
every part of Thailand. However, the
most severely affected region is the
mountainous areas in the north.
Chiang Mai province, surrounded by
the northern mountainous highlands,
has been recently experiencing the
increase in intensity of fire-related air
pollution. We are, therefore,
interested in deriving a data-driven
model to be used as a warning tool
for the haze transport from the
surrounding mountainous areas. The
haze occurrence predictive model is
nevertheless difficult to
deriveautomatically by a learning
algorithm because the event is
considered rare, comparative to the
majority of normal situation of air
quality monitoring system. We thus
propose to use the sample
bootstrapping technique to help
recognizing the rare event of
pollution caused by the smoke-haze.
From the model evaluation results,
bootstrapping can increase the overall
model accuracy from 90% to 92%.
The best performance of bootstrapped
model is its sensitivity, also called the
recall, in recognizing the rare event
of haze occurrences can increase
from 0% to 38%.
KEYWORDS
Statistical inference, Bootstrapping
sample, Decision tree learning, Haze
occurrence, Air quality, PM10
INTRODUCTION
According to the World
Meteorological Organization
(WMO), haze has been defined as
According to the World
Meteorological Organization
(WMO), haze has been defined as
numerous amount of extremely small
and dry particles that are suspended
in the atmosphere and cause the sky
to look dusky (World Meteorological
Organization, 2016). The visibility
degradation during the haze
phenomenon is due to the scattering
and absorption of visible light
Volume 5 Number 1, January-June 2017
119
through the haze particles in the
atmosphere. Haze can be a natural
phenomenon, but the smoke-haze is a
man-made event occurring from the
forest fire intentionally set to clear
land for vegetation. Thailand and
other countries in the Southeast Asia
have long been suffered from the
smoke-haze pollution episodes (Field,
R.D., et al., 2004),(Heil, A. and
Goldammer, J.G., 2001), (Koe, L., et
al., 2001),(Othman, J., et al., 2014),
(Velasco, E.and Rastan, S. 2015).
Haze and smoke-haze can
contain high concentrations of
hazardous heavy metals and small
particulate matters (PM). PM10 and
PM2.5 are particulate matters with
the aerodynamic diameters less than
or equal to 10 and 2.5 micrometers,
respectively. The smaller PM, the
more harmful because these
extremely tiny particles can retain in
the human lung and spread through
the whole body. PM can cause airway
damages, cardiovascular
impairments, and adverse effects in
infants (Feng, S., et al., 2016).
Winter haze events cause health
problems in many areas of China (Li,
J. and Han, Z., 2016). Toxic
pollutants caused by the haze
episodes have been investigated
(Qiao, T., et al., 2016), (Zhang, Q., et
al.,2015). Forecasting models and
simulation methods have also been
applied to study the transport
characteristics of haze events (Che,
H., et al., 2009), (Liu, D. and Li, L.
(2015), (Tian, Y., et al., 2016), (Xu,
H., et al., 2014), (Yan, X., et al.,
2016), (Zhou, G., et al., 2014).
In Thailand, smoke-haze is a
periodic air pollution incidence
occurred during the dry season from
the late January extending possibly to
the last week of April or even early
May, depending on the coming of
rain. The extent and intensity of fire-
related air pollution in northern
Thailand have been increasing in
recent years. Controlling burning is
obviously an effective solution, but it
is quite impractical in the sense that
so many large areas in valleys and
high mountains are difficult to reach
by officers.
Monitoring burning incidences
as a warning sign is more or less a
feasible way for handling in advance
the smoke-haze situation (Spessa,
A.C., et al., 2014). We thus propose
in this paper a data-driven model to
analyze occurrence patterns of
smoke-haze spreading from nearby
mountainous regions. Our model
building method is however different
from others in that we apply the data
mining technique, instead of the
statistical inference relationships and
simulation approaches. A specific
decision tree learning (Breiman, L., et
al., 1984), (Quinlan, J.R., 1986) from
the data mining research field has
been adopted as model induction
method. A decision tree induction is
an appropriate learning technique
when model interpretation, reasoning,
and ease of use are major concerns
regarding the induction outcome
(Kerdprasop, N. and Kerdprasop, K.,
2016).
From an inherent imbalance
nature of the obtained air quality data
Volume 5 Number 1, January-June 2017
120
that polluted days are much less than
normal days, we have to apply
balancing technique prior to the
application of decision tree learning.
We use the sample bootstrapping
(Efron, B. andTibshirani, R.J.,1993)
as a statistical tool to reweight the
data instances to obtain data balance.
The data preparation techniques are
explained in Section 2. Model
induction process is elaborated in
Section 3. The results of model
evaluation are presented in Section 4.
This paper is concluded with Section
5.
STUDY AREA AND DATA
COLLECTION
In this work, we focus our
study on the haze occurrence patterns
around the Chiang Mai area. Chiang
Mai is an important province both
economically and politically. It
locates around 685 kilometers to the
north of Bangkok. The city situates
along the Mae Ping River basin
surrounded by high mountain ranges
(shown in Fig.1). Being a low basin
valley enclosed by many high
mountains makes smoke pollution in
Chiang Mai lasting in some years for
weeks.
Figure 1. Chiang Mai province and the surrounding borders: Mae Hong Son
(west), Chiang Rai (northeast), Nan (east), Lampang (southeast) .
source: https://maps-for-free.com)
Volume 5 Number 1, January-June 2017
121
Figure 2.Example of average PM10 values from January 1-5, 2016.
The main source of smoke-haze
in northern Thailand and neighbor
countries such as Lao and Myanmar
is burning of land. The start of the
dry season around mid to late
February is the onset of slash and
burn farming. The lingering smoke
from the fires causes serious
pollution problem to people living
along the Chiang Mai valley during
March to April every year.
We thus collect air pollution
data, with a specific attention to
PM10, of Chiang Mai and other
mountainous provinces nearby,
including Mae Hong Son located on
the west of our study area, Chiang
Rai on the northeast, Nan on the far
east, and Lampang on the southeast
of Chiang Mai. Data are collected
from the Air Quality and Noise
Management Bureau of Thailand (Air
quality data, PM10, 2016). PM10
data from these five northern
provinces are collected from January
to April, 2016. The data are 24-hour
average PM10 values from ground
stations in each province. Some data
samples are shown in Fig.2. Chiang
Mai (CM) is the target of our
analysis. Lampang (LM), Chiang Rai
(CR), Mae Hong Son (MH), and Nan
are the predictor variables.
METHODOLOGY FOR HAZE
OCCURRENCE MODEL
CREATION
Data Exploration
Prior to model building, an
early essential stage of data analysis
is the exploration over data
distribution and the observation of
correlation among data attributes.
The target of our data analysis is haze
occurrence in Chiang Mai (CM) area.
The predictors used to forecast such
event is the smoke-haze with
excessive amount of PM10 occurred
in the nearby Nan, Lampang (LP),
Mae Hong Son (MH), and Chiang
Rai (CR) provinces.
From the consideration of
PM10 concentration as representative
of hazardous pollutants in the smoke-
haze episode, the area highly
correlated to Chiang Mai incidence is
Nan (as shown in Fig. 3). The
coincidence of peaks in distribution
graph during January to April 2016
(Fig. 4) also confirms this correlation
of smoke-haze events.
Volume 5 Number 1, January-June 2017
122
Figure 3.Correlation matrix of PM10 values in Chiang Mai and surrounding
provinces.
Figure 4.PM10 during January-April, 2016 in Chiang Mai and other
provinces.
The Model Building Process
We design our haze-occurrence
modeling to be comprised of five
main steps: data extraction, missing
value imputation, data
transformation, sample bootstrapping,
and model building. The data
extraction is simply creating file to
contain interesting values. The next
step is missing value handling, in
which the mean value is used for
imputation. For data transformation,
numeric values are transformed to be
nominal ones. This data format
transformation is for serving our
main purpose of building a kind of
classification model that can be used
to explain the co-occurrence
relationships and also be used for
predicting future events. We thus
transform the daily records of PM10
numeric values as either hazardous
(true), or not (false). The harmful
level of PM10 concentration is at or
higher than 120 micrograms per
cubic meters. The transformed data
Volume 5 Number 1, January-June 2017
123
are illustrated in Fig. 5. At theend of
this step, haze-occurrence model may
be created (as shown in Fig. 6).
Figure 5.A new data set that numeric values are transformed as binomial.
Figure 6.Original Chiang Mai haze occurrence predictive model built from
imbalance data.
Volume 5 Number 1, January-June 2017
124
Figure 7. A framework of haze occurrence predictive modeling with
bootstrapping technique.
Notice that the model in Fig.
6 has limited power of capturing only
a single hazardous event, which is
represented as the true leaf node.
Limitation is due to the minority 8
days of haze event as compared to the
113 days of normal situation. To
increase the model power, we
therefore propose the important step
of bootstrapping as shown in Fig.7.
Bootstrapping the Model
The proposed framework of
model induction with data balancing
technique of sample bootstrapping
has been implemented with the
RapidMiner software version Studio
Basic 17.0 (RapidMiner). The
difference between the new
bootstrapped model (in Fig.8) and the
original model built from the
imbalance data is that the new one
can capture four sequences of haze
occurrence in Chiang Mai province,
whereas the original model can
recognize only one sequence. The
two versions of haze-occurrence
predictive models are, however,
similar in their root node that smoke-
haze has normally occurred in Nan
before spreading through other
provinces and finally covering
Chiang Mai area.
Volume 5 Number 1, January-June 2017
125
Figure 8.The Chiang Mai haze occurrence predictive model built from
bootstrapped data.
The haze-occurrence predictive model (in Fig.8) can be interpreted as
warning signs for the probable haze incidence in Chiang Mai as follows:
Warning sign #1: Hazardous haze will occur in Chiang Mai,
if there are hazardous haze events in Lampang and
Mae Hong Son.
Warning sign #2: Hazardous haze will occur in Chiang Mai,
if there are hazardous haze events in Chiang Rai and
Lampang.
Warning sign #3: Hazardous haze will occur in Chiang Mai,
if there are hazardous haze events in Nan and Mae
Hong Son.
Warning sign #4: Hazardous haze will occur in Chiang Mai,
if there are hazardous haze events in Nan and
Lampang.
MODEL EVALUATION
Model performance evaluation
is the last step of our designed
process. The performance of the
bootstrapped model is assessed
against the original model without
data balancing technique. Our
original data set contains 121 data
instances, in which 8 of them are true
harmful haze events. The remaining
113 days are not hazardous events.
We thus perform bootstrapping to
balance data with the weight 1
assigned to the harmful hazy days,
whereas the weight 0.07 has been
applied to the normal days. The
Volume 5 Number 1, January-June 2017
126
weight 0.07 is computed from down-
balancing the majority group of data
to be ap-proximately the same
amount of the minority group (113
non-hazy days 0.07 8 days).
Upon this weight assignment scheme,
the bootstrapped data samples are
then drawn. The bootstrapped data set
is thus not exactly equal to the
original data set in their number of
data instances in true and false
classes. The percentage computation
can nevertheless be used to make a
fair comparison.
The performance evaluation of
haze-occurrence predictive model
built from the original imbalance data
is represented as confusion matrix in
the above table of Fig.9. This
evaluation report is based on the 10-
fold cross validation assessment
method. The overall accuracy, which
is the consideration of predictive
accuracy on both harmful hazy days
and non-hazy days, of this original
model is 90.08%. To further
investigate the accuracy on predicting
harmful hazy days, this model is 0%
accurate.
The bootstrapped model is also
evaluated with the 10-fold cross
validation assessment method. The
overall accuracy of the bootstrapped
model is 92.56%. On predicting
harmful hazy days, its accuracy is
83.33%. For non-harmful hazy days,
predicting ac-curacy is 93%, which is
approximately the same as the
original model.
To analyze sensitivity of the
two models, the bootstrapped model
can recall harmful events with
38.46% correctness, whereas the
original model yield 0%. The
bootstrapping technique can also
improve the sensitivity of predicting
non-hazardous haze event (the false
class) from 96.46% in the original
model to 99.07% in the bootstrapped
model.
The performance evaluation of
bootstrapping technique that has been
applied as an increment to the
decision tree induction power as
presented in Fig. 9 can therefore
confirm efficiency of this well-known
statistical tool. The sensitivity and
predictive precision improvement by
means of bootstrapping can also be
considered an augmentation of tree-
based model that has the inherent
ability of reasoning and
comprehensibility.
(a) Performance of a model built from original imbalance data
Volume 5 Number 1, January-June 2017
127
(b) Performance of a model built from bootstrapped data samples
Figure 9.Confusion matrices as evaluation results of the original model
performance (a) and the better performance of the bootstrapped model, (b) in
terms of overall accuracy and the improved recall rate of the true haze-
occurrence class.
CONCLUSION
Smoke-haze can occur in
almost every region of Thailand after
harvesting periods, normally by the
end of the year. The impact of
pollution caused by smoke-haze
incidence is most severe around the
lower areas surrounded by high
mountains such as Chiang Mai
province. We thus try to induce haze-
occurrence predictive model from the
historical data with the intention to
apply the model as a warning tool for
officers and ordinary people. The
well-known tree induction technique
has been considered appropriate for
this specific purpose as the widely
acceptance in its simplicity and easy-
to-understand nature of the
underlying decision tree structure.
The major hindrance of directly
applying tree induction to the haze
data available at hand is that the data
is imbalance. Learning from
imbalance data that amount of one
class overwhelm other classes causes
much trouble to the tree induction
algorithm. The result of learning from
highly imbalance data is that the final
model will lose patterns of the
minority class. We therefore design
our model induction steps to include
data sample bootstrapping as an
important part to deal with data
imbalance situation. By assigning
different weighting schemes on
sample drawing, the transformed data
set is then ready for a tree-based
model induction. The model’s
performance evaluation confirms
efficiency of bootstrapping method as
it can increase predictive accuracy
from 90% to around 93%. The
sensitivity as the important metric in
predicting rare events can increase
significantly from 0% to almost 39%.
ACKNOWLEDGMENT
This research work has been
supported in part by grants from the
Toray Industries Foundation, the
National Research Council of
Thailand, andSuranaree University of
Technology through the funding of
Data Engineering Research Unit.
Volume 5 Number 1, January-June 2017
128
BIBLIOGRAPHY
Air quality data, PM10 (2016). Air
Quality Information of
Thailand, Division of Air
Quality Data, Pollution
Control Department, Air
Quality and Noise
Management Bureau. Access
August 2016, Retrieved from
http://www.aqmthai.com/aqi.p
hp
Breiman, L., Friedman, J.H., Olshen,
R.A., Stone,
C.J.(1984).Classification and
Regression Tree. Monterey,
CA: Wadsworth &
Brooks/Cole Advanced Books
& Software.
Che, H., Zhang, X., Li, Y., Zhen, Z.,
Qu, J.J., Hao, X.(2009).Haze
trends over the capital cities
of 31 provinces in China,
1981-2005. Theoretical and
Applied Climatology, 97, 235-
242
Efron, B., Tibshirani, R.J.(1993).An
Introduction to the Bootstrap.
Boca Raton, FL: Chapman &
Hall/CRC
Feng, S., Gao, D., Liao, F., Zhou, F.,
Wang, X.(2016).The health
effects of ambient PM2.5 and
potential mechanisms.
Ecotoxicology and
Environmental Safety, 128,
67-74
Field, R.D., Wang, Y., Roswintiarti,
O., Guswanto(2004).A
drought-based predictor of
recent haze events in western
Indonesia. Atmospheric
Environment, 38, 1869-1878
Heil, A., Goldammer, J.G.(2001).
Smoke-haze pollution: a
review of the 1997 episode in
Southeast Asia. Regional
Environmental Change, 2(1),
24-37
Kerdprasop, N., Kerdprasop,
K.(2016).Regression tree
analysis of CO2 emissions and
environmental factors to the
survival rate of population in
Thailand and China. In The
24th
International
MultiConference of Engineers
and Computer Scientists, pp.
286-290
Koe, L., Arellano, A.F., McGregor,
J.L.(2001). Investigating the
haze transport from 1997
biomass burning in Southeast
Asia: its impact upon
Singapore. Atmospheric
Environment, 35, 2723-2734
Li, J., Han, Z.(2016).A modeling
study of severe winter haze
events in Beijing and its
neighboring regions.
Atmospheric Research, 170,
87-97
Liu, D., Li, L.(2015).Application
study of comprehensive
forecasting model based on
entropy weighting method on
trend of PM2.5 concentration
in Guangzhou, China.
International Journal of
Environmental Research and
Public Health, 12, 7085-7099
Othman, J., Sahani, M., Mahmud, M.,
Sheikh Ahmad,
M.K.(2014).Transboundary
smoke haze pollution in
Volume 5 Number 1, January-June 2017
129
Malaysia: inpatient health
impacts and economic
valuation. Environmental
Pollution, 189, 194-201
Qiao, T., Zhao, M., Xiu, G., Yu,
J.(2016).Simultaneous
monitoring and compositions
analysis of PM1 and PM2.5in
Shanghai: implications for
characterization of haze
pollution and source
apportionment. Science of the
Total Environment, 557-558,
386-394
Quinlan, J.R.(1986).Induction of
decision tree. Machine
Learning, 1, 81-106
RapidMiner: RapidMiner Studio.
Retrieved from
https://rapidminer.com
Spessa, A.C., Field, R.D.,
Pappenberger, F., Langner,
A., Englhart, S., Weber, U.,
Stockdate, T., Siegert, F.,
Kaiser, J.W., Moore,
J.(2014).Seasonal forecasting
of fire over Kalimantan,
Indonesia. Natural Hazards
and Earth System Sciences
Discussions, 2, 5079-5111
Tian, Y., Shi, G., Huang-Fu, Y.,
Song, D., Liu, J., Zhou, L.,
Feng, Y.(2016).Seasonal and
regional variations of source
contributions for PM10 and
PM2.5 in urban environment.
Science of the Total
Environment,557-558, 697-
704
Velasco, E., Rastan, S.(2015).Air
quality in Singapore during
the 2013 smoke-haze episode
over the Strait of Malacca:
lessons learned. Sustainable
Cities and Society, 17, 122-
131
World Meteorological Organization.
(2016).Meteoterm. Access
December 2016, Retrieved
from
http://public.wmo.int/en/resou
rces/meteoterm
Xu, H., Li, G., Yang, S., Xu,
X.(2014).Modeling and
simulation of haze process
based on Gaussian model. In
The 11th International
Computer Conference on
Wavelet Active Media
Technology and Information
Processing, pp. 68-74
Yan, X., Shi, W., Luo, N., Zhao,
W.(2016).A new method of
satellite based haze aerosol
monitoring over the North
China Plain and a comparison
with MODIS collection 6
aerosol products. Atmospheric
Research, 171, 31-40
Zhang, Q., Yan, R., Fan, J., Yu, S.,
Yang, W., Li, P., Wang, S.,
Chen, B., Liu, W., Zhang,
X.(2015).A heavy haze
episode in Shanghai in
December of 2013:
characteristics, origins and
implications. Aerosol and Air
Quality Research, 15, 1881-
1893
Zhou, G., Yang, F., Geng, F., Xu, J.,
Yang, X., Tie,
X.(2014).Measuring and
modeling aerosol: relationship
with haze events in Shanghai,
Volume 5 Number 1, January-June 2017
130
China. Aerosol and Air
Quality Research, 14, 783-
792
Rice Broadcasting