Download pdf - THE DATA-DRIVEN MODEL TO ANALYZE HAZE OCCURRENCE IN ...icmr.crru.ac.th/Journal9_draft/The Data-driven.pdf · Volume 5 Number 1, January-June 2017 118 THE DATA-DRIVEN MODEL TO ANALYZE

Volume 5 Number 1, January-June 2017

118

THE DATA-DRIVEN MODEL TO ANALYZE HAZE OCCURRENCE IN

NORTHERN THAILAND

Nittaya Kerdprasop1,

Kittisak Kerdprasop1,

Paradee Chuaybamroong2

1School of Computer Engineering,Suranaree University of Technology,

2Faculty of Science and Technology, Thammasat University, Rangsit Campus,

Thailand.

ABSTRACT

The seasonal smoke-

haze pollution has been a chronic

problem for large areas in Southeast

Asia, including Thailand. The major

cause of smoke-haze is the

intentional forest and vegetation field

burning to clear land for cultivation.

Smoke-haze has occurred in almost

every part of Thailand. However, the

most severely affected region is the

mountainous areas in the north.

Chiang Mai province, surrounded by

the northern mountainous highlands,

has been recently experiencing the

increase in intensity of fire-related air

pollution. We are, therefore,

interested in deriving a data-driven

model to be used as a warning tool

for the haze transport from the

surrounding mountainous areas. The

haze occurrence predictive model is

nevertheless difficult to

deriveautomatically by a learning

algorithm because the event is

considered rare, comparative to the

majority of normal situation of air

quality monitoring system. We thus

propose to use the sample

bootstrapping technique to help

recognizing the rare event of

pollution caused by the smoke-haze.

From the model evaluation results,

bootstrapping can increase the overall

model accuracy from 90% to 92%.

The best performance of bootstrapped

model is its sensitivity, also called the

recall, in recognizing the rare event

of haze occurrences can increase

from 0% to 38%.

KEYWORDS

Statistical inference, Bootstrapping

sample, Decision tree learning, Haze

occurrence, Air quality, PM10

INTRODUCTION

According to the World

Meteorological Organization

(WMO), haze has been defined as

According to the World

Meteorological Organization

(WMO), haze has been defined as

numerous amount of extremely small

and dry particles that are suspended

in the atmosphere and cause the sky

to look dusky (World Meteorological

Organization, 2016). The visibility

degradation during the haze

phenomenon is due to the scattering

and absorption of visible light


119

through the haze particles in the

atmosphere. Haze can be a natural

phenomenon, but the smoke-haze is a

man-made event occurring from the

forest fire intentionally set to clear

land for vegetation. Thailand and

other countries in the Southeast Asia

have long been suffered from the

smoke-haze pollution episodes (Field,

R.D., et al., 2004),(Heil, A. and

Goldammer, J.G., 2001), (Koe, L., et

al., 2001),(Othman, J., et al., 2014),

(Velasco, E.and Rastan, S. 2015).

Haze and smoke-haze can

contain high concentrations of

hazardous heavy metals and small

particulate matters (PM). PM10 and

PM2.5 are particulate matters with

the aerodynamic diameters less than

or equal to 10 and 2.5 micrometers,

respectively. The smaller PM, the

more harmful because these

extremely tiny particles can retain in

the human lung and spread through

the whole body. PM can cause airway

damages, cardiovascular

impairments, and adverse effects in

infants (Feng, S., et al., 2016).

Winter haze events cause health

problems in many areas of China (Li,

J. and Han, Z., 2016). Toxic

pollutants caused by the haze

episodes have been investigated

(Qiao, T., et al., 2016), (Zhang, Q., et

al.,2015). Forecasting models and

simulation methods have also been

applied to study the transport

characteristics of haze events (Che,

H., et al., 2009), (Liu, D. and Li, L.

(2015), (Tian, Y., et al., 2016), (Xu,

H., et al., 2014), (Yan, X., et al.,

2016), (Zhou, G., et al., 2014).

In Thailand, smoke-haze is a

periodic air pollution incidence

occurred during the dry season from

the late January extending possibly to

the last week of April or even early

May, depending on the coming of

rain. The extent and intensity of fire-

related air pollution in northern

Thailand have been increasing in

recent years. Controlling burning is

obviously an effective solution, but it

is quite impractical in the sense that

so many large areas in valleys and

high mountains are difficult to reach

by officers.

Monitoring burning incidences

as a warning sign is more or less a

feasible way for handling in advance

the smoke-haze situation (Spessa,

A.C., et al., 2014). We thus propose

in this paper a data-driven model to

analyze occurrence patterns of

smoke-haze spreading from nearby

mountainous regions. Our model

building method is however different

from others in that we apply the data

mining technique, instead of the

statistical inference relationships and

simulation approaches. A specific

decision tree learning (Breiman, L., et

al., 1984), (Quinlan, J.R., 1986) from

the data mining research field has

been adopted as model induction

method. A decision tree induction is

an appropriate learning technique

when model interpretation, reasoning,

and ease of use are major concerns

regarding the induction outcome

(Kerdprasop, N. and Kerdprasop, K.,

2016).

From an inherent imbalance

nature of the obtained air quality data


120

that polluted days are much less than

normal days, we have to apply

balancing technique prior to the

application of decision tree learning.

We use the sample bootstrapping

(Efron, B. andTibshirani, R.J.,1993)

as a statistical tool to reweight the

data instances to obtain data balance.

The data preparation techniques are

explained in Section 2. Model

induction process is elaborated in

Section 3. The results of model

evaluation are presented in Section 4.

This paper is concluded with Section

5.

STUDY AREA AND DATA

COLLECTION

In this work, we focus our

study on the haze occurrence patterns

around the Chiang Mai area. Chiang

Mai is an important province both

economically and politically. It

locates around 685 kilometers to the

north of Bangkok. The city situates

along the Mae Ping River basin

surrounded by high mountain ranges

(shown in Fig.1). Being a low basin

valley enclosed by many high

mountains makes smoke pollution in

Chiang Mai lasting in some years for

weeks.

Figure 1. Chiang Mai province and the surrounding borders: Mae Hong Son

(west), Chiang Rai (northeast), Nan (east), Lampang (southeast) .

source: https://maps-for-free.com)


121

Figure 2.Example of average PM10 values from January 1-5, 2016.

The main source of smoke-haze

in northern Thailand and neighbor

countries such as Lao and Myanmar

is burning of land. The start of the

dry season around mid to late

February is the onset of slash and

burn farming. The lingering smoke

from the fires causes serious

pollution problem to people living

along the Chiang Mai valley during

March to April every year.

We thus collect air pollution

data, with a specific attention to

PM10, of Chiang Mai and other

mountainous provinces nearby,

including Mae Hong Son located on

the west of our study area, Chiang

Rai on the northeast, Nan on the far

east, and Lampang on the southeast

of Chiang Mai. Data are collected

from the Air Quality and Noise

Management Bureau of Thailand (Air

quality data, PM10, 2016). PM10

data from these five northern

provinces are collected from January

to April, 2016. The data are 24-hour

average PM10 values from ground

stations in each province. Some data

samples are shown in Fig.2. Chiang

Mai (CM) is the target of our

analysis. Lampang (LM), Chiang Rai

(CR), Mae Hong Son (MH), and Nan

are the predictor variables.

METHODOLOGY FOR HAZE

OCCURRENCE MODEL

CREATION

Data Exploration

Prior to model building, an

early essential stage of data analysis

is the exploration over data

distribution and the observation of

correlation among data attributes.

The target of our data analysis is haze

occurrence in Chiang Mai (CM) area.

The predictors used to forecast such

event is the smoke-haze with

excessive amount of PM10 occurred

in the nearby Nan, Lampang (LP),

Mae Hong Son (MH), and Chiang

Rai (CR) provinces.

From the consideration of

PM10 concentration as representative

of hazardous pollutants in the smoke-

haze episode, the area highly

correlated to Chiang Mai incidence is

Nan (as shown in Fig. 3). The

coincidence of peaks in distribution

graph during January to April 2016

(Fig. 4) also confirms this correlation

of smoke-haze events.


122

Figure 3.Correlation matrix of PM10 values in Chiang Mai and surrounding

provinces.

Figure 4.PM10 during January-April, 2016 in Chiang Mai and other

provinces.

The Model Building Process

We design our haze-occurrence

modeling to be comprised of five

main steps: data extraction, missing

value imputation, data

transformation, sample bootstrapping,

and model building. The data

extraction is simply creating file to

contain interesting values. The next

step is missing value handling, in

which the mean value is used for

imputation. For data transformation,

numeric values are transformed to be

nominal ones. This data format

transformation is for serving our

main purpose of building a kind of

classification model that can be used

to explain the co-occurrence

relationships and also be used for

predicting future events. We thus

transform the daily records of PM10

numeric values as either hazardous

(true), or not (false). The harmful

level of PM10 concentration is at or

higher than 120 micrograms per

cubic meters. The transformed data


123

are illustrated in Fig. 5. At theend of

this step, haze-occurrence model may

be created (as shown in Fig. 6).

Figure 5.A new data set that numeric values are transformed as binomial.

Figure 6.Original Chiang Mai haze occurrence predictive model built from

imbalance data.


124

Figure 7. A framework of haze occurrence predictive modeling with

bootstrapping technique.

Notice that the model in Fig.

6 has limited power of capturing only

a single hazardous event, which is

represented as the true leaf node.

Limitation is due to the minority 8

days of haze event as compared to the

113 days of normal situation. To

increase the model power, we

therefore propose the important step

of bootstrapping as shown in Fig.7.

Bootstrapping the Model

The proposed framework of

model induction with data balancing

technique of sample bootstrapping

has been implemented with the

RapidMiner software version Studio

Basic 17.0 (RapidMiner). The

difference between the new

bootstrapped model (in Fig.8) and the

original model built from the

imbalance data is that the new one

can capture four sequences of haze

occurrence in Chiang Mai province,

whereas the original model can

recognize only one sequence. The

two versions of haze-occurrence

predictive models are, however,

similar in their root node that smoke-

haze has normally occurred in Nan

before spreading through other

provinces and finally covering

Chiang Mai area.


125

Figure 8.The Chiang Mai haze occurrence predictive model built from

bootstrapped data.

The haze-occurrence predictive model (in Fig.8) can be interpreted as

warning signs for the probable haze incidence in Chiang Mai as follows:

Warning sign #1: Hazardous haze will occur in Chiang Mai,

if there are hazardous haze events in Lampang and

Mae Hong Son.


if there are hazardous haze events in Chiang Rai and

Lampang.


if there are hazardous haze events in Nan and Mae

Hong Son.


if there are hazardous haze events in Nan and

Lampang.

MODEL EVALUATION

Model performance evaluation

is the last step of our designed

process. The performance of the

bootstrapped model is assessed

against the original model without

data balancing technique. Our

original data set contains 121 data

instances, in which 8 of them are true

harmful haze events. The remaining

113 days are not hazardous events.

We thus perform bootstrapping to

balance data with the weight 1

assigned to the harmful hazy days,

whereas the weight 0.07 has been

applied to the normal days. The


126

weight 0.07 is computed from down-

balancing the majority group of data

to be ap-proximately the same

amount of the minority group (113

non-hazy days 0.07 8 days).

Upon this weight assignment scheme,

the bootstrapped data samples are

then drawn. The bootstrapped data set

is thus not exactly equal to the

original data set in their number of

data instances in true and false

classes. The percentage computation

can nevertheless be used to make a

fair comparison.

The performance evaluation of

haze-occurrence predictive model

built from the original imbalance data

is represented as confusion matrix in

the above table of Fig.9. This

evaluation report is based on the 10-

fold cross validation assessment

method. The overall accuracy, which

is the consideration of predictive

accuracy on both harmful hazy days

and non-hazy days, of this original

model is 90.08%. To further

investigate the accuracy on predicting

harmful hazy days, this model is 0%

accurate.

The bootstrapped model is also

evaluated with the 10-fold cross

validation assessment method. The

overall accuracy of the bootstrapped

model is 92.56%. On predicting

harmful hazy days, its accuracy is

83.33%. For non-harmful hazy days,

predicting ac-curacy is 93%, which is

approximately the same as the

original model.

To analyze sensitivity of the

two models, the bootstrapped model

can recall harmful events with

38.46% correctness, whereas the

original model yield 0%. The

bootstrapping technique can also

improve the sensitivity of predicting

non-hazardous haze event (the false

class) from 96.46% in the original

model to 99.07% in the bootstrapped

model.

The performance evaluation of

bootstrapping technique that has been

applied as an increment to the

decision tree induction power as

presented in Fig. 9 can therefore

confirm efficiency of this well-known

statistical tool. The sensitivity and

predictive precision improvement by

means of bootstrapping can also be

considered an augmentation of tree-

based model that has the inherent

ability of reasoning and

comprehensibility.

(a) Performance of a model built from original imbalance data


127

(b) Performance of a model built from bootstrapped data samples

Figure 9.Confusion matrices as evaluation results of the original model

performance (a) and the better performance of the bootstrapped model, (b) in

terms of overall accuracy and the improved recall rate of the true haze-

occurrence class.

CONCLUSION

Smoke-haze can occur in

almost every region of Thailand after

harvesting periods, normally by the

end of the year. The impact of

pollution caused by smoke-haze

incidence is most severe around the

lower areas surrounded by high

mountains such as Chiang Mai

province. We thus try to induce haze-

occurrence predictive model from the

historical data with the intention to

apply the model as a warning tool for

officers and ordinary people. The

well-known tree induction technique

has been considered appropriate for

this specific purpose as the widely

acceptance in its simplicity and easy-

to-understand nature of the

underlying decision tree structure.

The major hindrance of directly

applying tree induction to the haze

data available at hand is that the data

is imbalance. Learning from

imbalance data that amount of one

class overwhelm other classes causes

much trouble to the tree induction

algorithm. The result of learning from

highly imbalance data is that the final

model will lose patterns of the

minority class. We therefore design

our model induction steps to include

data sample bootstrapping as an

important part to deal with data

imbalance situation. By assigning

different weighting schemes on

sample drawing, the transformed data

set is then ready for a tree-based

model induction. The model’s

performance evaluation confirms

efficiency of bootstrapping method as

it can increase predictive accuracy

from 90% to around 93%. The

sensitivity as the important metric in

predicting rare events can increase

significantly from 0% to almost 39%.

ACKNOWLEDGMENT

This research work has been

supported in part by grants from the

Toray Industries Foundation, the

National Research Council of

Thailand, andSuranaree University of

Technology through the funding of

Data Engineering Research Unit.


128

BIBLIOGRAPHY

Air quality data, PM10 (2016). Air

Quality Information of

Thailand, Division of Air

Quality Data, Pollution

Control Department, Air

Quality and Noise

Management Bureau. Access

August 2016, Retrieved from

http://www.aqmthai.com/aqi.p

hp

Breiman, L., Friedman, J.H., Olshen,

R.A., Stone,

C.J.(1984).Classification and

Regression Tree. Monterey,

CA: Wadsworth &

Brooks/Cole Advanced Books

& Software.

Che, H., Zhang, X., Li, Y., Zhen, Z.,

Qu, J.J., Hao, X.(2009).Haze

trends over the capital cities

of 31 provinces in China,

1981-2005. Theoretical and

Applied Climatology, 97, 235-

242

Efron, B., Tibshirani, R.J.(1993).An

Introduction to the Bootstrap.

Boca Raton, FL: Chapman &

Hall/CRC

Feng, S., Gao, D., Liao, F., Zhou, F.,

Wang, X.(2016).The health

effects of ambient PM2.5 and

potential mechanisms.

Ecotoxicology and

Environmental Safety, 128,

67-74

Field, R.D., Wang, Y., Roswintiarti,

O., Guswanto(2004).A

drought-based predictor of

recent haze events in western

Indonesia. Atmospheric

Environment, 38, 1869-1878

Heil, A., Goldammer, J.G.(2001).

Smoke-haze pollution: a

review of the 1997 episode in

Southeast Asia. Regional

Environmental Change, 2(1),

24-37

Kerdprasop, N., Kerdprasop,

K.(2016).Regression tree

analysis of CO2 emissions and

environmental factors to the

survival rate of population in

Thailand and China. In The

24th

International

MultiConference of Engineers

and Computer Scientists, pp.

286-290

Koe, L., Arellano, A.F., McGregor,

J.L.(2001). Investigating the

haze transport from 1997

biomass burning in Southeast

Asia: its impact upon

Singapore. Atmospheric

Environment, 35, 2723-2734

Li, J., Han, Z.(2016).A modeling

study of severe winter haze

events in Beijing and its

neighboring regions.

Atmospheric Research, 170,

87-97

Liu, D., Li, L.(2015).Application

study of comprehensive

forecasting model based on

entropy weighting method on

trend of PM2.5 concentration

in Guangzhou, China.

International Journal of

Environmental Research and

Public Health, 12, 7085-7099

Othman, J., Sahani, M., Mahmud, M.,

Sheikh Ahmad,

M.K.(2014).Transboundary

smoke haze pollution in


129

Malaysia: inpatient health

impacts and economic

valuation. Environmental

Pollution, 189, 194-201

Qiao, T., Zhao, M., Xiu, G., Yu,

J.(2016).Simultaneous

monitoring and compositions

analysis of PM1 and PM2.5in

Shanghai: implications for

characterization of haze

pollution and source

apportionment. Science of the

Total Environment, 557-558,

386-394

Quinlan, J.R.(1986).Induction of

decision tree. Machine

Learning, 1, 81-106

RapidMiner: RapidMiner Studio.

Retrieved from

https://rapidminer.com

Spessa, A.C., Field, R.D.,

Pappenberger, F., Langner,

A., Englhart, S., Weber, U.,

Stockdate, T., Siegert, F.,

Kaiser, J.W., Moore,

J.(2014).Seasonal forecasting

of fire over Kalimantan,

Indonesia. Natural Hazards

and Earth System Sciences

Discussions, 2, 5079-5111

Tian, Y., Shi, G., Huang-Fu, Y.,

Song, D., Liu, J., Zhou, L.,

Feng, Y.(2016).Seasonal and

regional variations of source

contributions for PM10 and

PM2.5 in urban environment.

Science of the Total

Environment,557-558, 697-

704

Velasco, E., Rastan, S.(2015).Air

quality in Singapore during

the 2013 smoke-haze episode

over the Strait of Malacca:

lessons learned. Sustainable

Cities and Society, 17, 122-

131

World Meteorological Organization.

(2016).Meteoterm. Access

December 2016, Retrieved

from

http://public.wmo.int/en/resou

rces/meteoterm

Xu, H., Li, G., Yang, S., Xu,

X.(2014).Modeling and

simulation of haze process

based on Gaussian model. In

The 11th International

Computer Conference on

Wavelet Active Media

Technology and Information

Processing, pp. 68-74

Yan, X., Shi, W., Luo, N., Zhao,

W.(2016).A new method of

satellite based haze aerosol

monitoring over the North

China Plain and a comparison

with MODIS collection 6

aerosol products. Atmospheric

Research, 171, 31-40

Zhang, Q., Yan, R., Fan, J., Yu, S.,

Yang, W., Li, P., Wang, S.,

Chen, B., Liu, W., Zhang,

X.(2015).A heavy haze

episode in Shanghai in

December of 2013:

characteristics, origins and

implications. Aerosol and Air

Quality Research, 15, 1881-

1893

Zhou, G., Yang, F., Geng, F., Xu, J.,

Yang, X., Tie,

X.(2014).Measuring and

modeling aerosol: relationship

with haze events in Shanghai,


130

China. Aerosol and Air

Quality Research, 14, 783-

792

Rice Broadcasting