Spatial Data Mining
using SAR-Kriging Model
Atje Setiawan Abdullah
A Lecturer at Informatics Engineering Study Program
Department of Computer Science FMIPA Universitas Padjadjaran
Jl. Raya Bandung Sumedang Km 21 Jatinangor
e-mail: [email protected], [email protected]
SEAMS School
Spatio Temporal Data Mining and Optimization Modeling
UTC-Bandung, August 9-19, 2016
1. Introduction
In this paper we combine the Expansion of
Spatial Autoregressive (Expansion SAR) model as
an extension of SAR model and Kriging technique
to predict a quality of education of elementary
school. The quality of education is defined as a
result of student on study which is measured by
National End Test (UAN). In Indonesia the score of
UAN still spreadly sparse, because there are
difference on education services based on spatial
or location.
Education of elementary or middle level is study
process of passing school, imposed to student to be
having storey; certain interest in cognate ability,
psycomotoric, and affective, according to specified
by a middle and elementary education curriculum.
Quality of education defined as achievement
reached by the student and measured by pursuant
to final test value of national (UAN).
1.1 Problems
Research about quality of education still be limited,
focused at measurement of result of education
through UAN school, and analysis method still
limited to descriptive analysis. Considering regional
swampy forest broadness of education in Indonesia
and social condition, economic, and also culture
which different in each location, hence
related/relevant problem with the education quality
in school at various location in Indonesia represent
the interesting study to be studied by method of
spatial of data mining.
One of model of spatial of data mining which can be
used for the description and prediction is Expansion
Spatial Autoregressive ( Expansion SAR). The
Expansion SAR used for prediction of observation in
sample location. In the case of measuring
heterogeneities based on co-ordinate of location
spatial. Lack of the SAR model, it cannot be used to
predict at unsample location. Kriging method is one
of spatial analysis which can be used for prediction at
unsample location. So, we try to combine the SAR
and Kriging method to be SAR-Kriging for prediction
at unsample location using the parameter of SAR as
an input of Kriging method.
1.2 The Aims of Research
• Studying model of combination of Expansion
SAR and Kriging method (SAR-Kriging)
• Applying concept of spatial of data mining use
the method of SAR-Kriging, for prediction at
unsample locations. For case study we use the
database of SDPN 2003 to predict quality of
education for elementary school, junior high
school and senior high school in Indonesia.
PROSES SPASIAL DATA MINING MENGGUNAKAN SAR-KRIGING
DATABASE HASIL
SDPN 2003
HASIL CLEANING
& TRANSFORMASI
HASIL DATA
PREPARATION
HASIL MODEL
SAR-KRIGING
HASIL EKSPANSI SAR
& GRAFIK
HASIL MODEL SAR &
INDEKS MORAN
KNOWLEDGE
PATTERN
CLEANING DATA & TRANSFORMASI KE RASIO
MODEL SAR
EVALUASI & VISUALISASI
DATA
EKSTERNAL
KOORDINAT KECAMATAN
MODEL SAR
INTERPRETASI
PERHITUNGAN KRIGING
MODEL EKSPANSI SAR
HASIL PERBANDINGAN
DATA AKTUAL & PREDIKSI
PERSAMAAN SAR-KRIGING DAN MUTU HASIL SAR-KRIGING
DATA MUTU HASIL EKSPANSI SAR
DATA MUTU HASIL SURVEI
PREP
ROCE
SSIN
GDA
TA M
ININ
GPO
STPR
OCES
SING
HASIL SELEKSI
FAKTOR DAN SEM
INTEGRASI DATA SPASIAL & NON SPASIAL
SELEKSI INDIKATOR MENGGUNAKAN FAKTOR & SEM
DATABASE SDPN 2003
DATA MINING
INTEGRASI DATA
TRANSFORMASI DATA
SELEKSI DATA
INTERPRETASI DAN VISUALISASI HASIL
KNOWLEDGE
PROSES DATA MINING
CLEANING DATA
PENGEMBANGN APLIKASI
ScalabilityUkuran data 3,91 GB (4.178.499.369 byte)
Terukur terdiri dari struktur tabel SD/SMP/SMA
Non-traditional AnalysisMelibatkan koordinat lokasi dan peta lokasi
kecamatan, kabupaten dan provinsi di Indonesia
Analysis menggunakan model spasial
Data Ownership and Distribution Tersebar secara geografis terdiri dari:
provinsi,kabupaten, kecamatan dan desa
Heterogeneity and Complex DataMelibatkan data non spasial dan data spasial
Data non spasial indikator mutu pendidikan
Data spasial koordinat kecamatan
High dimentionalityJumlah total record adalah 203.590
Jumlah variabel terdiri dari 569
DATABASE SDPN 2003
DATA PERSEKOLAHAN
TK: 54226 Record
SD: 158590 Record
SMP: 28949 Record
SMA: 10810 Record
SMK: 4753 Record
DATA PENELITIAN
SD: 158.590 record dengan 122 variabel
SMP: 28.949 record dengan 138 variabel
SMA 10.810 record dengan 142 variabel
SELEKSI DATA
DATABASE SDPN 2003
Data Persekolahan 257660
Data Pendidikan Luar Sekolah 3047
Data Non Pendidikan 240
Data Perguruan Tinggi 13202
SELECT left(sd_sarana.id,7) AS kdkec,
Sum(jbkips_1+jbkips_2+jbkips_3+jbkips_4+jbkips_5+jbkips_6+jbkPPKN_1+jbkPPKN_2+jbkPPK
N_3+jbkPPKN_4+jbkPPKN_5+jbkPPKN_6+jbkINDO_1+jbkINDO_2+jbkINDO_3+jbkINDO_4+jbkI
NDO_5+jbkINDO_6+jbkMat_1+jbkMat_2+jbkMat_3+jbkMat_4+jbkMat_5+jbkMat_6+jbkipa_1+jbki
pa_2+jbkipa_3+jbkipa_4+jbkipa_5+jbkipa_6)/
Sum(jsisK_tk1l+jsisK_tk1p+jsisK_tk2l+jsisK_tk2p+jsisK_tk3l+jsisK_tk3p+jsisK_tk4l+jsisK_tk4p+jsi
sK_tk5l+jsisK_tk5p+jsisK_tk6l+jsisK_tk6p) AS RSBKTS, Sum(Lbangun)/
Sum(jsisK_tk1l+jsisK_tk1p+jsisK_tk2l+jsisK_tk2p+jsisK_tk3l+jsisK_tk3p+jsisK_tk4l+jsisK_tk4p+jsi
sKtk5l+jsisK_tk5p+jsisK_tk6l+jsisK_tk6p) AS RSLBTS, Sum(Ltanah)/
Sum(jsisK_tk1l+jsisK_tk1p+jsisK_tk2l+jsisK_tk2p+jsisK_tk3l+jsisK_tk3p+jsisK_tk4l+jsisK_tk4p+jsi
sK_tk5l+jsisK_tk5p+jsisK_tk6l+jsisK_tk6p) AS RSLTTS, Sum(jrng_baik)/
Sum(jrng_baik+jrng_rr+jrng_rb+jrng_bm) AS RSRB,
Sum(jprg_ppkn+jprg_indo+jprg_mat+jprg_ipa+jprg_ips)/Sum(jrng_baik+jrng_rr+jrng_rb+jrng_bm)
AS RSPRGTK FROM SD_Sarana INNER JOIN SD_SISWA ON SD_Sarana.ID=SD_SISWA.ID
GROUP BY left(sd_sarana.id,7);
TRANSFORMASI DATA DARI VARIABEL KE INDIKATOR
Hasil Query untuk Agregat Sarana sebagai berikut:
TRANSFORMASI DATA DASAR KE
DATA INDIKATOR (QUERY)
SD: 21 Indikator
SMP: 19 Indikator
SMA: 20 Indikator
HASIL SELEKSI INDIKATOR
MENGGUNAKAN ANALISIS FAKTOR
SD: 14 Indikator
SMP: 16 Indikator
SMA: 14 Indikator
SELEKSI INDIKATOR
DATA DASAR
SD: 122 Variabel
SMP: 138 Variabel
SMA 142 Variabel
HASIL SELEKSI INDIKATOR
MENGGUNAKAN SEM
SD: 7 Indikator
SMP: 10 Indikator
SMA: 13 Indikator
input proses Mutu
Rasio jumlah siswathp jumlah kelas
(RSTRB)
Rasio jml siswa thpjml guru(RSTGR)
Rasio jml siswa usia 7tahun thdp jml siswa
(RSBR7)
Rasio jml siswamengulang thdp jmlsiswa (RSULGTJS)
Rasio jml buku thdp jml
siswa (RSBKTS)
Rasio luas bangunanthdp jml siswa
(RSLBTS)
Rasio luas tanah thdpjml siswa (RSLTTS)
Rata-rata jumlah nilai
UAS (TOTUAS)
Rasio jml guru tetapthdp jml guru
(RSGTTG)
Rasio jml pendaftarasal TK thdp jml
pendaftar (RSDFTK)
Rasio jml siswausia 7-12 tahun thdpjml siswa (RSUM712)
Rasio jml siswa putussekolah thdp jml siswa
(RSPTSTD)
Rasio jml ruang kelasbaik thdp jml ruang
kelas (RSRB)
Rasio jml alat peragathdp jml kelas
(RSPRGTK)
Rasio jml guru >= D2thdp jml guru
(RSGLTG)
Rasio jml guru agamathdp rombel(RSGATRB)
Rata-rata Tingkat
Kelulusan siswa
(TKTLLS)
INDIKATOR PENELITIAN MUTU PENDIDIKAN JENJANG SD
Rasio jml guru kelasterhadap jml guru
(RSGKTG)
Rasio julah guru B. Ingthdp rombel
(RSGINTROM)
Rasio jml siswa baruthdp jml siswa (RSB)
RSTGR28.69
RSBR120.03
RSUM13150.01
RSDFSD0.00
INPUT
PROSES
MUTU
RSLAB 0.02
RSRB 0.03
RSGUAN0.00
RSGLTG0.01
RSPTSTS0.00
TOTUAN1.55
Chi-Square=32.88, df=27,
P-value=0.20104, RMSEA=0.023
0.01
0.03
0.04
0.05
1.00
40.43
4.29
0.01
-0.00
0.03
-0.00
0.89
0.01
0.00-0.00
-0.000.03-0.01
Kecamatan yang tidak tersurvei pada SDPN 2003
dihilangkan dengan cara mengedit data
spasialnya.
Menggabungkan data non spasial dengan data
spasial yang telah terpilih pada tabel peta spasial
sesuai dengan kecamatan masing-masing.
INTEGRASI DATA
Menghubungkan kecamatan-kecamatan pada
peta spasial dengan data kecamatan yang disurvei
pada SDPN 2003.
Menjalankan program MATLAB menggunakan
metode yang sesuai
Database SDPN 2003Sihombing (2002)
Nababan (2003)
PROSES SPASIALDATA MINING
Cliff dan Ord (1975)
Anselin, (1988)
Cressie (1993)
Armstrong (1998)
Lazarevic (2000)
Lichstein et al. (2002)
Sekhar et al. (2003)
LeSage (1999)
LeSage dan Pace (2004)
Van Beers dan Kleijinen (2004)
Celik et al. (2005)
Bronnenberg (2005)
Kanazaki et al. (2006)
Kumar dan Remadevi (2006)
Bakkali, S. dan Amrani, M. (2008)
Lu et.al (2008)
Zhao Lu et al. (2008)
Koperski et al. (1997)
Berry dan Linoff (2000)
Soukup dan Davidson (2002)
Giudici, et al. (2003)
Han dan Kamber (2006)
Tan et al. (2006)
Olson dan Shi (2007)
Refaat (2007)
Giannotti dan Pedreschi (2008)
Maimon dan Rokach, (2008)
SPASIAL DATA MINING
DESKRIPSI
Indeks dan Plot Moran
PREDIKSI
Ordinary Kriging
MODEL KAUSAL
Model SAR
Model Ekspansi SAR
MODEL SAR-KRIGING
MODEL SAR KRIGING
SELEKSI VARIABEL
Proses Input Output
Analisis Faktor, SEM
1.3 Variables of Research
In this research we use the database of SDPN 2003 from
Balitbang-Depdiknas (2003), especially in elementary and
indicator variables. Elementary variable represent the
variable in individual raw data of school. Indicator variable is
variable obtained by pursuant to elementary variables.
Elementary variable cover the school identity, student
indicator, medium indicator, teacher indicator, and total
assess the UAN. From above indicator, builder by system of
input and output of quality of education, input consisted by
the student indicator, process composed by the indicator of
medium and teacher indicator, output indicator of quality of
education consisted by the amount assess the UAN and
mean mount the pass. Indicator selection use the factor
analysis and Structural Equation Model ( SEM).
Figure 1.1 Variables Reduction Process
input proses Mutu
Rasio jumlah siswa thp jumlah
kelas (RSTRB)
Rasio jml siswa usia 7 tahun
thdp jml siswa (RSBR7)
Rata-rata jumlah
nilai UAS
(TOTUAS)
Rasio jml ruang kelas baik thdp jml ruang kelas (RSRB)
Rasio jml guru >= D2 thdp jml guru
(RSGLTG)
Rata-rata
Tingkat
Kelulusan siswa
(TKTLLS)
HASIL REDUKSI VARIABEL INDIKATOR PENELITIAN MUTU PENDIDIKAN JENJANG SD
MENGGUNAKAN STRUCTURAL EQUATION MODEL
Rasio jml siswa baru thdp jml siswa (RSB)
Figure 1.1 shows the result reduces of indicator
variables having an effect on to quality, using
factor analysis and SEM. The result for input gives
3 indicators, student ratio to amount class, ratio
sum up the student old age 7 year to student at the
first class and ratio new student to all all students.
Process composed by 2 indicators that is ratio of
well classroom to all space and competent teacher
ratio to total teacher. Output composed by 2
indicators, total assess the UAN, and mount the
pass. Indicator outputs UAN try to be analyzed by
expansion SAR model.
2. Modeling at Spatial Data Mining
2.1 The Expansion SAR Model
The expansion SAR like known the previous
model spatial SAR in measuring heterogeneities spatial
based on neighborhood. Model the linear spatial locally in
the case of measuring heterogeneities based on co-
ordinate of location spatial or a co-ordinate. Model the
spatial like this is first time introduced by Casetti ( 1972,
1992 in Anselin, 1988 & Lesage, 1999). Paying attention to
model regression in the following is:
0y 1β x (2.1)
Where abouts and each showing coefficient regression,
and vector perception from free variable. Coefficient
regression in the equation shows the heterogeneities
spatial in perception unit. For that, in the equation require
to be entangled by a number of extension variables, for
example and in such a way till go into effect:
1 0 1 1 2 2z z (2.2)
0 0 1 2 2( ) ( )y 1x z x z x
εXβy
If the equation (2.1) substitution into equation ( 2.2)
obtained:
In general model the Casetti formulated as follows:
0ZJββ
(2.3)
where
ny
y
y
2
1
y
'
'
2
'
1
0
0
00
nx
x
x
X
n
2
1
β
y
x
0β
n
2
1
ε
kynkxn
kykx
IZIZ
IZIZ
0
011
Z
The model appraised by using smallest square method
to appraise the parameters. Pursuant to the parameter
valuation, other valuation for the dot of in space appraised to
use the second equation from (2.3). Distance from
perception center formulated:
22
yyixcxii zzzzd (2.4)
so the expansion SAR model can be noticed:
εXDβXβαy 0 (2.5)
In the equation (2.5), the influence of variable can be
separated between non spatial and spatial
εXDβXβαy 0
spatialspatialnon
Parameter β and β0 can be used to describe marginal
influence for non spatial and spatil influences. For
describing independent variables individually to dependent
variable also can be used graphically through equation
iidi
yiyiyi
xixixi
D
Z
Z
0
(2.6)
2.2 Ordinary Kriging Method
Kriging is a method of calculating estimates of a
regionalized variable at a point, over an area, or within a
volume, and uses as a criterion the minimization of an
estimation variance Kriging interpolation involves the
generation of images of the reservoir properties and
commonly used to visualize reservoir heterogeneities
Therefore, Kriging techniques not well suited for
reproducing geological reservoir patterns where the
number of data are very limited. Using Kriging technique,
we can predict the observation at unsample location
(Armstrong, 1998).
Assume that the regionalized variable under study has
value )( ii xZZ , each representing the value at a point
ix . Also assume that this regionalized variable is
second order stationary, with:
expectation: mxZE )]([
Covariance: )()().( 2 hCmxZhxZE
Variogram: )(2)()(2
hxZhxZE
A kriged estimator*
VZ
is a linear combination of n values of the regionalized
variable:
n
i
iiV ZZ1
* (2.7)
For two locations, we have the minimum variance of
Kriging (Armstrong, 1998):
1
2
1
12
12
1
VV
12
21
222
1
VV
To get the value of 1 and2
using ordinary Kriging method we should have the values
ofV1 , V2 and 12
The value of 12is semivariogram experimental from two sample points
and V1
is the semivariogram of the first sample point and the
unsample point which will be predicted.
For case study we use the spherical
semivariogram for two locations
rhr
rhhr
r
h
,)(ˆ
,)(ˆ
)(
(2.9)
2.3 SAR-Kriging Method
Method of SAR-Kriging in this study represent the
combination model the Expansion SAR with the technique
Kriging addressed for the prediction of quality of education
unsample locations. Stages in explainable SAR-Kriging
model as follows (Abdullah, A.S.-2009):
• Determining variable dependent and independent to
model the Ekspansi SAR entangling region data through
distance between location center with the perception
location
• Conducting parameter estimating model the Expansion
SAR with the Maximum Likelihood method
• Determining location which unsample , around two
sample location of co-ordinate and also apart to location
sample
• Parameter valuation model the Expansion SAR made by
input at Kriging method to obtain; get weight in location to
be predicted of quality of education
• The weight of Kriging represent the parameter valuation
in unsample location
• The weight of Kriging obtained become the coefficient
model of the Expansion SAR in unsample location
• Because model of Expansion SAR represent the model
for the data of cross sectional, hence method of SAR
Kriging got applicable to predict of quality of education if
known by the independent values variable.
The Result of SAR-Kriging
In this paper, we implemented spatial data mining using
SAR-Kriging method to predict quality of education at 13
provinces in Indonesia included Aceh Province. In the base
survey of education year 2003, Aceh didn’t included as a
survey location, because of the situation and condition was
very dangerous. So, for predicting of quality education we
can use SAR-Kriging method.
For the method of SAR-Kriging, selected by data input-
proses of quality of storey; level of elementary school, junior
high school, and senior high school from two provinces in
region of Indonesia, that is Banten Province and South
Sulawesi Province.
Figure 3.1 Maps of Provinces in Indonesia
http://zulfadli.files.wordpress.com/2008/01/indonesia-50-provinsi-gif.gif
Following the SAR-Kriging procedure, we have:
(1). Location co-ordinate which unsample selected by 13
provinces around Banten and South Sulawesi
(2). It’s obtained by a parameter valuation model the
Expansion SAR through technique Kriging to 13 new
locations by its co-ordinate
(3). Position of 13 locations between Banten and South
Sulawesi Provinces
(4). Pursuant to weight Kriging at step 2, can be expressed
by model of prediction expansion SAR through Kriging
to quality of education at 13 unsample locations for
elementary school
Figure 3.2 Kriging Weight and Prediction of Quality Education at 13 Provinces
Pursuant to inferential result that to 13 locations
among Banten and South Sulawesi, obtained by
model prediction of quality of education for
elementary school through method of SAR Kriging.
If known by the values from input variable and
process the education and also co-ordinate of
each;every location, hence quality of education
measured by totalizing UAN will be able to predict.
Model the prediction of quality of education to 13
locations among Banten and South Sulawesi
expressed as following tables:
Table 3.1 Prediction of Quality Education for Elementary School
in Indonesia using SAR-Kriging
From Table 3.1 we can explain that quality of
education in 13 provinces influenced by
component of non spatial with five variables and
five components spatial with five the variable
including distance of perception location to center
location. If we a selected Aceh Provinces between
Banten and South Sulawesi, pursuant to data
SDPN 2003 obtained by the following model
Expansion SAR:
Quality of Education at Aceh
= 25.61 + 0.02RSTRB + 5.88RSB -
2.87RSBR7 – 6.31RSRB + 1.77RSGLTG +
0.22d-RSTRB -7.81d-RSB -11.39d-RSBR7-
1.53d-RSRB+0.57d-RSGLTG
For predicting of quality education on elementary
school, junior high school and senior high school
at 13 Provinces in Indonesia, we have a
comparison between actual and prediction SAR-
Kriging as follows:
Table 3.2 Comparison of Quality Education Actual and
Prediction SAR-Kriging At Elementary School
NO PROVINCE ACTUAL PREDICTION ERROR APE
1 DKI 26.85 23.81 3.04 11.32
2 JABAR 31.73 26.04 5.69 17.93
3 JATENG 26.15 27.44 -1.29 4.93
4 DIY 26.47 26.76 -0.29 1.10
5 JATIM 26.83 28.19 -1.36 5.07
6 ACEH 25.94 24.27 1.67 6.44
7 SUMUT 24.22 24.54 -0.32 1.32
8 SUMBAR 23.13 29.13 -6 25.94
9 SULUT 24.95 25.96 -1.01 4.05
10 SULBAR 25.39 25.48 -0.09 0.35
11 KALBAR 24.09 24.1 -0.01 0.04
12 KALTENG 23.43 26.52 -3.09 13.19
13 KALTIM 23.57 26.68 -3.11 13.19
MAPE 8.07
Table 3.3 Comparison of Quality Education Actual and
Prediction SAR-Kriging At Junior High School
NO PROVINCE ACTUAL PREDICTION ERROR APE
1 DKI 18.54 16.99 1.55 8.36
2 JABAR 17.85 16.82 1.03 5.77
3 JATENG 17.65 18.00 -0.35 1.98
4 DIY 18.99 17.98 1.01 5.31
5 JATIM 16.46 16.97 -0.51 3.10
6 ACEH 14.47 15.23 -0.76 5.25
7 SUMUT 18.53 15.11 3.42 18.46
8 SUMBAR 19.20 16.57 2.63 13.69
9 SULUT 14.13 17.30 -3.17 22.43
10 SULBAR 18.02 17.36 0.66 3.66
11 KALBAR 16.15 16.07 0.08 0.50
12 KALTENG 18.20 16.94 1.26 6.92
13 KALTIM 16.42 16.71 -0.29 1.77
MAPE 7.48
Table 3.4 Comparison of Quality Education Actual and
Prediction SAR-Kriging At Senior High School
NO PROVINCE ACTUAL PREDICTION ERROR APE
1 DKI 36.74 16.90 19.84 54.00
2 JABAR 36.30 31.20 5.10 14.04
3 JATENG 39.54 29.92 9.62 24.33
4 DIY 40.30 29.25 11.05 27.43
5 JATIM 45.34 29.55 15.79 34.82
6 ACEH 17.16 28.93 -11.77 68.61
7 SUMUT 31.90 38.66 -6.76 21.19
8 SUMBAR 33.22 35.46 -2.24 6.73
9 SULUT 45.48 38.54 6.94 15.26
10 SULBAR 20.78 37.17 -16.39 78.87
11 KALBAR 16.58 16.70 -0.12 0.72
12 KALTENG 39.09 37.96 1.13 2.89
13 KALTIM 25.48 33.33 -7.85 30.81
MAPE 29.21
From three tables above, we can conclude that
Mean Average Percentage Error (MAPE) for
prediction of quality education at 13 provinces I
Indonesia for elementary school and junior high
school are less than 10%. But for senior high
school more than 10%. It means that the SAR-
Kriging method fit a good model for prediction of
quality education at unsample locations on
elementary and junior high school in Indonesia.
4. Conclusion
1). SAR-Kriging model is one of tools in spatial data
mining which combines expansion SAR model and
Kriging method.
2). An application of SAR-Kriging model for
prediction of quality of education at unsample
locations in Indonesia show that it gave a good result
for elementary and junior high school at 13 provinces
which are located in among two selected provinces.
References
• Abdullah, A. S. 2009. Spatial Data Mining using SAR-
Kriging Model (Spatial Autoregressive-Kriging) for
Mapping Quality of Education in Indonesia. Unpublished
Dissertation. Yogyakarta: Universitas Gadjah Mada.
• Anselin, L. 1988, Spatial Econometrics : Method and
Models, London: Kluwer Academic publisher.
• Armstrong, M. 1998. Basic Liniear Geostatistic, New
York: Springer Verlag.
• Balitbang Depdiknas, 2003, Survei Dasar Pendidikan
Nasional Tahun 2003, Jakarta.
• Han, J., and Kamber, M., 2006, Data Mining, Concept
and Techniques, USA: Academic Press.
• LeSage, J. P. 1999. The Theory and Practice of Spatial
Econometrics. University of Toledo.