Knowledge Discovery from Area-Class Resource Maps: Capturing Prototype … · 2010-12-09 · [Source: Roschand Mervis (1975).] Prototype Theory One of the central concerns of cognitive

Knowledge Discovery from Area-Class ResourceMaps: Capturing Prototype Effects*

Feng Qi, A-Xing Zhu, Tao Pei,Chengzhi Qin, and James E. Burt

ABSTRACT: 111is paper presents a knowledge discovery approach to extracting knowledge from area-classresource maps. Prototype theory forms the basis of the approach which consists of two major components:(1) a scheme for organizing knowledge used in categorizing geographic entities which allows for the model-ing of indeterminate boundaries and non-uniform memberships within categories; and (2) a data miningmethod using the Expectation Maximization (EM) algorithm for extracting such knowledge from area-classmaps. A case study on knowledge discovery from a soil map demonstrates the details of the approach. Thestudy shows that knowledge for classifying geographic entities with indeterminate boundaries is embeddedin area-class maps and can be extracted through data mining: and that continuous spatial variation of geo-g,-aphic entities can be beuer modeled iFthe knowledge discovery process retains knowledge of within-classvariations as well as transitions between classes.

Introduction

The development of geographic knowl-edge discovery arose out of the need tointelligently and automatically trans-

form geographic data into information and syn-thesize geographic knowledge (Yuan et al. 200 I).Research in this area has received continuousattention during the past decade. Moreover,studies have extended the scope of knowledgediscovery from remote sensing data sources totraditional maps (Malerba et al. 2002; Qi andZhu 2003).

Feng ai, Department of Geology and Meteorology, Kean University,1000 Morris Ave., Union, New Jersey 07083. Phone: 908 7373668; Fax: 908 737 3699. E-mail: <[email protected]>. A-Xing Zhu,State Key Laboratory of Resources and Environmental InformationSystem, Institute of Geographical Sciences and Natural ResourcesResearch, Chinese Academy of Sciences, Building 917, DatunRoad, An Wai, Beijing 100101, China and Department of Geography,University of Wisconsin-Madison, 550 N. Park St., Madison,Wisconsin 53706. Phone: 608 262 0272; Fax: 608 265 3991. E-mail:<[email protected]>. TaD Pei, State Key Laboratory of Resourcesand Environmental Information System Institute of GeographicalSciences and Natural Resources Research, Chinese Academy ofSciences, Building 917, Datun Road, An Wai, Beijing 100101, China.E-mail: <[email protected]> . Chengzhi Oin, State Key Laboratoryof Resources and Environmental Information System Institute ofGeographical Sciences and Natural Resources Research, ChineseAcademy of Sciences, Building 917, Datun Road, An Wai, Beijing100101, China. E-mail: <[email protected]>. James E Burt,Department of Geography, University of Wisconsin-Madison, 550N. Park St., Madison, Wisconsin 53706. Phone: 6082634460; Fax:6082653991. E-mail: <[email protected]>.

The map is a powerful medium for presentingspatial information and geographical relationships.Much of our understanding of the relationshipamong spatial phenomena is embedded in maps.Tn natural resource management practice, [orexample, decades of survey practice have gener-ated and amassed large volumes of inventory map~.In particular, maps of natural resources such assoils, wildlife habitats, and potential environmen-tal hazards are usually created through knowl-edge-driven environmental modeling (Skidmore2002). That is, the distributions of these resourcesare inferred from other easily observable envi-ronmental conditions using expert knowledgeof the relationships between the resources andthese relevant environmental conditions (Zhu etal. 2001). For example, wildlife habitat maps canbe modeled using species-environmental relation-hips where the environmental factors include cli-

mate, vegetation, landscape characteristics, andhuman factors. Wildfire assessment maps can bemodeled using three main environmental factors:fuel (biomass type, moisture levels, etc.), weather,and topography (Skidmore 2002). Based on thesemodels, the natural resources are classified andrepresented as area-class maps (Mark and Csillag1989). These maps thus contain rich luunoledge onsuch relationships.

Previous studies (Moran and Bui 2002; Qi andZhu 2003; Qi 2004) have demonstrated the useof inductive learning methods to extract theserelationships from area-class resource maps.The extracted knowledge is valuable in at leasttwo ways. First, it facilitates traclitional, resourceinventory map updates. Inventory maps need to

Cartography and Geographic Information Science, Vol. 35, No.4, 2008, PI)· 223-237

be updated when relevant environmental condi-tion change or b tter reference maps exist. Theresource-environment relationships used to createthe original map often exist as domain experts'tacit, undocumented knowledge. When the domainexpert is not available during the update (which isoften the case with oil mapping where the updatecycle i usually longer than the career span of asoil expert), a new local model needs to be devel-oped from cratch. This would involve tremendousamounts of fieldwork. Therefore, if the knowledgeimplicitly embedded in the original map could beretrieved and repre ented in a quantitative format,it would facilitate the update process. Second, ifthe extracted knowledge could model continuouspatial variations of the natural resource, it mightlead to maps that are more accurate or to improvedrepresentations of the spatial distribution of suchresources through automated knowledge-basedinference.

The area-cia presentation of the resource maps,however, limited previous studie (Moran and Bui2002; Qi and Zhu 2003; Qi 2004) to extract knowl-edge that can only model natural resources in atraditional way, such as generalized and discreteclasses (categories). That i , a particular locationis a ociated with one and only one cia s withfull membership. As a result, spatial entities aredelineated as polygons with cri p boundaries andno overlaps, within which areas share a member-ship of unity to each of the prescribed catego-ries. In reality, though, most geographic entitieshave indeterminate boundaries in both attributepac and geographic pace (Burrough 1996) and

exist more or less as continuums. Fitting suchcontinuums into di crete categories with uniformmemberships limits the portrayal of geographicentitie in term of both spatial detail and attributeaccuracy (Burrough 1996), thus failing to meetthe needs of many environmental managementapplications.

In the area of soil mapping, scientists have begundeveloping new approaches to map continuousspatial variations of oil u ing fuzzy logic (Zhu etal. 1996; De Gruijter et al. 1997; Zhu et al. 2001;hi et al. 2004; Qi et al. 2006). De Guijter and

colleagues mapped soil properties using the fuzzyk-means clustering method with large amountof field points (De Cruijter et al. 1997). Otherstudies (Zhu et al. 1996; Zhu et al. 2001; hi etal. 2004; Qi et al. 2006) followed the traditionalknowledge-driven soil modeling approach, but theyadopted a number of fuzzy inference methods toinfer soil description ba ed on oil-environmentrelationships acquired directly from local soil experts.

For example, Zhu and colleagues (1996, p. 200 I)worked with local oils experts to acquire knowl-edge on membership gradations within soil classesthrough interactive graphic interface. In a laterstudy, Shi and colleagues (2004) u d case-basedreasoning where soi I expert provided the typicalcase of oil las e . Qi and colleagues (2006) lateraddre ed ome of the limitations of the previousmethods and designed a new knowledge acquisi-tion methods based on prototype theory.

The common ground of these methods is thatthey all acquire knowledge directly from localexpert, either through on-site interview tech-nique or through expert-controlled computerinput with graphic interfac . Knowl dge acqui-sition directly from domain expert, however, hasalway b en considered as a bottleneck [or thedevelopment of knowledge-based approachesand sy terns (Molokova 1993). In an area whereno experienced soil expert i available, this typeof knowledge acquisition would be difficult. Analternative way to overcoming thi difficulty is toacquir knowl dge that ha already been expressedby experts but exists in other carrier forms. Thispaper expands on previou work in knowledgediscovery from resource maps (Moran and Bui2002; Qi and Zhu 2003; Qi 2004) by exploringthese forms and focusing on the extra tion ofmembership gradations within clas es on them.We present an approach to extracting the knowl-edge for classifying natural re ource in a waythat explicitly reflects the indeterminate natureof class boundaries and member hip gradationswithin classes. The belief is that such knowledge iembedded in the area-class map presentation andcan be extracted by taking into consideration howhuman cognitively use geographi knowledge incategorizing th ontinuou geographic entitiesand creating the maps.

Our approach con i t of two major components:I) a scheme for organizing knowledge u ed in cat-egorizing geographic entitie ,which allow for themodeling of indeterminate boundaries and non-uniform membership within cat gorie ; and (2) adata mining method for extracting uch knowledgefrom map data. The cheme [or organizing knowl-edge i ba ed on prototype theory (Ro ch 1973;Lakoff 1987), a well endor ed cognitive theory ofhuman categorization. The data-mining methodis based on an empirical measure of the typicalityof category member that were fir t introducedby psychologist Ro hand Mervi (1975) withinthe context of protoLype theor .

224 Cartography and Geographic Informatioll Science

Household Item Listed Features Familv Resemblance Measure

Chair F.151' F,(4) F,(3) F (21 14

Sofa F (5) U41 FJ3) F.(2) 14

Cushion F (5) F (4) F (1) F.l21 12

Rua F.(5) U3) F (2) F (1) 11

Vase F (5) U2) F,(1) F ,(1) 9

Teleohone F (41 F.(2) F .(1) F .(1) 8a. Numbers in parentheses indicate how often each feature occurs in set of instances.

Table 1. Experimental Results of the Family Resemblance analysis. [Source: Rosch and Mervis (1975).]

Prototype TheoryOne of the central concerns of cognitive psychol-ogy is the process of categorization (and subse-quent classification) by which humans organizeand represent their knowledge of the world(Rosch 1978; Tversky and Hemenway 1984).Classical category theory views the world asbeing comprised of natural partitions, and thepurpose of categorization is to allocate objectsinto the appropriate partitions (Hahn andRarnscar 2001). Under this view, all instancesof a category share common features that aresingly necessary and jointly sufficient for defin-ing the category (Smith and Medin 1981). Thus,an instance that possesses all of the def ningfeatures belongs to the category with full mem-bership, while an instance that lacks any of thdefining features must be excluded from the cat-egory completely.

This classical view of categories received intensecriticism in the 1970s as new theories started toemerge (Smith and Medin 1981). Of the newlydeveloped theories of categorization, prototypetheory (Rosch 1973; 1978) is the most widelyendorsed. Prototype theory emphasizes the factthat category membership is not homogenousand that some members are better representa-tives of a category than others-observationsthat are noted as the "prototype effects" (Lakoff1987). To quantify prototype effects, Rosch andMervis (J 975) experimented with natural concepts,developing a measure of "family resemblance" asthe determinant of an item's typicality. In theirexperiments, subjects were asked to list features ofvarious subsets of a superordinate concept, such asfurniture, For which the subsets varied in typicality(chair is typical, lamp atypical). Table I illustratetheir analysis on the distribution of listed features.Each feature listed for a subset is weighted by thetotal number of subsets that it is listed for; then,for each subset, the weights of all of its feature

are summed, yielding a family resemblance mea-sure. In short, an item has a greater degree offamily resemblance ifit contains features shared bymany other members of that same concept group.Rosch and Mervis's study shows that these familyresemblance measures are very highly correlatedwith typicality ratings of the subsets determinedby other means, and it allows for the quantifica-tion of prototype effects.

In terms of knowledge representation, prototypetheory assumes that categories are mentally rep-resented by prototypes that summarize a person'sprevious exemplar experiences (Minda and Smith2001). The categorization of possible membersis based on how similar they are to the prototype(Hampton 1995). The prototype of' a category ia composite or average of all the real instanceexperienced which 1) reflects some measure ofcentral tendency of the instances' properties; 2)consequently, is more similar to some categorymembers than others; and 3) is itself realizable as,but may not necessarily be, an instance (Smith andMedin 1981). Compared to the classical view ofcategories, the prototype is a summary representa-tion of a category in terms of features that may beonly probable to the members of the category.

Implications for Geographic KnowledgeDiscoveryDue to the complexity and continuity of mostgeographicfeaturesand phenomena, geographicclassification is a complicated abstraction pro-cess. In an ideal situation, an indefinite numberof categories should be used to approximateone-to-one correspondences for the modeling ofspatial variations in full detail. However, we typi-cally categorize in such a way that few conceptscapture rich situations, and the categories are,as a result, the most informative (MacEachren1995). In this case, we often discretize the contin-uous geographic features to a limited number of

Vol.35, No.4 225

classes (categoric ). This leads to an irresolvableindefinite correspondence between the .onceptsand the represented world. Because this discreti-zation results in inherent gradation of member-hips to categorie , it generates prototype effect

of the geographic categories (Lakoff 1987).We traditionally partition our geographic environ-

ment into variou kinds of categories that exhibitsuch prototype effects. For example, in the caseof landform recognition, the natural landscapeis commonly seen as hills and valley , OJ~ on alarger scale, ummits, houlders, back slopes, footslopes, and drainage way. Prototype eff cts arerefle ted in that a particular foot slope locationmay be perceived to be a better exam pIe of a footlope than some other location. Moreover, in

the case of oil cia ification, the continuou soilbody is categorized into soil classes. Prototypeeffects are shown when soil scienti ts claim that acertain pedon is more representative of one soilcia than another, although both are classifiedin the same way.

Our traditional products of the classificationpractice, however, are rooted in the clas ical viewof categories. The area-class representation of soilclasses, for example, does not explicitly model theprototypical characteristic of the soil categoriesthat exist in soil cientist ' mental representationsand i an over-generalization of soil variationsin both the spatial and attribute domains (Zhu1997). Previous studie on knowledge discoveryfrom such maps overlooked the explicit prototypi-cal characteri tics and extracted knowledge thatfails to capture the indeterminate boundaries ofthe oil classes. To overcome this deficiency, wedeveloped a knowledge discovery approach thatexplicitly consider the characteristics of geographiccategorie delineated by prototype theory,

Previous work by geographers has examined theconnection of prototype theory and geographiccategorization from everal p I' pectives. For example,aspects of map reading and human knowledgeconstruction have been studied (Lloyd 1994; Lloydand Carbone 1995). Building GIS data model thatrefl ct prototype ffects (U ery 1993) and model-ing spatial relations between ge graphic entities(Mark and Egenhofer 1994) have also been topicsof interest, as ba comparing spatial categories inthe context of GIS interoperability (Rodri ' guezand Eg nhofer 2004; Feng and Flewelling 2003;Ahlqvisl2005). Few, however, have been reported nthe application of prototype theory in geographicknowledge discovery. This research bring a tartin this direction by introducing an approach torepresenting knowl dge for c1as ifying geographic

ntities with indeterminate boundaries and extract-ing such knowledge from an area-cia s map.

For knowledge repre entation chemes to becapable of retaining tbe prototype effects of geo-graphic c1as e , they should take into con iderationthe way that human experts organize their knowl-edge mentally. For individual domain expert, thedevelopment of their knowledge about the catego-rie of their specialty is a proce s of con tructingthe categories' prototypes based on accumulatedin tanc xperiences (MacEachren 1995; Mindaand Smith 200 I). Thi gives insight to knowledgerepresentation in terms of prototypes. Classificationis then performed tbrougb the cornpari on ofnew instances to such established prototypes, andmemb r bip of an in tance is determined ba edon its similarity to the prototype.

Development of a knowledge di covery algorithmcapable of deriving prototype effects from classifiedexample is the next step. implied by prototypetheory, prototyp are defined by modal featurebundles; and the "family re ernblance" mea ured veloped by Ro ch and Mervis (1975) is a directestimate oftbe typicality of member . While Ro cband Mervi (1975) experimented with only binaryfeature (absence or presence of an attribute), thenotion of a family resemblance score can be ea ilyadapted for multi-valued and numerical features.The value of a feature that leads to the highe t par-tial family resemblance is the modal value amongall category members. In other words, when weobtain the frequency di tribution of the value ofa feature for all in tance of the ame category,the prototype of thi category re ide at the pointwith the highest frequency, and the shape of thefrequency di tribution curve indicate how mem-bership changes when an in tance departs fromthe prototype.

Geographic KnowledgeDiscovery Based on Prototype

Theory

Knowledge Repre entationAccording to Markman (1999), different repre-entation model are best for different cogni-

tive ta k . For example, long-term memory usesemantic nets, while categorization tasks u e

featural model. In a featural model, a categoryis represented by a composite set of features(properties). Ba ed on prototype theory, suchfeature hould summarize the real instances

226 Canography and Geographic l nfonnation Science

Frame ID - f-+ forest Type i\Instance: /

Elevation: <600ft

Aspect: North

Instance: 2Elevation: 800-1200ji

Aspect: SOllf)'+

Instance ill

Frame Slot

~Iot ID

Figure 1. Frame representation of a forest class.

of the category which serve as the cognitive ref-erence points for inference (Minda and Smith2001).

A widely adopted knowledge representation ofcategories using the featural model organizes theset of features of a category into frames (Minsky1975; Fillmore 1985). Figure 1 shows an exampleof using a frame to represent the distribution ofa forest resource category with its environmen-tal features. In such a representation, a frame iscomposed of a number of instances that definethe typical featural configurations of the category.Each instance consists of a list of slots to describethe values of the features. The slots and fillerstogether then define the forest category in termof its geographical location. 10 incorporate theprototypical properties of geographic classes, themost straightforward way is to explicitly model theprototypes and membership gradations in a framerepresentation. This means storing the prototypesof classes using a common frame structure, whilethe membership gradations are represented withoptimality functions (Zhu 1999).

Figure 2 shows an example of such an extendedframe representation for soil classes. The slotsand fillers of the frame define the prototype(s) ofa class; each slot also has a link to an optimalityfunction that describes how membership respondswhen the value of the feature changes. Specifically,if the value of a feature corresponds to an opti-mality value of 1, the possession of such a featurewill most probably lead to the highest member-ship in the class under review. On the other hand,an optimality of 0 means that the correspondingfeature value does not favor the class at all. Inother words, the optimality value directly modelsthe partial family resemblance (normalized to [0,1]) of an individual feature. For soil class vauon(Figure 1), we see that the optimality value drops

immediately from 1 to 0 when the bed-rock changes from Oneota to any othertype, meaning that an instance that ifound on any bedrock type other thanOneota will lead to a 0 membership ofthe instance to valion, Class member-hip to flalton also depends on the slope

and curvature features: steeper slopelimit the development of valion, andlinear curvatures provide an optimisticcondition for the soil.

In our extended frame representation,inter-frame links represent spatial rela-tionships between geographic categories.Figure 2 illustrates the representation offour soil classes. It shows explicit spa-tial relationships among the four soil

classes: directly downslope from flalton one mayfind Lamoille; further below there are Elbauille andDorerton, Knowledge represented in such a waycan be obtained through traditional knowledgeengineering. Domain experts can define prototypeeither in the form of descriptive knowledge (TypeI knowledge in Zhu 1999) or in the form of typi-cal cases (Shi et al. 2004). These experts can alsodefine optimality functions (Type II knowledge inZhu 1999) or model them by using a 1Jri071 heu-ristic curves (Qi et al. 2006). This paper presentsan alternative approach to obtain knowledge onprototypes and membership gradations of geo-graphic classes through knowledge discovery fromclassified examples. Such an approach is especiallyattractive when an experienced domain expert isnot available to provide the knowledge.

Classified examples exist in two formats: fieldsamples and existing inventory maps. For theformer, data mining algorithms can be directlyapplied to derive class characteristics. While inmost situations the cost associated with massivefield sampling prohibits the collection of a largenumber offield examples, existing inventory [napsare an acceptable source of classified examples:each location enclosed within a polygon is anexample of the class indicated by the polygonlabel, although the memberships of the exampleslabeled as the same class may vary.

This paper illustrates the knowledge discoveryapproach with existing inventory maps, using soilmaps as examples. Such maps are created by soilexperts through predictive soil mapping: the distri-bution of soil classes is inferred from the distribu-tion of relevant environmental variables. The basicidea of knowledge discovery from these maps is areverse engineering of the mapping process: byassociating the map with relevant environmental

Vol.35, No.4 227

Optimali

oil type: Vallof!

OL-__~ __~ -.Jordan Oneora Glauconite Bedrock

Instance: 1

Bedrock: Oneora, Valtou Bedrock.opr

Slope: Dot steep, Vaton_ lope. opt

Curvature: linear, Valron_Curvarure.opr

Optimali •

OL- ~~ ~-.0% 12% 30"1. lope

f

Ii

Down slope I Up slope. ,iII!iIi!i. .

Optimaliry

O~ __ ~ ~~ +-0.1 -0.001 o 0.001 0.1 Curvature

oil type: Lamoille

Instance: I

... Bedrock: Oneota, Larnoille jBedrock.opr

• lope: moderarely steep, Lamoille_Slope.opt

Planform: linear, LarnoillcPlan.oj r

Profile: convex, Lamoillc_Prof.opt

I~ t!f--;:::=:~~~==::!=:::~:=-=::i_..Jl I Up slope

I

l..~::=:==~===~==:~~:::=:::'lUpslope II

Soil rype: Elba/lille

Instance: I

Bedrock: Oncota, Dorcrton_Bcdrock.opt

Elevation: low

lope: stecp, Dorcrton_ lope.opt

Planform: linear, Dorertonj Plan.opt

Profile: convex, Dorcrton_Prof.opt

Instance: 1

Bedrock: Oncora, DorerronBcdrock.opr

Eleva cion: low

lope: steep, Dorcrron , lope.opt

Planform: convex, Dorerron Plan.opt

Profile: convex, Dorcrton_Prof.opt

Figure 2. Frame representation of the knowledge about the distribution of four soil classes.

data layers, the knowledge for cla sif)ring the oilcla e will be revealed.

Data MiningThe family resemblance mea ure developed byRosch and Mervis (1975) quantifie an instan e's

typicality based on a sum of feature that aninstance po e e, then the re ult is weighted byhow many other category members al 0 possessthose feature. An item is a typical member of aconcept if it contains features shared by manyother members of the ame concept. Applyingthis to continuously valued features, a set of

228 Cartography and Geographic Information cience

modal values of the defining features will definethe class prototype. Frequency distributioncurves of these features then model the mem-bership gradations. The data-mining algorithmfor extracting knowledge from natural resourcemaps is directly based on this notion of familyresemblance.

In deriving the prototypes and optirnality func-tions of the mapped classes, histograms are firstconstructed to obtain the frequency distribution ofthe feature values of all existing examples belong-ing to each individual class. The modal valuesdefine the prototype(s), and the optimality func-tions are obtained by fitting a curve to each of thedata histograms-under the assumption that theexamples are representative of the population ofthe classes.

A smooth curve can be obtained by fitting ahistogram in a number of ways. For example, aspline curve can be fitted to pass through the datapoints. One concern with using a spline is that itsshape completely depends on the shape of thedata histogram; thus, it is sensitive to the numberof intervals used in plotting a histogram. A secondway is to fit a priori heuristic curves using the datathat generate the histogram. For example, in thefields of natural resource mapping and land usemodeling, Gaussian curves are of Len employed asfuzzy membership functions (Burrough eLal, 1992).A Gaussian curve can be easily fitted using methodssuch as maximum likelihood. However, it is oftenthe case that the histogram may nOLexactly followthe curve because the data do not follow a normaldistribution. Figure 3 shows a histogram plottedusing the data ofa soil class. The histogram of soilclass Dorerton based on feature profile curvature,for example, is clearly unsymmetrical, and theGaussian curve fitted with maximum likelihoodfails to capture the shape well (Figure 3a).

1'0 obtain curves that better approximate theshapes of data histograms, kernel functions canbe used. Figure 3b shows a curve fitted to theunderlying histogram using Gaussian kernel func-tions for individual data points. A kernel function,however, does not give a holistic representation ofthe fitted curve with a single formula, thus clutterthe extracted knowledge with stored kernel points.To obtain a holistic but realistic representationof the fitted CUI-ve,each histogram can be fittedusing the composite of two Gaussian curves. Thecombination of two curves shows a better approxi-mation of the unsymmetrical shape of a histogramthan a single curve fitted using the maximumlikelihood method (Figure 3c). Furthermore, thecase study in the next section wi 11also show that

the decomposition of the curve into two separatecurves enables bimodal situations to be accountedfor and improves the accuracy of the knowledgediscovery.

When a single Gaussian curve is expressed withthe following formula:

1 _(x_~)2

2) - e 20D - N(~,a - aJ21t (I)

where D follows the Gaussian (Normal) distribu-tion, with 11 being the mean and a the variance.Our combined model is:

D- P'N(~"a,2)+(1-p)'N(~2,ai) (2)

where p is the proportion of the two distributionswhen combining the two. The LwOcurves arefitted using the Expectation Maximization (EM)method (Dempster et al. 1977). As an iterativeoptimization method, the EM algorithm isknown to best deal with problems of estimatingparameters in a distribution function wheninformation on the observed data is incomplete.In the case of fitting two Gaussian distributionto describe the feature values of a group ofexamples, the incomplete information is theGaussian distribution to which an individualexample belongs. Set 0; E [0,1] for each datapoint, where 0; = 1 if the ith data belongs toN(Il,,<J,2) and 0; =0 ifit belongs to the otherdistribution. Thus, each data point has a featurevalue and an unknown o.. The EM algorithmis then used to estimate the parameters ofp,Il"a, ,1l2,<J2 with the unknown 0; .

With each iteration, we compute an expectationof the unknown 0; based on the current settingof parameters with the following formula (theexpectation step):

(I) N( (I) ( 2 )(1))E(o(I+')) = P~, ' a,

; p(l) N(~?), (a ,2)(1)) + (1- p )N(Il;I), (a it»)(3)

where t refers to the number of iterationso far. During the maximization step we

estimate the parameters using maximum likeli-hood, given the current value of 0;, thu

N

" x.8(t)~ II

'1(1) =...!..;-=.!, _l"'" N

LO?);=, (4)

Vol.35, No.4 229

120

C=:J HistogramVl 100Qix'0.'0(jj..cE:::J.s.>,ocCI):::J0-

~u,

-- Fitted Gaussian Function80

60

LJLJLJULJLJLJ~O~O~~~~~uuuuuuuuuuuuuu

40

20

o-0.025 -0.019 -0.014 -0.008 -0.003 0.003 0.008

Profile Curvature

(a)

120C=:J Histogram

100-ijl -- Fitted kernel functions.~ 80'0(jj 60..cE:::J.s. 40

~~~~D~D~~~uuuuuuuuuuu~

>,ocg; 200-~u, o

-0.025 -0.019 -0.014 -0.008 -0.003 0.003 0.008Profile Curvature

(b)

120 c=::J Histogram

100Vl -- Combined normal distributionsQi.~ 80

(jj 60..cE:::J.s. 40~cg; 200-CI)U: o

-0.025 0.003 0.008Profile Curvature

(c)

Figure 3. Fitting a histogram to an optimality function using: a) Gaussian curve; b) kernel functions; and c) combinationof two Gaussian curves.

230 Cartography and Geographic Information Science

N N

I.[(1-0?»)(Xi - I. (l-O?»)Xi I I. (1-0/1})]2(0 2 )(1) = i=1 i=1 i=1

2 NI.(1-0/1))i=1

.V

I.Xi(l-O/I»)11(1)_...!:i-=.cl _/""2 - N

I.(1-0?»)i=1

N N NI.[0/1)(Xi - I.0/I)Xi 1I.0/1})]2(012)(/) = 1=1 i~1 i=1

I.0/1)i=1

NI.0(/)

p'" = i=1 1

where N is the number of data, and c-t) rep-resents the tth step in the iterative process. Thecalculation ceases when values of the parametersin the M-step stabilize. Once the parameters areestimated using the EM algorithm, the final curveis then represented in the form of Equation (2)and stored in the knowledge base.

Figure 4. Soil series map of the Raffelson watershed.

Case Study

(5)

To better illustrate the knowledge discovermethod, we present here a case study on knowl-edge discovery from a soil map. In many coun-tries of the world, the spatial distribution of soilsis routinely coUected, presented, and archivedduring soil surveys. The products are soil maps ofdifferent scales. Previous research indicates thatvaluable knowledge is embedded in archived soilmaps and that such knowledge can be revealed

through knowledge discovery (Moranand Bui 2002; Qi and Zhu 2003). Morepecifically, the maps may yield knowl-

edge about the environmental condi-tions under which each of the mappedoil class developed. Environmental

conditions are typically described usinguch variables as bedrock, elevation, and

lope, and these variables are the very fea-tures that soil experts used to identify soil c1assein soil mapping.

Our study area is the Raffelson watershed inouthwestern Wisconsin. A recent soil survey pro-

duced a soil series map that shows 16 different soileries (classes) in the area. Figure 4 shows the soilseries map of the watershed. We used this soil seriesmap to extract the knowledge required to classifyoils at the watershed. A uniquely constructed GI

database captured the local soil-formative envi-ronment. The database contains five commonlyused soil-formative environmental variables at

Vol. 35, No.4 231

(6)

(7)

(8)

Vallon

t amo: I

Church town

Boon

Hixton

[Iovasll

Gaphil

Rockbluff

Klckapoo

Or ron

Urno

the watershed scale and five spatialvariables. The five primitive variablesare elevation, slope gradi nt, plan-form curvature, profile curvature, andgeology. Three of the spatial variablescapture the spatial relations of theoil-formative environmental factor :

di tance to treams, topographic wet-ness index, and percentage of col-luvium from competing bedrock.The remaining two spatial variableapture topological and directional

relations between oil classes-theupslope and downslope neighborsof each mapped location.

Pixels in the study area were groupedacc rding to the different soil seriesas assigned on the oil map. We thenconstructed histograms for the values of all pixelsin a particular clas for every individual feature.The frequency hi togram can be either unimodalor bimodal. Data pr proce ing detected bimodalcase and separated the double prototype throughvisualization. Human experts examined the his-tograms and identified the soil-feature pair withsignificant bimodal histogram. When a bimodalcase was detected, the two mode were regardeda two prototypes, and the two Gau· ian functionsderived with the EM method were viewed indepen-dently, a were their as ociated optimality function.Figure 5 hows an example of a bimodal case.

Evaluation

Frequency

0.8

0.6

0.4

0.2

o0.1 0.15 0.2 0.25 0.3

Slope Gradient

Figure 5. Histogram Curve for Soil Series LamoilleGradient.

Extracted knowledge, especially knowledge inthe form of optimality function, i diffi ult tovalidate directly. To evaluate this knowledge di -covery method, extracted knowledge was usedfor soil inference so that the inference re ultcould be evaluated using field samples. Qi and

based on Slope

colleagues (2006) used prototype-based infer-ence to infer spatial distribution of oils andtheir properties from knowledge provided bysoil experts. With prototype-based inference, thedegree of imilarity to the cla s prototypes forany pixel in the mapping area determine theirrespective degrees of class membership to thesoil series. Every pixel is then as ociated witha set of member hip values to all 16 soil series.Soil classification in the traditional sense can beea ily conducted by a signing each location thesoil cla s that has the highest membership valueamong all classes. Figure 6 illustrates the cla -ification proce s.The inferred membership values were evaluated

in an indirect way as no experienced soil expert wasavailable to rate the samples using the asse mentmethod described by Gopal et al. (2001). In ourstudy, the continuou variation of soil was repre-ented by continuou soil property map derived

From the e member hip value. Continuou soilproperty maps of surface oil te ture in term

Similarity to Soil type A = 0.7D Similarity to Soil type B = 0.1

Similaritv to Soil tvoe C = 0.1

•

---.0 oil type A

•

Mapping area

Similarity to Soil type A = 0.2

Similarity to Soil type B = 0.6

S imilaritv to Soil tvoe C = 0.0

oil type B

Figure 6. Classifying soil pixels based on membership values.

232 Cartography and Geographic Information Science

Figure 7. Membership map of soil Elbaville.

of percentage of sand and silt in the A horizonwere generated in this case study following theapproach used in Zhu et al. (1997) and Qi et al.(2006). For the sake of comparison, soil texturemaps were also derived from the original mapby assigning each pixel the typical texture valuesof the labeled soil series, based on official soilurvey records.To evaluate the inference results, we collected

data from 99 field points in the Raffelson water-shed; of these, all were classified and assignedoil series names by the USDA-NRCS local office

and 49 were given a texture analysis to determinethe percentages of sand and silt in the A horizon.The inference results were then evaluated in twoaspects. First, the soil series map was compared tothe original map (from which the knowledge wasextracted) to assess classification accuracy. Second,the property maps derived from the original soilmap and from knowledge discovery were com-pared to examine the estimation accuracies ofboth sets of property maps. Three indices werecomputed to evaluate the performances of allproperty maps against the field samples: meanabsolute error (MAE), root mean squared error(RMSE), and agreement coefficient (AG). The ACindex is defined by Willmott (1984) as:

AC = 1- n·RMSE2

PE

M emb ershi p 0

M emb ershi p 1

where rI is the number of observations and PE thepotential error variance defined as:

1/

PE = L/I P; -0 I + 10; -01)2j~ (l~

given that ° is the observed mean, and P and 0, ,are the estimated and observed value, respectively.AC values vary between 0 and 1, with I indicatingperfect agreement and 0 means complete disagree-ment between the estimated and observed value(Willmott 1984).

Results and DiscussionA set of optimality functions was fitted for the16 soil series in the watershed using the methoddetailed above; then the set was used for soilinference. The inferred membership values toa soil series Elbaville of all pixels in the area areillustrated in Figure 7 The lighter pixels arethose with higher membership values than thoseof the darker ones. White zones are usually thetypical positions at which to expect a particularsoil series; typicality gradually fades out to blackzones where memberships to the soil series arezero.

A traditional soil series map was created by assign-ing each location the soil class that has the highestmembership value among all classes (Figure 8).

Vol. 35, No.4 233

(9)

Valton

Larnoill

I:lbavlll

Dororton

Churchlown

Groonodg

Norden

Council

Boon

Hixton

Elevasll

Gaph I

Roc bluff

Klc apoo

Orion

Urne

omparing Figure 8 to Figure 4, \ e note that thetwo map are largely identical. 10st oil erieappear to be on similar terrain po ition on bothmaps, with a few howing lightl diff rent pa-tial extents (soil erie Orion, for example). It wafound that the original map correctly classifiedthe oil erie at 83 out of th 99 ite, while theinferred soil ri map named 80 ite correctly.The inferred map mi -cla ified only thr e itthat were correctly mapped by the original map.Thi indicate that the extracted knowled Fe i ableto capture the spatial di tribution of oil erieover the mapped area to a con iderable degreeof accuracy.

To validate the extracted knowledge in it abilityto capture within-cia s variation and tran itionsbetween cia s prototypes, horizon textur proper-tie were derived using the memb r hip value.Figure 9 and 10 how the re ultant oil texturemaps injuxtapo ition \ ith the corre ponding tex-ture map, based on the original oil map. Themap derived from the inference re ult and tho eba ed on th original soil map show comparablespatial letail . On noticeable difference betw enth two et of maps i the continuity of oil tex-ture variations. The inferred te tur map tend toillu trate more continuou hange of the texturevalues than those ba ed on the original map. Withthe inferred map, abrupt changes of texture onlyoccur when parent material chang .

Table 2 list the computed tati tics for fieldevaluation of th two sets of property map . The

MAE and RM E tati tics for the inf rence re ulisare 10\ er than tho e for the original 'oil map.This, together with th higher AC for the infer-ence re ult, implie better performance by thinferred property map in term of e timatingcontinuous soil propertie . Thi could be attributedto their ability to capture the tran ition betwe noil prototypes, e pecially when the inferred map'

ac ura in prediciin oil serie name i evenlower than that of the original map.

ConclusionsThe outcome of this case tudy how thatthe knowledge n ed d for classifying natu-ral r ource (uch as oil) \ ith indet rminateboundaries i emb dded in exi ting area-clasmap and can be extracted u ing an appropriatedata mining metho I. The extra ted knowledge.when u d to infer soil erie types, wa able toachieve a level of accuracy comparable to thatof the original map. Moreover, the extractedknowledge (particularly the prototype effectlemeru of the knowledge) can be used to infer

cia s member hip to different 'oil eries. Theoil property map generated with the e mem-

bership value are more continuous and morea curate than prop rt map derived from theoriginal map, even though the accuracy of theinferred oil eries type i slightly Ie than that

Cartography and Geographic Information Science234

Figure 9. Left: A horizon sand percentage derived from the original raster map using field property values. Right: Ahorizon sand percentage inferred from data mining results.

(Plo

90%1-- 1Figure 10. Left: A horizon silt percentage derived from the original raster map using field property values. Right: Ahorizon silt percentage inferred from data mining results.

Percentaoe of sand Percentaoe of silt

MAE RMSE AC MAE RMSE ACInference Result 9.69 14.47 0.81 8.17 12.14 0.82Oriainal soil man 10.66 16.63 0.67 9.51 14.31 0.67

Table 2. Accuracy of the derived A horizon texture in the Raffelson watershed: the inference result vs. the original map.

of the source map. These results indicate thatknowledge of prototype effects is important forclassifying geographic entities with indetermi-nate boundaries.

ACKNOWLEDGMENTS

The research reported in this paper was a resultof an international collaboration between theInstitute of Geographic Sciences and NaturalResources Research, Chinese Academy of Sciences,Department of Geology and Meteorology, Kean

niversity, and Department of Geography,

niversity of Wisconsin-Madison. The supportthrough the following programs are greatlyappreciated: "National Basic Research Program ofChina" (2007CB407207, Ministry of Science andTechnology of China), International PartnershipProject "Human Activities and Ecosystem Change"(C>"''TD-Z2005-1, Chinese Academy of Sciences), the"Hundred Talents" Program of Chinese Academyof Sciences and Natural Resources ConservationServices, United States Department of Agriculture.We also like to thank the two anonymous reviewersand the editor for their constructive comments thathave greatly improved the quality of this paper.

Vol. 35, No.4 235

REFERE E

Ahlqvist, O. 2005. U ing uncertain con eptualspace to tran lat b tween land cov r categories.International journal oj Geographical Information

cience 19(7): 31--7.Burrough P.. 1996. atural object with

indeterminate boundarie . In: P. . Burrough and .. Frank (ed ), Geographic Objects with Indeterminate

Boundaries. London, .K.: Franci and Ta lor.Burrough, l.A., R. . McMillan, and \ . Van D ur n.

1992. Fuzzy cia ification method for determiningland uitability from oil profile ob ervation andtopography. European journal oj Soil cience 43: 193-210.

D Cruijter; JJ., DJJ. Walvoort, and P.F.M. van Cam .1997. Continuou oil maps-a fuzzy set approach tobridge the ap between aggregation levels of procesand di tribution model. Ceoderma 77: 169-95.

Demp ter; A. . Laird, and D. Rubin. 1977. Maximumlikelihood from incomplete data via the E 1algorithm.

[oumal cfthe Royal tatisdcal ociet»: nus B 39: 1-3 .Feng, C.C., and D.M. Flewelling. 2003. e merit

of emanti imilaritv b tween land u efland covercia ification t ms. Computers, Eninronment andUrban ystems 2 : 229-46.

Fillmore, CJ. 19 5. Frames and ih emaruic ofunder tanding. Quaderni di emantica 6(2): 222-254.

Gopal, ., \V. Liu, and .Woodcock. 2001. Visualizationbased on fuzzy ARTMAP neural network for miningremotely sensed data. In: H. J. Miller and J. Han(ed ), Geographic Data Mining and Knowledge Discovery.Iew York, ew York: Ta lor & Franci .

Hahn, ., and 1. Ram car. 2001. imilm"ity andCategorization. ew York, ew York: Oxford

niver it)' PI-e s.Hampton, J. . 1995. Te ting the prototype theory of

con ept .journal oJMemory and Language 34: 6 6-70 .Lakoff, C. 19 7. Women. fire, and dangerous things: What

categories reveal about the mind. hicago, Illinois:niver ity of Chicago Pres.

Lloyd. R. 1994. Learning patial prototypes. Annals ojthe A sociation oj American Geographer. Association ojAmerican Geographers 4: 41 -40.

Lloyd, R., and C. arbone. 1995. Comparing humanand neural network learning of climate categoric .The Profe sional Geographer 47(3): 237-50.

MacEachren, A.M. 1995. How map work, New York,ew YQI-k:Guilford Pre s.

Malerba, D., .L. E po ito, and F.A. Lisi. 2001.Machine learning for information extraction fromtopographic maps. In: H.]. Miller and]. Han (ed ),Geographic Data Mining and Knowledge Discovery.

ew York: Taylor & Franci . pp. 291-314.1\1ark,D.M.,andF.C illag.19 9.Th nature ofboundarie

on 'area-da s' map. Canographica. 26: 65-7 .Mark, D.M., and M.J. Egenhofer. 1994. Modeling

patial relations between line and region:Combining formal mathematical model and

human ubject testing. Cartography and GeographicInformation cience. 21(4): 195-212.

Markman, .B. 1999. Knowledge repre entation. Mahwah,w Jersey: Lawrence Erlbaurn Associate .

Minda, .J.P., and J.D. mith. 2001. Prototype IIIcat gory learning: The effect of categolY ize,category tructure, and timulu complexity.journaloj Experimental P. ychololfY. Learning, Memory, andCognition 27(3): 775-99.

Min ky, M. 197". framework for repre entingknowledge. In: P. H. Win ton (ed), The PsychololfY ojComputer l'isioll. Iew York: M Craw-Hill.

Molokova, O .. 1993. M thodology for the a qui itionof knowledge for expert ystem: pan I. Basicconcept and definition. International journal ojCompuler and y tem cience. 3 I: 1-5.

Moran, C.J., and E. . Bui. 2002. Spatial data miningfor nhanced soil map modeling. international

jOllmaloJGeogmphicallnJor'll!ation cience 16(6): 533-49.

Qi, F., and A.X. Zhu. 2003. Knowledge di covery fromsoil maps u ing inductive learning. International

journal of Ceographical Information Science 17(8): 771-95.

Qi, F. 2004. Knowl dge di overy from area-clasre our e map: Data preproces lIlg for noisereduction. Transactions in Geographic Informationcience 8(3): 297-30 .

Qi, F., .X. Zhu, M. Harrower, and JE. Burt. 2006.Fuzz oil mapping ba ed on prototype categorytheory, Geoderma 136: 774- 7.

Ro ch, E.H. 1973. Natural ategories. CognitivePsychololfY4(3): 328-50.

Ro ch, E.H. 197 . Principle of categorization. In:E. H. Ro ch and B. B. Lloyd (ed ), Cognition andCategorization, Hill dale, ew J rsey: LawrenceErlbaum ociates.

Ro ch, E.H., and C. Mervi . 1975. Family re emblance :tudie in the internal tructure of categories.

Cognitive Psycholog)17: 573-605.Shi, X., . . Zhu, J.E. BUrl, D. irnon on, and F. Qi.

2004. case-ba ed rea oning approach to fuzzy soilmapping. oil cience ociety oj America journal 68:885-94.

kidrnore, A. 2002. Environmental Modelling with GIand Remote en ing. London, .K.: Taylor & Ranci .

rnith, E. E., and D.L. Medin. 1981. Categories andconcepts. Cambridge, 1\1a achu en: HarvardUniversity Pre s.

Tver ky, B., and L. Hemenway. 1984. Obj ct , part,and categorie . journal oj Experimental Psychology.Genemlll3: 169-93.

U ery, E. L. 1993. Category theory and the structureof features in CI . Cartography and GeographicinJormation cience 20(1): 5-12.

Willmott, .J.) 9 4. On the evaluation of modelperformance in ph)' ical geograph . In: C. L. Caileand C.]. Willmott (eds), paiia! statistics and model.Dordrecht, Bo ton, and Hingham MA: D. ReidelPubli hing Com pan)'. pp. 43-460.

236 Cartography and Geographic Information cience

Yuan. M., B. Buuenfield, M. Cahegan, and II. Miller.200 I. Ceospatial data mining and knowledgediscovery. A USGS While Paper 011 Emergent ResearchThemes.

Zhu, A.x., L.J::. Band, B. Dutton, and 1]. Nimlos.1996. Automated soilinference under rUZLYlogic.Ecological Modelling 90(2): 123-45.

Zhu, A.X. 1997. Measuring uncertainty in claassignment for natural resource maps under rULlYlogic. Photogrammetric Engineering and Remote Sensing63(10): 1195-202.

Zhu. AX. 1999. A personal construct-based knowledgeacquisition process for natural resource mapping.lntemational [ournal oj Geographical Information

cience I 3: I 19-4 l.Zhu, A.X., L. E. Band, R. Veness)" and B, Dutton.

1997. Derivation or soil properties using a soil land

inference model (SoLIM). Soil Science Society ojAI171'11caJoumal61 (2): 523-33.

Zhu, AX., B. Hudson, J. Burt, K. Lubich, and D.irnonson. 200 I. Soil mapping using CIS, expert

knowledge, and fUll), logic. Soil Science Society ojAmericajournal Sb: 1..J63-72.

* Note: "Feature" here and throughout this paper refers to aparticular attribute or property of a category, following theuse of this term in the category theory literature. It is thusdifferent from the traditional notion of "feature" in commongeographic information system (GIS) literature, where thisterm often refers to a geographic entity.

GIScience position at Penn State

OPEN RANK(FULL, ASSOCIATE, OR ASSISTANT PROFESSOR) GEOGRAPHY (GISCIENCE)

THE PENNSYLVANIA STATE UNIVERSITY

Tenure track faculty position at open rank (Full, ASSOCiate, or Assistant Professor) specializing in geographic informationscience (GIScience). We are interested in candidates who will strengthen the Department of Geography's researchand teaching program and help build strong connections to other relevant science communities. GIScience research Inthe Department is coordinated through the GeoVISTA Center (www.geovista.psu.edu).aninterdisciplinary GIScienceunit based in Geography involving five associated departmental faculty and nine faculty from departments across theUniversity Park campus. Candidates with research expertise in any area of GIScience will be considered. Excellencein teaching, research, and service is expected, as is an established record of extramural funding. A willingness toparticipate in the online Masters of Geographic Information Systems (MGIS) graduate degree program is advantageous(http://gis.e-education.psu.edu). Applicants should submit, in digital form: 1) a letter describing how they wouldcontribute to the Department's teaching and research program; 2) a complete curriculum vitae; 3) a maximum of fivereprints; and 4) the names and addresses (including e-mail and fax) of three to five referees. Review of applicationswill begin November 5, 2008, but applications will be accepted until the position is filled. Applications from women andunder-represented groups are encouraged. Ph.D. must be in-hand at time of application.

Send application to:Dr. Alan M. MacEachren

Chair, Search Committee, Department of GeographyThe Pennsylvania State University

302 Walker Building, University Park, PA 16802Phone: (814) 865-7491 (main office: 814-865-3433). Fax: (814) 863-7943

E-mail: [email protected]

Penn State University is committed to affirmative action, equal opportunity, and the diversity of its workforce

•

1111. 35, No.4 237

Documents

Knowledge Discovery from Area-Class Resource Maps: Capturing Prototype … · 2010-12-09 · [Source: Roschand Mervis (1975).] Prototype Theory One of the central concerns of cognitive