Data Mining in the Chemical Industry

8/12/2019 Data Mining in the Chemical Industry

1/7

Data Mining in the Chemical IndustryAlex Kalos

The Dow Chemical Company2301 N. Brazosport Bvd.Freeport, TX, USA 77541

+1 [email protected]

Tim ReyThe Dow Chemical Company

2020 Dow CenterMidland, MI, USA, 78674

+1 [email protected]

ABSTRACTIn this paper we describe the experience of introducing data

mining to a large chemical manufacturing company. The multi-

national nature of doing business with multiple business units,

presents a unique opportunity for the deployment of data mining.While each business unit has its own objectives and challenges,

which may be at odds with those of other units, they also share

many common interests and resources. In this environment, data

mining can be used to identify potential value-creating

opportunities, through large site integration of multiple assets andsynergies from the use of common assets, such as site-wide

manufacturing facilities, and world-wide supply-chain,

purchasing and other shared services. However, issues arise, on

one hand from overly complex systems, and on the other hand,

from the danger of reaching sub-optimal solutions, if a big enoughpicture is not considered when executing projects. The company-

wide initiative and use of Six Sigma at all levels of the company

provided a fertile ground for making the case for data mining and

facilitating its acceptance. The Six Sigma mindset of measuring

the performance of processes and analyzing data promotes data-based decision making, therefore making data mining a natural

extension of this methodology. We will describe the approach for

launching a data mining capability within this framework, the

strategy for securing upper management support, drawing from

internal modeling, statistical, and other communities, and fromexternal consultants and universities. Lessons learned from

industrial case studies, enterprise-wide tool evaluation and peer

benchmarking will be discussed.

Categories and Subject DescriptorsI.6.5 [Simulation and Modeling]: Model Development

modeling methodologies.

General TermsManagement, Measurement, Documentation, Performance,

Design, Human Factors, Standardization, Verification.

KeywordsData mining, manufacturing, chemical industry.

1. INTRODUCTION

1.1 Six Sigma InitiativeIn the late 1990s the Dow Chemical Company launched the

practice of Six Sigma methodology [4]. By now, almost everyoneat Dow has been exposed to or has in some way been involved in

Six Sigma. The Measure, Analyze, Improve, and Control

(MAIC) phases are clearly delineated. Significant efforts and

resources have gone into developing and delivering training

materials on these topics. However, until recently, the Define

phase was largely being practiced inconsistently. As a result,many Six Sigma projects were delayed or terminated due to the

lack of, or poor execution of, defined deliverables for this phase.

Furthermore, projects that at first blush might have looked

good, often turned out to not be viable.

It became apparent that a better data-driven method was neededto identify projects and generate charters with a greater potential

for success. Data Mining and Modeling (DMM), the

methodology of finding relationships between inputs and outputs

(modeling) and converting this exploratory model into value,

was identified as a viable approach for accomplishing this and ateam was formed to bring knowledge about the DMM

methodology into the company and make it accessible to the Six

Sigma community at large.

Fortunately, Six Sigma has played a key role in promoting a

mentality of continuous improvement: Its foundation is that in

order to be able to fix something you have to be able to measure itand analyze it first. This mindset has provided a fertile ground for

making the case of data mining and facilitating its acceptance as a

natural extension to these phases of the Six Sigma methodology.

Upper management in particular was much more open to give

data mining a try and provide adequate resources to launch it at asignificant level.

1.2 Unique Data Mining Needs of a Global

Chemical CompanyAlthough the company shares some of the same issues and

concerns as other companies when it comes to data mining, it isalso different in many ways than the companies wheretraditional data mining has been largely applied of the sort done

in insurance, banking, credit card, financial institutions and the

large retail or on-line stores, where millions of transactions may

take place on a daily basis. For example, we only have 40,000

customers and just over 1 million shipments per year. Ourtransaction load is no where near that of a large on-line retailer

(millions per day). It is probably fair to say that much attention

has been devoted by vendors to provide tools and methods to

address these types of problems. And although other industries

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.

KDD05, August 2125, 2005, Chicago, Illinois, USA.

Copyright 2005 ACM 1-59593-135-X/05/0008$5.00.

763

Industry/Government Track Poster


2/7

can be the beneficiaries of such developments, still there are

unique requirements driving the need for a different sort of data

mining approaches: In contrast to the huge terabyte data sets, for

example, we generally deal with smaller gigabyte-size sets of

greater variety, such as manufacturing process data, research anddevelopment data for new material and product development,

marketing and business data (orders, purchasing, etc), and supply-

chain data of a globally distributed company dealing with multi-national regulatory issues. The company is essentially a

collection of businesses, each with different (and sometimesopposing) needs and objectives, yet they share in many of the

benefits that come from large scale implementation and

integration, especially at the geographic site level, through pooled

resources and services. The end result of all of this is that any

widely deployed data mining methodology and tools must begeneral enough and flexible enough to accommodate diverse

needs, in order to solve local problems effectively, while

avoiding sub-optimization.

2. Strategy for Deploying Data Mining2.1 Engaged External ResourcesConsultants were contracted to assess the existing situation, andrecommend approaches for how and where a core capability

should be established. Relationships were also established with

universities. A special arrangement with Central Michigan

University was setup whereby funding was provided to launch theCMURC Business Intelligence/Data Mining resource center in

close proximity to company headquarters. This arrangement

made available to the company both hardware and software

resources as well as personnel to kick off data mining projects.

Training of selected personnel on specialized software was alsoprovided.

The intent of the CMURC is to provide an incubator likeenvironment to kick the tires so to speak before fully investing

in a data mining infrastructure. As in any investment, there are

hidden costs to be concerned with. As 80% of a data miningeffort is in the data preparation phase, the extent to which anygiven company has invested in data warehouses and data marts

will determine the size of this initial investment. Therefore, for a

company like ours where the transactional load is low as

compared to the large retail stores, it is important to demonstrate

value with low initial costs, before jumping in and investingheavily in expensive tools.

2.2 Core Group CreationA core group was created with members of diverse backgrounds

drawn from different functions representing the main three chunks

of the company (manufacturing, commercial, and R&D). The

group was chartered to develop a business plan, set standards and

best practices, define infrastructure, identify and deployenterprise-wide tools, execute large scale projects, and manage

external relations. The group also launched evangelical type of

communications campaign throughout the company (businesses,

functions, and departments) and at all levels of management.

2.3 Training CurriculumThe first task of the core group was to pull together various

people who were already devoting a significant part of their time

as data and knowledge workers (e.g., math modelers, Six Sigma

black belts, etc.) and elevate their knowledge of data mining

through an intense training program. The participants were

selected such that they would reside within various businesses,

in order to seed the interest by placing knowledgeable people

within major functions.

After an initial search in the market place, it was decided that the

kind of DMM course that would address the companys overallneeds, one that would address business, manufacturing, and

research & development needs, was rather unique. Such a course

was simply not available on the market, and the course curriculum

had to be developed internally, with the help of an external

consultant. The curriculum development team consisted of SixSigma Master Black Belts and data mining & modeling domain

experts from inside and outside the company, who in addition to

developing and delivering the curriculum also were assigned as

mentors to course participants. One constraint in developing thecurriculum was that it would have to be built around tools that

were already deployed in the company. Essentially, this was

considered a pilot and it was important to do at as low a cost as

possible.

The expectation of successful completion of the two week-long

course is that DMM practitioners would be able to perform

advanced data analyses and create models that would enable themto make intelligent decisions regarding the viability of fixing a

perceived process or product defect, and to know which part

should be fixed if needed; in other terms, to develop good project

charters -- the actual fixing would still be left to the MAIC

process. Still, the course was designed at the 101 level, i.e.,targeted to the relative newcomer to DMM and additional, more

advanced training would need to be sought by those who intend to

be practitioners in the field.

The team began the curriculum development late in 2001 and

delivered two waves of the course in 2002. Thirty five people

took the course, seven of whom went on to actively put thisknowledge into practice. Although it is still early to assess the full

potential of this effort, 34 projects have been identified (13 ofwhich have been initiated) with potential value of over $1 billion.

3. Business PlanThe core group developed a business plan for setting up and

acting on a corporate data mining effort. The key elements of the

business include: the mission, product/service offering, identified

customer base, need for the business, value proposition, business

strategy, resource needs, time table, metrics, communication planand requirements, constraints, and risks.

The primary outcome of the business plan is the development of a

long-term, sustainable, resource model for data mining and

modeling in the company by:

Setting long term DMM strategy Developing and managing best practices

Solving commercial and technical problems

Building/managing DMM skills across the company

Enabling DMM projects and personnel

Supporting a data miners community

Consulting on tools and approaches

Leveraging external resources

764



3/7

A very formal approach was put in place to assess projects, to

determine the support model, and to assess and track value.

4. Best Practices The Data Mining/Modeling ProcessOne of the main activities of the core group is to identify, define,

and promote best practices with regards to data mining. This hasto be in a manner consistent the companys overall approach to

establishing well defined work processes, according to what is

known as Most Effective Technology, or MET. This calls fordetailed processes, roles and responsibilities, resource models,

technology, training and documentation.

In order to accommodate the diverse business needs for data

mining, the group developed the Data Mining and Modeling

(DMM) process (see Figure 1), based on its core set of

competencies in mathematical modeling. When compared to otherdata mining methodologies, like CRISP-DM [22], SEMMA [21],

the Virtuous DM Cycle [2], or the KDD process [7], there are

both similarities and differences. DMM takes a bit deeper look at

the system under study using various methods generally found

in the Systems literature, to detail the system. Also, there is amore formal separation between process and methods used to aid

certain steps in the process.

Metrics Raw Data Information Knowledge Preliminary

Six Sigma

Charters

Data Mining/ Modeling Subprocess

Business,

Site,

FunctionalLeadership

Local Champion

Data Miner

Systematically

fix

collection

methods

Defects

Local Champion

Data Miner

Black BeltMaster Black Belt

O

P

P

O

R

T

U

N

I

T

Y

D

E

P

L

O

Y

M

E

N

T

O

P

P

O

R

T

U

N

I

T

Y

D

I

S

C

OV

E

R

Y

D

A

T

A

P

R

E

P

R

O

C

E

S

S

I

N

G

S

Y

S

T

E

M

&

D

A

T

A

I

D

S

T

R

A

T

E

G

I

C

I

N

T

E

N

T

B

U

S

I

N

E

S

S

/

S

I

T

E

/

F

U

N

C

T

I

O

N

BusinessPlan

Feedback

Figure 1. Data Mining and Modeling Process

It is important to note that despite its block or linear

appearance, the DMM process, much like the KDD process [7], is

highly iterative and interactive. A high level description of eachof the phases of the DMM process follows:

4.1 Strategic Intent

4.1.1 Assess Current SituationThe main deliverable of this step is a preliminary DMM projectcharter, which includes modifying an existing project charter that

may have been handed to the data miner. Under consideration

here are understanding the business objectives, alignment with

strategic business goals, an initial assessment of the value of theopportunity, including the costs of doing the project as well as

the estimated hard and soft benefits (e.g., revenue generation, cost

reduction, etc). The preliminary plan will include scope and

boundaries, assumptions, constraints, risks, expected deliverables

and initial translation of the business goals to data mining project

tasks.

4.1.2 Validate & Cross ReferenceThe purpose of this step is ensure that the proposed data mining

project is consistent with other projects and initiatives that the

business may already have underway, or is planning to do in the

near future, as defined in the businesss managing improvement(MI) plans as well as enterprise-wide initiatives. This is to ensure

the relevance of the proposed data mining project and also toavoid sub-optimization that could result if the project were to be

done in a vacuum. Another outcome of this step is to identify the

business success criteria, including various measurements

(customer-, process-, and financial-measurements), and any

relevant benchmarking studies.

4.2 System & Data Identification

4.2.1 Conduct Stakeholder AnalysisThe purpose of this step is to identify the key stakeholders from a

long list of potential candidates, including sponsors (people

driving the project) and those who will fund the project as well as

other decision makers and the process owners. Other stakeholdersmay include data miners, math modelers, subject matter experts

and other technical people, black belts (a Six Sigma term referring

to the individuals who will actually implement the solution), and

finally the people who will ultimately benefit form the work (the

end-users). A desired outcome of this step is to determinehypotheses held by the key stakeholders. This is important both

as a means of cross-validation of project expectations, but also for

formulating the data mining plan. These hypotheses are

determined through interview sessions using brainstorming and

mind mapping techniques. Finally, the other objective here is todevelop a communication plan -- what should be communicated

to whom and how often.

4.2.2 Discern Previous WorkThe objective here is to review both internal documents andexternal literature to identify other related work on similarchemistries, systems, and work processes. Ideally, prior analysis

and modeling work would be identified which may be useful for

the proposed project in terms of lesson learned, what worked,

what didnt, etc.

4.2.3 Determine and Document System Structure andOperation -- Basis to ProceedThe purpose of this step is to document as much as possible aboutthe system under consideration, in order to provide a meaningful

context for data mining. The documentation may include system

diagrams and process maps, mind maps, relationship maps,

information exchange diagrams, flow diagrams of material flow

and money flow, and envelope analysis.

4.2.4 Identify & Understand the Sources of DataThe objective here is to identify all sources of data and determine

if all of the data needed for analysis is already available or if it is

necessary to start a data acquisition campaign. The desiredoutcome is to identify the specific data sets (e.g., named sources

and databases), to do a preliminary assessment of any gaps in the

data, identify important missing variables and ways to remedy

such, and determine if not being able to acquire missing data or

variables in a timely fashion would be detrimental to the project.

765



4/7

In addition, here we collect as much information as possible

regarding the data sources, including the collection method

(manual, electronic, or instrument).

4.2.5 Team Characteristics/CompositionHere, we identify the core group of people that will be involved in

the execution of the data mining project. This typically includes

data miners, process and system domain experts, data contentexperts, and data access/ extraction experts.

4.2.6 Develop Specific Problem StatementGiven what has been found up to this point, including input from

stakeholders, prior work, and what is now known about the

available data, it is time to develop a specific problem statement.

This is done in the form of a project charter and templates havebeen developed to facilitate capturing important elements. The

problem statement is reviewed with the process owner and other

key stakeholders.

4.2.7 Understand The Existing Data -- Build Contextfor the Data.This step is aimed at the detailed understanding of the data,

including the sampling frequency (and if the data is from aprocess information data historian, the granularity and any

filtering or smoothing that may have been done), any intentional

biases (and if outliers have been omitted, the selection criteria),

update frequencies, alignment criteria, collection gaps, and anyprocessing algorithms that may have already been applied to the

data. Also, here we identify the input- and output-variables,

describe the variable types and attributes, and document attribute

definitions, the scale (interval, nominal, ordinal), units of

measure, standard operating ranges (upper and lower limits), andany real physical limits. We also determine if a given variable is

measurable, controllable, random, or descriptive. Formatting

issues are documented, e.g., file types (flat files, relational files,

delimiters, missing data indicators, etc). Finally, sources and

magnitude of measurement error are identified.

All of this is for building the appropriate context for mining thedata and is driven by the principle of functional paranoia that

can best be described via a series of questions: What has been

done to the data? Is it filtered, aggregated, calculated or

measured? What is the update timing and time stamp rules?

What is the data taxonomy and how does that map to the businessstructure? Why was the data collected? How does it fit in with

the project goals? Why is it needed? Are there gaps or holes in

time? Is the data available at the right frequency and the

appropriate level of granularity to solve the problem on hand?

What about the information content: are there lots of rows of databut not enough variation in the patterns?

4.2.8 Secure/Collect Needed DataThe purpose of this step is to assemble all relevant sources of dataand/or start a data collection campaign to secure needed data that

is not yet available. This step generally requires working with

database analysts who will do the actual data extraction, so it is

important to be very specific about how the data sets should becreated, including prescriptions for the time frame, level/

aggregation, frequency, scope, format, variable names, file type,

delimiters, index variables, sort-merge-stack sequences,

harmonization strategy for multiple data sets, etc.

4.3 Data PreprocessingThe data is prepared and structured so that it may be imported

into the analysis platform. This may involve harmonizing datafrom multiple data sets as well as merging and stacking the data

and is often done outside of the final analysis platform through

the use of desktop databases. The merged dataset is then

imported for analysis.

4.3.1 Preliminary Data AnalysesThis phase essentially follows a traditional data analysis approach

[14], [20]. Major steps include visual exploration and inspection

of descriptive statistics for interval and ordinal variables,

assessing information content, assessing colinearity inindependent variables and eliminating redundant variables, and

assessing variability and time characteristics. Also, we identify

missing values and devise a strategy to handle them, and identify

and handle outliers.

4.3.2 Variable SelectionIn this phase, we select the variables that will be used as inputs

and outputs. It is characterized primarily by four aspects:

imputation, feature creation, variable reduction, and variable

partitioning. Imputation techniques are used to fill in missingdata, in cases where there is still adequate information to considerthe variable as an input. Feature creation techniques are used to

create meta variables, i.e., variables not directly found in the

original data but that may be derived by combining/transforming

other variables that are in the data. This generally requires

domain expertise to draw from physicochemical principles.Variable reduction and variable partitioning is done when there

are a lot of variables. In some cases, too many dimensions may

pose a problem for some of the downstream modeling techniques,

so down selection becomes a necessity. Clustering, principal

components and other multivariate techniques are used for theseactivities.

4.3.3

Transform/Recode DataStandard transformation techniques (e.g., standardization,normalization, log and other transforms) are used to recode the

data. Categorical data is recoded into numerical data asappropriate.

4.3.4 Document & Validate Data Findings andDerivationsThe last step of the data preprocessing phase is to document the

data finding and transformations and to communicate and validate

these with the stakeholders. Pending acceptance, the data sets arefinalized to be used for modeling.

4.4 Opportunity DiscoveryThis phase is the essentially the heart of the DMM process

consisting of two main activities: exploratory data analysis andmodeling.

4.4.1 Develop Data Analysis StrategyHere, the decision is made as to what analyses and modeling

techniques will be used. This is done on the basis of the type of

problem on hand. For supervised type problems (i.e. when thereis a response variable), the choices depend on whether it is a

prediction, classification, estimation, or optimization problem.

For unsupervised type problems (when there is no response

variable), the choice may be clustering, association, or linkage

766



5/7

type models. A methods-selection decision tree in the form of an

interactive mind map has been developed and is provided to the

data miner to facilitate the selection of the appropriate analysis

and modeling methods, emphasizing the assumptions, strengths,

and limitation of each method, and providing a framework forassessing methods unto themselves as well as comparing them to

one another. After the analysis/modeling methods have been

selected, it may be necessary to re-format or re-structure the dataaccordingly to accommodate these methods. Also, it may be

necessary to re-sample the data, as the data set may be too largefor some techniques.

4.4.2 Conduct Exploratory AnalysisThe aim here is to look for patterns and themes using

visualization and other techniques, such as distributions and

histograms, X by Y plots, contingency plots, linear regression,clustering, and recursive partitioning. If necessary, row reduction

is done to enhance information content. Feature extraction and

dimensionality reduction is done if necessary, using techniques

like principal components analysis. Finally, in the cases of highlynon-linear systems, we consider the generation of alternative

functional forms via genetic programming. Using such features

can help to linearize the problem and thus make it amenable tostandard techniques.

4.4.3 Build Models & Assess Model PerformanceThe data is partitioned into training, validation, and test sets [10].

Again, depending on whether or not a response variable is present

and the type of variables, different techniques may be appropriate

[5], but the general approach is to try linear methods first, since

these provide ample tests for testing the significance of themodels, then move on to non-linear models if necessary. The

general suite of methods used includes clustering techniques,

principal components analysis, discriminant and factor analysis,

linear and logistic regression, decision trees, and neural networks

[9], [15], [23]. For time series type models, special techniques areused [3], aimed at accounting for different sources of variability

in order to identify any periodic patterns or trends in the data.

Individual models are assessed for performance according to

technique-specific procedures (e.g., F-test, correlation coefficient,

lack of fit, root mean square error, predicted vs. actual plots,

residual plots, etc), as well as the differences of model fit ontraining, validation and test sets. Comparison of performance

between different types of models is trickier (e.g., linear

regression vs. neural nets), since significance tests that apply to

one do not always apply to the other. The stability of non-linear

models in particular is checked to ensure that convergence hasbeen reached and that the parameter estimation procedure did not

get stuck at a local minimum. As is the case for the entire DMM

process, this phase in particular is highly iterative; it may be

necessary to cycle back to the beginning and re-build models.Finally, the best model or models are chosen and assessed againstbusiness criteria in order to identify and validate relationships that

can be explored for further opportunities for improvement.

4.5 Opportunity DeploymentThis phase is characterized by three main activities: First the

developed model(s) may be used to make an immediate business

decision; this in and of itself may be the extent of a particular data

mining project. Furthermore, the developed model may beadequate/accurate enough to be implemented as part of an on-line

or real-time system. In this case, further development may be

needed (e.g. software development or re-coding in a format

appropriate for the deployment platform). The other outcome of

the data mining effort is that opportunities are identified that willneed further work to be realized. These are cast as Six Sigma

MAIC projects and are turned over to black belts. In this case, it

is the job of the data miner to define the project charters,

including assessment of the relative value of discoveredopportunities, identification of uncertainties, full documentationof all data mining activities and key findings, a description of the

model(s), a summary of the results, and any recommendations and

identification of potential challenges.

5. Evaluation of Enterprise-wide ToolsAs part of the development of most effective technology (MET),we have formally launched an evaluation of various technology

solutions. Independent entities like Gartner, Forrester Research,

Frost and Sullivan, etc. have done the same over the years, but

each organization has its own set of requirements, and needs to go

through its own learning curve in terms of the technology. Thus,we have established a formal pilot to review the top players in the

industry and will then conduct a formal assessment and choosethe technology that best suites the companys needs.

It is important to note that we do not assume that only one

technology will suite all of our needs. In fact, we will adapt a

layered approach for technology. Reporting and OLAP tools areat the base of this pyramid, used by thousands of people in the

company. At the next level, we have mid range statistical tools.

Specifically, JMP [21] is broadly used at this level, with over

3000 users. JMP does in fact have some basic exploratory data

analysis/data mining capabilities (PLS, neural networks, decisiontrees, and linear time series) and was selected as the basis for our

DMM 101 training curriculum. In the JMP space there are two

tiers of modelers: Those that have been trained fully in basic

statistics via our Six Sigma program, and those that have taken

our entry level DMM course. At the next level, and this is wherethe evaluation will take place, we expect to install a single

enterprise wide tool like SAS/EM, SPSS Clementine, S+

InsightfulMiner, IBMs IntelligentMiner, etc., of which we

expect about 50 - 75 users with the appropriate training to utilizeas their primary tool. Beyond that layer, we will draw from

specialty packages like those found in Wolframs Mathematicaor

Mathworks MATLAB (e.g., symbolic regression, support vector

machines, genetic algorithms, etc.). In this tier we only expect a

dozen or so of the highest level modelers to be involved.

6. Collaboration with Peers & Key CustomersAn important part of learning how to structure a data mining

effort is via collaborating with peers. The CMURC environment

is designed to do just that. On a quarterly basis, companies suchas The Dow Chemical Company, Ford, Eli Lilly, Steelcase, Henry

Ford Health Care Systems, EDS, Kelly Services, GFS, Kitchen

Aid, IBM, SAS, ESRI, Harris Interactive, etc. get together to

discuss BI/DM applications, data sources, and technology trends.

This allows companies to get out of the box a bit and generateideas of where and how to apply data mining in their own

situations.

767



6/7

6.1 BenchmarkingIn order to better understand how to design, support, value, fund,

and gain acceptance of the a data mining effort, The DowChemical Company, Ford, and Eli Lilly have joined with

CMURC to design and launch a BI/DM benchmarking study.

This study will give participants a look into various kinds of

organizations and industries that are at various stages of

implementation concerning BI/DM. Results of the study will bepresented at the July 2005 BI Forum at CMU.

7. Case StudyThe first project that we did in partnership with the CMURC

was an effort to link Customer Loyalty to financial impact [12].

Dow has a long history of collecting Customer

Satisfaction/Loyalty type perception data. From 1999 to 2005,some 50+ separate studies have been run across the globe

resulting in a Customer Loyalty data repository of over 30,000

observations of which 2/3rds is competitive data. Considering the

cost of design, collection, analysis, reporting and action, as the

results of this work are used in market planning, setting servicestandards and feeding Six Sigma projects, Customer Loyalty can

cost a company a considerable amount of time and money. Thus,as in the case of any large companys initiatives, the question is

asked, does Customer Loyalty make any difference financially?

As most people associated with the Customer Satisfactionindustry realize, this is in fact the holy grail.

In order to establish the value proposition for loyalty, a large datamining effort was established. Data was amassed from sources

ranging from perception data that included point in time market

orientation assessments of the different business units, global

employee satisfaction studies, customer complaint data and thecustomer loyalty data, to behavior data that included volume and

sales trend data, pricing data, profitability data, and attraction and

retention data. These data sets ranged in size from dozens of

variables and hundreds of observations to hundreds of variables

and millions of observations.

This data was harmonized with a series of complex SAS programs

in order to produce a modeling data set. Data preparationprocesses included but was not limited to imputation, hostage

modeling and removal, outlier testing and removal,

transformations, and smoothing. The fundamental model used

was based on a blend of the theoretical framework for Customer

Loyalty of Rust [19], Gale [8], Reichheld [16], Oliver [13],Johnson and Gustafson [11] and is primarily linked to the work of

Rey and Johnson [17]. This fundamental hierarchical path

modeling framework is well suited to the Customer Loyalty

problem, but assumes that the relationships are all linear. Variousauthors have shown this not to be the case [1]. Thus, a structured

neural network approach was used to first model the within

study framework, and the across many studies framework, and

then the full customer loyalty-profit chain framework. This

work was in fact unique in that some authors have claimed thatthere has never been an account level modeling effort that

showed the linkage between customer loyalty and financial

impact.

In the end, data miners would set themselves up for failure if they

expect to find loyalty as the key driver of financial impact. In

general, the global economy, regional economy, industry

economy, market economy, company economy, business

economy and customer economy will play a larger role in the

financial landscape than loyalty alone. Breaking a financial

number like profitability down to its constituent parts, one sees

that there are very few aspects of profitability that can actually beaffected by loyalty. In the end, margin is a function of revenue

and costs. Revenue is based on volume and price. Thus, what

can loyalty affect in volume, price and cost? It has beenhypothesized by many authors that loyal customers bring more

volume, allow higher prices and cost less to serve. In anindustrial commodity market with deep and frequent price cycles

like that found in the chemical industry, this often translates into

slower rate of change of volume and price for loyal customers,

high account share for loyal customers, and lower costs to serve

in the long run.

Using primarily a cross sectional approach, and applyingtraditional time-lagged econometric adjusted models based on

previous years financial trends, this data mining effort did in fact

show that customer loyalty perceptual measures, market

orientation perceptual measures and employee satisfaction

perceptual measures do contribute, at the account level, to the

explanation of financial impact in a statistically significantfashion, Rey [18]. The findings that related to perceptual vs.

behavioral measures followed much of what Dick and Basau

reported [6].

8. ConclusionIn this paper we described our approach to introducing data

mining in a large, global chemical company. Due to the unique

nature of the company and the dynamic nature of the needs of itsconstituents businesses, a customized and targeted methodology

had to be developed, borrowing beneficial aspects from published

methodologies, while fine-tuning other aspects to better suit

special needs.

Some lessons learned include that data mining is not for the faint

of heart in terms of quantitative methods, and that datapreparation is an important skill set (between data extractors and

data miners). Along the way, we had to deal with the ubiquitous

abuse of the term data mining it means different things to

different people; anyone that has ever manipulated a spreadsheet

is a self-proclaimed data mining expert, and data extractors areconsidered by many to be data miners. In a way, we and the

vendors and the trade journals carry some of the blame in our

zeal to preach the virtues of data mining we often end up

overselling and hyping. This often leads to unrealistic

expectations. Some of the misconceptions at high managementlevels included that only the big iron will do, that data mining is

only for terabyte type problems, that it is too esoteric and no one

within the company knows modeling. We also confirmed our

notion that while it is important to leverage external resources asmuch as possible, it is equally important to have experienced

people internally to jump-start the process, oversee projects and

keep the consultants honest.

On the plus side, the widespread use of Six Sigma methods and

the measuring and analyzing mindset that it promotes, proved to

be catalysts for both motivating the use of data mining and

facilitating its acceptance.

768



7/7

9. ACKNOWLEDGMENTSOur thanks to Dorian Pyle of Data Miners Inc. for his assistance

in developing and deploying the training program. We wouldalso like to thank Jim Mentele, Tim Pletcher, and Carl Lee of

CMURC, as well as our Dow colleagues Andy Paquet, Ken

Beebe, Dave Rothman, and Mike Costa.

10. REFERENCES[1] Anderson, E. W. and Mittal, V., Strengthening the

Satisfaction-Profit Chain,Journal of Service Research,Volume 3, No. 2, (Nov. 2000), 107-120.

[2] Berry, M. J. A., Linoff, G. S.,Data Mining Techniques: For

Marketing, Sales, and Customer Support.John Wiley &Sons, (1997) , 17-19 and 30-34.

[3] Box, G., Jenkins, G., and Reinsel, G., Time Series Analysis -

Forecasting and Control, Third Edition., Pearson Education,

Inc, 1994.

[4] Breyfogle, W. III,Implementing Six Sigma SmarterSolutions Using Statistical Methods, Wiley-Interscience,

1999

[5] Dhar, V., and Stein, R., Seven Methods for TransformingCorporate Data into Business Intelligence,Prentice Hall,

1996.

[6] Dick, A. S. and Basu, K., Customer Loyalty: Toward an

Integrated Conceptual Framework,Journal of the Academyof Marketing Science, 22 (2), (1994), 99-113.

[7] Fayyad, U. M., Piatesky-Shapiro, G., and Smyth, P. (eds),

From Data Mining to Knowledge Discovery: An Overview.InAdvances In Knowledge Discovery and Data Mining,pp.

1-34, AAAI Press/MIT Press, 1996.

[8] Gale, B. T., Managing Customer Value, The Free Press,

New York, New York, 1994.

[9] Hand , D. J., Mannila, H., and Smyth, P.,Principles of DataMining, MIT Press, 2001.

[10]Haykin, S.,Neural Networks: A Comprehensive Foundation,Second Edition, Prentice Hall, New Jersey, 1999.

[11]Johnson, M. and Gustafsson, A., Improving Customer

Satisfaction, Loyalty and Profit: An Integrated Measurementand Management System. San Francisco: Jossey-Bass, 2000.

[12]Lee, C., Mentele, J., Gaver, Rey, T.D., Structured Neural

Network Techniques for Modeling Loyalty and Profitability,

SUGI2005.

[13]Oliver, R. L., Satisfaction: A Behavioral Perspective on the

Consumer.New York: McGraw-Hill, 1997.

[14]Pyle , D.,Data Preparation for Data Mining, Morgan

Kaufmann, 1999.[15]Pyle , D.,Business Modeling and Data Mining, Morgan

Kaufmann, 2003.

[16]Reichheld, F. The Loyalty Effect: The Hidden Source Behind

Growth, Profits, and Lasting Value. Boston: HarvardBusiness School Press, 1996

[17]Rey, T. D. and Johnson, M., Modeling the Connection

Between Loyalty and Financial Impact: A Journey. In

Earning a Place at the Table, 23rd Annual MarketingResearch Conference, American Marketing Association,

Chicago, IL, September 8-11, 2002..

[18]Rey, T. D., Tying Customer Loyalty to Financial Impact. In

Symposium on Complexity and Advanced Analytics Applied

to Business, Government and Public Policy Society forIndustrial and Applied Mathematics, Great Lakes Section,

University of Michigan, Dearborn Campus, October 23,

2004.

[19]Rust, Z. and Kenningham,Return on Quality: Measuring theFinancial Impact of Your Company's Quest for Quality,

Probus Professional Publishing, 1993.

[20]Wang, X. Z. Data Mining and Knowledge Discovery forProcess Monitoring and Control (Advances in Industrial

Control), Springer Verlag, 1999

[21]SAS Institute, http://www.sas.com/technologies/analytics/-

datamining/miner/semma.html, 2005.

[22]Shearer, C., The CRISP-DM Model: The New Blueprint for

Data Mining. InJournal Data Warehousing, Vol. 5, No. 4,(2000), 13-22.

[23]Witten, I. H., Frank, E.,Data Mining: Practical Machine

Learning Tools and Techniques with Java Implementations,

Morgan Kaufmann Publishers, 1999.

769


Documents

Data Mining in the Chemical Industry