Data Mining in the Chemical Industry

Embed Size (px)

Citation preview

  • 8/12/2019 Data Mining in the Chemical Industry

    1/7

    Data Mining in the Chemical IndustryAlex Kalos

    The Dow Chemical Company2301 N. Brazosport Bvd.Freeport, TX, USA 77541

    +1 [email protected]

    Tim ReyThe Dow Chemical Company

    2020 Dow CenterMidland, MI, USA, 78674

    +1 [email protected]

    ABSTRACTIn this paper we describe the experience of introducing data

    mining to a large chemical manufacturing company. The multi-

    national nature of doing business with multiple business units,

    presents a unique opportunity for the deployment of data mining.While each business unit has its own objectives and challenges,

    which may be at odds with those of other units, they also share

    many common interests and resources. In this environment, data

    mining can be used to identify potential value-creating

    opportunities, through large site integration of multiple assets andsynergies from the use of common assets, such as site-wide

    manufacturing facilities, and world-wide supply-chain,

    purchasing and other shared services. However, issues arise, on

    one hand from overly complex systems, and on the other hand,

    from the danger of reaching sub-optimal solutions, if a big enoughpicture is not considered when executing projects. The company-

    wide initiative and use of Six Sigma at all levels of the company

    provided a fertile ground for making the case for data mining and

    facilitating its acceptance. The Six Sigma mindset of measuring

    the performance of processes and analyzing data promotes data-based decision making, therefore making data mining a natural

    extension of this methodology. We will describe the approach for

    launching a data mining capability within this framework, the

    strategy for securing upper management support, drawing from

    internal modeling, statistical, and other communities, and fromexternal consultants and universities. Lessons learned from

    industrial case studies, enterprise-wide tool evaluation and peer

    benchmarking will be discussed.

    Categories and Subject DescriptorsI.6.5 [Simulation and Modeling]: Model Development

    modeling methodologies.

    General TermsManagement, Measurement, Documentation, Performance,

    Design, Human Factors, Standardization, Verification.

    KeywordsData mining, manufacturing, chemical industry.

    1. INTRODUCTION

    1.1 Six Sigma InitiativeIn the late 1990s the Dow Chemical Company launched the

    practice of Six Sigma methodology [4]. By now, almost everyoneat Dow has been exposed to or has in some way been involved in

    Six Sigma. The Measure, Analyze, Improve, and Control

    (MAIC) phases are clearly delineated. Significant efforts and

    resources have gone into developing and delivering training

    materials on these topics. However, until recently, the Define

    phase was largely being practiced inconsistently. As a result,many Six Sigma projects were delayed or terminated due to the

    lack of, or poor execution of, defined deliverables for this phase.

    Furthermore, projects that at first blush might have looked

    good, often turned out to not be viable.

    It became apparent that a better data-driven method was neededto identify projects and generate charters with a greater potential

    for success. Data Mining and Modeling (DMM), the

    methodology of finding relationships between inputs and outputs

    (modeling) and converting this exploratory model into value,

    was identified as a viable approach for accomplishing this and ateam was formed to bring knowledge about the DMM

    methodology into the company and make it accessible to the Six

    Sigma community at large.

    Fortunately, Six Sigma has played a key role in promoting a

    mentality of continuous improvement: Its foundation is that in

    order to be able to fix something you have to be able to measure itand analyze it first. This mindset has provided a fertile ground for

    making the case of data mining and facilitating its acceptance as a

    natural extension to these phases of the Six Sigma methodology.

    Upper management in particular was much more open to give

    data mining a try and provide adequate resources to launch it at asignificant level.

    1.2 Unique Data Mining Needs of a Global

    Chemical CompanyAlthough the company shares some of the same issues and

    concerns as other companies when it comes to data mining, it isalso different in many ways than the companies wheretraditional data mining has been largely applied of the sort done

    in insurance, banking, credit card, financial institutions and the

    large retail or on-line stores, where millions of transactions may

    take place on a daily basis. For example, we only have 40,000

    customers and just over 1 million shipments per year. Ourtransaction load is no where near that of a large on-line retailer

    (millions per day). It is probably fair to say that much attention

    has been devoted by vendors to provide tools and methods to

    address these types of problems. And although other industries

    Permission to make digital or hard copies of all or part of this work for

    personal or classroom use is granted without fee provided that copies are

    not made or distributed for profit or commercial advantage and that

    copies bear this notice and the full citation on the first page. To copy

    otherwise, or republish, to post on servers or to redistribute to lists,

    requires prior specific permission and/or a fee.

    KDD05, August 2125, 2005, Chicago, Illinois, USA.

    Copyright 2005 ACM 1-59593-135-X/05/0008$5.00.

    763

    Industry/Government Track Poster

  • 8/12/2019 Data Mining in the Chemical Industry

    2/7

    can be the beneficiaries of such developments, still there are

    unique requirements driving the need for a different sort of data

    mining approaches: In contrast to the huge terabyte data sets, for

    example, we generally deal with smaller gigabyte-size sets of

    greater variety, such as manufacturing process data, research anddevelopment data for new material and product development,

    marketing and business data (orders, purchasing, etc), and supply-

    chain data of a globally distributed company dealing with multi-national regulatory issues. The company is essentially a

    collection of businesses, each with different (and sometimesopposing) needs and objectives, yet they share in many of the

    benefits that come from large scale implementation and

    integration, especially at the geographic site level, through pooled

    resources and services. The end result of all of this is that any

    widely deployed data mining methodology and tools must begeneral enough and flexible enough to accommodate diverse

    needs, in order to solve local problems effectively, while

    avoiding sub-optimization.

    2. Strategy for Deploying Data Mining2.1 Engaged External ResourcesConsultants were contracted to assess the existing situation, andrecommend approaches for how and where a core capability

    should be established. Relationships were also established with

    universities. A special arrangement with Central Michigan

    University was setup whereby funding was provided to launch theCMURC Business Intelligence/Data Mining resource center in

    close proximity to company headquarters. This arrangement

    made available to the company both hardware and software

    resources as well as personnel to kick off data mining projects.

    Training of selected personnel on specialized software was alsoprovided.

    The intent of the CMURC is to provide an incubator likeenvironment to kick the tires so to speak before fully investing

    in a data mining infrastructure. As in any investment, there are

    hidden costs to be concerned with. As 80% of a data miningeffort is in the data preparation phase, the extent to which anygiven company has invested in data warehouses and data marts

    will determine the size of this initial investment. Therefore, for a

    company like ours where the transactional load is low as

    compared to the large retail stores, it is important to demonstrate

    value with low initial costs, before jumping in and investingheavily in expensive tools.

    2.2 Core Group CreationA core group was created with members of diverse backgrounds

    drawn from different functions representing the main three chunks

    of the company (manufacturing, commercial, and R&D). The

    group was chartered to develop a business plan, set standards and

    best practices, define infrastructure, identify and deployenterprise-wide tools, execute large scale projects, and manage

    external relations. The group also launched evangelical type of

    communications campaign throughout the company (businesses,

    functions, and departments) and at all levels of management.

    2.3 Training CurriculumThe first task of the core group was to pull together various

    people who were already devoting a significant part of their time

    as data and knowledge workers (e.g., math modelers, Six Sigma

    black belts, etc.) and elevate their knowledge of data mining

    through an intense training program. The participants were

    selected such that they would reside within various businesses,

    in order to seed the interest by placing knowledgeable people

    within major functions.

    After an initial search in the market place, it was decided that the

    kind of DMM course that would address the companys overallneeds, one that would address business, manufacturing, and

    research & development needs, was rather unique. Such a course

    was simply not available on the market, and the course curriculum

    had to be developed internally, with the help of an external

    consultant. The curriculum development team consisted of SixSigma Master Black Belts and data mining & modeling domain

    experts from inside and outside the company, who in addition to

    developing and delivering the curriculum also were assigned as

    mentors to course participants. One constraint in developing thecurriculum was that it would have to be built around tools that

    were already deployed in the company. Essentially, this was

    considered a pilot and it was important to do at as low a cost as

    possible.

    The expectation of successful completion of the two week-long

    course is that DMM practitioners would be able to perform

    advanced data analyses and create models that would enable themto make intelligent decisions regarding the viability of fixing a

    perceived process or product defect, and to know which part

    should be fixed if needed; in other terms, to develop good project

    charters -- the actual fixing would still be left to the MAIC

    process. Still, the course was designed at the 101 level, i.e.,targeted to the relative newcomer to DMM and additional, more

    advanced training would need to be sought by those who intend to

    be practitioners in the field.

    The team began the curriculum development late in 2001 and

    delivered two waves of the course in 2002. Thirty five people

    took the course, seven of whom went on to actively put thisknowledge into practice. Although it is still early to assess the full

    potential of this effort, 34 projects have been identified (13 ofwhich have been initiated) with potential value of over $1 billion.

    3. Business PlanThe core group developed a business plan for setting up and

    acting on a corporate data mining effort. The key elements of the

    business include: the mission, product/service offering, identified

    customer base, need for the business, value proposition, business

    strategy, resource needs, time table, metrics, communication planand requirements, constraints, and risks.

    The primary outcome of the business plan is the development of a

    long-term, sustainable, resource model for data mining and

    modeling in the company by:

    Setting long term DMM strategy Developing and managing best practices

    Solving commercial and technical problems

    Building/managing DMM skills across the company

    Enabling DMM projects and personnel

    Supporting a data miners community

    Consulting on tools and approaches

    Leveraging external resources

    764

    Industry/Government Track Poster

  • 8/12/2019 Data Mining in the Chemical Industry

    3/7

    A very formal approach was put in place to assess projects, to

    determine the support model, and to assess and track value.

    4. Best Practices The Data Mining/Modeling ProcessOne of the main activities of the core group is to identify, define,

    and promote best practices with regards to data mining. This hasto be in a manner consistent the companys overall approach to

    establishing well defined work processes, according to what is

    known as Most Effective Technology, or MET. This calls fordetailed processes, roles and responsibilities, resource models,

    technology, training and documentation.

    In order to accommodate the diverse business needs for data

    mining, the group developed the Data Mining and Modeling

    (DMM) process (see Figure 1), based on its core set of

    competencies in mathematical modeling. When compared to otherdata mining methodologies, like CRISP-DM [22], SEMMA [21],

    the Virtuous DM Cycle [2], or the KDD process [7], there are

    both similarities and differences. DMM takes a bit deeper look at

    the system under study using various methods generally found

    in the Systems literature, to detail the system. Also, there is amore formal separation between process and methods used to aid

    certain steps in the process.

    Metrics Raw Data Information Knowledge Preliminary

    Six Sigma

    Charters

    Data Mining/ Modeling Subprocess

    Business,

    Site,

    FunctionalLeadership

    Local Champion

    Data Miner

    Systematically

    fix

    collection

    methods

    Defects

    Local Champion

    Data Miner

    Black BeltMaster Black Belt

    O

    P

    P

    O

    R

    T

    U

    N

    I

    T

    Y

    D

    E

    P

    L

    O

    Y

    M

    E

    N

    T

    O

    P

    P

    O

    R

    T

    U

    N

    I

    T

    Y

    D

    I

    S

    C

    OV

    E

    R

    Y

    D

    A

    T

    A

    P

    R

    E

    P

    R

    O

    C

    E

    S

    S

    I

    N

    G

    S

    Y

    S

    T

    E

    M

    &

    D

    A

    T

    A

    I

    D

    S

    T

    R

    A

    T

    E

    G

    I

    C

    I

    N

    T

    E

    N

    T

    B

    U

    S

    I

    N

    E

    S

    S

    /

    S

    I

    T

    E

    /

    F

    U

    N

    C

    T

    I

    O

    N

    BusinessPlan

    Feedback

    Figure 1. Data Mining and Modeling Process

    It is important to note that despite its block or linear

    appearance, the DMM process, much like the KDD process [7], is

    highly iterative and interactive. A high level description of eachof the phases of the DMM process follows:

    4.1 Strategic Intent

    4.1.1 Assess Current SituationThe main deliverable of this step is a preliminary DMM projectcharter, which includes modifying an existing project charter that

    may have been handed to the data miner. Under consideration

    here are understanding the business objectives, alignment with

    strategic business goals, an initial assessment of the value of theopportunity, including the costs of doing the project as well as

    the estimated hard and soft benefits (e.g., revenue generation, cost

    reduction, etc). The preliminary plan will include scope and

    boundaries, assumptions, constraints, risks, expected deliverables

    and initial translation of the business goals to data mining project

    tasks.

    4.1.2 Validate & Cross ReferenceThe purpose of this step is ensure that the proposed data mining

    project is consistent with other projects and initiatives that the

    business may already have underway, or is planning to do in the

    near future, as defined in the businesss managing improvement(MI) plans as well as enterprise-wide initiatives. This is to ensure

    the relevance of the proposed data mining project and also toavoid sub-optimization that could result if the project were to be

    done in a vacuum. Another outcome of this step is to identify the

    business success criteria, including various measurements

    (customer-, process-, and financial-measurements), and any

    relevant benchmarking studies.

    4.2 System & Data Identification

    4.2.1 Conduct Stakeholder AnalysisThe purpose of this step is to identify the key stakeholders from a

    long list of potential candidates, including sponsors (people

    driving the project) and those who will fund the project as well as

    other decision makers and the process owners. Other stakeholdersmay include data miners, math modelers, subject matter experts

    and other technical people, black belts (a Six Sigma term referring

    to the individuals who will actually implement the solution), and

    finally the people who will ultimately benefit form the work (the

    end-users). A desired outcome of this step is to determinehypotheses held by the key stakeholders. This is important both

    as a means of cross-validation of project expectations, but also for

    formulating the data mining plan. These hypotheses are

    determined through interview sessions using brainstorming and

    mind mapping techniques. Finally, the other objective here is todevelop a communication plan -- what should be communicated

    to whom and how often.

    4.2.2 Discern Previous WorkThe objective here is to review both internal documents andexternal literature to identify other related work on similarchemistries, systems, and work processes. Ideally, prior analysis

    and modeling work would be identified which may be useful for

    the proposed project in terms of lesson learned, what worked,

    what didnt, etc.

    4.2.3 Determine and Document System Structure andOperation -- Basis to ProceedThe purpose of this step is to document as much as possible aboutthe system under consideration, in order to provide a meaningful

    context for data mining. The documentation may include system

    diagrams and process maps, mind maps, relationship maps,

    information exchange diagrams, flow diagrams of material flow

    and money flow, and envelope analysis.

    4.2.4 Identify & Understand the Sources of DataThe objective here is to identify all sources of data and determine

    if all of the data needed for analysis is already available or if it is

    necessary to start a data acquisition campaign. The desiredoutcome is to identify the specific data sets (e.g., named sources

    and databases), to do a preliminary assessment of any gaps in the

    data, identify important missing variables and ways to remedy

    such, and determine if not being able to acquire missing data or

    variables in a timely fashion would be detrimental to the project.

    765

    Industry/Government Track Poster

  • 8/12/2019 Data Mining in the Chemical Industry

    4/7

    In addition, here we collect as much information as possible

    regarding the data sources, including the collection method

    (manual, electronic, or instrument).

    4.2.5 Team Characteristics/CompositionHere, we identify the core group of people that will be involved in

    the execution of the data mining project. This typically includes

    data miners, process and system domain experts, data contentexperts, and data access/ extraction experts.

    4.2.6 Develop Specific Problem StatementGiven what has been found up to this point, including input from

    stakeholders, prior work, and what is now known about the

    available data, it is time to develop a specific problem statement.

    This is done in the form of a project charter and templates havebeen developed to facilitate capturing important elements. The

    problem statement is reviewed with the process owner and other

    key stakeholders.

    4.2.7 Understand The Existing Data -- Build Contextfor the Data.This step is aimed at the detailed understanding of the data,

    including the sampling frequency (and if the data is from aprocess information data historian, the granularity and any

    filtering or smoothing that may have been done), any intentional

    biases (and if outliers have been omitted, the selection criteria),

    update frequencies, alignment criteria, collection gaps, and anyprocessing algorithms that may have already been applied to the

    data. Also, here we identify the input- and output-variables,

    describe the variable types and attributes, and document attribute

    definitions, the scale (interval, nominal, ordinal), units of

    measure, standard operating ranges (upper and lower limits), andany real physical limits. We also determine if a given variable is

    measurable, controllable, random, or descriptive. Formatting

    issues are documented, e.g., file types (flat files, relational files,

    delimiters, missing data indicators, etc). Finally, sources and

    magnitude of measurement error are identified.

    All of this is for building the appropriate context for mining thedata and is driven by the principle of functional paranoia that

    can best be described via a series of questions: What has been

    done to the data? Is it filtered, aggregated, calculated or

    measured? What is the update timing and time stamp rules?

    What is the data taxonomy and how does that map to the businessstructure? Why was the data collected? How does it fit in with

    the project goals? Why is it needed? Are there gaps or holes in

    time? Is the data available at the right frequency and the

    appropriate level of granularity to solve the problem on hand?

    What about the information content: are there lots of rows of databut not enough variation in the patterns?

    4.2.8 Secure/Collect Needed DataThe purpose of this step is to assemble all relevant sources of dataand/or start a data collection campaign to secure needed data that

    is not yet available. This step generally requires working with

    database analysts who will do the actual data extraction, so it is

    important to be very specific about how the data sets should becreated, including prescriptions for the time frame, level/

    aggregation, frequency, scope, format, variable names, file type,

    delimiters, index variables, sort-merge-stack sequences,

    harmonization strategy for multiple data sets, etc.

    4.3 Data PreprocessingThe data is prepared and structured so that it may be imported

    into the analysis platform. This may involve harmonizing datafrom multiple data sets as well as merging and stacking the data

    and is often done outside of the final analysis platform through

    the use of desktop databases. The merged dataset is then

    imported for analysis.

    4.3.1 Preliminary Data AnalysesThis phase essentially follows a traditional data analysis approach

    [14], [20]. Major steps include visual exploration and inspection

    of descriptive statistics for interval and ordinal variables,

    assessing information content, assessing colinearity inindependent variables and eliminating redundant variables, and

    assessing variability and time characteristics. Also, we identify

    missing values and devise a strategy to handle them, and identify

    and handle outliers.

    4.3.2 Variable SelectionIn this phase, we select the variables that will be used as inputs

    and outputs. It is characterized primarily by four aspects:

    imputation, feature creation, variable reduction, and variable

    partitioning. Imputation techniques are used to fill in missingdata, in cases where there is still adequate information to considerthe variable as an input. Feature creation techniques are used to

    create meta variables, i.e., variables not directly found in the

    original data but that may be derived by combining/transforming

    other variables that are in the data. This generally requires

    domain expertise to draw from physicochemical principles.Variable reduction and variable partitioning is done when there

    are a lot of variables. In some cases, too many dimensions may

    pose a problem for some of the downstream modeling techniques,

    so down selection becomes a necessity. Clustering, principal

    components and other multivariate techniques are used for theseactivities.

    4.3.3

    Transform/Recode DataStandard transformation techniques (e.g., standardization,normalization, log and other transforms) are used to recode the

    data. Categorical data is recoded into numerical data asappropriate.

    4.3.4 Document & Validate Data Findings andDerivationsThe last step of the data preprocessing phase is to document the

    data finding and transformations and to communicate and validate

    these with the stakeholders. Pending acceptance, the data sets arefinalized to be used for modeling.

    4.4 Opportunity DiscoveryThis phase is the essentially the heart of the DMM process

    consisting of two main activities: exploratory data analysis andmodeling.

    4.4.1 Develop Data Analysis StrategyHere, the decision is made as to what analyses and modeling

    techniques will be used. This is done on the basis of the type of

    problem on hand. For supervised type problems (i.e. when thereis a response variable), the choices depend on whether it is a

    prediction, classification, estimation, or optimization problem.

    For unsupervised type problems (when there is no response

    variable), the choice may be clustering, association, or linkage

    766

    Industry/Government Track Poster

  • 8/12/2019 Data Mining in the Chemical Industry

    5/7

    type models. A methods-selection decision tree in the form of an

    interactive mind map has been developed and is provided to the

    data miner to facilitate the selection of the appropriate analysis

    and modeling methods, emphasizing the assumptions, strengths,

    and limitation of each method, and providing a framework forassessing methods unto themselves as well as comparing them to

    one another. After the analysis/modeling methods have been

    selected, it may be necessary to re-format or re-structure the dataaccordingly to accommodate these methods. Also, it may be

    necessary to re-sample the data, as the data set may be too largefor some techniques.

    4.4.2 Conduct Exploratory AnalysisThe aim here is to look for patterns and themes using

    visualization and other techniques, such as distributions and

    histograms, X by Y plots, contingency plots, linear regression,clustering, and recursive partitioning. If necessary, row reduction

    is done to enhance information content. Feature extraction and

    dimensionality reduction is done if necessary, using techniques

    like principal components analysis. Finally, in the cases of highlynon-linear systems, we consider the generation of alternative

    functional forms via genetic programming. Using such features

    can help to linearize the problem and thus make it amenable tostandard techniques.

    4.4.3 Build Models & Assess Model PerformanceThe data is partitioned into training, validation, and test sets [10].

    Again, depending on whether or not a response variable is present

    and the type of variables, different techniques may be appropriate

    [5], but the general approach is to try linear methods first, since

    these provide ample tests for testing the significance of themodels, then move on to non-linear models if necessary. The

    general suite of methods used includes clustering techniques,

    principal components analysis, discriminant and factor analysis,

    linear and logistic regression, decision trees, and neural networks

    [9], [15], [23]. For time series type models, special techniques areused [3], aimed at accounting for different sources of variability

    in order to identify any periodic patterns or trends in the data.

    Individual models are assessed for performance according to

    technique-specific procedures (e.g., F-test, correlation coefficient,

    lack of fit, root mean square error, predicted vs. actual plots,

    residual plots, etc), as well as the differences of model fit ontraining, validation and test sets. Comparison of performance

    between different types of models is trickier (e.g., linear

    regression vs. neural nets), since significance tests that apply to

    one do not always apply to the other. The stability of non-linear

    models in particular is checked to ensure that convergence hasbeen reached and that the parameter estimation procedure did not

    get stuck at a local minimum. As is the case for the entire DMM

    process, this phase in particular is highly iterative; it may be

    necessary to cycle back to the beginning and re-build models.Finally, the best model or models are chosen and assessed againstbusiness criteria in order to identify and validate relationships that

    can be explored for further opportunities for improvement.

    4.5 Opportunity DeploymentThis phase is characterized by three main activities: First the

    developed model(s) may be used to make an immediate business

    decision; this in and of itself may be the extent of a particular data

    mining project. Furthermore, the developed model may beadequate/accurate enough to be implemented as part of an on-line

    or real-time system. In this case, further development may be

    needed (e.g. software development or re-coding in a format

    appropriate for the deployment platform). The other outcome of

    the data mining effort is that opportunities are identified that willneed further work to be realized. These are cast as Six Sigma

    MAIC projects and are turned over to black belts. In this case, it

    is the job of the data miner to define the project charters,

    including assessment of the relative value of discoveredopportunities, identification of uncertainties, full documentationof all data mining activities and key findings, a description of the

    model(s), a summary of the results, and any recommendations and

    identification of potential challenges.

    5. Evaluation of Enterprise-wide ToolsAs part of the development of most effective technology (MET),we have formally launched an evaluation of various technology

    solutions. Independent entities like Gartner, Forrester Research,

    Frost and Sullivan, etc. have done the same over the years, but

    each organization has its own set of requirements, and needs to go

    through its own learning curve in terms of the technology. Thus,we have established a formal pilot to review the top players in the

    industry and will then conduct a formal assessment and choosethe technology that best suites the companys needs.

    It is important to note that we do not assume that only one

    technology will suite all of our needs. In fact, we will adapt a

    layered approach for technology. Reporting and OLAP tools areat the base of this pyramid, used by thousands of people in the

    company. At the next level, we have mid range statistical tools.

    Specifically, JMP [21] is broadly used at this level, with over

    3000 users. JMP does in fact have some basic exploratory data

    analysis/data mining capabilities (PLS, neural networks, decisiontrees, and linear time series) and was selected as the basis for our

    DMM 101 training curriculum. In the JMP space there are two

    tiers of modelers: Those that have been trained fully in basic

    statistics via our Six Sigma program, and those that have taken

    our entry level DMM course. At the next level, and this is wherethe evaluation will take place, we expect to install a single

    enterprise wide tool like SAS/EM, SPSS Clementine, S+

    InsightfulMiner, IBMs IntelligentMiner, etc., of which we

    expect about 50 - 75 users with the appropriate training to utilizeas their primary tool. Beyond that layer, we will draw from

    specialty packages like those found in Wolframs Mathematicaor

    Mathworks MATLAB (e.g., symbolic regression, support vector

    machines, genetic algorithms, etc.). In this tier we only expect a

    dozen or so of the highest level modelers to be involved.

    6. Collaboration with Peers & Key CustomersAn important part of learning how to structure a data mining

    effort is via collaborating with peers. The CMURC environment

    is designed to do just that. On a quarterly basis, companies suchas The Dow Chemical Company, Ford, Eli Lilly, Steelcase, Henry

    Ford Health Care Systems, EDS, Kelly Services, GFS, Kitchen

    Aid, IBM, SAS, ESRI, Harris Interactive, etc. get together to

    discuss BI/DM applications, data sources, and technology trends.

    This allows companies to get out of the box a bit and generateideas of where and how to apply data mining in their own

    situations.

    767

    Industry/Government Track Poster

  • 8/12/2019 Data Mining in the Chemical Industry

    6/7

    6.1 BenchmarkingIn order to better understand how to design, support, value, fund,

    and gain acceptance of the a data mining effort, The DowChemical Company, Ford, and Eli Lilly have joined with

    CMURC to design and launch a BI/DM benchmarking study.

    This study will give participants a look into various kinds of

    organizations and industries that are at various stages of

    implementation concerning BI/DM. Results of the study will bepresented at the July 2005 BI Forum at CMU.

    7. Case StudyThe first project that we did in partnership with the CMURC

    was an effort to link Customer Loyalty to financial impact [12].

    Dow has a long history of collecting Customer

    Satisfaction/Loyalty type perception data. From 1999 to 2005,some 50+ separate studies have been run across the globe

    resulting in a Customer Loyalty data repository of over 30,000

    observations of which 2/3rds is competitive data. Considering the

    cost of design, collection, analysis, reporting and action, as the

    results of this work are used in market planning, setting servicestandards and feeding Six Sigma projects, Customer Loyalty can

    cost a company a considerable amount of time and money. Thus,as in the case of any large companys initiatives, the question is

    asked, does Customer Loyalty make any difference financially?

    As most people associated with the Customer Satisfactionindustry realize, this is in fact the holy grail.

    In order to establish the value proposition for loyalty, a large datamining effort was established. Data was amassed from sources

    ranging from perception data that included point in time market

    orientation assessments of the different business units, global

    employee satisfaction studies, customer complaint data and thecustomer loyalty data, to behavior data that included volume and

    sales trend data, pricing data, profitability data, and attraction and

    retention data. These data sets ranged in size from dozens of

    variables and hundreds of observations to hundreds of variables

    and millions of observations.

    This data was harmonized with a series of complex SAS programs

    in order to produce a modeling data set. Data preparationprocesses included but was not limited to imputation, hostage

    modeling and removal, outlier testing and removal,

    transformations, and smoothing. The fundamental model used

    was based on a blend of the theoretical framework for Customer

    Loyalty of Rust [19], Gale [8], Reichheld [16], Oliver [13],Johnson and Gustafson [11] and is primarily linked to the work of

    Rey and Johnson [17]. This fundamental hierarchical path

    modeling framework is well suited to the Customer Loyalty

    problem, but assumes that the relationships are all linear. Variousauthors have shown this not to be the case [1]. Thus, a structured

    neural network approach was used to first model the within

    study framework, and the across many studies framework, and

    then the full customer loyalty-profit chain framework. This

    work was in fact unique in that some authors have claimed thatthere has never been an account level modeling effort that

    showed the linkage between customer loyalty and financial

    impact.

    In the end, data miners would set themselves up for failure if they

    expect to find loyalty as the key driver of financial impact. In

    general, the global economy, regional economy, industry

    economy, market economy, company economy, business

    economy and customer economy will play a larger role in the

    financial landscape than loyalty alone. Breaking a financial

    number like profitability down to its constituent parts, one sees

    that there are very few aspects of profitability that can actually beaffected by loyalty. In the end, margin is a function of revenue

    and costs. Revenue is based on volume and price. Thus, what

    can loyalty affect in volume, price and cost? It has beenhypothesized by many authors that loyal customers bring more

    volume, allow higher prices and cost less to serve. In anindustrial commodity market with deep and frequent price cycles

    like that found in the chemical industry, this often translates into

    slower rate of change of volume and price for loyal customers,

    high account share for loyal customers, and lower costs to serve

    in the long run.

    Using primarily a cross sectional approach, and applyingtraditional time-lagged econometric adjusted models based on

    previous years financial trends, this data mining effort did in fact

    show that customer loyalty perceptual measures, market

    orientation perceptual measures and employee satisfaction

    perceptual measures do contribute, at the account level, to the

    explanation of financial impact in a statistically significantfashion, Rey [18]. The findings that related to perceptual vs.

    behavioral measures followed much of what Dick and Basau

    reported [6].

    8. ConclusionIn this paper we described our approach to introducing data

    mining in a large, global chemical company. Due to the unique

    nature of the company and the dynamic nature of the needs of itsconstituents businesses, a customized and targeted methodology

    had to be developed, borrowing beneficial aspects from published

    methodologies, while fine-tuning other aspects to better suit

    special needs.

    Some lessons learned include that data mining is not for the faint

    of heart in terms of quantitative methods, and that datapreparation is an important skill set (between data extractors and

    data miners). Along the way, we had to deal with the ubiquitous

    abuse of the term data mining it means different things to

    different people; anyone that has ever manipulated a spreadsheet

    is a self-proclaimed data mining expert, and data extractors areconsidered by many to be data miners. In a way, we and the

    vendors and the trade journals carry some of the blame in our

    zeal to preach the virtues of data mining we often end up

    overselling and hyping. This often leads to unrealistic

    expectations. Some of the misconceptions at high managementlevels included that only the big iron will do, that data mining is

    only for terabyte type problems, that it is too esoteric and no one

    within the company knows modeling. We also confirmed our

    notion that while it is important to leverage external resources asmuch as possible, it is equally important to have experienced

    people internally to jump-start the process, oversee projects and

    keep the consultants honest.

    On the plus side, the widespread use of Six Sigma methods and

    the measuring and analyzing mindset that it promotes, proved to

    be catalysts for both motivating the use of data mining and

    facilitating its acceptance.

    768

    Industry/Government Track Poster

  • 8/12/2019 Data Mining in the Chemical Industry

    7/7

    9. ACKNOWLEDGMENTSOur thanks to Dorian Pyle of Data Miners Inc. for his assistance

    in developing and deploying the training program. We wouldalso like to thank Jim Mentele, Tim Pletcher, and Carl Lee of

    CMURC, as well as our Dow colleagues Andy Paquet, Ken

    Beebe, Dave Rothman, and Mike Costa.

    10. REFERENCES[1] Anderson, E. W. and Mittal, V., Strengthening the

    Satisfaction-Profit Chain,Journal of Service Research,Volume 3, No. 2, (Nov. 2000), 107-120.

    [2] Berry, M. J. A., Linoff, G. S.,Data Mining Techniques: For

    Marketing, Sales, and Customer Support.John Wiley &Sons, (1997) , 17-19 and 30-34.

    [3] Box, G., Jenkins, G., and Reinsel, G., Time Series Analysis -

    Forecasting and Control, Third Edition., Pearson Education,

    Inc, 1994.

    [4] Breyfogle, W. III,Implementing Six Sigma SmarterSolutions Using Statistical Methods, Wiley-Interscience,

    1999

    [5] Dhar, V., and Stein, R., Seven Methods for TransformingCorporate Data into Business Intelligence,Prentice Hall,

    1996.

    [6] Dick, A. S. and Basu, K., Customer Loyalty: Toward an

    Integrated Conceptual Framework,Journal of the Academyof Marketing Science, 22 (2), (1994), 99-113.

    [7] Fayyad, U. M., Piatesky-Shapiro, G., and Smyth, P. (eds),

    From Data Mining to Knowledge Discovery: An Overview.InAdvances In Knowledge Discovery and Data Mining,pp.

    1-34, AAAI Press/MIT Press, 1996.

    [8] Gale, B. T., Managing Customer Value, The Free Press,

    New York, New York, 1994.

    [9] Hand , D. J., Mannila, H., and Smyth, P.,Principles of DataMining, MIT Press, 2001.

    [10]Haykin, S.,Neural Networks: A Comprehensive Foundation,Second Edition, Prentice Hall, New Jersey, 1999.

    [11]Johnson, M. and Gustafsson, A., Improving Customer

    Satisfaction, Loyalty and Profit: An Integrated Measurementand Management System. San Francisco: Jossey-Bass, 2000.

    [12]Lee, C., Mentele, J., Gaver, Rey, T.D., Structured Neural

    Network Techniques for Modeling Loyalty and Profitability,

    SUGI2005.

    [13]Oliver, R. L., Satisfaction: A Behavioral Perspective on the

    Consumer.New York: McGraw-Hill, 1997.

    [14]Pyle , D.,Data Preparation for Data Mining, Morgan

    Kaufmann, 1999.[15]Pyle , D.,Business Modeling and Data Mining, Morgan

    Kaufmann, 2003.

    [16]Reichheld, F. The Loyalty Effect: The Hidden Source Behind

    Growth, Profits, and Lasting Value. Boston: HarvardBusiness School Press, 1996

    [17]Rey, T. D. and Johnson, M., Modeling the Connection

    Between Loyalty and Financial Impact: A Journey. In

    Earning a Place at the Table, 23rd Annual MarketingResearch Conference, American Marketing Association,

    Chicago, IL, September 8-11, 2002..

    [18]Rey, T. D., Tying Customer Loyalty to Financial Impact. In

    Symposium on Complexity and Advanced Analytics Applied

    to Business, Government and Public Policy Society forIndustrial and Applied Mathematics, Great Lakes Section,

    University of Michigan, Dearborn Campus, October 23,

    2004.

    [19]Rust, Z. and Kenningham,Return on Quality: Measuring theFinancial Impact of Your Company's Quest for Quality,

    Probus Professional Publishing, 1993.

    [20]Wang, X. Z. Data Mining and Knowledge Discovery forProcess Monitoring and Control (Advances in Industrial

    Control), Springer Verlag, 1999

    [21]SAS Institute, http://www.sas.com/technologies/analytics/-

    datamining/miner/semma.html, 2005.

    [22]Shearer, C., The CRISP-DM Model: The New Blueprint for

    Data Mining. InJournal Data Warehousing, Vol. 5, No. 4,(2000), 13-22.

    [23]Witten, I. H., Frank, E.,Data Mining: Practical Machine

    Learning Tools and Techniques with Java Implementations,

    Morgan Kaufmann Publishers, 1999.

    769

    Industry/Government Track Poster