Op Tim Ization

Optimization Methods in Data Mining

OverviewOptimizationMathematicalProgrammingCombinatorialOptimizationSupportVectorMachinesSteepestDescentSearchClassification,Clustering,etcNeural Nets,Bayesian Networks(optimize parameters)GeneticAlgorithmFeature selectionClassificationClustering

What is Optimization?FormulationDecision variablesObjective functionConstraintsSolutionIterative algorithmImproving searchProblemModelSolutionFormulationAlgorithm

Combinatorial OptimizationFinitely many solutions to choose fromSelect the best rule from a finite set of rulesSelect the best subset of attributesToo many solutions to consider allSolutionsBranch-and-bound (better than Weka exhaustive search)Random search

Random SearchSelect an initial solution x(0) and let k=0Loop:Consider the neighbors N(x(k)) of x(k)Select a candidate x from N(x(0)) Check the acceptance criterionIf accepted then let x(k+1) = x and otherwise let x(k+1) = x(k)Until stopping criterion is satisfied

Common AlgorithmsSimulated Annealing (SA)Idea: accept inferior solutions with a given probability that decreases as time goes onTabu Search (TS)Idea: restrict the neighborhood with a list of solutions that are tabu (that is, cannot be visited) because they were visited recentlyGenetic Algorithm (GA)Idea: neighborhoods based on genetic similarityMost used in data mining applications

Genetic AlgorithmsMaintain a population of solutions rather than a single solutionMembers of the population have certain fitness (usually just the objective)Survival of the fittest throughselectioncrossovermutation

GA FormulationUse binary strings (or bits) to encode solutions:0 1 1 0 1 0 0 1 0TerminologyChromosomes = solutionParent chromosomeChildren or offspring

Problems SolvedData Mining Problems that have been addressed using Genetic Algorithms:ClassificationAttribute selectionClustering

Classification ExampleOutlookSunny100

Overcast010

Rainy001

Yes10

No01Windy

Representing a RuleIf windy=yes then play=yesIf outlook=overcast and windy=yes then play=no

Single-Point CrossoverCrossover pointParents Offspring

Two-Point CrossoverCrossover pointsParents Offspring

Uniform CrossoverParents OffspringProblem?

MutationParent OffspringMutated bit

SelectionWhich strings in the population should be operated on?Rank and select the n fittest onesAssign probabilities according to fitness and select probabilistically, say

Creating a New PopulationCreate a population Pnew with p individualsSurvivalAllow individuals from old population to survived intactRate: 1-r % of populationHow to select the individuals that survive: Deterministic/randomCrossoverSelect fit individuals and create new onceRate: r% of population. How to select?MutationSlightly modify any on the above individualsMutation rate: mFixed number of mutations versus probabilistic mutations

GA AlgorithmRandomly generate an initial population PEvaluate the fitness f(xi) of each individual in PRepeat:Survival: Probabilistically select (1-r)p individuals from P and add to Pnew, according to

Crossover: Probabilistically select rp/2 pairs from P and apply the crossover operator. Add to PnewMutation: Uniformly choose m percent of member and invert one randomly selected bitUpdate: P PnewEvaluate: Compute the fitness f(xi) of each individual in PReturn the fittest individual from P

Analysis of GA: SchemasDoes GA converge?Does GA move towards a good solution? Local optima?

Holland (1975): Analysis based on schemasSchema: string combination of 0s, 1s, *sExample: 0*10 represents {0010,0110}

The Schema Theorem(all the theory on one slide)Number of instance ofschema s at time tAverage fitness ofindividuals in schema s at time tProbabilityof crossoverProbabilityof mutationNumber ofdefined bitsin schema sDistance betweendefined bits in s

InterpretationFit schemas grow in influenceWhat is missingCrossover?Mutation?How about time t+1 ?Other approaches:Markov chainsStatistical mechanics

GA for Feature SelectionFeature selection:Select a subset of attributes (features)Reason: to many, redundant, irrelevant

Set of all subsets of attributes very largeLittle structure to searchRandom search methods

EncodingNeed a bit code representationHave some n attributesEach attribute is either in (1) or out (0) of the selected set

FitnessWrapper approachApply learning algorithm, say a decision tree, to the individual x ={outlook, humidity}Let fitness equal error rate (minimize)Filter approachLet fitness equal the entropy (minimize)Other diversity measures can also be usedSimplicity measure?

CrossoverCrossover point

In Weka

Clustering ExampleCrossover{10,20}{30,40}{20,40}{10,30}{10,20,40}{30}{20}{10,30,40}Create two clusters for:

ID

Outlook

Temperature

Humidity

Windy

Play

10

Sunny

Hot

High

True

No

20

Overcast

Hot

High

False

Yes

30

Rainy

Mild

High

False

Yes

40

Rainy

Cool

Normal

False

Yes

DiscussionGA is a flexible and powerful random search methodologyEfficiency depends on how well you can encode the solutions in a way that will work with the crossover operatorIn data mining, attribute selection is the most natural application

Attribute Selection in Unsupervised LearningAttribute selection typically uses a measure, such as accuracy, that is directly related to the class attributeHow do we apply attribute selection to unsupervised learning such as clustering?Need a measurecompactness of clusterseparation among clustersMultiple measures

Quality MeasuresCompactnessCentroidInstancesClustersNumber ofattributesNormalizationconstant to make

More Quality MeasuresCluster Separation

Final Quality MeasuresAdjustment for bias

Compexity

Wrapper FrameworkLoop:Obtain an attribute subsetApply k-means algorithmEvaluate cluster qualityUntil stopping criterion satisfied

ProblemWhat is the optimal attribute subset?What is the optimal number of clusters?

Try to find simultaneously

ExampleFind an attribute subset and optimal number of clusters (Kmin = 2, Kmin = 3) for

ID

Sepal Length

Sepal Width

Petal length

Petal Width

10

5.0

3.5

1.6

0.6

20

5.1

3.8

1.9

0.4

30

4.8

3.0

1.4

0.3

40

5.1

3.8

1.6

0.2

50

4.6

3.2

1.4

0.2

60

6.5

2.8

4.6

1.5

70

5.7

2.8

4.5

1.3

80

6.3

3.3

4.7

1.6

90

4.9

2.4

3.3

1.0

100

6.6

2.9

4.6

1.3

FormulationDefine an individual

Initial Population0 1 0 1 11 0 0 1 0

Evaluate FitnessStart with 0 1 0 1 1Three clusters and {Sepal Width, Petal Width}

Apply k-means with k=3

ID

Sepal Width

Petal Width

10

3.5

0.6

20

3.8

0.4

30

3.0

0.3

40

3.8

0.2

50

3.2

0.2

60

2.8

1.5

70

2.8

1.3

80

3.3

1.6

90

2.4

1.0

100

2.9

1.3

K-MeansStart with random centroids: 10, 70, 80

Chart1

0.6

0.4

0.3

0.2

0.2

1.5

1.3

1.6

1

1.3

Sepal Width

Petal Width

90

30

50

40

20

10

80

60

70

100

Sheet1

103.50.6

203.80.4

3030.3

403.80.2

503.20.2

602.81.5

702.81.3

803.31.6

902.41

1002.91.3

Sheet1

Sepal Width

Petal Width

100

70

60

80

10

20

40

50

30

90

Sheet2

Sheet3

New CentroidsNo change in assignment soterminate k-means algorithm

Chart3

0.6

0.4

0.3

0.2

0.2

1.5

1.3

1.6

1

1.3

1.275

0.34

Sepal Width

Petal Width

90

30

50

40

20

10

80

60

70

100

C3

C1

Sheet1

103.50.6

203.80.4

3030.3

403.80.2

503.20.2

602.81.5

702.81.3

803.31.6

902.41

1002.91.3

C12.7251.275

C2

C33.460.34

Sheet1

Sepal Width

Petal Width

C1

C3

100

70

60

80

10

20

40

50

30

90

Sheet2

Sheet3

Quality of ClustersCentersCenter 1 at (3.46,0.34): {60,70,90,100}Center 2 at (3.30,1.60): {80}Center 3 at (2.73,1.28): {10,20,30,40,50}Evaluation

Next IndividualNow look at 1 0 0 1 0Two clusters and {Sepal Length, Petal Width}

Apply k-means with k=3

ID

Sepal Length

Petal Width

10

5.0

0.6

20

5.1

0.4

30

4.8

0.3

40

5.1

0.2

50

4.6

0.2

60

6.5

1.5

70

5.7

1.3

80

6.3

1.6

90

4.9

1.0

100

6.6

1.3

K-MeansSay we select 20 and 90 as initial centroids:

Chart4

0.6

0.4

0.3

0.2

0.2

1.5

1.3

1.6

1

1.3

Sepal Width

Petal Width

90

30

50

40

20

10

80

60

70

100

C3

C1

Sheet1

1050.6

205.10.4

304.80.3

405.10.2

504.60.2

606.51.5

705.71.3

806.31.6

904.91

1006.61.3

C15.9251.275

C2

C34.920.34

Sheet1

0

0

0

0

0

0

0

0

0

0

Sepal Width

Petal Width

C1

C3

100

70

60

80

10

20

40

50

30

90

Sheet2

Sheet3

Recalculate Centroids

Chart6

0.6

0.4

0.3

0.2

0.2

1.5

1.3

1.6

1

1.3

0.34

1.34

Sepal Width

Petal Width

90

30

50

40

20

10

80

60

70

100

C3

C1

C2

Sheet1

1050.6

205.10.4

304.80.3

405.10.2

504.60.2

606.51.5

705.71.3

806.31.6

904.91

1006.61.3

C14.920.34

C261.34

C34.920.34

Sheet1

0

0

0

0

0

0

0

0

0

0

0

0

Sepal Width

Petal Width

C2

C1

C3

100

70

60

80

10

20

40

50

30

90

Sheet2

Sheet3

Recalculate AgainNo change in assignment soterminate k-means algorithm

Quality of ClustersCentersCenter 1 at (4.92,0.45): {10,20,30,40,50,90}Center 3 at (6.28,1.43): {60,70,90,100} Evaluation

Compare IndividualsWhich is fitter?

Evaluating FitnessCan scale (if necessary)Then weight them together, e.g.,

Alternatively, we can use Pareto optimization

Mathematical ProgrammingContinuous decision variablesConstrained versus non-constrainedForm of the objective functionLinear Programming (LP)Quadratic Programming (QP)General Mathematical Programming (MP)

Linear Program

Two Dimensional ProblemOptimum isalways at anextreme point

Simplex Method

Quadratic Programming

Chart1

1.2

1.01

0.84

0.69

0.56

0.45

0.36

0.29

0.24

0.21

0.2

0.21

0.24

0.29

0.36

0.45

0.56

0.69

0.84

1.01

1.2

f(x)=0.2+(x-1)2

Sheet1

01.2

0.11.01

0.20.84

0.30.69

0.40.56

0.50.45

0.60.36

0.70.29

0.80.24

0.90.21

10.2

1.10.21

1.20.24

1.30.29

1.40.36

1.50.45

1.60.56

1.70.69

1.80.84

1.91.01

21.2

Sheet1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

f(x)=0.2+(x-1)2

Sheet2

Sheet3

General MPDerivative beingzero is a necessarybut not sufficientcondition

Constrained Problem?

General MPWe write a general mathematical program in matrix notation as:

Karush-Kuhn-Tucker (KKT) Conditions

Convex SetsA set C is convex if any line connecting two points in theset lies completely within the set, that is,

Convex HullThe convex hull co(S) of a set S is the intersection of all convex sets containing SA set V Rn is a linear variety if

HyperplaneA hyperplane in Rn is a (n-1)-dimensional variety

Convex Hull ExamplePlay

No PlayTemperatureHumidity

Finding the Closest PointsFormulate as QP:

Support Vector MachinesPlay

No PlayTemperatureHumidity

Example

ID

Sepal Width

Petal Width

10

3.5

0.6

20

3.8

0.4

30

3.0

0.3

40

3.8

0.2

50

3.2

0.2

60

2.8

1.5

70

2.8

1.3

80

3.3

1.6

90

2.4

1.0

100

2.9

1.3

Separating Hyperplane

Chart1

0.6

0.4

0.3

0.2

0.2

1.5

1.3

1.6

1

1.3

Sheet1

3.50.6

3.80.4

30.3

3.80.2

3.20.2

2.81.5

2.81.3

3.31.6

2.41

2.91.3

Sheet1

Sheet2

Sheet3

Assume Separating PlanesConstraints:

Distance to each plane:

Optimization Problem

How Do We Solve MPs?

Improving SearchDirection-step approach

New Solution

Steepest DescentSearch direction equal to negative gradient

Finding l is a one-dimensional optimization problem of minimizing

Newtons MethodTaylor series expansion

The right hand side is minimized at

DiscussionComputing the inverse Hessian is difficultQuasi-Newton

Conjugate gradient methodsDoes not account for constraintsPenalty methodsLagrangian methods, etc.

Non-separableAdd an error term to the constraints:

Wolfe DualSimpleconstraintsOnly placedata appears

Extension to Non-LinearKernel functions

MappingTakes place ofdot product inWolfe dualHigh dimensionalHilbert space

Some Possible Kernels

In WekaWeka.classifiers.smoSupport vector machine for nominal data onlyDoes both linear and non-linear models

Optimization in DMOptimization

Bayesian ClassificationNave Bayes assumes independence between attributesSimple computationsBest classifier if assumption is trueBayesian Belief NetworksJoint probability distributionsDirected acyclic graphsNodes are random variables (attributes)Arcs represent the dependencies

Example: Bayesian NetworkFamily HistorySmokerLung CancerEmphysemaPositive X-RayDyspnea

Conditional ProbabilitiesRandomvariableOutcome of therandom variableThe node representingthe class attribute iscalled the output node

How Do we Learn?Network structureGiven/knownInferred or learned from the dataVariablesObservableHidden (missing values / incomplete data)

Case 1: Known Structure and Observable VariablesStraightforwardSimilar to Nave BayesCompute the entries of the conditional probability table (CPT) of each variable

Case 2: Known Structure and Some Hidden VariablesStill need to learn the CPT entriesLet S be a set of s training instances

Let wijk be the CPT entry for variable Yi=yij having parents Ui=uik.

CPT Example

FH,S

FH,~S

~FH,S

~FH,~S

LC

0.8

0.5

0.7

0.1

~LC

0.2

0.5

0.3

0.9

ObjectiveMust find the value of

The objective is to maximize the likelihood of the data, that is,

How do we do this?

Non-Linear MPCompute gradients:

Move in the direction of the gradientFrom training dataLearning rate

Case 3: Unknown Network StructureNeed to find/learn the optimal network structure for the data

What type of optimization problem is this?Combinatorial optimization (GA etc.)

Documents

Op Tim Ization