Upload
umeshkathuria
View
226
Download
0
Embed Size (px)
DESCRIPTION
Op Tim Ization
Citation preview
Optimization Methods in Data Mining
OverviewOptimizationMathematicalProgrammingCombinatorialOptimizationSupportVectorMachinesSteepestDescentSearchClassification,Clustering,etcNeural Nets,Bayesian Networks(optimize parameters)GeneticAlgorithmFeature selectionClassificationClustering
What is Optimization?FormulationDecision variablesObjective functionConstraintsSolutionIterative algorithmImproving searchProblemModelSolutionFormulationAlgorithm
Combinatorial OptimizationFinitely many solutions to choose fromSelect the best rule from a finite set of rulesSelect the best subset of attributesToo many solutions to consider allSolutionsBranch-and-bound (better than Weka exhaustive search)Random search
Random SearchSelect an initial solution x(0) and let k=0Loop:Consider the neighbors N(x(k)) of x(k)Select a candidate x from N(x(0)) Check the acceptance criterionIf accepted then let x(k+1) = x and otherwise let x(k+1) = x(k)Until stopping criterion is satisfied
Common AlgorithmsSimulated Annealing (SA)Idea: accept inferior solutions with a given probability that decreases as time goes onTabu Search (TS)Idea: restrict the neighborhood with a list of solutions that are tabu (that is, cannot be visited) because they were visited recentlyGenetic Algorithm (GA)Idea: neighborhoods based on genetic similarityMost used in data mining applications
Genetic AlgorithmsMaintain a population of solutions rather than a single solutionMembers of the population have certain fitness (usually just the objective)Survival of the fittest throughselectioncrossovermutation
GA FormulationUse binary strings (or bits) to encode solutions:0 1 1 0 1 0 0 1 0TerminologyChromosomes = solutionParent chromosomeChildren or offspring
Problems SolvedData Mining Problems that have been addressed using Genetic Algorithms:ClassificationAttribute selectionClustering
Classification ExampleOutlookSunny100
Overcast010
Rainy001
Yes10
No01Windy
Representing a RuleIf windy=yes then play=yesIf outlook=overcast and windy=yes then play=no
Single-Point CrossoverCrossover pointParents Offspring
Two-Point CrossoverCrossover pointsParents Offspring
Uniform CrossoverParents OffspringProblem?
MutationParent OffspringMutated bit
SelectionWhich strings in the population should be operated on?Rank and select the n fittest onesAssign probabilities according to fitness and select probabilistically, say
Creating a New PopulationCreate a population Pnew with p individualsSurvivalAllow individuals from old population to survived intactRate: 1-r % of populationHow to select the individuals that survive: Deterministic/randomCrossoverSelect fit individuals and create new onceRate: r% of population. How to select?MutationSlightly modify any on the above individualsMutation rate: mFixed number of mutations versus probabilistic mutations
GA AlgorithmRandomly generate an initial population PEvaluate the fitness f(xi) of each individual in PRepeat:Survival: Probabilistically select (1-r)p individuals from P and add to Pnew, according to
Crossover: Probabilistically select rp/2 pairs from P and apply the crossover operator. Add to PnewMutation: Uniformly choose m percent of member and invert one randomly selected bitUpdate: P PnewEvaluate: Compute the fitness f(xi) of each individual in PReturn the fittest individual from P
Analysis of GA: SchemasDoes GA converge?Does GA move towards a good solution? Local optima?
Holland (1975): Analysis based on schemasSchema: string combination of 0s, 1s, *sExample: 0*10 represents {0010,0110}
The Schema Theorem(all the theory on one slide)Number of instance ofschema s at time tAverage fitness ofindividuals in schema s at time tProbabilityof crossoverProbabilityof mutationNumber ofdefined bitsin schema sDistance betweendefined bits in s
InterpretationFit schemas grow in influenceWhat is missingCrossover?Mutation?How about time t+1 ?Other approaches:Markov chainsStatistical mechanics
GA for Feature SelectionFeature selection:Select a subset of attributes (features)Reason: to many, redundant, irrelevant
Set of all subsets of attributes very largeLittle structure to searchRandom search methods
EncodingNeed a bit code representationHave some n attributesEach attribute is either in (1) or out (0) of the selected set
FitnessWrapper approachApply learning algorithm, say a decision tree, to the individual x ={outlook, humidity}Let fitness equal error rate (minimize)Filter approachLet fitness equal the entropy (minimize)Other diversity measures can also be usedSimplicity measure?
CrossoverCrossover point
In Weka
Clustering ExampleCrossover{10,20}{30,40}{20,40}{10,30}{10,20,40}{30}{20}{10,30,40}Create two clusters for:
ID
Outlook
Temperature
Humidity
Windy
Play
10
Sunny
Hot
High
True
No
20
Overcast
Hot
High
False
Yes
30
Rainy
Mild
High
False
Yes
40
Rainy
Cool
Normal
False
Yes
DiscussionGA is a flexible and powerful random search methodologyEfficiency depends on how well you can encode the solutions in a way that will work with the crossover operatorIn data mining, attribute selection is the most natural application
Attribute Selection in Unsupervised LearningAttribute selection typically uses a measure, such as accuracy, that is directly related to the class attributeHow do we apply attribute selection to unsupervised learning such as clustering?Need a measurecompactness of clusterseparation among clustersMultiple measures
Quality MeasuresCompactnessCentroidInstancesClustersNumber ofattributesNormalizationconstant to make
More Quality MeasuresCluster Separation
Final Quality MeasuresAdjustment for bias
Compexity
Wrapper FrameworkLoop:Obtain an attribute subsetApply k-means algorithmEvaluate cluster qualityUntil stopping criterion satisfied
ProblemWhat is the optimal attribute subset?What is the optimal number of clusters?
Try to find simultaneously
ExampleFind an attribute subset and optimal number of clusters (Kmin = 2, Kmin = 3) for
ID
Sepal Length
Sepal Width
Petal length
Petal Width
10
5.0
3.5
1.6
0.6
20
5.1
3.8
1.9
0.4
30
4.8
3.0
1.4
0.3
40
5.1
3.8
1.6
0.2
50
4.6
3.2
1.4
0.2
60
6.5
2.8
4.6
1.5
70
5.7
2.8
4.5
1.3
80
6.3
3.3
4.7
1.6
90
4.9
2.4
3.3
1.0
100
6.6
2.9
4.6
1.3
FormulationDefine an individual
Initial Population0 1 0 1 11 0 0 1 0
Evaluate FitnessStart with 0 1 0 1 1Three clusters and {Sepal Width, Petal Width}
Apply k-means with k=3
ID
Sepal Width
Petal Width
10
3.5
0.6
20
3.8
0.4
30
3.0
0.3
40
3.8
0.2
50
3.2
0.2
60
2.8
1.5
70
2.8
1.3
80
3.3
1.6
90
2.4
1.0
100
2.9
1.3
K-MeansStart with random centroids: 10, 70, 80
Chart1
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1
1.3
Sepal Width
Petal Width
90
30
50
40
20
10
80
60
70
100
Sheet1
103.50.6
203.80.4
3030.3
403.80.2
503.20.2
602.81.5
702.81.3
803.31.6
902.41
1002.91.3
Sheet1
Sepal Width
Petal Width
100
70
60
80
10
20
40
50
30
90
Sheet2
Sheet3
New CentroidsNo change in assignment soterminate k-means algorithm
Chart3
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1
1.3
1.275
0.34
Sepal Width
Petal Width
90
30
50
40
20
10
80
60
70
100
C3
C1
Sheet1
103.50.6
203.80.4
3030.3
403.80.2
503.20.2
602.81.5
702.81.3
803.31.6
902.41
1002.91.3
C12.7251.275
C2
C33.460.34
Sheet1
Sepal Width
Petal Width
C1
C3
100
70
60
80
10
20
40
50
30
90
Sheet2
Sheet3
Quality of ClustersCentersCenter 1 at (3.46,0.34): {60,70,90,100}Center 2 at (3.30,1.60): {80}Center 3 at (2.73,1.28): {10,20,30,40,50}Evaluation
Next IndividualNow look at 1 0 0 1 0Two clusters and {Sepal Length, Petal Width}
Apply k-means with k=3
ID
Sepal Length
Petal Width
10
5.0
0.6
20
5.1
0.4
30
4.8
0.3
40
5.1
0.2
50
4.6
0.2
60
6.5
1.5
70
5.7
1.3
80
6.3
1.6
90
4.9
1.0
100
6.6
1.3
K-MeansSay we select 20 and 90 as initial centroids:
Chart4
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1
1.3
Sepal Width
Petal Width
90
30
50
40
20
10
80
60
70
100
C3
C1
Sheet1
1050.6
205.10.4
304.80.3
405.10.2
504.60.2
606.51.5
705.71.3
806.31.6
904.91
1006.61.3
C15.9251.275
C2
C34.920.34
Sheet1
0
0
0
0
0
0
0
0
0
0
Sepal Width
Petal Width
C1
C3
100
70
60
80
10
20
40
50
30
90
Sheet2
Sheet3
Recalculate Centroids
Chart6
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1
1.3
0.34
1.34
Sepal Width
Petal Width
90
30
50
40
20
10
80
60
70
100
C3
C1
C2
Sheet1
1050.6
205.10.4
304.80.3
405.10.2
504.60.2
606.51.5
705.71.3
806.31.6
904.91
1006.61.3
C14.920.34
C261.34
C34.920.34
Sheet1
0
0
0
0
0
0
0
0
0
0
0
0
Sepal Width
Petal Width
C2
C1
C3
100
70
60
80
10
20
40
50
30
90
Sheet2
Sheet3
Recalculate AgainNo change in assignment soterminate k-means algorithm
Quality of ClustersCentersCenter 1 at (4.92,0.45): {10,20,30,40,50,90}Center 3 at (6.28,1.43): {60,70,90,100} Evaluation
Compare IndividualsWhich is fitter?
Evaluating FitnessCan scale (if necessary)Then weight them together, e.g.,
Alternatively, we can use Pareto optimization
Mathematical ProgrammingContinuous decision variablesConstrained versus non-constrainedForm of the objective functionLinear Programming (LP)Quadratic Programming (QP)General Mathematical Programming (MP)
Linear Program
Two Dimensional ProblemOptimum isalways at anextreme point
Simplex Method
Quadratic Programming
Chart1
1.2
1.01
0.84
0.69
0.56
0.45
0.36
0.29
0.24
0.21
0.2
0.21
0.24
0.29
0.36
0.45
0.56
0.69
0.84
1.01
1.2
f(x)=0.2+(x-1)2
Sheet1
01.2
0.11.01
0.20.84
0.30.69
0.40.56
0.50.45
0.60.36
0.70.29
0.80.24
0.90.21
10.2
1.10.21
1.20.24
1.30.29
1.40.36
1.50.45
1.60.56
1.70.69
1.80.84
1.91.01
21.2
Sheet1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
f(x)=0.2+(x-1)2
Sheet2
Sheet3
General MPDerivative beingzero is a necessarybut not sufficientcondition
Constrained Problem?
General MPWe write a general mathematical program in matrix notation as:
Karush-Kuhn-Tucker (KKT) Conditions
Convex SetsA set C is convex if any line connecting two points in theset lies completely within the set, that is,
Convex HullThe convex hull co(S) of a set S is the intersection of all convex sets containing SA set V Rn is a linear variety if
HyperplaneA hyperplane in Rn is a (n-1)-dimensional variety
Convex Hull ExamplePlay
No PlayTemperatureHumidity
Finding the Closest PointsFormulate as QP:
Support Vector MachinesPlay
No PlayTemperatureHumidity
Example
ID
Sepal Width
Petal Width
10
3.5
0.6
20
3.8
0.4
30
3.0
0.3
40
3.8
0.2
50
3.2
0.2
60
2.8
1.5
70
2.8
1.3
80
3.3
1.6
90
2.4
1.0
100
2.9
1.3
Separating Hyperplane
Chart1
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1
1.3
Sheet1
3.50.6
3.80.4
30.3
3.80.2
3.20.2
2.81.5
2.81.3
3.31.6
2.41
2.91.3
Sheet1
Sheet2
Sheet3
Assume Separating PlanesConstraints:
Distance to each plane:
Optimization Problem
How Do We Solve MPs?
Improving SearchDirection-step approach
New Solution
Steepest DescentSearch direction equal to negative gradient
Finding l is a one-dimensional optimization problem of minimizing
Newtons MethodTaylor series expansion
The right hand side is minimized at
DiscussionComputing the inverse Hessian is difficultQuasi-Newton
Conjugate gradient methodsDoes not account for constraintsPenalty methodsLagrangian methods, etc.
Non-separableAdd an error term to the constraints:
Wolfe DualSimpleconstraintsOnly placedata appears
Extension to Non-LinearKernel functions
MappingTakes place ofdot product inWolfe dualHigh dimensionalHilbert space
Some Possible Kernels
In WekaWeka.classifiers.smoSupport vector machine for nominal data onlyDoes both linear and non-linear models
Optimization in DMOptimization
Bayesian ClassificationNave Bayes assumes independence between attributesSimple computationsBest classifier if assumption is trueBayesian Belief NetworksJoint probability distributionsDirected acyclic graphsNodes are random variables (attributes)Arcs represent the dependencies
Example: Bayesian NetworkFamily HistorySmokerLung CancerEmphysemaPositive X-RayDyspnea
Conditional ProbabilitiesRandomvariableOutcome of therandom variableThe node representingthe class attribute iscalled the output node
How Do we Learn?Network structureGiven/knownInferred or learned from the dataVariablesObservableHidden (missing values / incomplete data)
Case 1: Known Structure and Observable VariablesStraightforwardSimilar to Nave BayesCompute the entries of the conditional probability table (CPT) of each variable
Case 2: Known Structure and Some Hidden VariablesStill need to learn the CPT entriesLet S be a set of s training instances
Let wijk be the CPT entry for variable Yi=yij having parents Ui=uik.
CPT Example
FH,S
FH,~S
~FH,S
~FH,~S
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
ObjectiveMust find the value of
The objective is to maximize the likelihood of the data, that is,
How do we do this?
Non-Linear MPCompute gradients:
Move in the direction of the gradientFrom training dataLearning rate
Case 3: Unknown Network StructureNeed to find/learn the optimal network structure for the data
What type of optimization problem is this?Combinatorial optimization (GA etc.)