45
Gerald Friedland, http://www.gerald-friedland.org Gerald Friedland (UC Berkeley) Experimental Design for Machine Learning Paper, Demo, etc: https://tfmeter.icsi.berkeley.edu Commercial tool: http://brainome.ai

Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org

GeraldFriedland(UCBerkeley)

ExperimentalDesignforMachineLearning

Paper,Demo,etc:https://tfmeter.icsi.berkeley.edu

Commercialtool:http://brainome.ai

Page 2: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

About me….

▪ AdjunctFaculty,UCBerkeley

▪ DataScientistatNationalLab

▪ StartedworkinMachineLearningin2001

Page 3: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org3

Startofthiswork:SimpleQuestion

▪ Howmuchmoney(cputime,memory,IO)doIneedtobudgetformydeeplearningexperiment?

▪ StateoftheArt:Noanswer.Forexample,ImageNetmodelsvarysignificantly:

▪ AlexNet:238MBmodel,2.27BnOps

▪ DarkNet:28MBmodel,0.96BnOps

▪ VGG-16:528MB,30.94BnOps

Source:https://pjreddie.com/darknet/imagenet/

Page 4: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org4

Agame…

▪ Continuethesequence:

▪ 2,4,6,8,….

▪ 6,5,1,4,…..▪ Whatisthenextnumber?

▪ 100000(sequence1)

▪ 100000(sequence2)

▪ Why?

Page 5: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org5

TheScientificMethod

DataScience:TheScienceofAutomatingtheScientificMethod

Page 6: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org6

TheScientificMethod:Practical(traditional)

E = mc2

Page 7: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org7

TheScientificMethod:Practical(new)

E = mc2

Page 8: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org8

▪ Intelligence:Theabilitytoadapt(BinetandSimon,1904)

▪ MachinelearningadaptsafinitestatemachineMtoanunknownfunctionbasedonobservations.

▪ Input:nrowsofobservations(instances)inatablewithheader:whereisacolumnwithlabelswecalltargetfunction.

▪ Output:StatemachineMthatmapsapoint

ThoughtFramework:MachineLearning

(x1, x2, . . . , xm, f( ⃗x ))

f( ⃗x )

(x1, x2, . . . , xm) ⟹ f( ⃗x )

Page 9: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org9

▪ Assume(binaryclassifier)

▪ Question:HowmanystatetransitionsdoesMneedtomodelthetrainingdata?

ThoughtFramework:MachineLearning

xi ∈ ℝ, f( ⃗x ) ∈ {0,1}

Page 10: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org10

Refresh:MemoryArithmetic

• Informationisreductionofuncertainty:H=-log2P=-log2=log2#statesmeasuredinbits.

• Information:log2#states(positivebits)Uncertainty:log2P=log2(negativebits)

• Ifstatesarenotequiprobable,ShannonEntropyprovidestighterbound.Math:Assumptionsneeded!(infinity,distribution)Engineering:Estimateusingbinning

1#states

1#states

Page 11: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org11

▪ Assume(binaryclassifier)Question:HowmanystatetransitionsdoesMneedtomodelthetrainingdata?Maximally:#rows(lookuptable)Minimally:?(KolmogorovComplexity)

ThoughtFramework:MachineLearning

xi ∈ ℝ, f( ⃗x ) ∈ {0,1}

Page 12: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org12

▪ IntellectualCapacity:Thenumberofuniquetargetfunctionsamachinelearnerisabletorepresent(asafunctionofthenumberofmodelparameters).

▪ MemoryEquivalentCapacity(MEC):Amachinelearner’sintellectualcapacityismemory-equivalenttoNbitswhenthemachinelearnerisabletorepresentall2NbinarylabelingfunctionsofNuniformlyrandominputs.

▪ AtMECorhigher,Misabletomemorizeallpossiblestatetransitionsfromtheinputtotheoutput.

ThoughtFramework:MachineLearning

Page 13: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org13

ThisTalk:Maintrick

• Ifwededucenothingfromdata,theonlythingwecandoismemorizetheobservationsverbatim.

• Usingasmanyparametersasneededformemorizationisthereforeanindicatorthatthemachinelearnerdidnotdeduceanything(overfitting).

• Reducingparametersbelowmemorizationcapacitywill,inthebestcase,makethemachinelearnerforgetwhat’snotrelevantwithregardstothetargetfunction:generalization.

Memorizationisworst-casegeneralization

Page 14: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org14

GeneralizationinMachineLearning

Memorizationisworst-casegeneralization.Forbinaryclassifiers:G<1=>Mneedsmoretraining/data(notevenmemorizing)G=1=>Mismemorizing=overfitting1<G<=>Mcouldbeimplementingalosslesscompression(andstilloverfit)G>=>Misgeneralizing(nochanceforoverfitting)

G =#correctly classified instancesMemory Equivalent Capacity

[bitsbit

]

GMEM

GMEM

Page 15: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org15

GeneralizationinMachineLearning

G =#correctly classified instancesMemory Equivalent Capacity

[bitsbit

]

Advantagesofthisdefinition:

• Keepcurrentapproachwithtraining/validation/benchmarksets.

• Noi.i.d.requirementfortrain/testset:Onlyrequirementisinputpointsaredistinct!

• Nodistributionalassumptions.

Page 16: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org16

HowdowecalculatetheMemoryEquivalentCapacity?

• BinaryDecisionTree:Depthoftree(ifperfect).

• NeuralNetwork(reminderoftalk)

• RandomForrest:TBD

• SVN:TBD

• k-NN:TBD

• GMMs:TBD

Page 17: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org17

MachineLearningasEngineeringDiscipline

• SupervisedMachineLearnershaveaMemoryEquivalentCapacityinbitsthatiscomputableandmeasurable.

• ArtificialNeuralNetworkswithgatingfunctions(Sigmoid,ReLU,etc.)have

• acapacityupperlimitthatcanbedeterminedanalyticallyusing4principles

• aneffectivecapacitythatcanbemeasuredonactualimplementations.

• Predictingandmeasuringcapacityallowsfortask-independentoptimizationofaconcretenetworkarchitecture,learningalgorithm,convergencetricks,etc…

• Capacityrequirementcanbeapproximatelypredictedgiventheinputdataandgroundtruth.

Page 18: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org18

Repeat:ThePerceptron

Source:WikipediaPhysicalinterpretation:Energythreshold

Page 19: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org19

Repeat:ActivationFunctions(toomany)

Source:WikipediaActivationfunctionsapproximatethesharpdecisionboundary.

Page 20: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org20

HowmanybinaryfunctionscanonmodelusingasinglePerceptron?

Source:R.Rojas,IntrotoNeuralNetworks

Page 21: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org21

Example:BooleanFunctions

Source:R.Rojas,IntrotoNeuralNetworks

• 22vpossiblelabelingsofvbooleanvariables

• 22vlabelingsof2vpoints.

• Forv=2,allbut2functionswork:XOR,NXOR

Page 22: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org22

MachineLearningasanEncoder/Decoder

Informationloss

Learning Method

Neural Network

Sender

Identity

Encoder Channel Decoder Receiver

labels weights weights labels'

data

Source:D.MacKay:InformationTheory,InferenceandLearning

Maintrick:LettheMachineLearnerlabelrandompoints!

Page 23: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org23

CriticalPoints:Perceptron(Cover,MacKay)

N=K:VCDimension(forpointsinrandomposition)N=2K:Cover/MacKayInformationCapacity

Source:D.MacKay:InformationTheory,InferenceandLearning

Page 24: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org24

FromaPerceptrontoPerceptronNetworks

Source:Wikipedia

Page 25: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org25

Careful:OtherArchitectures

TypicalMLPShortcutNetwork

Source:R.Rojas,IntrotoNeuralNetworks

ExampleSolutionstoXOR

Page 26: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org26

Solution:Calculateinbits!

Assume:yi,xi∈{0,1},xiuniformlydistributednbitsofmemory:f(x1,…,xn)=x1,…,xn.(identityfunction).MachineLearner:binaryclassifier:f(x1,…,xn)=y1multi-class/regression:f(x1,…,xn)=y1,…,ym

MemoryEquivalentCapacity:Thenumberofconfigurationsofuniformlydistributedx1,…,xnthatamachinelearnercanguaranteetolabelcorrectly.

Page 27: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org27

MemoryEquivalentCapacityforNeuralNetworks

1) Theoutputofaperceptronismaximally1bit.2) Themaximummemorycapacityofaperceptronisthe

numberofparameters(includingbias)inbits.(MacKay2003)

3) Themaximummemorycapacityofperceptronsinparalleladditive.(MacKay2003speculative,FriedlandandKrell2017)

4) Themaximummemorycapacityofalayerofperceptronsdependingonapreviouslayerofperceptronsislimitedbythemaximumoutput(inbits)ofthepreviouslayer.(DataProcessingInequality,Tishby2012)

Page 28: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org28

Examples:Howmanybitsofmaximalcapacity?

x1

x2x2

x1

3bits 2*3bits+min(2,3)bits=8bits

w1

w2

b

w1

w2

w3

w4

b1

b2

w5

w6

Page 29: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org29

Examples:Howmanybitsofmaximalcapacity?

2*3bits+min(2,2*3)bits+min(2,3)bits=10bits

x1

x2

w1

w2

w3

w4

b1

b2

w5

w6

w7

w8

b3

b4

b5

w9

w10

Page 30: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org30

Examples:Howmanybitsofmaximalcapacity?

3bits+4bits=7bits

x1

x2

w1

w2

w3

w4

w5

b1 b2

ShortcutorResNet

Page 31: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org31

CharacteristicCurveofaTheoretical3-LayerMLP

Page 32: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org32

CharacteristicCurveofanActual3-LayerMLP

Python scikit-learn, 3-Layer MLP

Page 33: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org33

PredictingCapacityRequirements

Givendataandlabels:HowmuchactualcapacitydoIneedtomemorizethefunction?

Idea:1) Worstcase:Let’sbuildamemorizationnetworkwhereonly

thebiasesaretrained2) Expectedcase:Howmuchparameterreductioncan

(exponential)trainingbuyus?

Page 34: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org34

PredictingMaximumMemoryEquivalentCapacity

0

1

+1...

1

.

.

.

1

1

-1

x1

x2

+/-1

.

.

.1

1

1

b1

b2

bmxn

1

1

“Dumb”Network

Runtime:O(nlogn)

Page 35: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org35

PredictingExpectedMinimumMemoryEquivalentCapacity

DumbNetwork:• Highlyinefficient.• Potentiallynot100%accurate(hashcollisions).• Wecanassumetrainingweights(andbiases)gets100%accuracywhilereducingparameters.

ExpectedReduction:Exponential!nthresholdsshouldbeabletoberepresentedwithlog2nweightsandbiases(searchtree!).

Page 36: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org36

EmpiricalResults

Allresultsrepeatableat:https://github.com/fractor/nntailoring

Page 37: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

37

Training

▪ Everythingwedidsofarassumesperfecttraining.Thisis,trainingthatguaranteestoreachtheglobalminimumerror.

▪ Perfecttrainingrequiresexponentialtime.

▪ ImperfecttrainingmeansMemoryEquivalentCapacityiseffectivelyreduced.

▪ Howtomeasurethat:?

Page 38: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org38

FromMemorizationtoGeneralization

Goodnews:• Real-worlddataisnotrandom.• Theinformationcapacityofaperceptronisusually>1bitperparameter(Cover,MacKay).

Thismeans,weshouldbeabletouselessparametersthanpredictedbymemorycapacitycalculations.

Memorizationisworst-casegeneralization.

Page 39: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org39

SuggestedEngineeringProcessforGeneralization

• Startatapproximateexpectedcapacity.• Trainto>98%accuracy.Ifimpossible,increaseparameters.• Retrainiterativelywithdecreasedcapacitywhiletestingagainstvalidationset.Shouldsee:decreaseintrainingaccuracywithincreaseinvalidationsetaccuracy

• Stopatminimumcapacityforbestheld-outsetaccuracy.

Bestcasescenario:Asparametersarereduced,neuralnetworkfailstomemorizeonlytheinsignificant(noise)bits.

Page 40: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org40

GeneralizationProcess:ExpectedCurve

Page 41: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org41

OvercapacityMachineLearning:Issues

▪ Wasteofmoney,energy,andtime.Badforenvironment.

▪ Thelessparameters=>thebetterthegeneralizationrule=>thehigheradaptationperparameter=>thehigherthechanceanunseeninstancecanbepredictedcorrectly.

▪ Lessparametersgiveahigherchanceforexplainability(Occam’sRazor).See:G.Friedland,A.Metere:“MachineLearningforScience”,UQSciMLWorkshop,LosAngeles,June2018.

Page 42: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org42

Reminder:Occam’sRazor

Amongcompetinghypotheses,theonewiththefewestassumptionsshouldbeselected.

Foreachacceptedexplanationofaphenomenon,theremaybeanextremelylarge,perhapsevenincomprehensible,numberofpossibleandmorecomplexalternatives,becauseonecanalwaysburdenfailingexplanationswithadhochypothesestopreventthemfrombeingfalsified;therefore,simplertheoriesarepreferabletomorecomplexonesbecausetheyaremoretestable.(Wikipedia,Sep.2017)

Page 43: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org43

GeneralGeneralization

G =#correctly classified instances

#instances that can be memorized

▪ Binaryclassifier(repeat):

G =#correctly classified instancesMemory Equivalent Capacity

[bitsbit

]

▪ Multi-class/regression:

Page 44: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org44

Non-StatisticalDefinition(Literature)

Informally:Whendotwodifferentinputsleadtothesamemachinelearneroutput.

Thisis,whichbitscanbeignoredinthecomparison.

Statisticalequivalent:Howmanybitsperbitcanbeignoredonaverage(seeGmeasure).

Page 45: Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org45

http://tfmeter.icsi.berkeley.edu

Demo:ExperimentalDesignforTensorFlow