Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function

Gerald Friedland, http://www.gerald-friedland.org

GeraldFriedland(UCBerkeley)

ExperimentalDesignforMachineLearning

Paper,Demo,etc:https://tfmeter.icsi.berkeley.edu

Commercialtool:http://brainome.ai

http://www.gerald-friedland.org

https://tfmeter.icsi.berkeley.edu

http://brainome.ai

About me….

▪ AdjunctFaculty,UCBerkeley

▪ DataScientistatNationalLab

▪ StartedworkinMachineLearningin2001

http://www.siox.org

Gerald Friedland, http://www.gerald-friedland.org3

Startofthiswork:SimpleQuestion

▪ Howmuchmoney(cputime,memory,IO)doIneedtobudgetformydeeplearningexperiment?

▪ StateoftheArt:Noanswer.Forexample,ImageNetmodelsvarysignificantly:

▪ AlexNet:238MBmodel,2.27BnOps

▪ DarkNet:28MBmodel,0.96BnOps

▪ VGG-16:528MB,30.94BnOps

Source:https://pjreddie.com/darknet/imagenet/


https://pjreddie.com/darknet/imagenet/


Agame…

▪ Continuethesequence:

▪ 2,4,6,8,….

▪ 6,5,1,4,…..▪ Whatisthenextnumber?

▪ 100000(sequence1)

▪ 100000(sequence2)

▪ Why?



TheScientificMethod

DataScience:TheScienceofAutomatingtheScientificMethod



TheScientificMethod:Practical(traditional)

E = mc2



TheScientificMethod:Practical(new)

E = mc2



▪ Intelligence:Theabilitytoadapt(BinetandSimon,1904)

▪ MachinelearningadaptsafinitestatemachineMtoanunknownfunctionbasedonobservations.

▪ Input:nrowsofobservations(instances)inatablewithheader:whereisacolumnwithlabelswecalltargetfunction.

▪ Output:StatemachineMthatmapsapoint

ThoughtFramework:MachineLearning

(x1, x2, . . . , xm, f( ⃗x ))

f( ⃗x )

(x1, x2, . . . , xm) ⟹ f( ⃗x )



▪ Assume(binaryclassifier)

▪ Question:HowmanystatetransitionsdoesMneedtomodelthetrainingdata?


xi ∈ ℝ, f( ⃗x ) ∈ {0,1}



Refresh:MemoryArithmetic

• Informationisreductionofuncertainty:H=-log2P=-log2=log2#statesmeasuredinbits.

• Information:log2#states(positivebits)Uncertainty:log2P=log2(negativebits)

• Ifstatesarenotequiprobable,ShannonEntropyprovidestighterbound.Math:Assumptionsneeded!(infinity,distribution)Engineering:Estimateusingbinning

1#states

1#states



▪ Assume(binaryclassifier)Question:HowmanystatetransitionsdoesMneedtomodelthetrainingdata?Maximally:#rows(lookuptable)Minimally:?(KolmogorovComplexity)


xi ∈ ℝ, f( ⃗x ) ∈ {0,1}



▪ IntellectualCapacity:Thenumberofuniquetargetfunctionsamachinelearnerisabletorepresent(asafunctionofthenumberofmodelparameters).

▪ MemoryEquivalentCapacity(MEC):Amachinelearner’sintellectualcapacityismemory-equivalenttoNbitswhenthemachinelearnerisabletorepresentall2NbinarylabelingfunctionsofNuniformlyrandominputs.

▪ AtMECorhigher,Misabletomemorizeallpossiblestatetransitionsfromtheinputtotheoutput.




ThisTalk:Maintrick

• Ifwededucenothingfromdata,theonlythingwecandoismemorizetheobservationsverbatim.

• Usingasmanyparametersasneededformemorizationisthereforeanindicatorthatthemachinelearnerdidnotdeduceanything(overfitting).

• Reducingparametersbelowmemorizationcapacitywill,inthebestcase,makethemachinelearnerforgetwhat’snotrelevantwithregardstothetargetfunction:generalization.

Memorizationisworst-casegeneralization



GeneralizationinMachineLearning

Memorizationisworst-casegeneralization.Forbinaryclassifiers:G<1=>Mneedsmoretraining/data(notevenmemorizing)G=1=>Mismemorizing=overfitting1<G<=>Mcouldbeimplementingalosslesscompression(andstilloverfit)G>=>Misgeneralizing(nochanceforoverfitting)

G =#correctly classified instancesMemory Equivalent Capacity

[bitsbit

]

GMEM

GMEM



GeneralizationinMachineLearning


[bitsbit

]

Advantagesofthisdefinition:

• Keepcurrentapproachwithtraining/validation/benchmarksets.

• Noi.i.d.requirementfortrain/testset:Onlyrequirementisinputpointsaredistinct!

• Nodistributionalassumptions.



HowdowecalculatetheMemoryEquivalentCapacity?

• BinaryDecisionTree:Depthoftree(ifperfect).

• NeuralNetwork(reminderoftalk)

• RandomForrest:TBD

• SVN:TBD

• k-NN:TBD

• GMMs:TBD



MachineLearningasEngineeringDiscipline

• SupervisedMachineLearnershaveaMemoryEquivalentCapacityinbitsthatiscomputableandmeasurable.

• ArtificialNeuralNetworkswithgatingfunctions(Sigmoid,ReLU,etc.)have

• acapacityupperlimitthatcanbedeterminedanalyticallyusing4principles

• aneffectivecapacitythatcanbemeasuredonactualimplementations.

• Predictingandmeasuringcapacityallowsfortask-independentoptimizationofaconcretenetworkarchitecture,learningalgorithm,convergencetricks,etc…

• Capacityrequirementcanbeapproximatelypredictedgiventheinputdataandgroundtruth.



Repeat:ThePerceptron

Source:WikipediaPhysicalinterpretation:Energythreshold



Repeat:ActivationFunctions(toomany)

Source:WikipediaActivationfunctionsapproximatethesharpdecisionboundary.



HowmanybinaryfunctionscanonmodelusingasinglePerceptron?

Source:R.Rojas,IntrotoNeuralNetworks



Example:BooleanFunctions


• 22vpossiblelabelingsofvbooleanvariables

• 22vlabelingsof2vpoints.

• Forv=2,allbut2functionswork:XOR,NXOR



MachineLearningasanEncoder/Decoder

Informationloss

Learning Method

Neural Network

Sender

Identity

Encoder Channel Decoder Receiver

labels weights weights labels'

data

Source:D.MacKay:InformationTheory,InferenceandLearning

Maintrick:LettheMachineLearnerlabelrandompoints!



CriticalPoints:Perceptron(Cover,MacKay)

N=K:VCDimension(forpointsinrandomposition)N=2K:Cover/MacKayInformationCapacity

Source:D.MacKay:InformationTheory,InferenceandLearning



FromaPerceptrontoPerceptronNetworks

Source:Wikipedia



Careful:OtherArchitectures

TypicalMLPShortcutNetwork


ExampleSolutionstoXOR



Solution:Calculateinbits!

Assume:yi,xi∈{0,1},xiuniformlydistributednbitsofmemory:f(x1,…,xn)=x1,…,xn.(identityfunction).MachineLearner:binaryclassifier:f(x1,…,xn)=y1multi-class/regression:f(x1,…,xn)=y1,…,ym

MemoryEquivalentCapacity:Thenumberofconfigurationsofuniformlydistributedx1,…,xnthatamachinelearnercanguaranteetolabelcorrectly.



MemoryEquivalentCapacityforNeuralNetworks

1) Theoutputofaperceptronismaximally1bit.2) Themaximummemorycapacityofaperceptronisthe

numberofparameters(includingbias)inbits.(MacKay2003)

3) Themaximummemorycapacityofperceptronsinparalleladditive.(MacKay2003speculative,FriedlandandKrell2017)

4) Themaximummemorycapacityofalayerofperceptronsdependingonapreviouslayerofperceptronsislimitedbythemaximumoutput(inbits)ofthepreviouslayer.(DataProcessingInequality,Tishby2012)



Examples:Howmanybitsofmaximalcapacity?

x1

x2x2

x1

3bits 2*3bits+min(2,3)bits=8bits

w1

w2

b

w1

w2

w3

w4

b1

b2

w5

w6




2*3bits+min(2,2*3)bits+min(2,3)bits=10bits

x1

x2

w1

w2

w3

w4

b1

b2

w5

w6

w7

w8

b3

b4

b5

w9

w10




3bits+4bits=7bits

x1

x2

w1

w2

w3

w4

w5

b1 b2

ShortcutorResNet



CharacteristicCurveofaTheoretical3-LayerMLP



CharacteristicCurveofanActual3-LayerMLP

Python scikit-learn, 3-Layer MLP



PredictingCapacityRequirements

Givendataandlabels:HowmuchactualcapacitydoIneedtomemorizethefunction?

Idea:1) Worstcase:Let’sbuildamemorizationnetworkwhereonly

thebiasesaretrained2) Expectedcase:Howmuchparameterreductioncan

(exponential)trainingbuyus?



PredictingMaximumMemoryEquivalentCapacity

0

1

+1...

1

.

.

.

1

1

-1

x1

x2

+/-1

.

.

.1

1

1

b1

b2

bmxn

1

1

“Dumb”Network

Runtime:O(nlogn)



PredictingExpectedMinimumMemoryEquivalentCapacity

DumbNetwork:• Highlyinefficient.• Potentiallynot100%accurate(hashcollisions).• Wecanassumetrainingweights(andbiases)gets100%accuracywhilereducingparameters.

ExpectedReduction:Exponential!nthresholdsshouldbeabletoberepresentedwithlog2nweightsandbiases(searchtree!).



EmpiricalResults

Allresultsrepeatableat:https://github.com/fractor/nntailoring


https://github.com/fractor/nntailoring

37

Training

▪ Everythingwedidsofarassumesperfecttraining.Thisis,trainingthatguaranteestoreachtheglobalminimumerror.

▪ Perfecttrainingrequiresexponentialtime.

▪ ImperfecttrainingmeansMemoryEquivalentCapacityiseffectivelyreduced.

▪ Howtomeasurethat:?


FromMemorizationtoGeneralization

Goodnews:• Real-worlddataisnotrandom.• Theinformationcapacityofaperceptronisusually>1bitperparameter(Cover,MacKay).

Thismeans,weshouldbeabletouselessparametersthanpredictedbymemorycapacitycalculations.

Memorizationisworst-casegeneralization.



SuggestedEngineeringProcessforGeneralization

• Startatapproximateexpectedcapacity.• Trainto>98%accuracy.Ifimpossible,increaseparameters.• Retrainiterativelywithdecreasedcapacitywhiletestingagainstvalidationset.Shouldsee:decreaseintrainingaccuracywithincreaseinvalidationsetaccuracy

• Stopatminimumcapacityforbestheld-outsetaccuracy.

Bestcasescenario:Asparametersarereduced,neuralnetworkfailstomemorizeonlytheinsignificant(noise)bits.



GeneralizationProcess:ExpectedCurve



OvercapacityMachineLearning:Issues

▪ Wasteofmoney,energy,andtime.Badforenvironment.

▪ Thelessparameters=>thebetterthegeneralizationrule=>thehigheradaptationperparameter=>thehigherthechanceanunseeninstancecanbepredictedcorrectly.

▪ Lessparametersgiveahigherchanceforexplainability(Occam’sRazor).See:G.Friedland,A.Metere:“MachineLearningforScience”,UQSciMLWorkshop,LosAngeles,June2018.



Reminder:Occam’sRazor

Amongcompetinghypotheses,theonewiththefewestassumptionsshouldbeselected.

Foreachacceptedexplanationofaphenomenon,theremaybeanextremelylarge,perhapsevenincomprehensible,numberofpossibleandmorecomplexalternatives,becauseonecanalwaysburdenfailingexplanationswithadhochypothesestopreventthemfrombeingfalsified;therefore,simplertheoriesarepreferabletomorecomplexonesbecausetheyaremoretestable.(Wikipedia,Sep.2017)



GeneralGeneralization

G =#correctly classified instances

#instances that can be memorized

▪ Binaryclassifier(repeat):


[bitsbit

]

▪ Multi-class/regression:



Non-StatisticalDefinition(Literature)

Informally:Whendotwodifferentinputsleadtothesamemachinelearneroutput.

Thisis,whichbitscanbeignoredinthecomparison.

Statisticalequivalent:Howmanybitsperbitcanbeignoredonaverage(seeGmeasure).



http://tfmeter.icsi.berkeley.edu

Demo:ExperimentalDesignforTensorFlow


http://tfmeter.icsi.berkeley.edu

Documents

Experimental Design for Machine Learningdeeplearning.cs.cmu.edu/F20/document/slides/Gerald_friedland.pdf · Machine learning adapts a finite state machine M to an unknown function