Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Gerald Friedland, http://www.gerald-friedland.org
GeraldFriedland(UCBerkeley)
ExperimentalDesignforMachineLearning
Paper,Demo,etc:https://tfmeter.icsi.berkeley.edu
Commercialtool:http://brainome.ai
About me….
▪ AdjunctFaculty,UCBerkeley
▪ DataScientistatNationalLab
▪ StartedworkinMachineLearningin2001
Gerald Friedland, http://www.gerald-friedland.org3
Startofthiswork:SimpleQuestion
▪ Howmuchmoney(cputime,memory,IO)doIneedtobudgetformydeeplearningexperiment?
▪ StateoftheArt:Noanswer.Forexample,ImageNetmodelsvarysignificantly:
▪ AlexNet:238MBmodel,2.27BnOps
▪ DarkNet:28MBmodel,0.96BnOps
▪ VGG-16:528MB,30.94BnOps
Source:https://pjreddie.com/darknet/imagenet/
Gerald Friedland, http://www.gerald-friedland.org4
Agame…
▪ Continuethesequence:
▪ 2,4,6,8,….
▪ 6,5,1,4,…..▪ Whatisthenextnumber?
▪ 100000(sequence1)
▪ 100000(sequence2)
▪ Why?
Gerald Friedland, http://www.gerald-friedland.org5
TheScientificMethod
DataScience:TheScienceofAutomatingtheScientificMethod
Gerald Friedland, http://www.gerald-friedland.org6
TheScientificMethod:Practical(traditional)
E = mc2
Gerald Friedland, http://www.gerald-friedland.org7
TheScientificMethod:Practical(new)
E = mc2
Gerald Friedland, http://www.gerald-friedland.org8
▪ Intelligence:Theabilitytoadapt(BinetandSimon,1904)
▪ MachinelearningadaptsafinitestatemachineMtoanunknownfunctionbasedonobservations.
▪ Input:nrowsofobservations(instances)inatablewithheader:whereisacolumnwithlabelswecalltargetfunction.
▪ Output:StatemachineMthatmapsapoint
ThoughtFramework:MachineLearning
(x1, x2, . . . , xm, f( ⃗x ))
f( ⃗x )
(x1, x2, . . . , xm) ⟹ f( ⃗x )
Gerald Friedland, http://www.gerald-friedland.org9
▪ Assume(binaryclassifier)
▪ Question:HowmanystatetransitionsdoesMneedtomodelthetrainingdata?
ThoughtFramework:MachineLearning
xi ∈ ℝ, f( ⃗x ) ∈ {0,1}
Gerald Friedland, http://www.gerald-friedland.org10
Refresh:MemoryArithmetic
• Informationisreductionofuncertainty:H=-log2P=-log2=log2#statesmeasuredinbits.
• Information:log2#states(positivebits)Uncertainty:log2P=log2(negativebits)
• Ifstatesarenotequiprobable,ShannonEntropyprovidestighterbound.Math:Assumptionsneeded!(infinity,distribution)Engineering:Estimateusingbinning
1#states
1#states
Gerald Friedland, http://www.gerald-friedland.org11
▪ Assume(binaryclassifier)Question:HowmanystatetransitionsdoesMneedtomodelthetrainingdata?Maximally:#rows(lookuptable)Minimally:?(KolmogorovComplexity)
ThoughtFramework:MachineLearning
xi ∈ ℝ, f( ⃗x ) ∈ {0,1}
Gerald Friedland, http://www.gerald-friedland.org12
▪ IntellectualCapacity:Thenumberofuniquetargetfunctionsamachinelearnerisabletorepresent(asafunctionofthenumberofmodelparameters).
▪ MemoryEquivalentCapacity(MEC):Amachinelearner’sintellectualcapacityismemory-equivalenttoNbitswhenthemachinelearnerisabletorepresentall2NbinarylabelingfunctionsofNuniformlyrandominputs.
▪ AtMECorhigher,Misabletomemorizeallpossiblestatetransitionsfromtheinputtotheoutput.
ThoughtFramework:MachineLearning
Gerald Friedland, http://www.gerald-friedland.org13
ThisTalk:Maintrick
• Ifwededucenothingfromdata,theonlythingwecandoismemorizetheobservationsverbatim.
• Usingasmanyparametersasneededformemorizationisthereforeanindicatorthatthemachinelearnerdidnotdeduceanything(overfitting).
• Reducingparametersbelowmemorizationcapacitywill,inthebestcase,makethemachinelearnerforgetwhat’snotrelevantwithregardstothetargetfunction:generalization.
Memorizationisworst-casegeneralization
Gerald Friedland, http://www.gerald-friedland.org14
GeneralizationinMachineLearning
Memorizationisworst-casegeneralization.Forbinaryclassifiers:G<1=>Mneedsmoretraining/data(notevenmemorizing)G=1=>Mismemorizing=overfitting1<G<=>Mcouldbeimplementingalosslesscompression(andstilloverfit)G>=>Misgeneralizing(nochanceforoverfitting)
G =#correctly classified instancesMemory Equivalent Capacity
[bitsbit
]
GMEM
GMEM
Gerald Friedland, http://www.gerald-friedland.org15
GeneralizationinMachineLearning
G =#correctly classified instancesMemory Equivalent Capacity
[bitsbit
]
Advantagesofthisdefinition:
• Keepcurrentapproachwithtraining/validation/benchmarksets.
• Noi.i.d.requirementfortrain/testset:Onlyrequirementisinputpointsaredistinct!
• Nodistributionalassumptions.
Gerald Friedland, http://www.gerald-friedland.org16
HowdowecalculatetheMemoryEquivalentCapacity?
• BinaryDecisionTree:Depthoftree(ifperfect).
• NeuralNetwork(reminderoftalk)
• RandomForrest:TBD
• SVN:TBD
• k-NN:TBD
• GMMs:TBD
Gerald Friedland, http://www.gerald-friedland.org17
MachineLearningasEngineeringDiscipline
• SupervisedMachineLearnershaveaMemoryEquivalentCapacityinbitsthatiscomputableandmeasurable.
• ArtificialNeuralNetworkswithgatingfunctions(Sigmoid,ReLU,etc.)have
• acapacityupperlimitthatcanbedeterminedanalyticallyusing4principles
• aneffectivecapacitythatcanbemeasuredonactualimplementations.
• Predictingandmeasuringcapacityallowsfortask-independentoptimizationofaconcretenetworkarchitecture,learningalgorithm,convergencetricks,etc…
• Capacityrequirementcanbeapproximatelypredictedgiventheinputdataandgroundtruth.
Gerald Friedland, http://www.gerald-friedland.org18
Repeat:ThePerceptron
Source:WikipediaPhysicalinterpretation:Energythreshold
Gerald Friedland, http://www.gerald-friedland.org19
Repeat:ActivationFunctions(toomany)
Source:WikipediaActivationfunctionsapproximatethesharpdecisionboundary.
Gerald Friedland, http://www.gerald-friedland.org20
HowmanybinaryfunctionscanonmodelusingasinglePerceptron?
Source:R.Rojas,IntrotoNeuralNetworks
Gerald Friedland, http://www.gerald-friedland.org21
Example:BooleanFunctions
Source:R.Rojas,IntrotoNeuralNetworks
• 22vpossiblelabelingsofvbooleanvariables
• 22vlabelingsof2vpoints.
• Forv=2,allbut2functionswork:XOR,NXOR
Gerald Friedland, http://www.gerald-friedland.org22
MachineLearningasanEncoder/Decoder
Informationloss
Learning Method
Neural Network
Sender
Identity
Encoder Channel Decoder Receiver
labels weights weights labels'
data
Source:D.MacKay:InformationTheory,InferenceandLearning
Maintrick:LettheMachineLearnerlabelrandompoints!
Gerald Friedland, http://www.gerald-friedland.org23
CriticalPoints:Perceptron(Cover,MacKay)
N=K:VCDimension(forpointsinrandomposition)N=2K:Cover/MacKayInformationCapacity
Source:D.MacKay:InformationTheory,InferenceandLearning
Gerald Friedland, http://www.gerald-friedland.org24
FromaPerceptrontoPerceptronNetworks
Source:Wikipedia
Gerald Friedland, http://www.gerald-friedland.org25
Careful:OtherArchitectures
TypicalMLPShortcutNetwork
Source:R.Rojas,IntrotoNeuralNetworks
ExampleSolutionstoXOR
Gerald Friedland, http://www.gerald-friedland.org26
Solution:Calculateinbits!
Assume:yi,xi∈{0,1},xiuniformlydistributednbitsofmemory:f(x1,…,xn)=x1,…,xn.(identityfunction).MachineLearner:binaryclassifier:f(x1,…,xn)=y1multi-class/regression:f(x1,…,xn)=y1,…,ym
MemoryEquivalentCapacity:Thenumberofconfigurationsofuniformlydistributedx1,…,xnthatamachinelearnercanguaranteetolabelcorrectly.
Gerald Friedland, http://www.gerald-friedland.org27
MemoryEquivalentCapacityforNeuralNetworks
1) Theoutputofaperceptronismaximally1bit.2) Themaximummemorycapacityofaperceptronisthe
numberofparameters(includingbias)inbits.(MacKay2003)
3) Themaximummemorycapacityofperceptronsinparalleladditive.(MacKay2003speculative,FriedlandandKrell2017)
4) Themaximummemorycapacityofalayerofperceptronsdependingonapreviouslayerofperceptronsislimitedbythemaximumoutput(inbits)ofthepreviouslayer.(DataProcessingInequality,Tishby2012)
Gerald Friedland, http://www.gerald-friedland.org28
Examples:Howmanybitsofmaximalcapacity?
x1
x2x2
x1
3bits 2*3bits+min(2,3)bits=8bits
w1
w2
b
w1
w2
w3
w4
b1
b2
w5
w6
Gerald Friedland, http://www.gerald-friedland.org29
Examples:Howmanybitsofmaximalcapacity?
2*3bits+min(2,2*3)bits+min(2,3)bits=10bits
x1
x2
w1
w2
w3
w4
b1
b2
w5
w6
w7
w8
b3
b4
b5
w9
w10
Gerald Friedland, http://www.gerald-friedland.org30
Examples:Howmanybitsofmaximalcapacity?
3bits+4bits=7bits
x1
x2
w1
w2
w3
w4
w5
b1 b2
ShortcutorResNet
Gerald Friedland, http://www.gerald-friedland.org31
CharacteristicCurveofaTheoretical3-LayerMLP
Gerald Friedland, http://www.gerald-friedland.org32
CharacteristicCurveofanActual3-LayerMLP
Python scikit-learn, 3-Layer MLP
Gerald Friedland, http://www.gerald-friedland.org33
PredictingCapacityRequirements
Givendataandlabels:HowmuchactualcapacitydoIneedtomemorizethefunction?
Idea:1) Worstcase:Let’sbuildamemorizationnetworkwhereonly
thebiasesaretrained2) Expectedcase:Howmuchparameterreductioncan
(exponential)trainingbuyus?
Gerald Friedland, http://www.gerald-friedland.org34
PredictingMaximumMemoryEquivalentCapacity
0
1
+1...
1
.
.
.
1
1
-1
x1
x2
+/-1
.
.
.1
1
1
b1
b2
bmxn
1
1
“Dumb”Network
Runtime:O(nlogn)
Gerald Friedland, http://www.gerald-friedland.org35
PredictingExpectedMinimumMemoryEquivalentCapacity
DumbNetwork:• Highlyinefficient.• Potentiallynot100%accurate(hashcollisions).• Wecanassumetrainingweights(andbiases)gets100%accuracywhilereducingparameters.
ExpectedReduction:Exponential!nthresholdsshouldbeabletoberepresentedwithlog2nweightsandbiases(searchtree!).
Gerald Friedland, http://www.gerald-friedland.org36
EmpiricalResults
Allresultsrepeatableat:https://github.com/fractor/nntailoring
37
Training
▪ Everythingwedidsofarassumesperfecttraining.Thisis,trainingthatguaranteestoreachtheglobalminimumerror.
▪ Perfecttrainingrequiresexponentialtime.
▪ ImperfecttrainingmeansMemoryEquivalentCapacityiseffectivelyreduced.
▪ Howtomeasurethat:?
Gerald Friedland, http://www.gerald-friedland.org38
FromMemorizationtoGeneralization
Goodnews:• Real-worlddataisnotrandom.• Theinformationcapacityofaperceptronisusually>1bitperparameter(Cover,MacKay).
Thismeans,weshouldbeabletouselessparametersthanpredictedbymemorycapacitycalculations.
Memorizationisworst-casegeneralization.
Gerald Friedland, http://www.gerald-friedland.org39
SuggestedEngineeringProcessforGeneralization
• Startatapproximateexpectedcapacity.• Trainto>98%accuracy.Ifimpossible,increaseparameters.• Retrainiterativelywithdecreasedcapacitywhiletestingagainstvalidationset.Shouldsee:decreaseintrainingaccuracywithincreaseinvalidationsetaccuracy
• Stopatminimumcapacityforbestheld-outsetaccuracy.
Bestcasescenario:Asparametersarereduced,neuralnetworkfailstomemorizeonlytheinsignificant(noise)bits.
Gerald Friedland, http://www.gerald-friedland.org40
GeneralizationProcess:ExpectedCurve
Gerald Friedland, http://www.gerald-friedland.org41
OvercapacityMachineLearning:Issues
▪ Wasteofmoney,energy,andtime.Badforenvironment.
▪ Thelessparameters=>thebetterthegeneralizationrule=>thehigheradaptationperparameter=>thehigherthechanceanunseeninstancecanbepredictedcorrectly.
▪ Lessparametersgiveahigherchanceforexplainability(Occam’sRazor).See:G.Friedland,A.Metere:“MachineLearningforScience”,UQSciMLWorkshop,LosAngeles,June2018.
Gerald Friedland, http://www.gerald-friedland.org42
Reminder:Occam’sRazor
Amongcompetinghypotheses,theonewiththefewestassumptionsshouldbeselected.
Foreachacceptedexplanationofaphenomenon,theremaybeanextremelylarge,perhapsevenincomprehensible,numberofpossibleandmorecomplexalternatives,becauseonecanalwaysburdenfailingexplanationswithadhochypothesestopreventthemfrombeingfalsified;therefore,simplertheoriesarepreferabletomorecomplexonesbecausetheyaremoretestable.(Wikipedia,Sep.2017)
Gerald Friedland, http://www.gerald-friedland.org43
GeneralGeneralization
G =#correctly classified instances
#instances that can be memorized
▪ Binaryclassifier(repeat):
G =#correctly classified instancesMemory Equivalent Capacity
[bitsbit
]
▪ Multi-class/regression:
Gerald Friedland, http://www.gerald-friedland.org44
Non-StatisticalDefinition(Literature)
Informally:Whendotwodifferentinputsleadtothesamemachinelearneroutput.
Thisis,whichbitscanbeignoredinthecomparison.
Statisticalequivalent:Howmanybitsperbitcanbeignoredonaverage(seeGmeasure).
Gerald Friedland, http://www.gerald-friedland.org45
http://tfmeter.icsi.berkeley.edu
Demo:ExperimentalDesignforTensorFlow