Deep Learning For Practitioners, lecture 2: Selecting the right applications deep learning

Deep Learning For Prac//oners Lecture 2: Which applica/ons benefit from

deep learning?

Anantharaman Narayana Iyer [email protected]

17th June 2014

Note: Notes that contain code examples for these slides and detailed analysis will be published separately later.

Review of previous lecture •  Deep learning as a major machine learning discipline

has received phenomenal aNen/on of late due to: –  Breakthrough results reported by the research

community for certain class of applica/ons, beNering the current state of the art

–  Substan/al investments by technology companies such as: Google, Facebook, MicrosoU, IBM

•  While there is no single unique architecture, deep networks are typically built using some variant of Autoencoders or Restricted Boltzmann Machines with key characteris/cs of: –  Deep architecture: Mul/ple layers performing

complex, nonlinear computa/ons, cascading the layerwise outputs.

–  Automated feature extrac/on: each layer produces as its output an abstracted form of its inputs (e.g. Edges from raw pixels). One may add a classifier layer (e.g SVM) on top of the abstracted features and can view the classifica/on as being done on the most abstract features automa/cally generated by the system. (An example with code illustrated in the next lecture)

Looking through the prac//oner’s prism

•  To address real world problems, prac//oners need to be aware of where deep learning yields best results, prac/cal considera/ons, limita/ons and when not to use it.

•  This requires looking at the research results

and other claims from a prac/cal perspec/ve and stay clear of common misconcep/ons.

“If all you have is a hammer everything looks as a nail” •  Deep learning has proved its poten/al in some applica/on domains (e.g.

Computer Vision, Speech recogni/on), holds early promise in several other areas (e.g Natural Language Processing) but this is not a universal tool to provide the best result for “any” AI task.

•  When does it have the poten/al to perform best? –  When structure of the problem being solved naturally maps to a mul/ layer

architecture •  If the problem we are trying to solve can be decomposed in to processing hierarchical

abstract features and these features are derivable from the input data through a set of poten/ally nonlinear transforma/ons, deep learning based solu/on might be effec/ve.

•  As a corollary, problems that don’t exhibit a mul/ layer structure may not see much incremental benefit compared to tradi/onal methods

–  Data availability •  While tradi/onal architectures require expert designed features, deep learning systems

automa/cally learn these features, given the raw input. •  In order to learn the features, extensive, unsupervised pretraining using large volumes of

data is oUen required. Hence any advanced solu/on based on deep learning is likely to require availability of such data.

“More data or beNer models?” •  Data Vs Algorithm: research shows that

training a system with more data, the performance asympto/cally approaches same levels regardless of the model.

•  One may be led to believe that shallow networks, trained with huge data might equal the performance of deep networks. –  Unfortunately, much of the available data in the

web is unlabeled and without an effec/ve unsupervised training model, the data is not useful. Deep networks with unsupervised pretraining phase, can leverage the data beNer.

•  Another no/on could be that any algorithm or model selec/on for a deep network is good enough if you give it a huge volume of data. –  Choosing an op/mal algorithm and design is

very cri/cal as deep networks are resource heavy due to mul/ple layers and weights. A good intui/on on the problem structure is important to make right choices of the model.

Automated Feature Learning and data preprocessing Though deep learning systems extract features automa/cally, the task of data preprocessing is s/ll non-‐trivial.

–  The input data should be complete enough so that the features relevant for the given problem can be extracted. •  Consider the example of detec/ng anomalies in the opera/on of a nuclear reactor. The

input to be given to a deep learning system should include signals from all the relevant sensors and missing any of them may result in inadequate performance

–  The op/mum size of the input data adequate for the job needs to be determined. •  Suppose we need to perform face detec/on, given the input images. What should be the

right input size? Should it be 10 x 10 or 100 x 100 pixels? High dimensionality increases the model parameters substan/ally, requiring more compute resources.

–  Input vector representa/on must be determined •  E.g, for an NLP problem, words from a vocabulary V may be represented in “one-‐hot” form

where each word in V is represented by a posi/on. Here, the number of features for a given word w equals the size of the vocabulary |V| and a sentence with k words will be represented as k * |V| sized input vector. When the size of vocabulary becomes large (say over 10000 words), this representa/on increases the dimensionality substan/ally.

–  For many problems, data cleaning and preprocessing are s/ll required •  E.g. For many NLP problems, beNer performance may be obtained easier through some

preprocessing steps (such as stopword removal, stemming etc) rather than lehng the deep learning system handle the data in its raw form.

Data & Analytics

Deep Learning For Practitioners, lecture 2: Selecting the right applications deep learning