Click here to load reader

View

7Download

1

Embed Size (px)

Learning with linear neurons Adapted from Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007 Prehistory W.S. McCulloch & W. Pitts (1943). A logical calculus of the ideas immanent in nervous activity, Bulletin of Mathematical Biophysics, 5, 115-137. x+y-2 1 * -2 y * +1 x * +1 if sumUS$1b), Deere & Co. pension fund (US$100m). Forecasting weather patterns & earthquakes. Speech technology: verification and generation. Medical: Predicting heart attacks from EKGs and mental illness from EEGs. Advantages of Using ANNs Works well with large sets of noisy data, in domains where experts are unavailable or there are no known rules. Simplicity of using it as a tool Universal approximator. Does not impose a structure on the data. Possible to extract rules. Ability to learn and adapt. Does not require an expert or a knowledge engineer. Well suited to non-linear type of problems. Fault tolerant Problem with the Perceptron Can only learn linearly separable tasks. Cannot solve any interesting problems-linearly nonseparable problems e.g. exclusive-or function (XOR)-simplest nonseparable function. X1X2Output0 0 00 1 11 0 11 1 0The Fall of the Perceptron Marvin Minsky & Seymour Papert (1969). Perceptrons, MIT Press, Cambridge, MA. Before long researchers had begun to discover the Perceptrons limitations. Unless input categories were linearly separable, a perceptron could not learn to discriminate between them. Unfortunately, it appeared that many important categories were not linearly separable. E.g., those inputs to an XOR gate that give an output of 1 (namely 10 & 01) are not linearly separable from those that do not (00 & 11). The Fall of the Perceptron Successful Unsuccessful Many Hours in the Gym per Week Few Hours in the Gym per Week Footballers Academics In this example, a perceptron would not be able to discriminate between the footballers and the academics despite the simplicity of their relationship: Academics = Successful XOR Gym This failure caused the majority of researchers to walk away. The simple XOR example masks a deeper problem ... 1. 3. 2. 4. Consider a perceptron classifying shapes as connected or disconnected and taking inputs from the dashed circles in 1. In going from 1 to 2, change of right hand end only must be sufficient to change classification (raise/lower linear sum thru 0) Similarly, the change in left hand end only must be sufficient to change classification Therefore changing both ends must take the sum even further across threshold Problem is because of single layer of processing local knowledge cannot be combined into global knowledge. So add more layers ... THE PERCEPTRON CONTROVERSY There is no doubt that Minsky and Papert's book was a block to the funding of research in neural networks for more than ten years. The book was widely interpreted as showing that neural networks are basically limited and fatally flawed. What IS controversial is whether Minsky and Papert shared and/or promoted this belief ? Following the rebirth of interest in artificial neural networks, Minsky and Papert claimed that they had not intended such a broad interpretation of the conclusions they reached in the book Perceptrons. However, Jianfeng was present at MIT in 1974, and reached a different conclusion on the basis of the internal reports circulating at MIT. What were Minsky and Papert actually saying to their colleagues in the period after the publication of their book? Minsky and Papert describe a neural network with a hidden layer as follows: GAMBA PERCEPTRON: A number of linear threshold systems have their outputs connected to the in- puts of a linear threshold system. Thus we have a linear threshold function of many linear threshold functions. Minsky and Papert then state: Virtually nothing is known about the computational capabilities of this latter kind of machine. We believe that it can do little more than can a low order perceptron. (This, in turn, would mean, roughly, that although they could recognize (sp) some relations between the points of a picture, they could not handle relations between such relations to any significant extent.) That we cannot understand mathematically the Gamba perceptron very well is, we feel, symptomatic of the early state of development of elementary computational theories. The connectivity of a perceptron The input is recoded using hand-picked features that do not adapt. Only the last layer of weights is learned. The output units are binary threshold neurons and are learned independently. non-adaptive hand-coded features output units input units Binary threshold neurons McCulloch-Pitts (1943) First compute a weighted sum of the inputs from other neurons Then output a 1 if the weighted sum exceeds the threshold. = yiiiw x z =u > z 1 if 0 otherwise y z 1 0 threshold The perceptron convergence procedure Add an extra component with value 1 to each input vector. The bias weight on this component is minus the threshold. Now we can forget the threshold. Pick training cases using any policy that ensures that every training case will keep getting picked If the output unit is correct, leave its weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector. This is guaranteed to find a suitable set of weights if any such set exists. Weight space Imagine a space in which each axis corresponds to a weight. A point in this space is a weight vector. Each training case defines a plane. On one side of the plane the output is wrong. To get all training cases right we need to find a point on the right side of all the planes. an input vector bad weights good weights o origin Why the learning procedure works Consider the squared distance between any satisfactory weight vector and the current weight vector. Every time the perceptron makes a mistake, the learning algorithm moves the current weight vector towards all satisfactory weight vectors (unless it crosses the constraint plane). So consider generously satisfactory weight vectors that lie within the feasible region by a margin at least as great as the largest update. Every time the perceptron makes a mistake, the squared distance to all of these weight vectors is always decreased by at least the squared length of the smallest update vector. What perceptrons cannot do The binary threshold output units cannot even tell if two single bit numbers are the same! Same: (1,1) 1; (0,0) 1 Different: (1,0) 0; (0,1) 0 The following set of inequalities is impossible: u u u u< < > > +2 12 1,0 ,w ww wData Space 0,1 0,0 1,0 1,1 The positive and negative cases cannot be separated by a plane What can perceptrons do? They can only solve tasks if the hand-coded features convert the original task into a linearly separable one. How difficult is this? The N-bit parity task : Requires N features of the form: Are at least m bits on? Each feature must look at all the components of the input. The 2-D connectedness task requires an exponential number of features! The N-bit even parity task There is a simple solution that requires N hidden units. Each hidden unit computes whether more than M of the inputs are on. This is a linearly separable problem. There are many variants of this solution. It can be learned. It generalizes well if: >0 >1 >2 >3 -2 +2 -2 +2 +1 1 0 1 0 output input 22 NN>>Why connectedness is hard to compute Even for simple line drawings, there are exponentially many cases. Removing one segment can break connectedness But this depends on the precise arrangement of the other pieces. Unlike parity, there are no simple summaries of the other pieces that tell us what will happen. Connectedness is easy to compute with an iterative algorithm. Start anywhere in the ink Propagate a marker See if all the ink gets marked. Distinguishing T from C in any orientation and position What kind of features are required to distinguish two different patterns of 5 pixels independent of position and orientation? Do we need to replicate T and C templates across all positions and orientations? Looking at pairs of pixels will not work Looking at triples will work if we assume that each input image only contains one object. Replicate the following two feature detectors in all positions + + - + + - If any of these equal their threshold of 2, its a C. If not, its a T. Beyond perceptrons Need to learn the features, not just how to weight them to make a decision. This is a much harder task. We may need to abandon guarantees of finding optimal solutions. Need to make use of recurrent connections, especially for modeling sequences. The network needs a memory (in the activities) for events that happened some time ago, and we cannot easily put an upper bound on this time. Engineers call this an Infinite Impulse Response system. Long-term temporal regularities are hard to learn. Need to learn representations without a teacher. This makes it much harder to define what the goal of learning is. Beyond perceptrons Need to learn complex hierarchical representations for structures like: John was annoyed that Mary disliked Bill. We need to apply the same computational apparatus to the embedded sentence as to the whole sentence. This is hard if we are using special purpose hardware in which activities of hardware units are the representations and connections between hardware units are the program. We must somehow traverse deep hierarchies using fixed hardware and sharing knowledge between levels. Sequential Perception We need to attend to one part of the sensory input at a time. We only have high resolution in a tiny region. Vision is a very sequential process (but the scale varies) We do not do high-level processing of most of the visual input (lack of motion tells us nothing has changed). Segmentation and the sequential organization of sensory processing are often ignored by neural models. Segmentation is a very difficult problem Segmenting a figure from its background seems very easy because we are so good at it, but its actually very hard. Contours sometimes have imperceptible contrast, but we still perceive them. Segmentation often requires a lot of top-down knowledge. Fisher Linear Discrimination Lower the problem from multi-dimensional to single-dimensional Let v be a vector in our space Project the data on the vector v Estimate the scatterness of the data as projected on v Use this v to create a classifier Fisher Linear Discrimination Suppose we are in a 2D space Which of the three vectors is an optimal v? Fisher Linear Discrimination The optimal vector maximizes the ratio of between-group-sum-of-squares to within-group-sum-of-squares, denoted ttv Bvv WvwithinbetweenwithinFisher Linear Discrimination Suppose a case two classes Mean of these classes samples: Mean of the projected samples: Scatterness of the projected samples: Criterion function: e= 1iix Xm xne e= = = 1 1i it ti iy Y x Xm y w x w mn ne= 2 2( )ii iy Ys y m( ) 21 22 21 2m mJ vs s=+Fisher Linear Discrimination Criterion function should be maximized Present J as a function of a vector v ( )ee e= = += = =+ == = = == 1 22 22 21 21 2 1 22 21 2 1 2 1 2 1 2( )( )( ) ( )( )( )( )( ) ( ) ( )( )ii iti i ix Xt t t t ti i i i ix X x Xttt t t t tttW x m x mW W Ws v x v m v x m x m v v Wvs s v WvB m m m mm m v m v m v m m m m v v Bvv BvJ vv WvFisher Linear Discrimination The matrix version of the criterion works the same for more than two classes J(v) is maximized when = Bv WvFisher Linear Discrimination Classification of a new observation x: Let the class of x be the class whose mean vector is closest to x in terms of the discriminant variables In other words, the class whose mean vectors projection on v is the closest to the projection of x on v A Regularized Fisher LDA Sw can be singular (inaccurate inversion) Regularization: Swreg = Sw + *I (Ridge-regression) Adding to the standard deviation (diagonal) a compensation of the noise - Choosing as a percentile of the eigenvalues of Sw -FLDA Linear regression 0 10 20 30 40 0 10 20 30 20 22 24 26 Temperature 0 10 20 0 20 40 Given examples Predict given a new point 0 20 0 20 40 0 10 20 30 40 0 10 20 30 20 22 24 26 Temperature Linear regression Prediction Prediction Ordinary Least Squares (OLS) 0 20 0 Error or residual Prediction Observation Sum squared error Minimize the sum squared error Sum squared error Linear equation Linear system Alternative derivation n d Solve the system (its better not to invert the matrix) LMS Algorithm (Least Mean Squares) where Online algorithm Beyond lines and planes everything is the same with still linear in 0 10 20 0 20 40 Geometric interpretation [Matlab demo] 0 10 20 0 100 200 300 400 -10 0 10 20 Ordinary Least Squares [summary] n d Let For example Let Minimize by solving Given examples Predict Probabilistic interpretation 0 20 0 Likelihood