NLP in Scala with Breeze and Epic David Hall UC Berkeley

Preview:

Citation preview

NLP in Scala with Breeze and Epic

David HallUC Berkeley

ScalaNLP Ecosystem

Breeze Epic Puck

• Linear Algebra• Scientific Computing• Optimization

• Natural Language Processing

• Structured Prediction

• Super-fast GPUparser for English≈ ≈

Numpy/Scipy PyStruct/NLTK

≈{ }

Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap.

PP

NP

VP

VPNP

V S

VP

S

It certainly won’t get there on looks.

Natural Language Processing

Epic for NLP

Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap.

PP

NP

VP

VPNP

V S

VP

S

Named Entity Recognition

Person Organization Location Misc

NER with Epic> import epic.models.NerSelector> val nerModel = NerSelector.loadNer("en").get> val tokens = epic.preprocess.tokenize("Almost 20

years ago, Bill Watterson walked away from \"Calvin & Hobbes.\"")

> println(nerModel.bestSequence(tokens).render("O"))

Almost 20 years ago , [PER:Bill Watterson] walked away from `` [LOC:Calvin & Hobbes] . ''

Not a location!

Annotate a bunch of data?

Building an NER system> val data: IndexedSeq[Segmentation[Label, String]]

= ???> val system = SemiCRF.buildSimple(data,

startLabel,

outsideLabel)> println(system.bestSequence(tokens).render("O"))

Almost 20 years ago , [PER:Bill Watterson] walked away from `` [MISC:Calvin & Hobbes] . ''

Gazetteers

http://en.wikipedia.org/wiki/List_of_newspaper_comic_strips_A%E2%80%93F

Gazetteers

Using your own gazetteer> val data: IndexedSeq[Segmentation[Label, String]]

= ???> val myGazetteer = ???> val system = SemiCRF.buildSimple(data,

startLabel, outsideLabel,

gaz = myGazetteer)

Gazetteer• Careful with gazetteers!• If built from training data, system will use it

and only it to make predictions!• So, only known forms will be detected.• Still, can be very useful…

Semi-CR-What?• Semi-Markov Conditional Random Field• Don’t worry about the name.

Semi-CRFs

Semi-CRFs

+ score(Berkeley, CA)+ score(- A bowl of )+ score(Churchill-Brenneis Orchards)+ score(Page mandarins and medjool dates)

score(Chez Panisse)

Features

= w(starts-with-Chez)+w(starts-with-C…)+w(ends-with-P…)+w(starts-sentence)+w(shape:Xxx Xxx) +w(two-words)+w(in-gazetteer)

score(Chez Panisse)

Building your own featuresval dsl = new WordFeaturizer.DSL[L](counts) with SurfaceFeaturizer.DSL import dsl._

word(begin) // word at the beginning of the span + word(end – 1) // end of the span + word(begin – 1) // before (gets things like Mr.) + word (end) // one past the end + prefixes(begin) // prefixes up to some length + suffixes(begin) + length(begin, end) // names tend to be 1-3 words + gazetteer(begin, end)

Using your own featurizer> val data: IndexedSeq[Segmentation[Label, String]]

= ???> val myFeaturizer = ???> val system = SemiCRF.buildSimple(data,

startLabel, outsideLabel,

featurizer = myFeaturizer)

Features• So far, we’ve been able to do everything with

(nearly) no math.• To understand more, need to do some math.

Machine Learning Primer• Training example (x, y)– x: sentence of some sort– y: labeled version

• Goal:– want score(x, y) > score(x, y’), forall y’.

Machine Learning Primer

score(x, y) = wTf(x, y)

Machine Learning Primer

score(x, y) = w.t * f(x, y)

Machine Learning Primer

score(x, y) = w dot f(x, y)

Machine Learning Primer

score(x, y) >= score(x, y’)

Machine Learning Primer

w dot f(x, y) >= w dot f(x, y’)

Machine Learning Primer

w dot f(x, y) >= w dot f(x, y’)

Machine Learning Primer

f(x,y)

f(x,y’)

w

w + f(x,y)

(w + f(x,y)) dot f(x,y) >= w dot f(x,y)

The Perceptron> val featureIndex : Index[Feature] = ???> val labelIndex: Index[Label] = ???> val weights = DenseVector.rand[Double]

(featureIndex.size)

> for ( epoch <- 0 until numEpochs; (x, y) <- data ) { val labelScores = DV.tabulate(labelIndex.size) { yy =>

val features = featuresFor(x, yy)weights.t * new FeatureVector(indexed)// or weights dot new FeatureVector(indexed)

} …}

The Perceptron (cont’d)> val featureIndex : Index[Feature] = ???> val labelIndex: Index[Label] = ???> val weights = DenseVector.rand[Double](featureIndex.size)

> for ( epoch <- 0 until numEpochs; (x, y) <- data ) { val labelScores = ... val y_best = argmax(labelScores)

if (y != y_best) {weights += new FeatureVector(featuresFor(x,

y))weights -= new FeaureVector(featuresFor(x,

y_best)) }}

Structured Perceptron• Can’t enumerate all segmentations! (L * 2n)• But dynamic programs exist to max or sum• … if the feature function has a nice form.

Structured Perceptron> val featureIndex : Index[Feature] = ???> val labelIndex: Index[Label] = ???> val weights = DenseVector.rand[Double]

(featureIndex.size)

> for ( epoch <- 0 until numEpochs; (x, y) <- data ) { val y_best = bestStructure(weights, x) if (y != y_best) { weights += featuresFor(x, y) weights -= featuresFor(x, y_best) } }

Constituency Parsing

Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap.

PP

NP

VP

VPNP

V S

VP

S

Multilingual Parser

Avg Ar Ba Fr Ge HeHun Ko Po

Swe

70

75

80

85

90

95"Berkeley"Epic

“Berkeley:” [Petrov & Klein, 2007]; Epic [Hall, Durrett, and Klein, 2014]

Epic Pre-built Models• Parsing

– English, Basque, French, German, Swedish, Polish, Korean

– (working on Arabic, Chinese, Spanish)

• Part-of-Speech Tagging– English, Basque, French, German, Swedish,

Polish

• Named Entity Recognition– English

• Sentence segmentation– English – (ok support for others)

• Tokenization– Above languages

What Is Breeze?

Dense Vectors, Matrices, Sparse Vectors,Counters, Matrix Decompositions

What Is Breeze?

Nonlinear Optimization,Probability Distributions

Getting startedlibraryDependencies ++= Seq( "org.scalanlp" %% "breeze" % "0.8.1",

// optional: native linear algebra libraries // bulky, but faster "org.scalanlp" %% "breeze-natives" % "0.8.1")

scalaVersion := "2.11.1"

Linear Algebra> import breeze.linalg._> val x = DenseVector.zeros[Int](5)

// DenseVector(0, 0, 0, 0, 0)

> val m = DenseMatrix.zeros[Int](5,5)

> val r = DenseMatrix.rand(5,5)

> m.t // transpose> x + x // addition> m * x // multiplication by vector> m * 3 // by scalar> m * m // by matrix> m :* m // element wise mult, Matlab .*

Return Type Selection> import breeze.linalg.{DenseVector => DV}

> val dv = DV(1.0, 2.0)

> val sv = SparseVector.zeros[Double](2)

> dv + sv// res: DenseVector[Double] = DenseVector(1.0, 2.0)

> sv + sv// res: SparseVector[Double] = SparseVector(1.0, 2.0)

Return Type Selection> import breeze.linalg.{DenseVector => DV}

> val dv = DV(1.0, 2.0)

> val sv = SparseVector.zeros[Double](2)

> (dv: Vector[Double]) + (sv: Vector[Double])// res: Vector[Double] = DenseVector(1.0, 2.0)

> (sv: Vector[Double]) + (sv: Vector[Double])// res: Vector[Double] = SparseVector()

Static: VectorDynamic: Dense

Static: VectorDynamic: Sparse

Linear Algebra: Slices> val m = DenseMatrix.zeros[Int](5, 5)

> m(::,1) // slice a column// DV(0, 0, 0, 0, 0)

> m(4,::) // slice a row

> m(4,::) := DV(1,2,3,4,5).t

> m.toString0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5

Linear Algebra: Slices> m(0 to 1, 3 to 4).toString

0 0 2 3

> m(IndexedSeq(3,1,4,2),IndexedSeq(4,4,3,1))

0 0 0 0 0 0 0 0 5 5 4 2 0 0 0 0

Universal Functions> import breeze.numerics._

> log(DenseVector(1.0, 2.0, 3.0, 4.0))// DenseVector(0.0, 0.6931471805599453, // 1.0986122886681098, 1.3862943611198906)

> exp(DenseMatrix( (1.0, 2.0), (3.0, 4.0)))

> sin(Array(2.0, 3.0, 4.0, 42.0))

> // also sin, cos, sqrt, asin, floor, round, digamma, trigamma, ...

UFuncs: In-Place> import breeze.numerics._

> val v = DV(1.0, 2.0, 3.0, 4.0)

> log.inPlace(v)// v == DenseVector(0.0, 0.6931471805599453, // 1.0986122886681098, 1.3862943611198906)

UFuncs: Reductions> sum(DenseVector(1.0, 2.0, 3.0))

// 6.0> sum(DenseVector(1, 2, 3))

// 6> mean(DenseVector(1.0, 2.0, 3.0))

// 2.0> mean(DenseMatrix( (1.0, 2.0, 3.0)

(4.0, 5.0, 6.0)))// 3.5

UFuncs: Reductions> val m = DenseMatrix((1.0, 3.0), (4.0, 4.0))

> sum(m) == 12.0

> mean(m) == 3.0

> sum(m(*, ::)) == DenseVector(4.0, 8.0)> sum(m(::, *)) == DenseMatrix((5.0, 7.0))

> mean(m(*, ::)) == DenseVector(2.0, 4.0)> mean(m(::, *)) == DenseMatrix((2.5, 3.5))

Broadcasting> val dm = DenseMatrix((1.0,2.0,3.0), (4.0,5.0,6.0))

> sum(dm(::, *)) == DenseMatrix((5.0, 7.0, 9.0))

> dm(::, *) + DenseVector(3.0, 4.0)// DenseMatrix((4.0, 5.0, 6.0), (8.0, 9.0, 10.0)))

> dm(::, *) := DenseVector(3.0, 4.0)// DenseMatrix((3.0, 3.0, 3.0), (4.0, 4.0, 4.0))

UFuncs: Implementation> object log extends UFunc

> implicit object logDouble extends log.Impl[Double, Double] { def apply(x: Double) = scala.math.log(x)

}

> log(1.0) // == 0.0

> // elsewhere: magic to automatically make this work:> log(DenseVector(1.0)) // == DenseVector(0.0)> log(DenseMatrix(1.0)) // == DenseMatrix(0.0)> log(SparseVector(1.0)) // == SparseVector()

UFuncs: Implementation> object add1 extends UFunc

> implicit object add1Double extends add1.Impl[Double, Double] { def apply(x: Double) = x + 1.0

}

> add1(1.0) // == 2.0

> // elsewhere: magic to automatically make this work:> add1(DenseVector(1.0)) // == DenseVector(2.0)> add1(DenseMatrix(1.0)) // == DenseMatrix(2.0)> add1(SparseVector(1.0)) // == SparseVector(2.0)

Breeze Benchmarks

Breeze jblas Mahout Apache Commons0

500

1000

1500

2000

2500

3000

135.04

685.5

2626

2151

Singular Value Decomposition

Lower is better

Breeze Benchmarks

Breeze jblas Mahout Apache Commons0

500

1000

1500

2000

2500

3000

26

2562

56

Sparse/Dense Multiply

Lower is better

N/A

Breeze Benchmarks

Breeze jblas Mahout Apache Commons0

5

10

15

20

25

30

0.037 0.075

25.376

Sparse/Dense Addition

Lower is better

N/A

Nonnegative Matrix Factorization

V W H≈ Vij, Wij, Hij >= 0

Nonnegative Matrix Factorization

• Input: matrix V• Output: W * H ≈ V, W and H ≥ 0

[Lee and Seung, 2011]

Breeze NMF val n = V.rows val m = V.cols val r = dims val W = DenseMatrix.rand(n, r) val H = DenseMatrix.rand(r, m)

for(i <- 0 until iters) { H :*= ((W.t * V) :/= (W.t * W * H)) W :*= ((V * H.t) :/= (W * H * H.t)) }

(W, H)

From Breeze to Gust

+

Breeze NMF val n = X.rows val m = X.cols val r = dims val W = DenseMatrix.rand(n, r) val H = DenseMatrix.rand(r, m)

for(i <- 0 until iters) { H :*= ((W.t * X) :/= (W.t * W * H)) W :*= ((X * H.t) :/= (W * H * H.t)) }

(W, H)

Gust NMF val n = X.rows val m = X.cols val r = dims val W = CuMatrix.rand(n, r) val H = CuMatrix.rand(r, m)

for(i <- 0 until iters) { H :*= ((W.t * X) :/= (W.t * W * H)) W :*= ((X * H.t) :/= (W * H * H.t)) }

(W, H)

≥6x speedup on my laptop!

Optimization

(x2 + y – 11)2 + (x + y2 – 7)2

Optimization

trait DiffFunction[T] extends (T=>Double) { /** Calculates both the value

and the gradient at a point */ def calculate(x:T):(Double,T)}

Optimization

val df = new DiffFunction[DV[Double]] { def calculate(values: DV[Double]) = { val gradient = DV.zeros[Double](2) val (x,y) = (values(0),values(1)) val value = pow(x * x + y - 11, 2) + pow(x + y * y - 7, 2) gradient(0) = 4 * x * (x * x + y - 11) + 2 * (x + y * y - 7) gradient(1) = 2 * (x * x + y - 11) + 4 * y * (x + y * y - 7)

(value, gradient)

}}

Optimizationbreeze.optimize.minimize(df, DV(0.0, 0.0))

// DenseVector(3.0000000000012905, 1.999999999997128)

Optimization• Fully configurable• Support for:– LBFGS (+ L2/L1 Regularization)– Stochastic Gradient (+ L2/L1 Regularization)– Truncated Newton– Linear Programs– Conjugate Gradient Descent

Thanks!

Breeze Epic Puck

www.github.com/dlwh/{breeze, epic, puck}

Recommended