Introduction to NumPy for Machine Learning Programmers

Introduction to NumPy

for Machine Learning Programmers

PyData Tokyo Meetup

April 3, 2015 @ Denso IT Laboratory

Kimikazu Kato

Silver Egg Techonogy

1 / 26

Target AudiencePeople who want to implement machine learning algorithms in Python.

2 / 26

OutlinePreliminaries

Basic usage of NumPyIndexing, Broadcasting

Case studyReading source code of scikit-learn

Conclusion

3 / 26

Who am I?

Kimikazu Kato

Chief Scientist at Silver Egg TechnologyAlgorithm designer for a recommendation systemPh.D in computer science (Master's degree in math)

4 / 26

Python is Very Slow!Code in C

#include <stdio.h>int main() { int i; double s=0; for (i=1; i<=100000000; i++) s+=i; printf("%.0f\n",s);}

Code in Python

s=0.for i in xrange(1,100000001): s+=iprint s

Both of the codes compute the sum of integers from 1 to 100,000,000.

Result of benchmark in a certain environment:Above: 0.109 sec (compiled with -O3 option)Below: 8.657 sec(80+ times slower!!)

5 / 26

Better code

import numpy as npa=np.arange(1,100000001)print a.sum()

Now it takes 0.188 sec. (Measured by "time" command in Linux, loading timeincluded)

Still slower than C, but sufficiently fast as a script language.

6 / 26

Lessons

Python is very slow when written badlyTranslate C (or Java, C# etc.) code into Python is often a bad idea.Python-friendly rewriting sometimes result in drastic performanceimprovement

7 / 26

Basic rules for better performance

Avoid for-sentence as far as possibleUtilize libraries' capabilities insteadForget about the cost of copying memory

Typical C programmer might care about it, but ...

8 / 26

Basic techniques for NumPy

BroadcastingIndexing

9 / 26

Broadcasting

>>> import numpy as np>>> a=np.array([0,1,2,3])>>> a*3array([0, 3, 6, 9])>>> np.exp(a)array([ 1. , 2.71828183, 7.3890561 , 20.08553692])

exp is called a universal function.

10 / 26

Broadcasting (2D)

>>> import numpy as np>>> a=np.arange(9).reshape((3,3))>>> b=np.array([1,2,3])>>> aarray([[0, 1, 2], [3, 4, 5], [6, 7, 8]])>>> barray([1, 2, 3])>>> a*barray([[ 0, 2, 6], [ 3, 8, 15], [ 6, 14, 24]])

11 / 26

Indexing

>>> import numpy as np>>> a=np.arange(10)>>> aarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>>> indices=np.arange(0,10,2)>>> indicesarray([0, 2, 4, 6, 8])>>> a[indices]=0>>> aarray([0, 1, 0, 3, 0, 5, 0, 7, 0, 9])>>> b=np.arange(100,600,100)>>> barray([100, 200, 300, 400, 500])>>> a[indices]=b>>> aarray([100, 1, 200, 3, 300, 5, 400, 7, 500, 9])

12 / 26

Boolean Indexing

>>> a=np.array([1,2,3])>>> b=np.array([False,True,True])>>> a[b]array([2, 3])>>> c=np.arange(-3,4)>>> carray([-3, -2, -1, 0, 1, 2, 3])>>> d = c>0>>> darray([False, False, False, False, True, True, True], dtype=bool)>>> c[d]array([1, 2, 3])>>> c[c>0]array([1, 2, 3])>>> c[np.logical_and(c>=0,c%2==0)]array([0, 2])>>> c[np.logical_or(c>=0,c%2==0)]array([-2, 0, 1, 2, 3])

13 / 26

Cf. In Pandas>>> import pandas as pd>>> import numpy as np>>> df=pd.DataFrame(np.random.randn(5,3),columns=["A","B","C"])>>> df A B C0 1.084117 -0.626930 -1.8183751 1.717066 2.554761 -0.5600692 -1.355434 -0.464632 0.3226033 0.013824 0.298082 -1.4054094 0.743068 0.292042 -1.002901

[5 rows x 3 columns]>>> df[df.A>0.5] A B C0 1.084117 -0.626930 -1.8183751 1.717066 2.554761 -0.5600694 0.743068 0.292042 -1.002901

[3 rows x 3 columns]>>> df[(df.A>0.5) & (df.B>0)] A B C1 1.717066 2.554761 -0.5600694 0.743068 0.292042 -1.002901

[2 rows x 3 columns]

14 / 26

Case Study 1: Ridge Regression(sklearn.linear_model.Ridge)

, : input, output of training data : hyper parameter

The optimum is given as:

The corresponding part of the code:

K = safe_sparse_dot(X, X.T, dense_output=True) try: dual_coef = _solve_cholesky_kernel(K, y, alpha)

coef = safe_sparse_dot(X.T, dual_coef, dense_output=True).T except linalg.LinAlgError:

(sklearn.h/linear_model/ridge.py L338-343)

∥y − Xw + α∥wminw

∥22 ∥2

2

X yα

w = ( X + αI yXT )−1XT

15 / 26

https://github.com/scikit-learn/scikit-learn/blob/0.16.X/sklearn/linear_model/ridge.py#L338-343

K = safe_sparse_dot(X, X.T, dense_output=True) try: dual_coef = _solve_cholesky_kernel(K, y, alpha)

coef = safe_sparse_dot(X.T, dual_coef, dense_output=True).T except linalg.LinAlgError:

(sklearn.h/linear_model/ridge.py L338-343)

safe_sparse_dot is a wrapper function of dot which can be applied tosparse and dense matrices._solve_cholesky_kernel computes

w = ( X + α yXT )−1XT

(K + αI y)−1

16 / 26


Inside _solve_cholesky_kernel

K.flat[::n_samples + 1] += alpha[0]

try: dual_coef = linalg.solve(K, y, sym_pos=True, overwrite_a=False) except np.linalg.LinAlgError:

(sklearn.h/linear_model/ridge.py L138-146, comments omitted)

inv should not be used; solve is faster (general knowledge in numericalcomputation)flat ???

(K + αI y)−1

17 / 26


flat

class flatiter(builtins.object) | Flat iterator object to iterate over arrays. | | A flatiter iterator is returned by ̀ x̀.flat̀ ̀for any array x. | It allows iterating over the array as if it were a 1-D array, | either in a for-loop or by calling its next method. | | Iteration is done in C-contiguous style, with the last index varying the | fastest. The iterator can also be indexed using basic slicing or | advanced indexing. | | See Also | -------- | ndarray.flat : Return a flat iterator over an array. | ndarray.flatten : Returns a flattened copy of an array. | | Notes | ----- | A flatiter iterator can not be constructed directly from Python code | by calling the flatiter constructor.

In short, x.flat is a reference to the elements of the array x, and can be usedlike a one dimensional array.

18 / 26


try: dual_coef = linalg.solve(K, y, sym_pos=True, overwrite_a=False) except np.linalg.LinAlgError:

(sklearn.h/linear_model/ridge.py L138-146, comments omitted)


is equivalent to

K += alpha[0] * np.eyes(n_samples)

(The size of is n_samples n_samples)

The upper is an inplace version.

K ×

19 / 26


Case Study 2: NMF(sklearn.decomposition.nmf)

NMF = Non-negative Matrix Factorization

Successful in face part detection

20 / 26

Idea of NMFApproximate the input matrix as a product of two smaller non-negativematrix:

Notation

Parameter set:

Error function:

X ≈ HW T

≥ 0, ≥ 0Wij Hij

Θ = (W ,H), : i-th element of Θθi

f(Θ) = ∥X − HW T ∥2F

21 / 26

Algorithm of NMFProjected gradient descent (Lin 2007):

where

Convergence condition:

where

(Note: )

= P [ − α∇f( )]Θ(k+1) Θ(k) Θ(k)

P [x = max(0, )]i xi

f( ) ≤ ϵ f( )∥∥∇P Θ(k) ∥∥ ∥∥∇P Θ(1) ∥∥

f(Θ) = {∇P ∇f(Θ)i

min(0,∇f(Θ ))i

if > 0θi

if = 0θi

≥ 0θi

22 / 26

Computation of where

Code:

proj_norm = norm(np.r_[gradW[np.logical_or(gradW < 0, W > 0)], gradH[np.logical_or(gradH < 0, H > 0)]])

(sklearn/decomposition/nmf.py L500-501)

norm : utility function of scikit-learn which computes L2-normnp.r_ : concatenation of arrays

f(Θ)∥∥∇P ∣∣

f(Θ) = {∇P∇f(Θ)i

min(0,∇f(Θ ))i

if > 0θi

if = 0θi

23 / 26

https://github.com/scikit-learn/scikit-learn/blob/0.16.X/sklearn/decomposition/nmf.py#L500-501

means

Code:

proj_norm = norm(np.r_[gradW[np.logical_or(gradW < 0, W > 0)], gradH[np.logical_or(gradH < 0, H > 0)]])

(sklearn/decomposition/nmf.py L500-501)

gradW[np.logical_or(gradW < 0, W > 0)],

means that an element is employed if or , and discardedotherwise.

Only non-zero elements remains after indexing.

f(Θ) = {∇P∇f(Θ)i

min(0,∇f(Θ ))i

if > 0θi

if = 0θi

f(Θ) =∇P

⎧⎩⎨⎪⎪

∇f(Θ)i

∇f(Θ)i

0

if > 0θi

if = 0 and ∇f(Θ < 0θi )i

otherwise

∇f(Θ < 0)i > 0θi

24 / 26

https://github.com/scikit-learn/scikit-learn/blob/0.16.X/sklearn/decomposition/nmf.py#L500-501

ConclusionAvoid for-sentence; use NumPy/SciPy's capabilitiesMathematical derivation is importantYou can learn a lot from the source code of scikit-learn

25 / 26

References

Official

scikit-learn

For beginners of NumPy/SciPy

Gabriele Lanaro, "Python High Performance Programming," PacktPublishing, 2013.Stéfan van der Walt, Numpy MedkitPython Scientific Lecture Notes

Algorithm of NMF

C.-J. Lin. Projected gradient methods for non-negative matrixfactorization. Neural Computation 19, 2007.

26 / 26

https://scipy-lectures.github.io/index.html

http://scikit-learn.org/stable/

http://mentat.za.net/numpy/numpy_advanced_slides/