Big Algorithms Made Easy with Microsoft's F#

Joel Pobar Languages Geek DEV450 http://callvirt.net/blog/post/Why-F-(TechEd-09-DEV450).aspx

Agenda

What is it?

F# Intro

Algorithms: Search

Fuzzy Matching

Classification (SVM)

Recommendations

Q&A

All This in 1 hour?

This is an awareness session! Lots of content, very broad, very fast

You’ll get all demos, pointers, and slide deck to take offline and digest

Two takeaways: F# is a great language for data

Smart algorithms aren’t hard – use them, explore more!

F# is

...a functional, object-oriented, imperative and explorative programming language for .NET

what is Functional Programming?

What is Functional Programming?

Wikipedia: “A programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data”

-> Emphasizes functions

-> Emphasizes shapes of data, rather than impl.

-> Modeled on lambda calculus

-> Reduced emphasis on imperative

-> Safely raises level of abstraction

Motivation for Functional

Simplicity in life is good: cheaper, easier, faster, better.

We typically achieve simplicity in software in two ways:

By raising the level of abstraction (and OO was one design to raise abstraction)

Increasing modularity

Better composition and modularity == reuse

Increasing signal to noise another good strategy:

Communicate more in less time with more clarity

Functional Programming Safer, while still being useful

Unsafe Safe

Useful

Not Useful

C#, C++, … V.Next#

Haskell

F#

Motivation for Functional

Data driven world More and more data: need higher order algorithms and techniques to derive value from data

Scalability is king Economies of software scale are changing: the web requires tools + frameworks + languages that scale to millions

The Multi-core (r)evolution! Need more adaptive languages + compilers to scale

Language features matter!

What is F# for?

F# is a General Purpose Language Can be used for a broad range of programming tasks

Superset of imperative and dynamic features

Great for learning FP concepts

Some particularly important domains: Financial modelling

Data mining

Scientific analysis

Academic

Let

Let binds values to identifiers

let helloWorld = “Hello, World”

print_any helloWorld

let myNum = 12

let myAddFunction x y =

let sum = x + y

sum

Type inference. The static typing of C# with

the succinctness of a scripting language

Tuples

Simple, very useful data structure

let site1 = (“msdn.com”, 10)

let site2 = (“abc.net.au”, 12)

let site3 = (“news.com.au”, 22)

let allSites = (site1, site2, site3)

let fst (a, b) = a

let snd (a, b) = b

List, Arrays, Seq, and Options

Lists and Arrays are first class citizens

Options provide a some-or-nothing capability

let list1 = [“Joel"; "Luke"]

let array = [|2; 3; 5;|]

let myseq = seq [0; 1; 2; ]

let option1 = Some(“Joel")

let option2 = None

Records

Simple concrete type definition

type Person =

{ Name: string;

DateOfBirth: System.DateTime; }

let n = { Name = “Joel”;

DateOfBirth = “13/04/81”; }

Immutability

Values may not be changed

Data is immutable by default

Discriminated Unions

Great for representing the structure of data

type Make = string

type Model = string

type Transport =

| Car of Make * Model

| Bicycle

let me = Car (“Holden”, “Barina”)

let you = Bicycle

Both of these identifiers are of type “Transport”

Functions

Functions: like delegates, but unified and simple

Deep type inference

(fun x -> x + 1)

let myFunc x = x + 1

val myFunc : int -> int

let rec factorial n =

if n>1 then n * factorial (n-1)

else 1

let data = [5; 3; 4; 4; 5]

List.sort (fun x y -> x – y) data

Pattern Matching

Helps tease apart data and data structures

Works best with Unions and Records

let (fst, _) = (“first”, “second”)

Console.WriteLine(fst)

let switchOnType(a:obj)

match a with

| :? Int32 -> printfn “int!”

| :? Transport -> printfn “Transport“

| _ -> printfn “Everything Else!”

F# Interactive

Search

Given a search term and a large document corpus, rank and return a list of the most relevant results…

Blog Crawler

Search

Words Stemming? Tokenise

Markup Title/Author/Date

Links? A sign of strength?

Let’s explore something simple

Search

Simplify: For easy machine/language manipulation

… and most importantly, easy computation

Vectors: natures own quality data structure Convenient machine representation (lists/arrays)

Lots of existing vector math algorithms

After a loving incubation period, moonlight 2.0 has been released. <a

href=“whatever”>source code</a><br><a

href”something else”>FireFox

binaries</a> … after 2

after

1

incub

ation

1 lo

vin

g

6 m

oonlig

ht

4

fire

fox

6

linu

x

2

bin

aries

Term Count

Document1: Linux post:

Document2: Animal post:

Vector space:

9

the

1

incub

ation

1

cra

zy

6

moonlig

ht

4

fire

fox

6

linux

2

pengu

in

2

the

1

do

g

5

pengu

in

9

the

1

incu

ba

tio

n

1

cra

zy

6 m

oonlig

ht

4

fire

fox

6

linux

0

do

g

2

pengu

in

2 0 2 0 0 0 1 5

2

cra

zy

Term Count Issues

‘the dog penguin’ Linux: 9+0+2 = 11

Animal: 2+1+5 = 8

‘the’ is overweight

Enter TF-IDF: Term Frequency Inverse Document Frequency

A weight to evaluate how important a word is to a corpus

i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t weight it very highly in the total query

9

the

1

incub

ation

1

cra

zy

6

moonlig

ht

4

fire

fox

6

linux

0

do

g

2

pengu

in

2 0 2 0 0 0 1 5

TF-IDF

Normalise the term count against the doc: tf = termCount / docWordCount

Measure importance of term idf = log ( |D| / termInDocumentCount)

where |D| is the total documents in the corpus

tfidf = tf * idf A high weight is reached by high term frequency, and a low document frequency

Search in under 10 minutes

Fuzzy Matching

String similarity algorithms: SoundEx; Metaphone

Jaro Winkler Distance; Cosine similarity; Sellers; Euclidean distance; …

We’ll look at Levenshtein Distance algorithm

Defined as: The minimum edit operations which transforms string1 into string2

Fuzzy Matching

Edit costs: In-place copy – cost 0

Delete a character in string1 – cost 1

Insert a character in string2 – cost 1

Substitute a character for another – cost 1

Transform ‘kitten’ in to ‘sitting’ kitten -> sitten (cost 1 – replace k with s)

sitten -> sittin (cost 1 - replace e with i)

sittin -> sitting (cost 1 – add g)

Levenshtein distance: 3

Fuzzy Matching

Estimated string similarity computation costs: Hard on the GC (lots of temporary strings created and thrown away, use arrays if possible.

Levenshtein can be computed in O (kl) time, where ‘l’ is the length of the shortest string, and ‘k’ is the maximum distance.

Parallelisable – split the set of words to compare across n cores.

Can do approximately 10,000 compares per second on a standard single core laptop.

Did You Mean?

Classification

Support Vector Machines (SVM) Supervised learning for binary classification

Training Inputs: ‘in’ and ‘out’ vectors.

SVM will then find a separating ‘hyperplane’ in an n-dimensional space

Training costs, but classification is cheap

Can retrain on the fly in some cases

Classification

SVM Issues

Classification on 2 dimensions is easy, but most input is multi-dimensional

Some ‘tricks’ are needed to transform the input data

SVM Classifier Demo

F# Recommendation Engine

Netflix Prize - $1 million USD Must beat Netflix prediction algorithm by 10%

480k users

100 million ratings

18,000 movies

Great example of deriving value out of large datasets

Earns Netflix loads and loads of $$$!

Netflix Data Format

MovieId CustomerId Rating

Clerks 444444 5

Clerks 2093393 4

Clerks 999 5

Clerks 8668478 1

Dogma 2432114 3

Dogma 444444 5

Dogma 999 5

... ... ...

Nearest Neighbour

MovieId CustomerId Rating

Clerks 444444 5

Clerks 2093393 4

Clerks 999 5

Clerks 8668478 1

Dogma 2432114 3

Dogma 444444 5

Dogma 999 5

... ... ...

Nearest Neighbour

Find the best movies my neighbours agree on

CustomerId 302 4418 3 56 732

444444 5 4 5 2

999 5 5 1

111211 3 5 3

66666 5 5

1212121 5 4

5656565 1

454545 5 5

Netflix Demo

Vector Math Made Easy

If we want to calculate the distance between A and B, we call on Euclidean Distance

We can represent the points in the same way using Vectors: Magnitude and Direction.

Having this Vector representation, allows us to work in ‘n’ dimensions, yet still achieve Euclidean Distance/Angle calculations.

A (x1,y1)

B (x2,y2)

C (x0,y0)

http://callvirt.net/blog/post/Why-F-(TechEd-09-DEV450).aspx

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,

IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Technology

Big Algorithms Made Easy with Microsoft's F#