Predicting the future with Google Prediction API

Preview:

DESCRIPTION

This is my presentation from Talks #32 - http://www.softbinator.ro/events/talks-32/ It is in Romanian && English Pentru discuții puteți să vă alăturați grupului de Facebook - Talks by Softbinator: https://www.facebook.com/groups/talks.by.softbinator/

Citation preview

Predicting the future with Google Prediction

API

Talks #32

RESTful API Flexible Input

Asynchronous cloud-based training, automatic model selection and tuning, and the ability to add training data on the fly.

Numeric or text input that can output hundreds of discrete categories or continuous values.

Great, so what do we do now?

The same thing we do every night Pinky, TRY TO TAKE OVER THE WORLD!

Does that take any money?

• Well… It’s free. În limita bunului simț :D

1.0 requests/second/user100 requests/dayTraining de 5MB / zi100 de streaming updates / zi

• Lifetime cap (20k predicții), deci după 20k predicții trebuie să începi să plătești...

Great, so what do I get for my MONEY? X(• 10$, 10k predicții pe lună gratuite• 10k streaming updates gratis• Max training upload (via Google Cloud Storage 2.5GB)

How do I get started?

• Glad you asked!

• Trebuie să creezi un proiect nou în Google Console API și să enable• Google Prediction API• Google Cloud Storage API (requires billing ON, adică vrea cardul tău)

Great, any documentation to read?

• Yes!• But it totally sucks. (Toate lucrurile din Tools and Resources au link-uri

broken…)• But the Hello World example works. Yuppie!

Great, I got things done, now What?

• Now we traing the CSV. If we have it• If not we build it.

Great, how should my CSV look like?

“like”, “Am castigat la loto si vreau sa dau tuturor hosting gratuity forever”, “bucuresti” , “loto”“dislike”, “Doi caini maidanezi au muscat 3 pisici clonate si au murit.”, “bucuresti” , “venim”

[output], [feature1], [feature2], [feature3]

Output = Output. Hhahah.

Feature = Input. Poate să fie numeric / text / whatever.

Și FĂRĂ HEADERE la CSV.

Și de maxim 2.5GB.

Eh, dacă ai varianta Free de Google Prediction, 4mb mai exact

Great, ne arăți unul?

That’s one ugly Excel, not a CSV

NEVER USE EXCEL!Nu face output *content**quotation_mark**comma**quotation_mark**content*

Și nici uploadat în Google Drive și Export din Spreadsheet-ul lor.

So, go for OpenOffice!

So? Now what?

• Upload la CSV în Google Cloud Storage.

500 training Data = 18 sec

476 instances? Shouldn’t be 500 ?

Let’s see some fresh meat. I mean tweet. Lol

So, cât de bine prezice Google Prediction API ?

• Un băiețaș a vrut să facă niște teste / exemple:• http://blog.notdot.net/2010/06/Trying-out-the-new-Prediction-API• Training on movie/book reviews to try and predict the score given

based on the text• Training on product descriptions to try and predict their rating• Training on Reddit submissions to try and predict the subreddit a new

submission belongs in

Guessing subreddits with the Prediction API• He had: 75MB of JSON-encoded data, comprising 72,986 submissions• A determinat 20 subreddits cu cele mai multe submisii in acea

perioada de timp• This subset made up 42,753 submissions, or about 58% of the

original.• Submissions were randomly split into either the training set (98%) or

the validation set (2%):

Reddit Submissions

reddit.com 14578

pics 4157

AskReddit 3375

reportthespammers 3258

politics 3162

funny 2176

WTF 1773

gaming 1367

worldnews 938

videos 849

atheism 834

Music 833

technology 732

trees 703

comics 639

nsfw 611

circlejerk 600

news 567

environment 537

DoesAnybodyElse 537

După training, Google a estimate o rată de success de 61%

So? Cum s-a descurcat?

484 of 857 predicted correctly.56% - not far off the system's own estimate.

Where’s the problem?

• People are the problem, not Google Prediction API• Userii au pus incorect categoriile. NEVER TRUST THE USER!

Anyway, back la oile noastre

• Data Harvesting (from Twitter)• Phirehose - https://github.com/fennb/phirehose - a php interface to

twitter streaming api • What have I gathered?

1,3GB twitter #bigdata harvesting. Hihi

Am luat 500 de exemple (but the more, the better)

Le-am introdus în excel, și împărțit în 3 bucket-uri (0,1,2)

0 = Dislike = nu-mi place1 = Fav = îi dau doar fav2 = Reshare dar îi dau și fav și retweet Save to CSV, upload, TRAIN.

So, cu ce ne ajută?

The interesting part, este că deși avem 3 valori (0 sau 1 sau 2),El ne va return un float între 0 și 2, adică un rezultat de 1,563212 este foarte posibil!

Ce-am folosit for the Twitter Bot cool Follower gathering Application blabla?• folosit PHP Library-ul asociat Google Prediction si anume

serviceAccount.php• E stricat!

$result = $service->trainedmodels->predict($id, $input);

Trebuie să fie:

$service->trainedmodels->predict($project, $id, $input);

What else?

• Twitter API Exchanger - https://github.com/J7mbo/twitter-api-php

So ce anume facem?

• Database, luam ultimul Tweet• Vedem ce scor scoate• Daca scoate un scor bun ii dam fav / retweet.• Atât.

Huge recap?

• New Google Project• Enable Google Prediction & Google Cloud Storage• Upload your training CSV• Make Predictions via API EXPLORER• Download PHP Library for Google Prediction & Twitter Library• Fix Google Library• Put all toghether in one php file• Run it, put a sleep, make it run forever lol.

Recommended