42
MACHINE LEARNING IN PHP The roots of education are bitter, but the fruit is sweet PHP conf asia, Singapore, 2016

Machine learning in php singapore

Embed Size (px)

Citation preview

Page 1: Machine learning in php   singapore

MACHINE LEARNING IN PHPThe roots of education are bitter, but the fruit is sweet

PHP conf asia, Singapore, 2016

Page 2: Machine learning in php   singapore

Agenda

• How to teach tricks to your PHP

• Application : searching for code in comments

• Complex learning

Page 3: Machine learning in php   singapore

Speaker

• Damien Seguy

• Exakat CTO

• Static analysis of PHP code

Page 4: Machine learning in php   singapore

Machine Learning

• Teaching the machine

• Supervised learning : learning then applying

• Application build its own model : training phase

• It applies its model to real cases : applying phase

Page 5: Machine learning in php   singapore

Applications

• Play go, chess, tic-tac-toe and beat everyone else

• Fraud detection and risk analysis

• Automated translation or automated transcription

• OCR and face recognition

• Medical diagnostics

• Walk, welcome guest at hotels, play football

• Finding good PHP code

Page 6: Machine learning in php   singapore

PHP Applications

• Recommendations systems

• Predicting user behavior

• SPAM

• conversion user to customer

• ETA

• Detect code in comments

Page 7: Machine learning in php   singapore

Real use case

• Identify code in comments

• Classic problem

• Good problem for machine learning

• Complex, no simple solution

• A lot of data and expertise are available

Page 8: Machine learning in php   singapore

Supervised Training

Historydata Training

ModelReal data Results

Page 9: Machine learning in php   singapore

The Fann Extension

• ext/fann (https://pecl.php.net/package/fann)

• Fast Artificial Neural Network

• http://leenissen.dk/fann/wp/

• Neural networks in PHP

• Works on PHP 7, thanks to the hard work of Jakub Zelenka

• https://github.com/bukka/php-fann

Page 10: Machine learning in php   singapore

NEURAL NETWORKS

• Imitation of nature

• Input layer

• Output layer

• Intermediate layers

Page 11: Machine learning in php   singapore

Neural network

• Imitation of nature

• Input layer

• Output layer

• Intermediate layers

Page 12: Machine learning in php   singapore

Initialisation

<?php

$num_layers  = 1; $num_input  = 5; $num_neurons_hidden = 3; $num_output  = 1; $ann = fann_create_standard($num_layers, $num_input,  $num_neurons_hidden, $num_output);

// Activation function fann_set_activation_function_hidden($ann, 

FANN_SIGMOID_SYMMETRIC); fann_set_activation_function_output($ann,  FANN_SIGMOID_SYMMETRIC);

Page 13: Machine learning in php   singapore

Preparing data

Raw data Extract Filter Human review Fann ready

Page 14: Machine learning in php   singapore

Expert at work

// Test if the if is in a compressed format

// none need yet

// There is a parser specified in `Parser::$KEYWORD_PARSERS`

// $result should exist, regardless of $_message

// $a && $b and multidimensional

// numGlyphs + 1

// TODO : fix this; var_dump($var);

// if(ob_get_clean()){

//$annots .= ' /StructParent ';

// $cfg['Servers'][$i]['controlpass'] = 'pmapass';

Page 15: Machine learning in php   singapore

Input vector

• 'length' : size of the comment

• 'countDollar' : number of $

• 'countEqual' : number of =

• 'countObjectOperator' number of -> operator ($o->p)

• 'countSemicolon' : number of semi-colon ;

Page 16: Machine learning in php   singapore

Input data

46 5 1 825 0 0 0 1 0 37 2 0 0 0 0 55 2 2 0 1 1 61 2 1 3 1 1 ...

 * This file is part of Exakat.  *  * Exakat is free software: you can redistribute it and/or modify  * it under the terms of the GNU Affero General Public License as published by  * the Free Software Foundation, either version 3 of the License, or  * (at your option) any later version.  *  * Exakat is distributed in the hope that it will be useful,  * but WITHOUT ANY WARRANTY; without even the implied warranty of  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the  * GNU Affero General Public License for more details.  *  * You should have received a copy of the GNU Affero General Public License  * along with Exakat.  If not, see <http://www.gnu.org/licenses/>.  *  * The latest code can be found at <http://exakat.io/>.  * */

// $x[3] or $x[] and multidimensional

//if ($round == 3) { die('Round '.$round);}

//$this->errors[] = $this->language->get('error_permission');

Number of input Number of incoming data Number of outgoing data

Page 17: Machine learning in php   singapore

1 5 1 37 2 0 0 0 0

// $x[3] or $x[] and multidimensional

ext/Fann

It's a comment

Page 18: Machine learning in php   singapore

Training

$max_epochs  = 500000; $desired_error  = 0.001;

// the actual trainingif (fann_train_on_file($ann,  'incoming.data',  $max_epochs,  $epochs_between_reports,  $desired_error)) {        fann_save($ann, 'model.out'); }fann_destroy($ann); ?>

Page 19: Machine learning in php   singapore
Page 20: Machine learning in php   singapore
Page 21: Machine learning in php   singapore
Page 22: Machine learning in php   singapore

TRAINING

• 47 cases

• 5 characteristics

• 3 hidden neurons

• + 5 input + 1 output

• Duration : 5.711 s

Page 23: Machine learning in php   singapore

Application

Historydata Training

ModelReal data Results

Page 24: Machine learning in php   singapore

Application

<?php 

$ann = fann_create_from_file('model.out'); 

$comment = '//$gvars = $this->getGraphicVars();';

$input = makeVector($comment); $results = fann_run($ann, $input); 

if ($results[0] > 0.8) {       print "\"$comment\" -> $results[0] \n";  } 

?>

Page 25: Machine learning in php   singapore

Results > 0.8

• Answer between 0 and 1

• Values ranges from -14 to 0,999

• The closer to 1, the safer. The closer to 0, the safer.

• Is this a percentage? Is this a carrots count ?

• It's a mix of counts…

Page 26: Machine learning in php   singapore

-16

-12

-8

-4

0

60.000000

70.000000

80.000000

90.000000

100.000000

Page 27: Machine learning in php   singapore

REAL CASES

• Tested on 14093 comments

• Duration 68.01ms

• Found 1960 issues (14%)

Page 28: Machine learning in php   singapore

0.99999893 // $cfg['Servers'][$i]['controlhost'] = '';    

0.99999928 //$_SESSION['Import_message'] = $message->getDisplay();    

/* 0.99999928 if (defined('SESSIONUPLOAD')) {     // write sessionupload back into the loaded PMA session

    $sessionupload = unserialize(SESSIONUPLOAD);     foreach ($sessionupload as $key => $value) {         $_SESSION[$key] = $value;     }

    // remove session upload data that are not set anymore     foreach ($_SESSION as $key => $value) {         if (mb_substr($key, 0, mb_strlen(UPLOAD_PREFIX))             == UPLOAD_PREFIX             && ! isset($sessionupload[$key])         ) {             unset($_SESSION[$key]);         }     } }

Page 29: Machine learning in php   singapore

0.98780382 //LEAD_OFFSET = (0xD800 - (0x10000 >> 10)) = 55232    

0.99361396 // We have server(s) => apply default configuration      0.98383027 // Duration = as configured    

0.99999928 // original -> translation mapping    

0.97590065 // = (   59 x 84   ) mm  = (  2.32 x 3.31  ) in    

Page 30: Machine learning in php   singapore

True positive False positive

True negative False negative

Found by FANN

Target

Page 31: Machine learning in php   singapore

True positive

False positive

True negative

False negative

Found by FANN

Target

// $cfg['Servers'][$i]['table_coords'] = 'pma__table_coords';    

//(isset($attribs['height'])?$attribs['height']: 1);    

// if ($key != null) did not work for index "0"    

// the PASSWORD() function    

0.99999923

0.73295981

0.99999851

0.2104115

Page 32: Machine learning in php   singapore

RESULTS

• 1960 issues

• 50+% of false positive

• With an easy clean, 822 issues reported

• 14k comments, analyzed in 68 ms (367ms in PHP5)

• Total time of coding : 27 mins.

// = (   59 x 84   ) mm  = (  2.32 x 3.31  ) in     /* vim: set expandtab sw=4 ts=4 sts=4: */

Page 33: Machine learning in php   singapore

Learn better, not harder

• Better training data

• Improve characteristics

• Configure the neural network

• Change algorithm

• Automate learning

• Update constantly

Real data

Historydata

Training

Model Results

Retroaction

Page 34: Machine learning in php   singapore

Better training data

• More data, more data, more data

• Varied situations, real case situations

• Include specific cases

• Experience is capital

• https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

Page 35: Machine learning in php   singapore

Improve characteristics

• Add new characteristics

• Remove the one that are less interesting

• Find the right set of characteristics

Page 36: Machine learning in php   singapore

Network Configuration

• Input vector

• Intermediate neurons

• Activation function

• Output vector

0

5000

10000

15000

20000

1 2 3 4 5 6 7 8 9 10

1 layer 2 layers 3 layers 4 layers

Time of training (ms)

Page 37: Machine learning in php   singapore

Change algorithm

• First add more data before changing algorithm

• Try cascade2 algorithm from FANN

• 0.6 => 0 found

• 0.5 => 2 found

• Not found by the first algorithm

Page 38: Machine learning in php   singapore

Finding the BEST

• Test with 2-4 layers 10 neurons

• Measure results

0

2250

4500

6750

9000

1 2 3 4 5 6 7 8 9 10 11 12 13

1 layer 2 layers 3 layers 4 layers

Page 39: Machine learning in php   singapore

DEEP LEARNING

• Chaining the neural networks

• Auto-encoders

• Unsupervised Learning

• Genetic algorithm, ant, random forest, naive Bayes

Page 40: Machine learning in php   singapore

Other tools

• PHP ext/fann

• Langage R

• https://github.com/kachkaev/php-r

• Scikit-learn

• https://github.com/scikit-learn/scikit-learn

• Mahout

• https://mahout.apache.org/

Page 41: Machine learning in php   singapore

Conclusion

• Machine learning is about data, not code

• There are tools to use it with PHP

• Fast to try, easy results or fast fail

• Use it for complex problems, that accepts error

Page 42: Machine learning in php   singapore

THANK YOU!

@exakat