114
Data Science Toolchain presented by Jie-Han Chen slide: https://goo.gl/1hXBGk

Data science-toolchain

Embed Size (px)

Citation preview

Data Science Toolchainpresented by Jie-Han Chen

slide: https://goo.gl/1hXBGk

Language & SoftwarePythonRJavaMatlabOctaveJupyter Notebook

PythonOpen Source CommunityPackageWeb ServiceGood ReadabilityMachine Learning

ROpen Source CommunityBuilt-in Statistics PackageStandalone computing &data analysisSlower than Python

High PerformanceBig DataPoor Visualization,Modeling

Java

Matlab & OctavePowerful built-in math functionsSimple Data Visualization toolPrototyping

-50

510

-10 -10

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-50

510

Jupyter NotebookSupport 40+ programming language. eg: Python, R, Scala...Excellent for sharing your experimentsMarkdown, Latexexample1example2

Language & SoftwarePythonRJavaMatlabOctaveJupyter Notebook

Data Science Roadmap

Data Science Toolchains

Data CollectionData VisualizationData StorageAlgorithm & Modeling

Data CollectionUsing API: Facebook, WikipediaWeb Scraper

Web Scraper

Web ScraperHTTP request + HTML Parser

HTTP: python-requestsBetter than built-in urllibSessions with Cookie PersistenceThread-safety

HTTP: python-requests

HTTP: python-requests

Web page

Parser!Regular Expression?

BeautifulSoupHTML/XML parser

BeautifulSoup

Ptt

HTML parser

More Powerful Tool?

Scrapy

An open source and collaborative framework forextracting the data you need from websites.In afast, simple, yet extensible way.

Scrapy

$ scrapy startproject tutorial

Scrapy

path: /scrapy/dmoz.pycrawler name: dmoz

Scrapy

Scrapy

$ scrapy crawl dmoz

Scrapy

robots.txt

youtube.com/robots.txt

"I believe that visualization is one of the most

powerful means of achieving personal goals."

Harvey Mackay

Data Visualization

Data VisualizationMatplotlib, ggplot2D3.jsBokehTableauPlotDBLeaflet

Matplotlib

ggplot2

D3.jsData Visualization ProjectInteraciveWeb frontendexample1example2

BokehPython, R, Scala, JuliaInteractiveJupyter Notebook

Data Visulization

Code

Programming

Using GeoJSON with Leaflet

, Configurable

Using GeoJSON with Leaflet

S3

1. Key-value2. Permission3. Data Visualization4. Big Data (Spark)

Algorithm

&

Modeling

Algorithm & Modeling

python-numpy + python-pandas + scikit-learnlibsvmspark-MlibWekaDeep Learning

Numpy + Pandas

+ Scikit-learn

Numpy

C

Numpy - data structure

ndarray (n-dim array)ndimsizeshapedtype

Numpy

generate matrix

Numpy

generate matrix

Numpy

generate matrix

Numpy

generate matrix

Numpy

generate matrix

Numpy - linalg

numpy Series, DataFrame

: csv, json ... nan

Series -

Series -

Series -

Series -

DataFrame - Series

Pandas - import

Pandas - import

Pandas - import

Pandas - import

Pandas - NaN

Pandas - NaN

Pandas - NaN

Pandas - operation

MergeGroupingReshaping. . .

DatasetFeature EngineeringModelingEvaluation

LIBSVMC

Easy to useSupport many programming languages

Dataset

LIBSVM - install$ git clone

LIBSVM - install$ make

LIBSVM - workflow

LIBSVM - data format

label index , attribute value , attribute

LIBSVM - data format

LIBSVM - toy

MLlib

MLlib, Hadoop

Java, Scala, R, Python

MLlib

MLlib, Hadoop

Java, Scala, R, Python

Classification: logistic regression, naive Bayes,...Regression: generalized linear regression, survival regression,...Decision trees, random forests, and gradient-boosted treesRecommendation: alternating least squares (ALS)Clustering: K-means, Gaussian mixtures (GMMs),...Topic modeling: latent Dirichlet allocation (LDA)Frequent itemsets, association rules, and sequential patternmining

MLlib

Feature transformations: standardization,normalization, hashing,...ML Pipeline constructionModel evaluation and hyper-parameter tuningML persistence: saving and loading models andPipelines

MLlib

MLlib, Hadoop

Java, Scala, R, Python

MLlib

Weka

Java libraryBig DataSupport GUI

Deep LearningTheanoPylearn2KerasTensorflowCaffeDeeplearning4J...

Theano

Base on NumpyImplemented by CythonDynamic C code generationGPU & CUDAtensor, math expression

A CPU and GPU Math Compiler in

Python

Theano tutorial:http://www.slideshare.net/SergiiGavrylov/theano-tutorial

Keras

Theano, TensorflowSupport GPU

prototype

High-level neural networks library

Tool ?

Homework Github repo Data science

Database, Social Network Analytics, ML library, DeepLearning Platform ...

READM.md: Repo Demo Code

email: [email protected]

Google https://goo.gl/forms/PQPz8u2glyunQvfM2