24
Multi-modal big data analysis within the Spark ecosystem in BCN Jordi Torres Professor at UPC - BSC

Spark summit 2015: Multi-modal big data analysis within the Spark ecosystem in BCN

Embed Size (px)

Citation preview

Multi-modal big data analysis within the Spark ecosystem in BCN

Jordi Torres Professor at UPC - BSC

Our University

2

Our Research Center

3

Our research: Cognitive Computing

4

FOUNDATIONAL BUILDING BLOCKS For us Cognitive Computing refers to the continuous development of supercomputing systems enabling the convergence of advanced analytic algorithms and big data technologies driving new insights based on the massive amounts of available data

DATA  Supercomputer  Systems  

Big  Data  Technologies  

Advanced  Analy9c  

Algorithms  

Today focus: Multimedia & Spark

5

DATA  Supercomputer  Systems  

Big  Data  Technologies  

Advanced  Analy9c  

Algorithms  

 ML  &  

OpenCV    

Marenostrum Supercomputer

Our Supercomputer in Barcelona: Marenostrum

36-p

ort F

DR1

0!36

-por

t FD

R10!

36-p

ort F

DR1

0!36

-por

t FD

R10!

36-p

ort F

DR1

0!36

-por

t FD

R10!

36-p

ort F

DR1

0!36

-por

t FD

R10!

Mel

lano

x

648-

port

IB

Cor

e S

witc

h

Mel

lano

x 64

8-po

rt IB

C

ore

Sw

itch

Mel

lano

x

648-

port

IB

Cor

e S

witc

h

Mel

lano

x

648-

port

IB

Cor

e S

witc

h

Infin

iba

nd

648-

port

FDR

C

ore

switc

h

Mel

lano

x

648-

port

IB

Cor

e S

witc

h

Mel

lano

x

648-

port

IB

Cor

e S

witc

h

36-p

ort F

DR1

0!36

-por

t FD

R10!

560!

560!

560!

560!

560!

560!

Leaf

sw

itche

s!18

!18

!18

!18

!12

!

3 lin

ks to

eac

h co

re!

3 lin

ks to

eac

h co

re!

3 lin

ks to

eac

h co

re!

3 lin

ks to

eac

h co

re!

2 lin

ks to

eac

h co

re!

FDR1

0 lin

ks!

18!

18!

18!

18!

12!

3 lin

ks to

eac

h co

re!

3 lin

ks to

eac

h co

re!

3 lin

ks to

eac

h co

re!

3 lin

ks to

eac

h co

re!

2 lin

ks to

eac

h co

re!

18!

18!

18!

18!

12!

18!

18!

18!

18!

12!

Late

ncy:

0,7

μs

!Ba

ndw

idth

: 40

Gb/

s!

Storage Network Storage Racks Computer Nodes

Computer Network

Our Supercomputer in Barcelona: Marenostrum

new module: spark4mn

•  Framework  to  run  efficiently  a  Spark  cluster  over  an  LSF-­‐based  environment  and  the  hardware  par<culari<es  of  MareNostrum  

•  The  framework  provides  func<onali<es  to  evaluate  different  configura<ons  (HDFS  vs  GPFS,  different  networks,  different  affini<es,  cluster  geometries,  etc.)  

8

Marenostrum Supercomputer

!

9

Shuffle 1TB of data in Sort Benchmark format 1010 records of 100 bytes each

Three different partitioning :100, 1000 and 10000 parts

Spark over Marenostrum: Shuffling 1TB

101.47 TB/H max speed (128 nodes)!

10!

Spark over Marenostrum: ML workloads

k-means Naïve Bayes

Research line in our group: Multimedia Big Data Computing

11!

real-time!

analysis!

multimodal!

social!

Crossover of 4 main aspects:

12!

social network

relationships

audiovisual content metadata

Multimedia Big Data Computing

The challenge is to work with tree kind of data at the same time:

13!

Case Study:

Multimodal Data Analytics systems can aid Desigual in better understanding their customers and potential customers through the analysis of social media data sources

Source:demotix.com

!

14!

Case Study: (Autumn-­‐Winter  2015-­‐2016)  

Dataset1: #desigual #lavidaeschula #mydesigual

30.000 photos 100 photos x 2K followers = 200K Photos (100 GB)

Dataset 2: Followers

15!

Case Study:

AGE,  GENDER,  HOME  LOCATION,  TRAVEL  PATTERNS,  LIFESTYLE/CONSUMPTION  PATTERNS,  …          

E.g. Latent User Attribute Inference to Predicting Desigual Followers

16!

CATWALK : Social Media Image Analysis for Fashion Industry Market Research

Multimedia Big Data Computing platform that operates over freely available online images from sources such as Instagram or Twitter

17!

CATWALK : Small files problem

json! json! json!

json! json! json!

json! json! json!

SEQUENCE !FILE!

SEQUENCE!FILE!

SEQUENCE !FILE!

SEQUENCE!FILE!…   …  

18!

CATWALK : Vectorization

PATCH 1!PATCH 2!PATCH 3!PATCH 4!

KP1!KP2!KP3!KP4!

PATCH 1!PATCH 2!PATCH 3!PATCH 4!

KP1!KP2!KP3!KP4!

kmeans! CW1!

feature    detec<on  

feature    descrip<on  

CODEWORDS  DICTIONARY  

CW2! CW3!

0.4! 0.2! 0.8!

Necessary for visual similarity search, visual clustering, classification, etc.

19!

CATWALK: bsc.spark.image scala>  import  bsc.spark.image.ImageU9ls  …  scala>  images  =  ImageU9ls.seqFile("hdfs://...",  sc);    scala>  dic9onary  =  ImageU9ls.BoWDic9onary(images);    scala>  vectors  =  dic9onary.getBags(images);        …    scala>  val  splits  =  vectors.randomSplit(Array(0.6,  0.4),  seed  =  11L)    scala>  training  =  splits(0)    scala>  test  =  splits(1)    scala>  model  =  NaiveBayes.train(training,  lambda  =  1.0)  …  

20!

CATWALK : Locality Sensitive Hashing e.g. near-replica detection (visual spam detection, copyright infringement)

PATCH 1!

PATCH 2!

PATCH 3!

PATCH 4!

KP1!

KP2!

KP3!

KP4!

feature    detec<on  

feature    descrip<on  

0000 0100 1100

0010 0110 1110

0011 0111 1111

features  are  sketched,  embedded    into  a  Hamming  space  

Similar  features  are  hashed  into  similar  buckets  in  a  hash  table  

SIFT,  SURF,  ORB,  etc.  

0! 1! 1! 0!

21!

CATWALK : more …

•  High  performance  visual  recogni<on  

•  High  performance  near-­‐replica  detec<on  

•  Image  style  recogni<on  

•  Unified  representa<on  of  inferred  knowledge  through  ISO/IEC  24800-­‐2  (JPEG’s  JPSearch)  

•  Launch  of  CATWALK  placorm  (autumn  2015)  

Thank you for your attention!

@Jordi Torres BCN

22!#CognitiveComputing & #BCN

Thank you for your attention!

@Jordi Torres BCN

23!#CognitiveComputing & #BCN

Thank you for your attention!

@Jordi Torres BCN

24!#CognitiveComputing & #BCN