33
Value extraction from BBVA credit card transactions Iván de Prado Alonso – CEO of Datasalt www.datasalt.es @ivanprado @datasalt www.bigdataspain.org November 16 th , 2012 ETSI Telecomunicación Madrid Spain #BDSpain

Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Embed Size (px)

DESCRIPTION

Session presented at Big Data Spain 2012 Conference 16th Nov 2012 ETSI Telecomunicacion UPM Madrid www.bigdataspain.org More info: http://www.bigdataspain.org/es-2012/conference/value-extraction-from-bbva-credit-card-transactions/ivan-de-prado

Citation preview

Page 1: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Value extraction from BBVA credit card transactions

Iván  de  Prado  Alonso  –  CEO  of  Datasalt  www.datasalt.es  @ivanprado  @datasalt  

www.bigdataspain.org  November  16th,  2012  ETSI  Telecomunicación    Madrid  Spain  #BDSpain  

Page 2: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012
Page 3: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

BIG  “MAC”  DATA  

Page 4: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

104,000  employees  47  million  customers  

Page 5: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

The  idea  

Extract  value  from  

anonymized  credit  card  transacNons  data  &  share  it      

Always:    ü  Impersonal  ü  Aggregated  ü  Dissociated  ü  Irreversible  

Page 6: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Helping  

Consumers  

Sellers  

Informed  decision  ü  Shop  recommendaNons  (by  locaNon  and  by  category)  ü  Best  Nme  to  buy  ü  AcNvity  &  fidelity  of  shop’s  customers  

Learning  clients  paCerns  ü  AcNvity  &  fidelity  of  shop’s  customers  ü  Sex  &  Age  &  LocaNon  ü  Buying  paXerns  

Page 7: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Shop  stats   For  different  periods  ü  All,  year,  quarter,  month,  week,  day  

…  and  much  more  

Page 8: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

The  applicaNons  

Customers  

Internal  use  

Sellers  

Page 9: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

The  challenges  

Company  silos  

The  amount  of  data  

The  costs  

Security  

Development  flexibility/agility  

Human  failures  

Page 10: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

The  pla]orm  

S3  Data  storage  ElasNc  Map  Reduce  Data  processing  

EC2  Data  serving  

Page 11: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

The  architecture  

Page 12: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Hadoop  

Distributed  Filesystem  ü  Files  as  big  as  you  want  ü  Horizontal  scalability  ü  Failover    

Distributed  CompuNng  ü  MapReduce  ü  Batch  oriented  

•  Input  files  processed  and  converted  in  output  files  ü  Horizontal  scalability    

Page 13: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Easier  Hadoop  Java  API  ü  But  keeping  similar  efficiency  

Common  design  paXerns  covered  ü  Compound  records  ü  Secondary  sorNng  ü  Joins  

Other  improvements  ü  Instance  based  configuraNon  ü  First  class  mulNple  input/output  

Tuple  MapReduce  implementaJon  for  Hadoop  

Page 14: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Tuple  MapReduce  

Pere  Ferrera,  Iván  de  Prado,  Eric  Palacios,  Jose  Luis  Fernandez-­‐Marquez,  Giovanna  Di  Marzo  Serugendo:      Tuple  MapReduce:  Beyond  classic  MapReduce.      In  ICDM  2012:  Proceedings  of  the  IEEE  Interna2onal  Conference  on  Data  Mining    Brussels,  Belgium  |  December  10  –  13,  2012  

Our  evoluJon  to  Google’s  MapReduce  

Page 15: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Tuple  MapReduce   Sales  difference  between  the  most  selling  offices  per  each  loca2on  

Page 16: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Tuple  MapReduce  

Main  constraint  

ü  Group  by  clause  must  be  a  subset  of  sort  by  clause  

Indeed,  Tuple  MapReduce  can  be  implemented  on  top  of  any  MapReduce  implementaJon  

•  Pangool  -­‐>  Tuple  MapReduce  over  Hadoop  

Page 17: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Efficiency  

hXp://pangool.net/benchmark.html  

Similar  efficiency  to  Hadoop  

Page 18: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Voldemort  

Distributed  key/value  store  

Page 19: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Voldemort  &  Hadoop  

Benefits  ü  Scalability  &  failover  ü  UpdaNng  the  database  does  not  affect  serving  queries  ü  All  data  is  replaced  at  each  execuNon  

•  Providing  agility/flexibility    §  Big  development  changes  are  not  a  pain  

•  Easier  survival  to  human  errors  §  Fix  code  and  run  again  

•  Easy  to  set  up  new  clusters  with  different  topologies    

Page 20: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Basic  staNsNcs  

Count   Average   Min   Max   Stdev  

Easy  to  implement  with  Pangool/Hadoop  ü  One  job,  grouping  by  the  dimension  over  which  you  want  to  

calculate  the  staNsNcs.  

CompuJng  several  Jme  periods  in  the  same  job  

ü  Use  the  mapper  for  replicaNng  each  datum  for  each  period  ü  Add  a  period  idenNfier  field  in  the  tuple  and  include  it  in  the  

group  by  clause    

Page 21: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

DisNnct  count  Possible  to  compute  in  a  single  job  

ü  Using  secondary  sorNng  by  the  field  you  want  to  disNnct  count  on  

ü  DetecNng  changes  on  that  field    

Example  

Shop   Card  

Shop  1   1234  

Shop  1   1234  

Shop  1   1234  

Shop  1   5678  

Shop  1   5678  

Change  +1  

Change  +1  

2  disNnct  buyers  for  shop  1  

ü  Group  by  shop,  sort  by  shop  and  card  

Page 22: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Histograms  Typically  two-­‐pass  algorithm  

ü  First  pass  for  detecNng  the  minimum  and  the  maximum  and  determine  the  bins  ranges  

ü  Second  pass  to  count  the  number  of  occurrences  on  each  bin  

AdaptaJve  histogram    

ü  One  pass  ü  Fixed  number  of  bins  ü  Bins  adapt    

Page 23: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

OpNmal  histogram  Calculate  the  beCer  histogram  that  represents  the  original  one  using  a  limited  number  of  flexible  width  bins  

ü  Reduce  storage  needs  ü More  representaNve  than  fixed  width  ones  -­‐>  beXer  

visualizaNon  

Page 24: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

OpNmal  histogram  

Exact  Algorithm  Petri  Kontkanen,  Petri  Myllym  aki    MDL  Histogram  Density  EsJmaJon    hXp://eprints.pascal-­‐network.org/archive/00002983/  

Too  slow  for  producJon  use  

Page 25: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

OpNmal  histogram  

AlternaNve:  Approximated  algorithm  

Random-­‐restart  hill  climbing    

1.  Iterate  N  Nmes,  keeping  best  soluNon  1.  Generate  a  random  soluNon  2.  Iterate  unNl  no  improvement  

1.  Move  to  next  beXer  possible  movement  

ü  A  soluNon  is  just  a  way  of  grouping  exisNng  bins  ü  From  a  soluNon,  you  can  move  to  some  close  

soluNons  ü  Some  are  beXer:  reduce  the  representaNon  error    

Algorithm  

Page 26: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

OpNmal  histogram  

AlternaNve:  Approximated  algorithm  

Random-­‐restart  hill  climbing    ü  One  order  of  magnitude  faster  ü  99%  accuracy    

Page 27: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Everything  in  one  job  

Basic  staJsJcs  -­‐>  1  job  

DisJnct  count  staJsJcs  -­‐>  1  job  One  pass  histograms  -­‐>  1  job  Several  periods  &  shops  -­‐>  1  job  

We  can  put  all  together  so  that  compuNng  all  staNsNcs  for  all  shops  

fits  into  exactly  one  job      

Page 28: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Shop  recommendaNons  

Based  on  co-­‐occurrences  ü  If  somebody  bought  in  shop  A  and  in  shop  B,  then  a  co-­‐occurrence  

between  A  and  B  exists  ü Only  one  co-­‐occurrence  is  considered  although  a  buyer  bought  

several  Nmes  in  A  and  B  ü  Top  co-­‐occurrences  per  each  shop  are  the  recommendaNons  

Improvements  ü Most  popular  shops  are  filtered  out  because  almost  everybody  buys  

in  them.  ü  RecommendaNons  by  category,  by  locaJon  and  by  both  ü  Different  calculaNon  periods  

Page 29: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Shop  recommendaNons  

Implemented  in  Pangool  ü  Using  its  counNng  and  joining  capabiliNes  ü  Several  jobs  

Challenges  ü  If  somebody  bought    in  many  shops,  the  list  of  co-­‐occurrences  can  

explode:  •  Co-­‐occurrences  =  N  *  (N  –  1),  where  N  =  #  of  disNnct  shops  

where  the  person  bought  ü  Alleviated  by  limiNng  the  total  number  of  disNnct  shops  to  consider  

ü  Only  uses  the  top  M  shops  where  the  client  bought  the  most    

Future  ü  Time  aware  co-­‐occurrences.  The  client  bought  in  A  and  B  and  he  

did  it  in  a  close  period  of  Nme.  

Page 30: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Some  numbers  EsJmated  resources  needed  with  1  year  data  

270  GB  of  stats  to  serve  

24  large  instances  ~  11  hours  of  execuNon  

$3500  month  ü  OpNmizaNons  sNll  possible  ü  Cost  without  the  use  of  reserved  instances  ü  Probably  cheaper  with  an  in-­‐house  Hadoop  cluster  

Page 31: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Conclusion  It  was  possible  to  develop  a  Big  Data  soluJon  for  a  Bank  

ü With  low  use  of  resources  ü Quickly  ü  Thanks  to  the  use  of  technologies  like  Hadoop,  Amazon  Web  

Services  and  NoSQL  databases  

The  soluJon  is  ü  Scalable  ü  Flexible/agile.  Improvements  easy  to  implement  ü  Prepared  to  stand  human  failures  ü  At  a  reasonable  cost  

Main  advantage:  doing  always  everything  

Page 32: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Future:  Splout  Key/value  datastores  have  limitaJons  

ü  Only  accept  querying  by  the  key  ü  AggregaNons  no  possible  ü  In  other  words,  we  are  forced  to  pre-­‐compute  everything  

ü  Not  always  possible  -­‐>  data  explode  ü  For  this  parNcular  case,  Nme  ranges  are  fixed  

Splout:  like  Voldemort  but  SQL!  ü  The  idea:  to  replace  Voldemort  by  Splout  SQL  ü  Much  richer  queries:  real-­‐Nme  aggregaNons,  flexible  Nme  ranges  ü  It  would  allow  to  create  some  kind  of  Google  AnalyNcs  for  the  

staNsNcs  discussed  in  this  presentaNon  ü  Open  Sourced!!!  

hXps://github.com/datasalt/splout-­‐db    

Page 33: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Iván  de  Prado  Alonso  –  CEO  of  Datasalt  www.datasalt.es  @ivanprado  @datasalt  

QuesJons?