124

EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics
Page 2: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

EFFECTIVE TECHNOLOGY

STACKS

Page 3: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Jaume Lluch

Responsable de la Unidad de

Business Intelligence (IMI)

Ajuntament de Barcelonawww.ajuntament.barcelona.ca

Page 4: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Javier BerdoncesResponsable Alta Tecnología y Arquitectura

para Sector Público en Cataluña

Accenturewww.accenture.com

Page 5: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Urban Platform for Barcelona

CityOS

Page 6: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

ANY SUFFICIENTLY

ADVANCED TECHNOLOGY

IS EQUIVALENT TO MAGIC.

Page 7: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

THE CHALLENGE

New Services to make Barcelona a more inclusive city, more

oriented to citizen’s needs

7Collaborative Intelligence using different data sources as the foundation for new services

Page 8: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

THE CHALLENGE

Page 9: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

EVERYTHING IS CONNECTED

Data Sources Use Cases

Sentilo

(Barcelona

Sensors Platform)

Data

Ingestion

Apps

CityOS

Infrastructures

Information

3rd Parties

Data Sources

Citizen

Suggestions

Cleaning

Services Noise

Management

Public

Lightning

Mobility

City

Resilience

9

Page 10: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

THE PRINCIPLES

Advanced Analytics Platform Big Data Processes

Single View of the Data of the City Consolidation

Open Source & Community Collaboration

Universal Repository to Provide

Transparence Open

10

Page 11: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

12

LOGICAL ARCHITECTURE

Governance, Security & Procedures

Processes

Ontology

Normalization

Processes

Publication

Processes

Analytics

Repository

Publication Real Time

Normalization

Historification

Staging

Inte

rop

erability

Co

nn

ector

City OS Interfases

Data load (Batch & Online)

Services & Applications

Data Sources

Dir

ec

tive

s

MyC

iX

Page 12: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Apache Karaf (Kernel)

Processes

Processos

(Talend)

Analítics (R)

Repository (Cloudera)

Hive, Hbase, & Impala

(SQL), Kudu,

Spark, Flume

API Manager (WSO2) & WFS (Geoserver Client)

Ide

nti

ty S

erv

er

(WS

O2

+ A

cti

ve

Dir

ec

tory

)

HDFS Hadoop Mo

nit

ori

ng

(Z

ab

bix

)

Kafka (Online) Flink (CEP) Talend

Ontology

Protegé Jena

BP

M (

Ac

tivit

ii)

EL

K

13

TECHNICAL ARCHITECTURE

Services & Applications

Data Sources

Services & Applications

Page 13: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

API Manager (WSO2) & WFS (Geoserver Client)

Kafka (Online) Flink (CEP) Talend

14

DATA LIFECYCLE

Repository

PublicationReal Time

Normalization Historification

Staging

BBDD

An

aly

tic

s

Batch Example

Data SQL extraction with Talend

Data

Raw Data Creation

Data Data NormalizationData

HIST

Data Historification

Data

Data Publication

APP

Data Access

Data for Analytics

ALL ETL PROCESSES FOR

CSVS, API SOURCES AND

ORACLE TABLES CAN BE

CREATED USING

ARCHITECTURE MODULES

ONLY THROUGH

CONFIGURATION

Page 14: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

API Manager (WSO2) & WFS (Geoserver Client)

Kafka (Online) Flink (CEP) Talend

15

DATA LIFECYCLE

Repository

PublicationReal Time

Normalization Historification

Staging

An

aly

tic

s

Online Example

Data Kafka Event Data

Data

Raw Data Creation

Data

Data NormalizationData

HIST

Data Historification

Data

Data Publication

APP

Data Access

Transactional

System

Page 15: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

LESSONS LEARNED

Technical Lessons Learned

Strategic Lessons Learned

16

• Use every specific technology to solve each problem (Hbase, Kudu, Spark...).

There are no “One size fits all” solutions. Hybrid architectures are here to stay.

• Light integration with sources with all components necessary (Kafka Connect,

Table Views, etc.)

• LAB MINDSET: Don’t be afraid to make “mistakes under control”.

• Follow the Big Data principle: Think Big, Start Small, Scale Fast.

• Focus on people: the citizen in the center of everything.

• Find a key data set or platform capability that accelerates the adoption rate.

• Define your data inventory in advance.

• Shape a Strong Core version of the solution and improve it. A first “MVP” will shape

the final solution.

• Define your Data Governance as soon as possible.

Page 16: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

CONCLUSIONS

Barcelona evolved from the vision of a pure Smart

Cities Platform to a more comprehensive view: a

Urban Data Platform with the citizen in the middle

Data is Value only with analytics and under a Data

Governance Strategy

The Strategy for CityOS (as a product) is based on

Open Source and the Community

It’s time for Barcelona to Scale Fast

17

Page 17: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

18

MANY THANKS!

Page 18: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

EFFECTIVE TECHNOLOGY

STACKS

Page 19: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Juan Luis Alarcón

Solution Manager

Dominion Digitalwww.dominion-global.com/dominion-digital

@jlamanas

Page 20: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

2 1

B i g D a t a C o n g r e s sB A R C E L O N A

2 4 O C T U B R E 2 0 1 8

C l o u d - Re a d y D a ta C e n te r

J u a n L u i s A l a r c ó nj u a n l u i s . a l a r c o n @ d o m i n i o n - g l o b a l . c o m

Page 21: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

2 2

Cloud HíbridaCloud

Privada

Cloud Pública

¿Qué es?

Solución de transformación del datacenter tradicional en un servicio automatizado y preparado tanto para absorber las cargas de trabajo de misión crítica tradicionales como las nativas cloud.Aplicamos conocimiento, tecnología e innovación para hacer realidad la creación de una cloud híbrida.

Conocimiento Tecnologías Innovación aplicada

PROFUNDO

EXPERTISE Y

CONOCIMIENTO

MODELO DE NEGOCIO

ORIENTADO A LA

VITALIDAD

TECNOLÓGICA

BEST PRACTICES EN

NUEVAS

TECNOLOGÍAS

Page 22: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

2 3

Standardization & Automation

Infrastructure for change, Infrastructure as Code

Environment Standardization

Simplified Infrastructure. It easily solves the application of security patches, updates and evolutionary maintenance.

Increase efficiency and productivity through centralized configuration management and automate deployment processes

It allows to audit and configure systems for

greater security and regulatorycompliance.

Costs

Downtime

Configuration Management /Automation

Puppet

Ansible

Optimization

Compliance

S t a n d a r d i z a t i o n & C e n t r a l i z e d C o n f i g u r a t i o n a n d M a n a g e m e n t

Page 23: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

2 4

Cloud-ready

Cloud backed in Enterprise-ready solutions

Where business meets IT

Speed up innovation and time to market

Eliminate the need to buy, deploy and maintain traditional IT

solutions

Focus your efforts in business

Save, thanks to the elasticity of self-service capacity and on-demand workloads

Use world-class OpenStack solution

provided by Red Hat, a platform

agnostic private cloud that best takes

advantage of CEPH storage

Scale even more and leverage Docker,

Kubernetes, Atomic and more in a truly

end-to-end Container Platform with a

superior user experience: It’s OpenShift

Thanks to the already built, security

tested and certified containerized

services consumed as xPaaS, via

Marketplace and container ecosystem

IaaS

PaaS

SaaS

Strategic alliance with world's open source leader Red Hat

Business Premier Partnership

Strong relation and alignment since 2006

Matching capabilities for Data Centre, Middleware and

Cloud solutions

Page 24: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

2 5

OpenStack

Page 25: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

2 6

Page 26: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

2 7

Stack

720Gbps 480Gbps

CE 12816

• Average Power per RACK 10KW

• Average Power for X6800 ~ 1428W

• Equipment for RACK:

• 7 X6800 Chassis 28U´s

• 60 Storage Nodes with option to be Hyper-Converged

• 192 Compute Nodes

• 28 Physical Nodes per Rack (can be scaled up to 52)

• 2 or 3 TOR Data Centre Switches

• 12 1U servers ( Controller Nodes )

• Next-generation, high-performance Data Centre Core Switch

42U Rack 2 or 3 Switches ToRCE 6850-48T6Q-HI

Management switch

480Gbps720Gbps

Up to 50 Rack and 350 Chassis X6800 ~ 1400 Nodes

Stack Management switch

42U Rack 2 Switches ToRCE 6850-48T6Q-HI

Implementación OpenStack

Page 27: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

2 8

Acerca de DominionLlegando al mercado a través de 6 divisiones

T&T Services Industry Commercial Digital 360ºApplied

Engineering

Page 28: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

2 9

Integrando conocimiento

F O C O T E C N O L Ó G I C OE X P A N S I Ó N I N T E R N A C I O N A L

Y N U E V O S P R O Y E C T O S

N U E V A S L I N E A S D E N E G O C I O

20 años de experiencia adquiriendo “know-how”: Dominion ha integrado con éxito más de 30 empresas y desarrollado joint ventures con diferentes partners.

1998

Nacimiento de Dominion

Decisión estratégica: compromiso con la

tecnología (Proyecto Smart Innovation)

2001

Expansión

Expansión Internacional. Apertura de la división de México

2006

Proyectos

Primeros grandes proyectos internacionales en Salud, Medioambiente y Educación

2011

Fusión

Fusión de INSSEC-CIE. Integración de Dominion Soluciones y Servicios

2014

Integración de Dominion y Beroa

1998-2000

4 adquisiciones en España

2001

1 adquisición en España

2 en extranjero (México y Alemania)

2002-2005

4 adquisiciones en Iberia

4 en extranjero (Francia, Italia, Reino Unido y EE.UU.)

2006

3 adquisiciones en extranjero (Alemania y Australia)

2008-2011

6 adquisiciones en extranjero (Dinamarca, Alemania, Brasil y Francia)

2011

1 adquisición en extranjero (India)

2012-2013

1 adquisición en España

3 greenfields(España, Méjico y Peru)

2016

Adquisición de actividades ABANTIA

Adquisición de CDI y ICC en EE.UU.

Adquisición de las actividades Protisa

Integración del equipo Scorpio

2017

Adquisición de PHONE HOUSE SPAIN

2015

Near y Bilcan

Dominion incorpora Soluciones Digitales (Near) y Servicios Comerciales (Bilcan)

2018

Adquisición de Scorpio

Adquisición de Seref

Adquisición de Ditecsa Colombia

Page 29: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

3 0

w w w . d o m i n i o n - g l o b a l . c o m

Page 30: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

EFFECTIVE TECHNOLOGY

STACKS

Page 31: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

David Bordas

Big Data Tribe Lead

Minsaitwww.minsait.com

Page 32: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Productivización de Analítica Avanzada

multiplataformaOctubre 2018

Page 33: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

2

• 01. Problemática Inicial

• 02. Fusión de equipos como solución

• 03. Ejemplo Práctico

Índice

Page 34: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

3

Problemática inicial

01• Separación de Aptitudes por Equipos

• Ciclo de un proyecto DS Tradicional

Page 35: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

4

El Reto

Problemática inicial

El departamento de DS recibe sets de datos cerrados (ETLs previas) con lo que los DS no siempre tienen acceso a la información en bruto donde hay valor adicional (Accesibilidad)

Si necesitan datos adicionales de origen tienen que pedirlos

La comunicación de los departamentos de Negocio, Arquitectura con el de Data Sicenceno siempre es directa (Sin DevOPS)

El equipo de DS no siempre sabe optimizar su código para producionalizarlo con grandes volumetrías

Page 36: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

5

Separación de Aptitudes por equipo

Arquitectura Data Science

• Arquitectura• Optimizaciones • Soluciones Técnicas• Sistemas

• Preparación de Datos• Creación de librerías• Algorítmica Avanzada• Marketplace

• Negocio• Reporting a Medida• ETLs• Consultoría

BA/BI

Page 37: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

6

Ciclo de un proyecto de DS Tradicional

Iteraciones –Optimización de resultados

3 6

4 5

72

Quality Profile Construcción de modelos

ProducciónResultadosExtracción de informaciónDefinición

Identificación de los problemas y fijación

objetivos para su solución

Búsqueda de fuentes de información y selección de los

métodos de extracción de datos

Evaluación de los resultados obtenidos tras la

aplicación del modelo

Automatización del modelo y despliegue en el entorno

seleccionado

Diseño e implantación del modelo analítico más adecuado

Revisión de la calidad del dato y corrección de los datos

analizados

Este era el panorama habitual de un proyecto de DS hace unos 5 años

ETL

8

Presentación

1

Disponibilización de los datos de los diversos

orígenes

Page 38: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

7

Fusión de equipos como solución 02• Aptitudes a compartir

Page 39: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

8

Aptitudes a compartir

• Pueden paralelizar códigos de DS

• Pueden ayudar a seleccionar librerías

• Pueden dar soporte a la productivización

• Pueden optimizar RAM, recursos y mejorar las configuraciones

• Pueden conectar los BI’s directamente la Big Data

• Pueden ayudar a Negocio a comprender que las variables tradicionales no tienen porque ser las relevantes

• Pueden Ayudar a BI a hacer reports mas dinámicos

• Pueden ayudar a mejorar la calidad del dato con reglas de negocio

• Pueden ayudar con los modelos

• Pueden ayudarnos en la pre-evaluación de resultados

Arquitectura Data Science BA/BI

Page 40: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

9

Ejemplo Práctico

03• Introducción al Problema

• Algoritmo

• Antigua forma de pensar

• Nuevos Repositorios

• Solución del problema

• Resultados

• Nuevo ciclo de vida de un proyecto DS

Page 41: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

10

Introducción al problema

Tenemos una situación concreta, trabajamos en una compañía que detecta el Fraude en la compraventa de inmuebles

y tenemos la siguiente información:

Tenemos un único tipo de dato explotado previo a hacer DS, datos de personas,

declaraciones, listados de inmuebles o precios medios del suelo por zona en CSV o en

relacionales ajenos a nuestro Datalake. Lleva años explotándose solo estos orígenes.

Tenemos fichas notariales extraídas de la aplicación oficial y coordenadas de municipios y

viviendas en Json y XML.

Tenemos Información que tiene pinta de poder construir nodos y aristas referentes a personas

y empresas que hemos campado con procesos de robotización que atacan a boletines oficiales

y orígenes fuera de nuestra compañía.

Page 42: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

11

Algoritmo

Queremos hacer un algoritmo en R que va a juntar la información relevante de inmuebles, compraventas y personas.

Lo enriqueceremos cruzando con lo que dicen los notarios en la compraventa extrayendo de los JSON precio y metros

cuadrados y con la geoposición agruparemos por comarca y marcaremos mar y montaña para ver diferencias.

Finalmente vía boletines y escrituras relacionaremos los compradores y vendedores con las empresas que trabajan,

con los notarios, con intermediarios, con personas que han cometido fraude previamente y buscaremos crear mapas

de calor y grafos de propagación del fraude.

Page 43: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

12

Antigua forma de pensar

Hace tal vez 5-10 años simplemente habríamos puesto una ETL entre medias introduciendo los campos que el

relacional no comprende en columnas y tratándolos con java o .net

ETL DIRECTA

ETL PERDIDA LEVE

ETL PERDIDA ELEVADA

Data Science

Page 44: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

13

Nuevos Repositorios

Los nuevos repositorios nos permiten almacenar la información tal cual viene o transformándola lo mínimo posible de

cara a no perder valor y guardar la información lo más nativamente posible.

Envío de información sin ETL

Envío sin ETL

Envío sin ETL

Page 45: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

14

Solución del problema

Conexiones directas desde nuestro código a los nuevos repositorios

Columnar

No SQL

Grafos

Page 46: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

15

Resultados

24h

20 m

Mejora en tiempo de Ejecución

70%

88%

Mejora del % de Acierto

Page 47: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

16

Nuevo ciclo de vida de un proyecto de DS

Iteraciones –Optimización de resultados

1 4

2 3

5

Quality Profile Construcción de modelos

ProducciónResultadosConexión

Consolidación

Almacenamiento y normalización en el mejor

contenedor

Conectividad directa sin ETLs a los repositorios óptimos para el tipo

de dato

Evaluación de los resultados obtenidos tras la

aplicación del modelo y puesta en común con

consultoría

Automatización del modelo y despliegue en el entorno

seleccionado

Diseño e implantación del modelo analítico más adecuadoRevisión de la calidad del dato

y corrección de los datos analizados, Optimización de los métodos de Data Quality

Haciendo DevOps acercando roles y su habilidades transversales optimizamos nuestro equipo enriqueciendo todas

las fases de un proyecto de Data Science

6

Presentación

Nuevas formas de presentarresultados

Page 48: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

17

Conclusiones

Los departamentos de DS, Big Data y BA deben trabajar juntos

Atacar a los datos con su forma y contenido nativo aporta valor

Llevar más procesos al Big Data nos hace más eficientes

Futuro

Equipos DEV OPS con aptitudes transversales

Equipos de DS con mas capacidades de produccionalización

Abordar los nuevos problemas con menos ETLs comprendiendo el Dato

En definitiva

Page 49: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

EFFECTIVE TECHNOLOGY

STACKS

Page 50: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Ferran Galí

Lead Data Engineer

Trovit Search SLUcorporate.trovit.com

@ferrangali

Page 51: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Ferran Galí i Reniu@ferrangali

Page 52: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

About Me

Page 53: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Today we talk about trovit

Page 54: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Classified Ads

Page 55: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Searching for a new Home

Page 56: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Trovit

Page 57: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Ads Database

Web Scraping

Parser ParserParserParser Parser

Page 58: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

User Website

Website

Ads Database

Page 59: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

User

Ads Database

Website

Too many ads

Page 60: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

User

Ads Database

Website

Bad User Experience

Page 61: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Apache Solr

Page 62: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Ads Database

Indexing

Index Generation

Page 63: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

User Website

Nice User Experience

Page 64: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Our experience with Apache Solr

Can be blazing fast

Faceting (counting top field values) provided as default

Online modifications can result in perf. degradation

Big hardware to achieve low latencies

Page 65: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Searching for second hand cars

Page 66: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Searching for job offers

Page 67: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Everywhere

Page 68: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Ads Database

Data

Page 69: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Ads Database

Indexing

Slow Index Generation

Page 70: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Awful freshness

24 hours of index generation

Page 71: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Ads Database

Indexing

Unstable Index Generation

Page 72: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Sad Engineers

Page 73: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Apache Hadoop

Page 74: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Vertical vs Horizontal scalability

vs

Page 75: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

MapReduce

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing

Page 76: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job

The Big Data problemData Pipeline

Application

Page 77: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job

The Big Data problemData Pipeline

Application

Page 78: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job

The Big Data problemData Pipeline

Application

Page 79: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job Job

The Big Data problemData Pipeline

Application

Page 80: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job Job

The Big Data problemData Pipeline

Application

Page 81: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Data Pipeline

MapReduce Data Pipeline

MapReduceJob 1

MapReduceJob 2

MapReduceJob 3

MapReduceJob 4

Page 82: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Ads Pipeline

HDFS

Upload

Filter Index

Page 83: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Ads Pipeline

HDFS

Upload

Filter Enrich Deduplication Index

Page 84: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Our experience with Apache Hadoop

Scales massively

Easily extensible

Long running batch processes

Around 100 lines of boilerplate code each MapReduce

Page 85: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Freshness isn’t perfect

4-8 hours of index generation

Page 86: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Storm

Page 87: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Topology

Spout

Spout

Storm Topology

Bolt

Bolt

Bolt

Bolt

Page 88: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Ads topology

Storm Topology

Spout Enricher Bolt Indexer Bolt

Page 89: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Ads topology + Kafka Spout

Storm Topology

Spout Enricher Bolt Indexer Bolt

Page 90: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Our experience with Storm

Scales massively

Easy to code

Need to maintain another infrastructure

Not easy to implement different streaming semantics

Page 91: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Lambda Architecture

Page 92: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Batch (MR) + Streaming (Storm) + Serving (Solr)

HDFS

Upload

Filter Enrich Deduplication Index

Storm Topology

Spout Enricher Bolt Indexer Bolt

Page 93: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Our experience with Lambda Architecture

Real time latency through Streaming layer

Consistency through Batch layer

Duplicated code (on our approach)

Synchronizing serving layer adds complexity

Page 94: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Synchronizing with HBase & Zookeeper

HDFS

Upload

Filter Enrich Deduplication Index

Storm Topology

Spout Enricher Bolt Indexer Bolt

Page 95: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Our experience with HBase

Can ingest lots of requests

Shared resources with MapReduce

Very unstable (on our approach)

Split brain when something fails

Page 96: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Mission aborted

HDFS

Upload

Filter Enrich Deduplication Index

Storm Topology

Spout Enricher Bolt Indexer Bolt

Page 97: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Mission aborted

HDFS

Upload

Filter Enrich Deduplication Index

Page 98: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Freshness isn’t perfect (again)

4-8 hours of index generation

Page 99: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Apache Spark

Page 100: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

The Big Data problemMapReduce vs Spark

MapReduce is disk-intensive

Spark is memory-intensive

Page 101: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

The Big Data problemMapReduce vs Spark

Page 102: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Data Pipeline

Data Pipeline

MapReduceJob 1

MapReduceJob 2

MapReduceJob 3

MapReduceJob 4

Page 103: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Data Pipeline

Data Pipeline

MapReduceJob 1

SparkJob 1

MapReduceJob 3

MapReduceJob 4

Page 104: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Data Pipeline

Data Pipeline

MapReduceJob 1

SparkJob 1

SparkJob 2

MapReduceJob 4

Page 105: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Data Pipeline

Data Pipeline

MapReduceJob 1

SparkJob 1

MapReduceJob 4

Page 106: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Ads Pipeline

HDFS

Upload

Filter Enrich Deduplication Index

Page 107: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Ads Pipeline

HDFS

Upload

Filter Enrich Deduplication Index

Page 108: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Our experience with Apache Spark

Faster than MapReduce

Productive

Lots of connectors & libraries

Steep learning curve

Page 109: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Looking into the future

The future...

Page 110: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Apache Spark

Page 111: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Batch Pipeline

Unifying batch & streaming code

Spark Job

Streaming Pipeline

SparkStreaming

Job

Spark Code

Page 112: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Apache Kafka

Page 113: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Kafka Log

V1 V2 V3 V2’ V1’ V1’’

A B C B A A

0 1 2 3 4 5

Value

Key

Offset

Page 114: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Kafka Log Compaction

V1 V2 V3 V2’ V1’ V1’’

A B C B A A

0 1 2 3 4 5

Value

Key

Offset

Page 115: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Kafka Log Compaction

V3 V2’ V1’’

C B A

2 3 5

Value

Key

Offset

Page 116: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Kafka Log Compaction

V3 V2’ V1’’

C B A

2 3 5

Value

Key

Offset

Page 117: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Streaming Pipeline

Spark Streaming

Filter Enrich Dedup

Streaming

Index

Page 118: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Batch from offset 0

Spark Batch (from offset 0)

Filter Enrich Dedup

Streaming

Index

Page 119: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Batch from offset 0

Upload

Spark Batch (from offset 0)

Filter Enrich Dedup

Batch (off 0)

Index

Page 120: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Parsers

Streaming Pipeline

Spark Streaming

Filter Enrich Dedup

Streaming

Index

Page 121: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Interested? We’re hiring!

Page 122: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

Ferran Galí i Reniu@ferrangali

Icons made by Freepik from Flaticon is licensed by CC BY 3.0

Page 123: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics

EFFECTIVE TECHNOLOGY

STACKS

Page 124: EFFECTIVE TECHNOLOGY STACKScdn.bdigital.org/PDF/BDC18/BDC18_EffecTechStacks.pdf · 2018-11-07 · Urban Data Platform with the citizen in the middle Data is Value only with analytics