Upload
natalino-busa
View
118
Download
0
Tags:
Embed Size (px)
DESCRIPTION
How do you combine comprehensive analysis running on large amount of data with the demand for responsiveness of today's api services? This talk illustrates one of recipes that we currently use at ING to tackle this problem. Our analytical stack combines machine learning algorithms running on hadoop cluster and api services executed by an akka cluster. Cassandra is used as a 'latency adapter' between the fast and the slow path. Our api services are executed by the akka/spray layer. Those services consume both live data sources as well as intermediate results as promoted by the hadoop layer via cassandra. This approach allows us to provide internal api services which are both complete and responsive.
Citation preview
Awesome Banking APIsExposing bigdata and streaming analytics using hadoop, cassandra, akka and spray
Humanize Data
The bank statements
The bank statements How I read the bank bills
The bank statements How I read the bank bills What happened those days
data is the fabric of our lives
Personal history:
Long term Interaction:
Real time events:
>>> from sklearn.datasets import load_iris>>> from sklearn import tree>>> iris = load_iris()>>> clf = tree.DecisionTreeClassifier()>>> clf = clf.fit(iris.data, iris.target)
● Flexible, coincise language● Quick to code and prototype● Portable, visualization libraries
Machine learning libraries:scipy, statsmodels, sklearn, matplotlib, ipython
Web librariesflask, tornado, (no)SQL clients
# Multiple Linear Regression Example
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit) # show results
● Language for statitics● Easy to Analyze and shape data● Advanced statistical package● Fueled by academia and professionals● Very clean visualization packages
Packages for machine learningtime serie forecasting, clustering, classification decision trees, neural networks
Remote procedure calls (RPC)From scala/java via RProcess and Rserve
OK, let’s build some banking apps
core banking systems
SOAP services and DBs
System BUS
customer facing appls
channels
Bank schematic
Challenges
Higher separation !
Bigger and Faster
Less silos
Interactions
with core
systems
Reliable
Low cost↓ ↑
Computing Powerhouse
Reliable
Low latency
Tunable CAP
Data model: hashed rows, sorted wide columns
Architecture model: No SPOF, ring of nodes, omogeneous system
ActorA Actor
B
ActorC
msg 1msg 2
msg 3
msg 4●
●
●
●
CoreFlow
HTTPI/O
NoSQLClient
hadoop
BatchDatascience
Cassandra
SOAPClient
Real-time Analytics
Bank core servicesBankTransactions
Data Science
Data Science
Data Science
API
Sprayin’ trait ApiService extends HttpService {
// Create Analytics client actor
val actor = actorRefFactory.actorOf(Props[AnalyticsActor], "analytics-actor")
//curl -vv -H "Content-Type: application/json" localhost:8888/api/v1/123/567
val serviceRoute = {
pathPrefix("api" / "v1") {
pathPrefix( Segment / Segment ) {
(aid, cid) =>
get {
complete {
actor ? (aid, cid)
Create an actor for analytics
Serve the API path
Message is passed on to the analytics actor
https://github.com/natalinobusa/wavr
Latency tradeoffs
Managing computation
Science & Engineering
Statistics, Data Science
PythonRVisualization
IT InfraBig Data
JavaScalaSQL
Hadoop: Big Data Infrastructure, Data Science on large datasets
Big Data and Fast Data requires different profiles to be able to achieve the best results
Some lessons learned
● Mix and match technologies is a good thing● Harden the design as you go● Define clear interfaces● Ease integration among teams● Hadoop , Cassandra, and Akka: they work!● Plugin the Data Science !
Thanks !Any questions?