Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the...

Automata-based Dynamic Data Processing for Clouds

Reginald Cushing, Marian Bubak,Adam Belloum, Cees de Laat

System Network EngineeringUniversity of Amsterdam

BigDataClouds25th August 2014

COMMIT/

Outline● Motivation

– Traditional workflow process model

● Automata data processing model– Transition Function– Transition Operations

● Implementation– Data Routing– Process Congestion Avoidance

● Example Data State Graphs/Workflows● Results

– State Transitions– Process Congestion Avoidance

● Conclusion/Future Work

● How can we model data transformation during processing?– Workflows, etc model tasks and their respective

execution order.

– Provenance captured data lineage after processing.

● How can we model data processing agnostic from the underlying infrastructure?– Data processing schema

– Workflows often fit the resources.

● How can we have collaboration in data processing between Vms/Clouds/Groups?

Motivation - 1

Motivation – 2

● The standard black box process:– A → [p] → B (A and B are different entities)

● What if:– A → [p] → A' (both input and output are A but

in different states)

– This allows us to reason about data using state graphs.

– Data processing is described as a sequence of state transitions.

Motivation - 3

D:S1 D:S2 D:S9

D:S3 D:S4 D:S5

D:S6 D:S7

Automata Data Processing Model

● If the side effect of processing data object is changing its state then an automata can represent the possible state a data can be in.

● Specifically we use an NFA which is a 5 tuple:

– (Q, ∑, δ, q0, F)

– Q: finite set of states

– ∑: input alphabet

– δ: transition function

– F: set of final states

● Input alphabet are functions.● The transition function is the process (black box) transforms data from

state to state

Transition Function

Instead of an input symbol, the transition function takes anotherfunction σ as input. σ takes data and state as input and produces transformed data and new state. σ accepts multiple states and Outputs multiple states thus it can perform AND/OR operations onStates.

Transition Operations

● State transition logic follows boolean algebra.● Functions can output multiple states.● Basic state combinations are AND, OR.

Distributed Implementation

AMQP● Each VM/Node runs an identical copy of the stack.● Data processing is decentralized.● Communication adapters allow for P2P, MessageServer, File, HTTP, etc packet communication.

Processing Functions

● Each node can host a number of functions● Each function performs a state transitions

and data processing● Functions process data packets● Functions can optionally implement

split/run/merge functions– Split: splits data packets into many packets

– Run: processes the data packets

– Merge: merges the data packets

Data Packet Split/Merge

split()

merge()

Function A

Data Routing

● Each node builds a state routing table

● Depending on the routes data can be partitioned amongst replica functions or replicated to different functions.

– Data replicated to S1:func1, S1:func2 but split between endpoint list of each.

State Tag Endpoint List

S3:func3

S1:func1

S1:func2

S4:func3

S5:func3

tcp://192.168.1.1:7711, tcp://192.168.1.2:7711, inproc://someid, amqp://rabbitmq

tcp://192.168.1.3:7711, inproc://someid, amqp://rabbitmq

tcp://192.168.1.5:7711, inproc://someid, amqp://rabbitmq

Processing Congestion Avoidance

● Data partitioning on heterogeneous performing resources is not straight forward. – Fine-grain partitioning increases parellisim, increases

overhead (communication, etc).

– Coarse grain partitioning decreases overhead, decreases parallelism.

– Too much data can crash a VM.

● Dynamically controlling granularity.– Successively increment data payload using PRTT

(Processing Round Trip Time).

– Reduce payload when degradation is detected.

Processing Efficiency

Foo()Bar()

ttimestamp

Data record1

Data record2

Data recordn

ttimestamp

Efficiency = texec

/ (tnow

- ttimestamp

VM1VM2

Data Processing State Graph

The graph describing the allowable data transitions is also a data protocol header which allows data to be self contained and routable and processable.

MRI - Workflow

MRI – State Graph

Combined State Network

State Transition Pipeline

● Creating a chain of state transitions on 11 Vms

● Each VM hosted 2 transitions 1 Forward, 1 Backward

Filtering Machine

PCA – 1 Machine Shared Memory

● Functions on same node.● Communication through shared memory.● Stable Efficiency ~0.85● Little change in efficiency after ~150 Data Records.● Minimum comm overhead.

PCA – 2 VMs P2P

● Functions on different nodes on same virtual network.● Communication P2P through ZMQ.● Stable Efficiency ~0.65

● Little change in efficiency after ~1550 Data Records.

PCA – 2 LAN VMs AMQP Server

● Functions on different nodes. ● Communication through 3rd party message queue server.● Evidence of jitter.

● Efficiency increase but never reached plateau. ● Efficiency was degrading due to VM running out of memory.

PCA – Local Cloud – EC2

● Functions on different nodes 1 on local private cloud and the other on EC2. ● Communication through 3rd party message queue server.

● Evidence of heavy jitter.● Strange result that we have two efficiency curves?

Conclusion/Future Work

● Data automata modeling provides– Better understanding of data transformations

– Doubles as routing table for distributed architectures

– Facilitates computing collaboration

● Data model can be considered as a schema. Data records can have an additional dimension, state, which results in a kind of 3D database.

Questions?

● REFs– https://github.com/recap/pumpkin

Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the...

Documents

Weighted Tree Automata and Transducers for Syntactic ...jonmay/pubs/thesis_ss.pdf · Weighted Tree Automata and Transducers for Syntactic Natural Language Processing Jonathan David

Automata Processing: Accelerating Big Data

Web Data Management - Inriawebdam.inria.fr/Jorge/files/wdm-typing.pdfXML data typ-ing is typically based on ﬁnite-state automata. Therefore, we recall basic notion of automata, ﬁrst

Data mining with cellular automata - SIGKDD · Data mining with cellular automata Tom Fawcett Center for the Study of Language and Information Stanford University Stanford, CA 94305

Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011

Regular Expressions & Finite State Automata – Part 2 ICS 482: Natural Language Processing

Demystifying Automata Processing: GPUs, FPGAs or Micron’s …synergy.cs.vt.edu/pubs/papers/yu-demysifying-ics17.pdfDemystifying Automata Processing: GPUs, FPGAs or Micron’s AP?

Model Checking Timed Automata - ISTEiste.co.uk/data/doc_wrkszvritcbv.pdf · Chapter 4 Model Checking Timed Automata 4.1. Introduction Currently, formal veriﬁcation of reactive,

Words: Surface Variation and Automata CMSC 35100 Natural Language Processing April 3, 2003

AUTOMATA BASED VERIFICATION OVER LINEARLY ORDERED …szymtor/papers/regdata.pdf · AUTOMATA BASED VERIFICATION OVER LINEARLY ORDERED DATA DOMAINS LUCSEGOUFIN1 ANDSZYMONTORUŃCZYK2

Data Mining Using Learning Automata

ISSN: INFORMATION LOSS MINIMIZATION USING AUTOMATA … · Keywords: Associative Classification, Data Mining, Classification, Automata 1. INTRODUCTION Classification is considered

Data Processing Whitepaper - What type of data processing

Reasoning about data repetitions with counter systems€¦ · Specifying classes of data words I Automata I Register automata [Kaminski & Francez, TCS 94] I Data automata [Bouyer

Applications of Weighted Tree Automata and Tree ... · Applications of Weighted Tree Automata and Tree Transducers in Natural Language Processing Andreas Maletti Institute of Computer

DISTRIBUTED PROCESSING IN AUTOMATAprahladh/papers/KBH/KBH1999.pdfDistributed Automata are a group of automata working in unison to accept one language. We build the theory of distributed

automata In Natural Language Processing - École Pour · Automata in Natural Language Processing Jimmy Ma Technical Report no0834, December 2008 revision 2002 VAUCANSON has been designed

Digital Signal Processing, Cellular Automata, and Parallelism

Data processing - University Of Maryland · Data processing Lec 04 - May 31, 2019. NEXT!2 Data collection Data processing Exploratory analysis & ... Data Processing Operations, which

Cellular Automata and Parallel Processing for Practical