Changing Engines in Midstream: A Java Stream Computational

  • View

  • Download

Embed Size (px)

Text of Changing Engines in Midstream: A Java Stream Computational

  • Changing Engines in Midstream: A Java StreamComputational Model for Big Data Processing

    Xueyuan Su, Garret Swart, Brian Goetz, Brian Oliver, Paul SandozOracle Corporation


    ABSTRACTWith the addition of lambda expressions and the StreamAPI in Java 8, Java has gained a powerful and expressivequery language that operates over in-memory collections ofJava objects, making the transformation and analysis of da-ta more convenient, scalable and efficient. In this paper,we build on Java 8 Stream and add a DistributableStreamabstraction that supports federated query execution over anextensible set of distributed compute engines. Each queryeventually results in the creation of a materialized resultthat is returned either as a local object or as an engine de-fined distributed Java Collection that can be saved and/orused as a source for future queries. Distinctively, Distributa-bleStream supports the changing of compute engines bothbetween and within a query, allowing different parts of acomputation to be executed on different platforms. At exe-cution time, the query is organized as a sequence of pipelinedstages, each stage potentially running on a different engine.Each node that is part of a stage executes its portion of thecomputation on the data available locally or produced bythe previous stage of the computation. This approach allowsfor computations to be assigned to engines based on pric-ing, data locality, and resource availability. Coupled withthe inherent laziness of stream operations, this brings greatflexibility to query planning and separates the semantics ofthe query from the details of the engine used to execute it.We currently support three engines, Local, Apache HadoopMapReduce and Oracle Coherence, and we illustrate hownew engines and data sources can be added.

    1. INTRODUCTIONIn this paper, we introduce DistributableStream, a Java

    API that enables programmers to write distributed and fed-erated queries on top of a set of pluggable compute engines.Queries are expressed in a concise and easy to understandway as illustrated in the WordCount example shown in Pro-gram 1. The framework supports engines that are disk ormemory based, local or distributed, pessimistic or optimistic

    This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-cense, visit Obtain per-mission prior to any use beyond those covered by the license. Contactcopyright holder by emailing Articles from this volumewere invited to present their results at the 40th International Conference onVery Large Data Bases, September 1st - 5th 2014, Hangzhou, China.Proceedings of the VLDB Endowment, Vol. 7, No. 13Copyright 2014 VLDB Endowment 2150-8097/14/08.

    fault handling. Distinctively it can also federate multi-stagequeries over multiple engines to allow for data access to belocalized, or resource utilization to be optimized. Since Dis-tributableStream is a pure Java library with an efficient localimplementation, it can also scale down for processing smallamounts of data within a JVM.

    Program 1 WordCount

    public static Map wordCount(

    DistributableStream stream) {

    return stream

    .flatMap(s -> Stream.of(s.split("\\s+")))


    .toMap(s -> s, s -> 1, Integer::sum)); }

    The contributions of our work include:

    A DistributableStream interface for processing Big Da-ta across various platforms. The abstraction frees de-velopers from low-level platform-dependent details andallows them to focus on the algorithmic part of theirdesign. At execution time, stream applications aretranslated into query plans to be executed on the con-figured compute engines.

    A pluggable Engine abstraction. This design separatesengine-specific configuration from the stream compu-tational model and the modular design makes addingnew JVM-enabled engines straightforward, supportingnegotiable push or pull-based data movement acrossengine boundaries. Applications developed with theDistributableStream API can be reused on differentengines without rewriting any algorithmic logic.

    Three engine implementations two distributed en-gines, Apache Hadoop and Oracle Coherence, and alocal engine, layered on the Java 8 Stream thread poolwhere local computations are executed in parallel onmultiple cores serve to validate and commercializethis design. Our implementation approach allows forboth high-level query planner and low-level compileroptimizations. On the high level, we use explicit datastructures to describe the processing steps and allowan execution planner to manipulate these data struc-tures to optimize the movement of data and state be-tween the engines. On the low level, each portion ofa DistributableStream is assembled as a local Java 8Stream which arranges its code to make Java Just-in-Time based compilation optimizations applicable.


  • Implementations of Java 8 Streams for various externaldata sources, including InputFormatStream for access-ing HDFS and other Hadoop data sources and Named-CacheStream for accessing Oracle Coherence; and Java8 Stream Collector implementations, such as Output-FormatCollector and NamedCacheCollector, for stor-ing the results of stream computations into ApacheHadoop and Oracle Coherence.

    Federated query model: Data streams connect fusibleoperators within a single engine pipeline and opera-tors running in parallel on different engines, forminga clean and easy to understand federated query mod-el. The ability to exploit multiple engines within asingle query allows us to most efficiently exploit diskarrays, distributed memory, and local memory for per-forming bulk ETL, iterative jobs, and interactive ana-lytics, by applying the appropriate compute enginesfor each processing stage. For example, combiningHadoop with a distributed memory engine (Coherenceor Spark) and an SMP, we can read input data fromHDFS, cache and efficiently process it in memory, andperform the final reduction on a single machine.

    Applications are the proof of any programming sys-tem and WordCount is not enough. For that reasonwe show this interface at work for distributed reservoirsampling, PageRank, and k-means clustering. Theseexamples demonstrate the expressiveness and concise-ness of the DistributableStream programming model.We also run these examples on different combinationsof engines and compare their performance.

    A variety of features are incomplete as of this writing,including a global job optimizer, job progress monitoring,and automatic cleanup of temporary files and distributedobjects dropped by client applications. In addition we haveplans to support other engines and data sources, such asSpark, Tez, and the Oracle Database, high on our list.

    Our discussion starts with an overview of Java 8 Stream(2) and DistributableStream(3). We then present detaileddiscussions on the design and implementation of Distributa-bleStream (4), example stream applications (5), and ex-perimental evaluation (6). Finally, we explore work thatare promising for future development (7), survey relatedwork (8), and conclude the paper (9).

    2. STREAM IN JAVA 8Java SE 8 [12] is the latest version of the Java platform

    and includes language-level support for lambda expressionsand the Stream API that provides efficient bulk operationson Java Collections and other data sources.

    2.1 Prerequisite Features for Supporting Stream

    2.1.1 Lambda ExpressionsA Java lambda expression is an anonymous method con-

    sisting of an argument list with zero or more formal param-eters with optionally inferred types, the arrow token, and abody consisting of a single expression or a statement block.(a,b) -> a+b is a simple example that computes the sumof two arguments, given that the enclosing environment pro-vides sufficient type information to infer the types of a, b.

    Lambda expressions are lifted into object instances by aFunctional Interface. A functional interface is an interfacethat contains only one abstract method. Java 8 predefinessome functional interfaces in the package java.util.function,including Function, BinaryOperator, Predicate, Supplier, Con-sumer, and others. Concise representation of a lambda ex-pression depends on target typing. When the Java compilercompiles a method invocation, it has a data type for each ar-gument. The data type expected is the target type. Targettyping for a lambda expression is used to infer the argumentand return type of a lambda expression. This allows types tobe elided in the definition of the lambda expression, makingthe expression less verbose and more generic.

    The Java 8 Stream API is designed from the ground upfor use with lambda expressions, and encourages a statelessprogramming style which provides maximum flexibility forthe implementation.

    2.1.2 Dynamic CompilationThe Just-in-Time compiler is an important technique used

    to optimize the performance of programs at runtime, inliningmethods, removing redundant loads, and eliminating deadcode. The design of the Java 8 Stream API plays naturallyinto the strength of such dynamically-compiling runtime,exploiting its ability to identify and dynamically optimizecritical loops. This design allows the performance of streamsto continue to improve as the Java runtime environmentimproves, resulting in highly efficient stream execution.

    An important design principle is to ensure the classes formethods invoked in critical loops can be determined by theJIT when the loop is compiled, as this allows the inner meth-ods to be inlined and the loop to be intelligently unrolled.

    2.2 The Stream Computational ModelJava Stream is an abstraction that represents a sequence

    of elements that support sequential and parallel aggregateoper