Upload
nljug
View
226
Download
0
Embed Size (px)
Citation preview
Performance of Java 8 and beyond
Performance van Java 8 en verder
By Jeroen Borgers
1
Contents1. Introduction 2. Lambda expressions 3. Stream API 4. Parallel execution & cores 5. Filter map reduce, parallel streams internals 6. Fork-join framework use 7. Lambda’s versus inner classes 8. Tiered compilation 9. PermGen removal 10.java.time performance 11.Accumulators en Adders 12.Map improvements 13.Java 9+ improvements 14.Utilization of GPU's 15.Value Types 16.Arrays 2.0 17.Summary and conclusions 2
Introduction to lambdas and streams• Java 8 introduces lambda expressions for functional
programming • With the Stream API iteration can be handled internally by a
library • Tell don’t ask for applying a function on a collection • or tell to do that in parallel, on multiple cores • question is if this improves your response time
3
Lambda expressions and streams• Example
4
Lambda expressions and streams• Example with method references
5
Lambda expressions: short notation• instance of anonymous inner class of functional interface • functional interface has only one abstract method • Runnable: void run() • Executor: void execute(Runnable r) • Iterable<T>: Iterator<T> iterator() • new: java.util.function • Consumer<T>: void accept(T t) • Function<T, R>: R apply(T t) • Predicate<T>: boolean test(T t)
• Annotation: @FunctionalInterface6
Anonymous inner class instance example
7
Inner class has boiler plate code
8
Lambda expression is concise
9
Stream pipeline
10
Source Intermediate operations lazy evaluation
Terminal operations eager evaluation
Stream lazy evaluation
11
Stream lazy evaluation optimizes with short-circuiting - can be big win
12
Stream executed in parallel
13
Parallel execution & hardware threads• Parallel != concurrent • CPU Frequency at max • #cores/hardware threads increase 64+ • Must be able to utilize those cores • need to process data faster: BigData, IoT • Runtime.getRuntime().availableProcessors() • reports #hardware threads • my Mac: 2 cores with 2 hyper threads = 4
• Can we get a speedup of ~4?
14
Parallel streams utilize ForkJoinPool• Java 8 ForkJoinPool introduces a common pool for any ForkJoinTask • one per JVM
• Used in Array.parallelSort, .parallelSetAll and parallelStream • Size defaults to Runtime.getRuntime().availableProcessors() - 1 • Can be set with: • -Djava.util.concurrent.ForkJoinPool.common.parallelism=N
• Multiple JVM’s on a machine • consider lowering the pool size
• Tasks waiting for I/O • consider increasing the pool size
15
Fork-join framework: divide-and-conquer• Divide task recursively in smaller tasks • Divide array of 640 elements into 64
leaf tasks of 10 elements • e.g. sum or sort on each level
• Many ForkJoinTasks processed by limited threads, e.g. ForEachTask • like ThreadPoolExecutor • worse: overhead of creating tasks • better: work stealing from queue
of other threads • great for unbalanced tasks!
16
Performance of Lambda’s versus inner classes• Lambdas seem syntactic sugar around creating anonymous class • in fact, it is not
• Inner class • Actual class loaded by class loader • New object created, allocation, initialization, gc
• Lambda • creates a static method called through helper class
• Performance is similar • Only first time loading inner class in class loader is slower
17
When to use parallel streams?• source.parallelStream().operation(F) • F independent • computation on element does not rely on or impact other • stateless, non-interfering
• source is efficiently splittable • Collections, Arrays, SplittableRandom • not I/O based: designed for sequential use
• computationally expensive • ROT: sequential version > 100 µs
18
Parallel when computationally expensive• source.parallelStream().operation(F) • ROT: sequential version > 100 µs • N * Q > 10 000 • N = #elements • Q = cost per element of F: #operations • small function like x -> x * x: N > 10 000 elements • moderately large function Q = 100: N > 100 elements
19
Overhead of parallel execution• Startup of power-controlled cores • Sequential part of setting up parallel calculation • Splittability = ease of partitioning • efficient if random access or efficient search: • ArrayLists, [Concurrent]HashMaps, arrays
• inefficient: LinkedLists, BlockingQueues, IO-based • Stream BufferedReader.lines() currently for sequential
use • might by improved in future JDK, for highly efficient
bulk processing of buffered IO
20
Creating the micro benchmarkTiny calculation per element
21
Creating the micro benchmark 2
22
Micro benchmark demo
23
Medium sized calculation benchmark
• 1000 elements • Speedup by using serial lambda's = 0.95884454 • Speedup of parallel over serial lambda's= 1.2968781 • Speedup of parallel over oldSchool = 1.2435045
• 100_000 elements • Speedup by using serial lambda's = 0.9760258 • Speedup of parallel over serial lambda's= 2.1337924 • Speedup of parallel over oldSchool = 2.0826366
24
Utilization of coresMedium calculation, 1000 and 100_000 elements
25Parallel part
26
27
Tiny calculation benchmark
• 1000 elements • Speedup by using serial lambda's = 0.12944984 • Speedup of parallel over serial lambda's= 0.46804 • Speedup of parallel over oldSchool = 0.0605877
• 100_000 elements • Speedup by using serial lambda's = 0.10920245 • Speedup of parallel over serial lambda's= 5.905797 • Speedup of parallel over oldSchool = 0.64492756
28
Utilization of coresTiny calculation, 1000 and 100_000 elements
29
Micro benchmark conclusions(for this benchmark, on this computer)• For high performance and small functions: use old school loops • lambda’s infrastructure takes more overhead than function
• For high performance and large functions • serial • if N * Q > 100 000 then parallel
• I need more cores!
30
Tiered compilation• JIT-compiler came in 2 flavors, now 3 • -client (C1) • quick startup time
• -server (C2) • best performance in long run
• -XX:+TieredCompilation • first C1, then C2
• only Java 8: TieredCompilation default • Java 7: often need to increase code cache • -XX:ReservedCodeCacheSize=96M (7) 240M (8)
31
Permgen removal• Upto Java 7: Permgen; Java 8: Metaspace • Permgen (wrong name) • data not related to classes: String pool
• Metaspace • only class meta data • Class objects itself on heap • String pool on heap • -XX:[Max]MetaspaceSize=N • Default max ‘unlimited’ (1 GB)
• OutOfMemoryError: Metaspace instead of PermGen space32
java.time performance • Finally a proper library for Date and Time that replaces the • Crappy stuff: • java.util.Date • mutable - defensive copies needed
• java.util.Calendar • 540 bytes to store timestamp, Locale, TZ - heap/gc
• java.text.SimpleDateFormat • not thread safe - so have to re-create
• Stephen Colebourne spec lead, from Joda time
33
java.util.concurrent.atomic Accumulators and Adders
34
Map improvements• HashMap, LinkedHashMap and
ConcurrentHashMap • collisions on keys: keys end up in same bucket • access time O(1) -> O(n) • follow LinkedList until key.equals() returns
true • Balanced tree instead of linked list • if size > TREEIFY_THRESHOLD (8) • worst case access time O(n) -> O(log(n)) • keys should implement Comparable • branches on hashCode, then compareTo
35
Java 9+ performance improvements
36
Sumatra: Utilization of GPU's• GPU’s have 100_000’s of stream cores • SIMD - single instruction multiple data • work offloaded to GPU • implemented off-loadable version of parallel().forEach()
• Use parallel streams and lambdas
37
Value Types (JEP 169)The next big thing!• Currently:
• limited set of primitives, by value: no identity • others by reference: identity • footprint:
• heap allocated • object headers • 1+ pointers pointing to it • burden for small objects
• object identity only serves mutability • JVM attempts to figure out if identity is needed
• escape analysis and object elision can unwrap in cases • fragile • Object might be used as lock, then needs identity
38
Integer overhead
39
mark word class pointer value
object pointer
value
Point example
40
Point - class versus value type • Point object layout
• @Value Point layout
41
mark word class pointer x
object pointer
x
y padding
y
Arrays 2.0 Improvements• array[(long)i] = 5; • array[i, j, k] = 7; • Arrays.chop(T[] a, int newLen); • prevents copying in StringBuilder.toString()
• arrays become real Java objects • indexes of other types than int, long • like Map
• thread-safe access for array slices • final/volatile
42
• Summary and conclusions• Lamdas and streams offer possible performance improvement • lazy evaluation • tiny calcs or small #elements & medium size calc • don’t use parallel() • consider old school iterations if performance important
• Many performance improvements in Java 8 • Use it if you can and get better performance
• Several performance improvements planned for Java 9+ (10?) • Better support for Big Data & number crunching
43
Want to know more?• www.jpinpoint.com / www.profactive.com • references, presentations
• Accelerating Java Applications • 3 days technical training • 24-25-26 November 2014 • nl-jug members 10% discount • hand-in business card today: 20% discount
44
Questions?
45