77
Data Wars The Bloody Enterprise strikes back

Data Wars: The Bloody Enterprise strikes back

Embed Size (px)

Citation preview

Page 1: Data Wars: The Bloody Enterprise strikes back

DataWars

The Bloody Enterprise strikes back

Page 2: Data Wars: The Bloody Enterprise strikes back

VictorPolischuk

@alkovictorvictor-cr

Long time ago in a galaxy far,

far away…

Page 3: Data Wars: The Bloody Enterprise strikes back

It is GOOD

when we have a lot of

data

when we have data

several years old, the older

the better

It is BADwhen we have to remove

historical data

Page 4: Data Wars: The Bloody Enterprise strikes back

It is BAD

when we have a lot of

code

when we have code

several years old, the older

the worse

It is GOO

Dwhen we have to remove

historical code

Page 5: Data Wars: The Bloody Enterprise strikes back

Money, numbers, and

arithmeticIdentities

Text data and Strings Date and time

Page 6: Data Wars: The Bloody Enterprise strikes back

PleaseParticipate

Page 7: Data Wars: The Bloody Enterprise strikes back

Money

Page 8: Data Wars: The Bloody Enterprise strikes back

Money

Float & Double

Convert to Integer

Page 9: Data Wars: The Bloody Enterprise strikes back

Money Float & Double Problem

Developers usually have no idea how it is represented:

Page 10: Data Wars: The Bloody Enterprise strikes back

Money Float & Double Quiz #1

• = ?

Float: 0.6 + 0.1

• = ?

Double: 0.6 + 0.1

Page 11: Data Wars: The Bloody Enterprise strikes back

Money Float & Double Quiz #1

• = 0.70000005

Float: 0.6 + 0.1

• = 0.7

Double: 0.6 + 0.1

Page 12: Data Wars: The Bloody Enterprise strikes back

Money: Float & Double: stackoverflow.com

Page 13: Data Wars: The Bloody Enterprise strikes back

Money Float & Double Quiz #2

• = ?

Float: 0.2 + 0.1

• = ?

Double: 0.2 + 0.1

Page 14: Data Wars: The Bloody Enterprise strikes back

Money Float & Double Quiz #2

• = 0.3

Float: 0.2 + 0.1

• = 0.30000000000000004

Double: 0.2 + 0.1

Page 15: Data Wars: The Bloody Enterprise strikes back

Money Float & Double

Drill Down

• Binary representation: [sign] [exponent] [mantissa]• Float: 1 bit, 8 bits, 23 bits• Double: 1 bit, 11 bits, 52 bits• Value:

0.1f = 0-01111011-10011001100110011001101 0.1f = + 2-127+123 * (2-1 + 2-4 + 2-5 + 2-8 + 2-9 + 2-12 + 2-13 + 2-16 + 2-17 …)0.1f = 2-4 * (1 + 5033165 / 223) = 0.100000001490116119384765625

Page 16: Data Wars: The Bloody Enterprise strikes back

Money Float & Double Example

Page 17: Data Wars: The Bloody Enterprise strikes back

Money Float & Double Quiz #3

+0.0f = 0-00000000-00000000000000000000000

-0.0f = 1-00000000-00000000000000000000000

+0.0f == -0.0f?

Page 18: Data Wars: The Bloody Enterprise strikes back

Money Float & Double

Drill Down

/** * Get or create float value for the given float. * * @param d the float * @return the value */public static ValueFloat get(float d) { if (d == 1.0F) { return ONE; } else if (d == 0.0F) { // -0.0 == 0.0, and we want to return 0.0 for both return ZERO; } return (ValueFloat) Value.cache(new ValueFloat(d));}

Page 19: Data Wars: The Bloody Enterprise strikes back

Money Float & Double Summary

Just never use it

Forget it exists

Unless you are working on a video codec

Page 20: Data Wars: The Bloody Enterprise strikes back

Money Convert to Integer

Multiply decimals up to integers

• (as a constant probably)

Keep the “scale” somewhere else

Page 21: Data Wars: The Bloody Enterprise strikes back

Money Convert to Integer Quiz #1

•= ?

10 * 230 + 5

Page 22: Data Wars: The Bloody Enterprise strikes back

Money Convert to Integer Quiz #1

•= 28, where 10 is 10%

10 * 230 + 5

Page 23: Data Wars: The Bloody Enterprise strikes back

Money Convert to Integer

Drill Down

@Embeddablepublic class Amount implements Serializable { private int rate; @Transient private final int scale;

public Amount() { scale = 6; } public Amount(int rate, int scale) { this.scale = scale; setRate(rate); } …

Page 24: Data Wars: The Bloody Enterprise strikes back

Money Convert to Integer Summary

• It is better to keep precision closer to the number• It is better when arithmetic just works• It is better when equals and compareTo work• int <*/+> int can exceed int (same with long)• Consistency is almost always above performance

Page 25: Data Wars: The Bloody Enterprise strikes back

Money Solution BigDecimal

Precision and accuracy are known and adjustable

Arithmetic is included

Supported by JDK, JDBC, and etc

Performance is quite nice

Page 26: Data Wars: The Bloody Enterprise strikes back

identity

Page 27: Data Wars: The Bloody Enterprise strikes back

Identity

Unexpected Overflow

UUID

Page 28: Data Wars: The Bloody Enterprise strikes back

Identity Overflow

Integer and Long are finite types

Sometimes they can overflow

Moreover they usually twice smaller than you think

Page 29: Data Wars: The Bloody Enterprise strikes back

Identity Overflow Example

Page 30: Data Wars: The Bloody Enterprise strikes back

Identity Overflow Example

Page 31: Data Wars: The Bloody Enterprise strikes back

Identity Overflow Sucker Punch

• Also, “some languages” cannot work with 53+ bits integer types• In addition, “some languages” work with custom 32-bit integer types

Page 32: Data Wars: The Bloody Enterprise strikes back

Identity Overflow Summary

There is a difference between DB and API identity

• Always use integer types as identity for DB• Always use text types as identity for API• Avoid using 32- bit types as identity at all

Unless you are 99.9% sure

Page 33: Data Wars: The Bloody Enterprise strikes back

Identity UUID• Are not guaranteed to be globally unique• Not K-ordered• In most of the cases are excessively big (128 bit)• Can be the reason of a serious performance degradation• Have different versions which may suite better or worse• Strangely enough RDBMS rarely supports UUID/GUID data types• Weird:• Time based on 100-nanosecond intervals since 15th of October 1582• Were invented/published around 1999

Page 34: Data Wars: The Bloody Enterprise strikes back

Identity UUID Store as String

• 16 bytes

UUID is 128-bit value

• 36 symbols – which is more than 2 times bigger

A96A0D4C-49D0-4431-B126-4C66688ADEF3

Page 35: Data Wars: The Bloody Enterprise strikes back

Identity UUID Drill Down

High long

32

time_low

16

time_mid

4

version

12

time_hi

Low long

4

variant

12

clock_seq

48

node

Page 36: Data Wars: The Bloody Enterprise strikes back

Identity UUID Example

$ uuidgen0c8aa0f6-9f6f-4fad-9662-1b683f2f4a0d$ uuidgen1ee09695-3a04-4e7a-8bab-e67dabc4b5a2$ uuidgen -t3770f4d0-88b3-11e6-bba6-005056bb68cb$ uuidgen -t3b14dd54-88b3-11e6-8c53-005056bb68cb

Page 37: Data Wars: The Bloody Enterprise strikes back

Identity Solution

• Use text representation for public identities (API)• Database Sequences (Long)• UUID + Database Sequences (Long)• UUID (BigInteger/Binary)• Twitter Snowflake (Long) – outdated• UUID (String)• *Flake (128 bit)

Page 38: Data Wars: The Bloody Enterprise strikes back

String

Page 39: Data Wars: The Bloody Enterprise strikes back

String

Java and encoding

JDBC drivers and DB types

Page 40: Data Wars: The Bloody Enterprise strikes back

String Encoding

Java uses UTF-16 for String encoding

UTF-16 has symbol range: 0x0000..0x10FFFF

String uses char[] (byte[] in JDK9)

Char has range: 0x0000..0xFFFF

Page 41: Data Wars: The Bloody Enterprise strikes back

String Encoding Quiz #1

How to represent range

0x100000..0x10FFFF using char?

Page 42: Data Wars: The Bloody Enterprise strikes back

String Encoding Quiz #1

• Define surrogate range: 0xD800..0xDFFF (0x800 characters)• Split it equally to “High”: 0xD800..0xDBFF and “Low”:

0xDC00..0xDFFF• Combine “High”-to-“Low” to get 0x400 * 0x400 = 0x100000 symbols• Profit???• Profit!!!

Page 43: Data Wars: The Bloody Enterprise strikes back

String Encoding Quiz #2

String x = new String(new char[]{'z',0xD801,0xDC37,'a','b','c'

});

System.out.println(x);System.out.println(x.substring(0,2));System.out.println(x.substring(2));

Page 44: Data Wars: The Bloody Enterprise strikes back

String Encoding Quiz #2

z𐐷abcz??abc

Page 45: Data Wars: The Bloody Enterprise strikes back

String Encoding Solution

No solution, just be aware

Yet, it might be more sophisticated soon

Page 46: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB

Mapping DB specific types to JDBC

Some DB or driver exceptional cases

BLOB vs CLOB

Narrower DB encoding

Page 47: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB Mapping

public enum JDBCType implements SQLType {CHAR(Types.CHAR),VARCHAR(Types.VARCHAR),LONGVARCHAR(Types.LONGVARCHAR),...BLOB(Types.BLOB),CLOB(Types.CLOB),...NCHAR(Types.NCHAR),NVARCHAR(Types.NVARCHAR),LONGNVARCHAR(Types.LONGNVARCHAR), NCLOB(Types.NCLOB),

Page 48: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB Mapping

• CHARACTER [(len)] or CHAR [(len)]• VARCHAR (len)• BOOLEAN

• SMALLINT

• INTEGER or INT

• DECIMAL [(p[,s])] or DEC [(p[,s])]

• NUMERIC [(p[,s])]

• REAL

• FLOAT(p)

• DOUBLE PRECISION

• DATE

• TIME

• TIMESTAMP

• CLOB [(len)] or CHARACTER LARGE OBJECT [(len)] or CHAR LARGE OBJECT [(len)]• BLOB [(len)] or BINARY LARGE OBJECT [(len)]

Page 49: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB Quiz #1

What is JDBC type

LONGVARCHAR?

Page 50: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB

Drill Down

setStringInternal(int var1,String var2) throws SQLException {

...int var6 = var2 != null?var2.length():0;...if(var6 <= this.maxVcsCharsSql) { this.basicBindString(var1, var2);} else if(var6 <= this.maxStreamNCharsSql) { this.setStringForClobCritical(var1, var2);} else { this.setStringForClobCritical(var1, var2);}

Page 51: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB

CHAR & VARCHAR

Page 52: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB

BLOB & CLOB

• just bytes

Binary Large OBject

• just characters in your DB encoding

Character Large OBject

Page 53: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB

Char & NChar

• uses your DB encoding or no encoding

Char, Varchar, CLOB…

• uses specified encoding

NChar, NVarchar, NCLOB…

• does not have NBLOB

BLOB

Page 54: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB

Cp1251 vs Cp1252

Sometimes encoding does not matter much

Unless too smart drivers spoil it

Unless they are not compatible

Page 55: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB Solution

Check your DB encoding upfront

If needed use N* DB types and N* JDBC types as well

Page 56: Data Wars: The Bloody Enterprise strikes back

String JDBC vs DB Solution

Losing data because of encoding is lame

If you expect some strange strings coming use N* types

Never forget that symbol is not a char/byte it may save you one day

Your JDBC driver can screw you

Page 57: Data Wars: The Bloody Enterprise strikes back

Date

Page 58: Data Wars: The Bloody Enterprise strikes back

DateTime zones

DST and leap miracles

Page 59: Data Wars: The Bloody Enterprise strikes back

Date Time Zone

Does DB and App time zone match?

What can go wrong if they don’t?

Page 60: Data Wars: The Bloody Enterprise strikes back

Date Time Zone Quiz #1

• Database: Oracle 11g• Database time zone: CET/CEST (+01:00/+02:00)• Application: Java 8• Application time zone: EET/EEST (+02:00/+03:00)• setTimestamp(‘2016-10-14 15:35:01’)?• getTimestamp()?

Page 61: Data Wars: The Bloody Enterprise strikes back

Date Time Zone

Quiz #1 Hintfinal int oracleYear(int var1) {

int var2 = ((this.rowSpaceByte[0 + var1] & 255) - 100) * 100 + (this.rowSpaceByte[1 + var1] & 255) - 100; return var2 <= 0?var2 + 1:var2;}

final int oracleMonth(int var1) { return this.rowSpaceByte[2 + var1] - 1; }final int oracleDay(int var1) { return this.rowSpaceByte[3 + var1]; }final int oracleHour(int var1) { return this.rowSpaceByte[4 + var1] - 1; }final int oracleMin(int var1) { return this.rowSpaceByte[5 + var1] - 1; }final int oracleSec(int var1) { return this.rowSpaceByte[6 + var1] - 1; }final int oracleTZ1(int var1) { return this.rowSpaceByte[11 + var1]; }final int oracleTZ2(int var1) { return this.rowSpaceByte[12 + var1]; }

Page 62: Data Wars: The Bloody Enterprise strikes back

Date Time Zone Quiz #1

• 2016-10-14 15:35:01

Database

• 2016-10-14 15:35:01

Application

Page 63: Data Wars: The Bloody Enterprise strikes back

Date Time Zone Quiz #1

• 2016-10-14 15:35:01

Database

• 2016-10-14 15:35:01

Application with UTC time zone

Page 64: Data Wars: The Bloody Enterprise strikes back

Date Time Zone Quiz #2

JavaScript client: time zone unknown

Java server: EET time zone

How to pass dates?

Page 65: Data Wars: The Bloody Enterprise strikes back

Date Time Zone Quiz #2

• Date… ehmm… JSON does not know what it is…• Long is a bit of a problem for 53+ impotent integer types (now 41,

~140,000 years and we will cross the border)• String as ISO 8601 is a lesser evil

Page 66: Data Wars: The Bloody Enterprise strikes back

Date Time Zone Solution

Use the same App/DB time zone

Check your DB driver to ensure conversion safety

Store timestamps as long: DB and API

Store timestamps as String: API

Page 67: Data Wars: The Bloody Enterprise strikes back

Date DST & Magic

Missing and extra hours

Leap seconds

Page 68: Data Wars: The Bloody Enterprise strikes back

Date DST & Magic Calculations

24 hours in a day

60 minutes in an hour

60 seconds in a minute

FTW: 24 * 60 * 60 * 1000

Page 69: Data Wars: The Bloody Enterprise strikes back

Date DST & Magic Quiz #1

27.03.2016 00:00:00 - 28.03.2016 00:00:00 (EET/EEST)

Page 70: Data Wars: The Bloody Enterprise strikes back

Date DST & Magic Quiz #1

26.03.2016 22:00:00 - 27.03.2016 21:00:00 (UTC) – 23h

Page 71: Data Wars: The Bloody Enterprise strikes back

Date DST & Magic Quiz #2

31.12.2016 00:00:00 – 01.01.2017 00:00:00 (EET/EEST)

01.01.2017 00:00:00 – 02.01.2017 00:00:00 (EET/EEST)

Page 72: Data Wars: The Bloody Enterprise strikes back

Date DST & Magic Quiz #2

30.12.2016 22:00:00 – 31.01.2017 22:00:00 (UTC) – 24h

31.12.2017 22:00:00 – 01.01.2017 22:00:00 (UTC) – 24h+1s

Page 73: Data Wars: The Bloody Enterprise strikes back

Date DST & Magic Quiz #2

It will happen: 31.12.2016 23:59:60 (UTC)

It had happened: 30.06.2015 23:59:60 (UTC)

Blame the Earth, and Moon, and Sun

Blame software developers

Page 74: Data Wars: The Bloody Enterprise strikes back

Date One Last Thing

Date vs Interval

Date is a tuple of year, month, day, hour, and etc.

Instant is a precise point on the timeline

Page 75: Data Wars: The Bloody Enterprise strikes back

Date One Last Thing

Date vs Interval

Date can be converted to Instant

Instant can be converted to Date

• even within a Chronology

However, “conversion rate” is not constant

Page 76: Data Wars: The Bloody Enterprise strikes back

Date Summary

• Use UTC as much as possible• Keep in mind the difference between Date and Instant• Think of Date/Instant interoperation as it was designed/used by

idiots• 24*60*60*1000 is, basically, simplification. Quite harmful at times.• Use proper date libraries – you wouldn’t want to reinvent it again.• GMT is not yet another name for UTC, beware!

Page 77: Data Wars: The Bloody Enterprise strikes back

Thank you

?