Data Wars: The Bloody Enterprise strikes back

Preview:

Citation preview

DataWars

The Bloody Enterprise strikes back

VictorPolischuk

@alkovictorvictor-cr

Long time ago in a galaxy far,

far away…

It is GOOD

when we have a lot of

data

when we have data

several years old, the older

the better

It is BADwhen we have to remove

historical data

It is BAD

when we have a lot of

code

when we have code

several years old, the older

the worse

It is GOO

Dwhen we have to remove

historical code

Money, numbers, and

arithmeticIdentities

Text data and Strings Date and time

PleaseParticipate

Money

Money

Float & Double

Convert to Integer

Money Float & Double Problem

Developers usually have no idea how it is represented:

Money Float & Double Quiz #1

• = ?

Float: 0.6 + 0.1

• = ?

Double: 0.6 + 0.1

Money Float & Double Quiz #1

• = 0.70000005

Float: 0.6 + 0.1

• = 0.7

Double: 0.6 + 0.1

Money: Float & Double: stackoverflow.com

Money Float & Double Quiz #2

• = ?

Float: 0.2 + 0.1

• = ?

Double: 0.2 + 0.1

Money Float & Double Quiz #2

• = 0.3

Float: 0.2 + 0.1

• = 0.30000000000000004

Double: 0.2 + 0.1

Money Float & Double

Drill Down

• Binary representation: [sign] [exponent] [mantissa]• Float: 1 bit, 8 bits, 23 bits• Double: 1 bit, 11 bits, 52 bits• Value:

0.1f = 0-01111011-10011001100110011001101 0.1f = + 2-127+123 * (2-1 + 2-4 + 2-5 + 2-8 + 2-9 + 2-12 + 2-13 + 2-16 + 2-17 …)0.1f = 2-4 * (1 + 5033165 / 223) = 0.100000001490116119384765625

Money Float & Double Example

Money Float & Double Quiz #3

+0.0f = 0-00000000-00000000000000000000000

-0.0f = 1-00000000-00000000000000000000000

+0.0f == -0.0f?

Money Float & Double

Drill Down

/** * Get or create float value for the given float. * * @param d the float * @return the value */public static ValueFloat get(float d) { if (d == 1.0F) { return ONE; } else if (d == 0.0F) { // -0.0 == 0.0, and we want to return 0.0 for both return ZERO; } return (ValueFloat) Value.cache(new ValueFloat(d));}

Money Float & Double Summary

Just never use it

Forget it exists

Unless you are working on a video codec

Money Convert to Integer

Multiply decimals up to integers

• (as a constant probably)

Keep the “scale” somewhere else

Money Convert to Integer Quiz #1

•= ?

10 * 230 + 5

Money Convert to Integer Quiz #1

•= 28, where 10 is 10%

10 * 230 + 5

Money Convert to Integer

Drill Down

@Embeddablepublic class Amount implements Serializable { private int rate; @Transient private final int scale;

public Amount() { scale = 6; } public Amount(int rate, int scale) { this.scale = scale; setRate(rate); } …

Money Convert to Integer Summary

• It is better to keep precision closer to the number• It is better when arithmetic just works• It is better when equals and compareTo work• int <*/+> int can exceed int (same with long)• Consistency is almost always above performance

Money Solution BigDecimal

Precision and accuracy are known and adjustable

Arithmetic is included

Supported by JDK, JDBC, and etc

Performance is quite nice

identity

Identity

Unexpected Overflow

UUID

Identity Overflow

Integer and Long are finite types

Sometimes they can overflow

Moreover they usually twice smaller than you think

Identity Overflow Example

Identity Overflow Example

Identity Overflow Sucker Punch

• Also, “some languages” cannot work with 53+ bits integer types• In addition, “some languages” work with custom 32-bit integer types

Identity Overflow Summary

There is a difference between DB and API identity

• Always use integer types as identity for DB• Always use text types as identity for API• Avoid using 32- bit types as identity at all

Unless you are 99.9% sure

Identity UUID• Are not guaranteed to be globally unique• Not K-ordered• In most of the cases are excessively big (128 bit)• Can be the reason of a serious performance degradation• Have different versions which may suite better or worse• Strangely enough RDBMS rarely supports UUID/GUID data types• Weird:• Time based on 100-nanosecond intervals since 15th of October 1582• Were invented/published around 1999

Identity UUID Store as String

• 16 bytes

UUID is 128-bit value

• 36 symbols – which is more than 2 times bigger

A96A0D4C-49D0-4431-B126-4C66688ADEF3

Identity UUID Drill Down

High long

32

time_low

16

time_mid

4

version

12

time_hi

Low long

4

variant

12

clock_seq

48

node

Identity UUID Example

$ uuidgen0c8aa0f6-9f6f-4fad-9662-1b683f2f4a0d$ uuidgen1ee09695-3a04-4e7a-8bab-e67dabc4b5a2$ uuidgen -t3770f4d0-88b3-11e6-bba6-005056bb68cb$ uuidgen -t3b14dd54-88b3-11e6-8c53-005056bb68cb

Identity Solution

• Use text representation for public identities (API)• Database Sequences (Long)• UUID + Database Sequences (Long)• UUID (BigInteger/Binary)• Twitter Snowflake (Long) – outdated• UUID (String)• *Flake (128 bit)

String

String

Java and encoding

JDBC drivers and DB types

String Encoding

Java uses UTF-16 for String encoding

UTF-16 has symbol range: 0x0000..0x10FFFF

String uses char[] (byte[] in JDK9)

Char has range: 0x0000..0xFFFF

String Encoding Quiz #1

How to represent range

0x100000..0x10FFFF using char?

String Encoding Quiz #1

• Define surrogate range: 0xD800..0xDFFF (0x800 characters)• Split it equally to “High”: 0xD800..0xDBFF and “Low”:

0xDC00..0xDFFF• Combine “High”-to-“Low” to get 0x400 * 0x400 = 0x100000 symbols• Profit???• Profit!!!

String Encoding Quiz #2

String x = new String(new char[]{'z',0xD801,0xDC37,'a','b','c'

});

System.out.println(x);System.out.println(x.substring(0,2));System.out.println(x.substring(2));

String Encoding Quiz #2

z𐐷abcz??abc

String Encoding Solution

No solution, just be aware

Yet, it might be more sophisticated soon

String JDBC vs DB

Mapping DB specific types to JDBC

Some DB or driver exceptional cases

BLOB vs CLOB

Narrower DB encoding

String JDBC vs DB Mapping

public enum JDBCType implements SQLType {CHAR(Types.CHAR),VARCHAR(Types.VARCHAR),LONGVARCHAR(Types.LONGVARCHAR),...BLOB(Types.BLOB),CLOB(Types.CLOB),...NCHAR(Types.NCHAR),NVARCHAR(Types.NVARCHAR),LONGNVARCHAR(Types.LONGNVARCHAR), NCLOB(Types.NCLOB),

String JDBC vs DB Mapping

• CHARACTER [(len)] or CHAR [(len)]• VARCHAR (len)• BOOLEAN

• SMALLINT

• INTEGER or INT

• DECIMAL [(p[,s])] or DEC [(p[,s])]

• NUMERIC [(p[,s])]

• REAL

• FLOAT(p)

• DOUBLE PRECISION

• DATE

• TIME

• TIMESTAMP

• CLOB [(len)] or CHARACTER LARGE OBJECT [(len)] or CHAR LARGE OBJECT [(len)]• BLOB [(len)] or BINARY LARGE OBJECT [(len)]

String JDBC vs DB Quiz #1

What is JDBC type

LONGVARCHAR?

String JDBC vs DB

Drill Down

setStringInternal(int var1,String var2) throws SQLException {

...int var6 = var2 != null?var2.length():0;...if(var6 <= this.maxVcsCharsSql) { this.basicBindString(var1, var2);} else if(var6 <= this.maxStreamNCharsSql) { this.setStringForClobCritical(var1, var2);} else { this.setStringForClobCritical(var1, var2);}

String JDBC vs DB

CHAR & VARCHAR

String JDBC vs DB

BLOB & CLOB

• just bytes

Binary Large OBject

• just characters in your DB encoding

Character Large OBject

String JDBC vs DB

Char & NChar

• uses your DB encoding or no encoding

Char, Varchar, CLOB…

• uses specified encoding

NChar, NVarchar, NCLOB…

• does not have NBLOB

BLOB

String JDBC vs DB

Cp1251 vs Cp1252

Sometimes encoding does not matter much

Unless too smart drivers spoil it

Unless they are not compatible

String JDBC vs DB Solution

Check your DB encoding upfront

If needed use N* DB types and N* JDBC types as well

String JDBC vs DB Solution

Losing data because of encoding is lame

If you expect some strange strings coming use N* types

Never forget that symbol is not a char/byte it may save you one day

Your JDBC driver can screw you

Date

DateTime zones

DST and leap miracles

Date Time Zone

Does DB and App time zone match?

What can go wrong if they don’t?

Date Time Zone Quiz #1

• Database: Oracle 11g• Database time zone: CET/CEST (+01:00/+02:00)• Application: Java 8• Application time zone: EET/EEST (+02:00/+03:00)• setTimestamp(‘2016-10-14 15:35:01’)?• getTimestamp()?

Date Time Zone

Quiz #1 Hintfinal int oracleYear(int var1) {

int var2 = ((this.rowSpaceByte[0 + var1] & 255) - 100) * 100 + (this.rowSpaceByte[1 + var1] & 255) - 100; return var2 <= 0?var2 + 1:var2;}

final int oracleMonth(int var1) { return this.rowSpaceByte[2 + var1] - 1; }final int oracleDay(int var1) { return this.rowSpaceByte[3 + var1]; }final int oracleHour(int var1) { return this.rowSpaceByte[4 + var1] - 1; }final int oracleMin(int var1) { return this.rowSpaceByte[5 + var1] - 1; }final int oracleSec(int var1) { return this.rowSpaceByte[6 + var1] - 1; }final int oracleTZ1(int var1) { return this.rowSpaceByte[11 + var1]; }final int oracleTZ2(int var1) { return this.rowSpaceByte[12 + var1]; }

Date Time Zone Quiz #1

• 2016-10-14 15:35:01

Database

• 2016-10-14 15:35:01

Application

Date Time Zone Quiz #1

• 2016-10-14 15:35:01

Database

• 2016-10-14 15:35:01

Application with UTC time zone

Date Time Zone Quiz #2

JavaScript client: time zone unknown

Java server: EET time zone

How to pass dates?

Date Time Zone Quiz #2

• Date… ehmm… JSON does not know what it is…• Long is a bit of a problem for 53+ impotent integer types (now 41,

~140,000 years and we will cross the border)• String as ISO 8601 is a lesser evil

Date Time Zone Solution

Use the same App/DB time zone

Check your DB driver to ensure conversion safety

Store timestamps as long: DB and API

Store timestamps as String: API

Date DST & Magic

Missing and extra hours

Leap seconds

Date DST & Magic Calculations

24 hours in a day

60 minutes in an hour

60 seconds in a minute

FTW: 24 * 60 * 60 * 1000

Date DST & Magic Quiz #1

27.03.2016 00:00:00 - 28.03.2016 00:00:00 (EET/EEST)

Date DST & Magic Quiz #1

26.03.2016 22:00:00 - 27.03.2016 21:00:00 (UTC) – 23h

Date DST & Magic Quiz #2

31.12.2016 00:00:00 – 01.01.2017 00:00:00 (EET/EEST)

01.01.2017 00:00:00 – 02.01.2017 00:00:00 (EET/EEST)

Date DST & Magic Quiz #2

30.12.2016 22:00:00 – 31.01.2017 22:00:00 (UTC) – 24h

31.12.2017 22:00:00 – 01.01.2017 22:00:00 (UTC) – 24h+1s

Date DST & Magic Quiz #2

It will happen: 31.12.2016 23:59:60 (UTC)

It had happened: 30.06.2015 23:59:60 (UTC)

Blame the Earth, and Moon, and Sun

Blame software developers

Date One Last Thing

Date vs Interval

Date is a tuple of year, month, day, hour, and etc.

Instant is a precise point on the timeline

Date One Last Thing

Date vs Interval

Date can be converted to Instant

Instant can be converted to Date

• even within a Chronology

However, “conversion rate” is not constant

Date Summary

• Use UTC as much as possible• Keep in mind the difference between Date and Instant• Think of Date/Instant interoperation as it was designed/used by

idiots• 24*60*60*1000 is, basically, simplification. Quite harmful at times.• Use proper date libraries – you wouldn’t want to reinvent it again.• GMT is not yet another name for UTC, beware!

Thank you

?

Recommended