Upload
victorcr
View
83
Download
4
Embed Size (px)
Citation preview
DataWars
The Bloody Enterprise strikes back
VictorPolischuk
@alkovictorvictor-cr
Long time ago in a galaxy far,
far away…
It is GOOD
when we have a lot of
data
when we have data
several years old, the older
the better
It is BADwhen we have to remove
historical data
It is BAD
when we have a lot of
code
when we have code
several years old, the older
the worse
It is GOO
Dwhen we have to remove
historical code
Money, numbers, and
arithmeticIdentities
Text data and Strings Date and time
PleaseParticipate
Money
Money
Float & Double
Convert to Integer
Money Float & Double Problem
Developers usually have no idea how it is represented:
Money Float & Double Quiz #1
• = ?
Float: 0.6 + 0.1
• = ?
Double: 0.6 + 0.1
Money Float & Double Quiz #1
• = 0.70000005
Float: 0.6 + 0.1
• = 0.7
Double: 0.6 + 0.1
Money: Float & Double: stackoverflow.com
Money Float & Double Quiz #2
• = ?
Float: 0.2 + 0.1
• = ?
Double: 0.2 + 0.1
Money Float & Double Quiz #2
• = 0.3
Float: 0.2 + 0.1
• = 0.30000000000000004
Double: 0.2 + 0.1
Money Float & Double
Drill Down
• Binary representation: [sign] [exponent] [mantissa]• Float: 1 bit, 8 bits, 23 bits• Double: 1 bit, 11 bits, 52 bits• Value:
0.1f = 0-01111011-10011001100110011001101 0.1f = + 2-127+123 * (2-1 + 2-4 + 2-5 + 2-8 + 2-9 + 2-12 + 2-13 + 2-16 + 2-17 …)0.1f = 2-4 * (1 + 5033165 / 223) = 0.100000001490116119384765625
Money Float & Double Example
Money Float & Double Quiz #3
+0.0f = 0-00000000-00000000000000000000000
-0.0f = 1-00000000-00000000000000000000000
+0.0f == -0.0f?
Money Float & Double
Drill Down
/** * Get or create float value for the given float. * * @param d the float * @return the value */public static ValueFloat get(float d) { if (d == 1.0F) { return ONE; } else if (d == 0.0F) { // -0.0 == 0.0, and we want to return 0.0 for both return ZERO; } return (ValueFloat) Value.cache(new ValueFloat(d));}
Money Float & Double Summary
Just never use it
Forget it exists
Unless you are working on a video codec
Money Convert to Integer
Multiply decimals up to integers
• (as a constant probably)
Keep the “scale” somewhere else
Money Convert to Integer Quiz #1
•= ?
10 * 230 + 5
Money Convert to Integer Quiz #1
•= 28, where 10 is 10%
10 * 230 + 5
Money Convert to Integer
Drill Down
@Embeddablepublic class Amount implements Serializable { private int rate; @Transient private final int scale;
public Amount() { scale = 6; } public Amount(int rate, int scale) { this.scale = scale; setRate(rate); } …
Money Convert to Integer Summary
• It is better to keep precision closer to the number• It is better when arithmetic just works• It is better when equals and compareTo work• int <*/+> int can exceed int (same with long)• Consistency is almost always above performance
Money Solution BigDecimal
Precision and accuracy are known and adjustable
Arithmetic is included
Supported by JDK, JDBC, and etc
Performance is quite nice
identity
Identity
Unexpected Overflow
UUID
Identity Overflow
Integer and Long are finite types
Sometimes they can overflow
Moreover they usually twice smaller than you think
Identity Overflow Example
Identity Overflow Example
Identity Overflow Sucker Punch
• Also, “some languages” cannot work with 53+ bits integer types• In addition, “some languages” work with custom 32-bit integer types
Identity Overflow Summary
There is a difference between DB and API identity
• Always use integer types as identity for DB• Always use text types as identity for API• Avoid using 32- bit types as identity at all
Unless you are 99.9% sure
Identity UUID• Are not guaranteed to be globally unique• Not K-ordered• In most of the cases are excessively big (128 bit)• Can be the reason of a serious performance degradation• Have different versions which may suite better or worse• Strangely enough RDBMS rarely supports UUID/GUID data types• Weird:• Time based on 100-nanosecond intervals since 15th of October 1582• Were invented/published around 1999
Identity UUID Store as String
• 16 bytes
UUID is 128-bit value
• 36 symbols – which is more than 2 times bigger
A96A0D4C-49D0-4431-B126-4C66688ADEF3
Identity UUID Drill Down
High long
32
time_low
16
time_mid
4
version
12
time_hi
Low long
4
variant
12
clock_seq
48
node
Identity UUID Example
$ uuidgen0c8aa0f6-9f6f-4fad-9662-1b683f2f4a0d$ uuidgen1ee09695-3a04-4e7a-8bab-e67dabc4b5a2$ uuidgen -t3770f4d0-88b3-11e6-bba6-005056bb68cb$ uuidgen -t3b14dd54-88b3-11e6-8c53-005056bb68cb
Identity Solution
• Use text representation for public identities (API)• Database Sequences (Long)• UUID + Database Sequences (Long)• UUID (BigInteger/Binary)• Twitter Snowflake (Long) – outdated• UUID (String)• *Flake (128 bit)
String
String
Java and encoding
JDBC drivers and DB types
String Encoding
Java uses UTF-16 for String encoding
UTF-16 has symbol range: 0x0000..0x10FFFF
String uses char[] (byte[] in JDK9)
Char has range: 0x0000..0xFFFF
String Encoding Quiz #1
How to represent range
0x100000..0x10FFFF using char?
String Encoding Quiz #1
• Define surrogate range: 0xD800..0xDFFF (0x800 characters)• Split it equally to “High”: 0xD800..0xDBFF and “Low”:
0xDC00..0xDFFF• Combine “High”-to-“Low” to get 0x400 * 0x400 = 0x100000 symbols• Profit???• Profit!!!
String Encoding Quiz #2
String x = new String(new char[]{'z',0xD801,0xDC37,'a','b','c'
});
System.out.println(x);System.out.println(x.substring(0,2));System.out.println(x.substring(2));
String Encoding Quiz #2
z𐐷abcz??abc
String Encoding Solution
No solution, just be aware
Yet, it might be more sophisticated soon
String JDBC vs DB
Mapping DB specific types to JDBC
Some DB or driver exceptional cases
BLOB vs CLOB
Narrower DB encoding
String JDBC vs DB Mapping
public enum JDBCType implements SQLType {CHAR(Types.CHAR),VARCHAR(Types.VARCHAR),LONGVARCHAR(Types.LONGVARCHAR),...BLOB(Types.BLOB),CLOB(Types.CLOB),...NCHAR(Types.NCHAR),NVARCHAR(Types.NVARCHAR),LONGNVARCHAR(Types.LONGNVARCHAR), NCLOB(Types.NCLOB),
String JDBC vs DB Mapping
• CHARACTER [(len)] or CHAR [(len)]• VARCHAR (len)• BOOLEAN
• SMALLINT
• INTEGER or INT
• DECIMAL [(p[,s])] or DEC [(p[,s])]
• NUMERIC [(p[,s])]
• REAL
• FLOAT(p)
• DOUBLE PRECISION
• DATE
• TIME
• TIMESTAMP
• CLOB [(len)] or CHARACTER LARGE OBJECT [(len)] or CHAR LARGE OBJECT [(len)]• BLOB [(len)] or BINARY LARGE OBJECT [(len)]
String JDBC vs DB Quiz #1
What is JDBC type
LONGVARCHAR?
String JDBC vs DB
Drill Down
setStringInternal(int var1,String var2) throws SQLException {
...int var6 = var2 != null?var2.length():0;...if(var6 <= this.maxVcsCharsSql) { this.basicBindString(var1, var2);} else if(var6 <= this.maxStreamNCharsSql) { this.setStringForClobCritical(var1, var2);} else { this.setStringForClobCritical(var1, var2);}
String JDBC vs DB
CHAR & VARCHAR
String JDBC vs DB
BLOB & CLOB
• just bytes
Binary Large OBject
• just characters in your DB encoding
Character Large OBject
String JDBC vs DB
Char & NChar
• uses your DB encoding or no encoding
Char, Varchar, CLOB…
• uses specified encoding
NChar, NVarchar, NCLOB…
• does not have NBLOB
BLOB
String JDBC vs DB
Cp1251 vs Cp1252
Sometimes encoding does not matter much
Unless too smart drivers spoil it
Unless they are not compatible
String JDBC vs DB Solution
Check your DB encoding upfront
If needed use N* DB types and N* JDBC types as well
String JDBC vs DB Solution
Losing data because of encoding is lame
If you expect some strange strings coming use N* types
Never forget that symbol is not a char/byte it may save you one day
Your JDBC driver can screw you
Date
DateTime zones
DST and leap miracles
Date Time Zone
Does DB and App time zone match?
What can go wrong if they don’t?
Date Time Zone Quiz #1
• Database: Oracle 11g• Database time zone: CET/CEST (+01:00/+02:00)• Application: Java 8• Application time zone: EET/EEST (+02:00/+03:00)• setTimestamp(‘2016-10-14 15:35:01’)?• getTimestamp()?
Date Time Zone
Quiz #1 Hintfinal int oracleYear(int var1) {
int var2 = ((this.rowSpaceByte[0 + var1] & 255) - 100) * 100 + (this.rowSpaceByte[1 + var1] & 255) - 100; return var2 <= 0?var2 + 1:var2;}
final int oracleMonth(int var1) { return this.rowSpaceByte[2 + var1] - 1; }final int oracleDay(int var1) { return this.rowSpaceByte[3 + var1]; }final int oracleHour(int var1) { return this.rowSpaceByte[4 + var1] - 1; }final int oracleMin(int var1) { return this.rowSpaceByte[5 + var1] - 1; }final int oracleSec(int var1) { return this.rowSpaceByte[6 + var1] - 1; }final int oracleTZ1(int var1) { return this.rowSpaceByte[11 + var1]; }final int oracleTZ2(int var1) { return this.rowSpaceByte[12 + var1]; }
Date Time Zone Quiz #1
• 2016-10-14 15:35:01
Database
• 2016-10-14 15:35:01
Application
Date Time Zone Quiz #1
• 2016-10-14 15:35:01
Database
• 2016-10-14 15:35:01
Application with UTC time zone
Date Time Zone Quiz #2
JavaScript client: time zone unknown
Java server: EET time zone
How to pass dates?
Date Time Zone Quiz #2
• Date… ehmm… JSON does not know what it is…• Long is a bit of a problem for 53+ impotent integer types (now 41,
~140,000 years and we will cross the border)• String as ISO 8601 is a lesser evil
Date Time Zone Solution
Use the same App/DB time zone
Check your DB driver to ensure conversion safety
Store timestamps as long: DB and API
Store timestamps as String: API
Date DST & Magic
Missing and extra hours
Leap seconds
Date DST & Magic Calculations
24 hours in a day
60 minutes in an hour
60 seconds in a minute
FTW: 24 * 60 * 60 * 1000
Date DST & Magic Quiz #1
27.03.2016 00:00:00 - 28.03.2016 00:00:00 (EET/EEST)
Date DST & Magic Quiz #1
26.03.2016 22:00:00 - 27.03.2016 21:00:00 (UTC) – 23h
Date DST & Magic Quiz #2
31.12.2016 00:00:00 – 01.01.2017 00:00:00 (EET/EEST)
01.01.2017 00:00:00 – 02.01.2017 00:00:00 (EET/EEST)
Date DST & Magic Quiz #2
30.12.2016 22:00:00 – 31.01.2017 22:00:00 (UTC) – 24h
31.12.2017 22:00:00 – 01.01.2017 22:00:00 (UTC) – 24h+1s
Date DST & Magic Quiz #2
It will happen: 31.12.2016 23:59:60 (UTC)
It had happened: 30.06.2015 23:59:60 (UTC)
Blame the Earth, and Moon, and Sun
Blame software developers
Date One Last Thing
Date vs Interval
Date is a tuple of year, month, day, hour, and etc.
Instant is a precise point on the timeline
Date One Last Thing
Date vs Interval
Date can be converted to Instant
Instant can be converted to Date
• even within a Chronology
However, “conversion rate” is not constant
Date Summary
• Use UTC as much as possible• Keep in mind the difference between Date and Instant• Think of Date/Instant interoperation as it was designed/used by
idiots• 24*60*60*1000 is, basically, simplification. Quite harmful at times.• Use proper date libraries – you wouldn’t want to reinvent it again.• GMT is not yet another name for UTC, beware!
Thank you
?