Big Data Pragmaticalities - University of Tasmania

Big Data Pragmaticalities Experiences from Time Series Remote Sensing

MARINE & ATMOSPHERIC RESEARCH

Edward King Remote Sensing & Software Team Leader

3 September 2013

Overview

• Remote sensing (RS) and RS time series (type of processing & scale)

• Opportunities for parallelism

• Compute versus Data

• Scientific programming versus software engineering

• Some handy techniques

• Where next

Big Data Pragmaticalities 2 |

Automated data collection….


Presto! Big Data(sets).


More Detail…

L0 (raw sensor)

L1B (calibrated)

L2 (derived

quantity)

Examples 1km imagery 3000 scenes/year x 500MB/scene x 10 years = 15TB 500m imagery x 4 = 60TB

Remapped

Composites

Recap - Big Picture View

• These archives are large

• They are often only stored in raw format

• We usually need to do some significant amount of processing to extract the geophysical variable(s) of interest

• We often need to process the whole archive to achieve consistency in the data

• As scientists, unless you have a background in high performance computing and data intensive science, this is a daunting prospect.

• There are things that can make it easier…


+ + = “best pixels” User

Output types…


Scenes: … User

Composites:

+ + = …etc

Things to notice

• Some operations are done over and over again to data from different times. • For example: processing Monday data and Tuesday data are independent • This is an opportunity to do things in parallel (ie all at the same time)

• Operations on one place in the data are completely independent to

operations in other places. • For example: Processing data from WA doesn’t depend on data from Tas. • This is another opportunity to do things in parallel (ie all at the same

time)


12th ARSPC - Fremantle

Note: This general pattern is often referred to as a “HADOOP” or “MAP-REDUCE” system, and there are software frameworks that formalise it – eg it lies behind Google search indexing. (Disclaimer: I’ve never used one)

So what?

• Our previous example – 10yrs x 3000 scenes/yr @ 10mins/scene = 5000hrs = 30weeks

– Give me 200 CPUs = 25hours

• But what about the data flux? • 15TB/30 weeks = 3 GB/hour

• 15TB/25 hours = 600 GB/hour

• Problem is transformed from compute bound to I/O bound


~0.5GB

Key tradeoff #1:

• Can you supply data fast enough to make the most of your computing?

• How much effort you put into this depends on • How big is your data set

• How much computing you have available

• How many times you have to do it

• How soon you need your result

• Figuring out how to balance data organisation and supply against time spent computing is key to getting the best results.

• Unless you have an extraordinarily computationally intensive algorithm, you’re (usually) better off focussing on steps to speed up data.


Computing Clusters


Workstation 2 CPUs (15 weeks)

My first (& last) cluster (2002) 20 CPUs (1.5 weeks)

NCI (now obsolete) 20000 CPUs (20 mins)

Plumbing & Software

• Somehow we have to connect data to operations: • Operations = atmosphere correction | remap | calibrate | myCleverAlgorithm

– Might be pre-existing packages

– Your own special code (Fortran, C, Python,…. Matlab, IDL)

• Connect = provide the right data to the right operation and collect the results

– Usually you will use a scripting language since you need:

– To work with the operating system

– Run programs

– Analyse file names

– Maybe read log files to see if something went wrong

• Software for us is like glassware in a chem lab: a specialised setup for our experiments; you can get components off the shelf, but only you know how you want to connect them together.

• Bottom line – you’re going to be doing some programming of some sort.


Scientific Programming versus Software Engineering (Key Tradeoff #2)

• Do you want to do this processing only once, or many times? • Which parts of your workflow are repeated, which are one-off?

• Eg base processing many times, followed by one-off analysis experiments

• How does the cost of your time spent programming compare with the availability of computing and time spent running your workflow? • Why spend a week making something twice as fast if it already runs in two days?

(maybe because you need to do it many times?)

• Will you need to understand it later?


Proprietary fly in the ointment (#1) • If you use licenced software (IDL, Matlab

etc.) you need licences for each CPU you want to run on.

• This may mean you can’t use anything like as much computing as you otherwise could.

• These languages are good for prototyping and testing

• But, to really make the most of modern computing, you need to escape the licencing encumbrance = migrate to free software.

• PS: Windows is licenced software


• Example: we have complex IDL code that we run on a big data set at the NCI. We have only 4 licences. It runs in a week (6 days). If we had 50 licences -> 12hours. We can live with that since there would be weeks and weeks of coding and testing to port to Python.

How to do it…

Maximise performance by 1. Minimise the amount of programming you do

• Exploit existing tools (eg std. processing packages, operating system cmds) • Write things you can re-use (data access, logging tools) • Choose file names that make it easy to figure out what to do • Use the file-system as your database.

2. Maximise your ability to use multiple CPUs • Eliminate unnecessary differences (eg data formats, standards) • Look for opportunities to parallelise • Avoid licencing (eg proprietary data formats, libraries, languages)

3. Seek data movement efficiency everywhere • Data layout • Compression • RAM disks

4. Minimise the number of times you have to run your workflow • Log everything (so there is no uncertainty about whether you did what you think you did)


RAM disks

• Tapes are slow • Disks are less slow • Memory is even less slow • Cache is fast – but small • Most modern systems have multiple GB of RAM for each CPU,

which you can assign to working memory and as virtual disk.

• If you have multiple processing steps, which need intermediate file storage – use a RAM disk. Can get a factor of 10 improvement.


DISK TAPE CPU

RAM

Cache

Compression • Data that is half the size takes half as long to move (but then you have to

uncompress it – but CPUs are faster than disks) • Zip, gzip will usually get you a factor of 2-4 compression • Bzip2 is often 10-15% better

• BUT – it is much slower (factor of 5).

• Don’t store random precision (3.14 compresses more than 3.1415926) • Avoid recompressing (treat compressed archive as read-only, ie copy-

uncompress-use-delete, DO NOT move-uncompress-use-recompress-moveBack)


Remote Disk

File.gz

CPU (decompression) RAM File

Data Layout

• Look at your data access patterns and organise your code/data to match

• Eg 1. if your analysis uses multiple files repeatedly, reorganise the data so you reduce the number of open & close operations

• Eg 2. Big files tend to end up as contiguous blocks on a disk, so try and localise access to data, not jumping around which will entail waiting for the disk.


Access by row

Access by column

Data Formats (and metadata)

• This is still a religious subject, factors to consider: • Avoid proprietary (may need licences or libraries for undocumented

formats) – versus open formats that are publicly documented • Self-contained (keep header (metadata) and data together) • Self-documenting formats have structure that can be decoded using only

information already in the file • Architectural independence – will work on different computers • Storage efficiency – binary versus ascii • Access efficiency and flexibility – support for different layouts • Interoperability – openness and standard conformance = reuse

• Need some conventions around metadata for consistency

• Automated metadata harvest (for indexing/cataloguing)

• Longevity (& migration) • Answer: use netCDF or HDF (or maybe FITS in astronomy)


The file-system is my database

• Often in your multi-step processing of 1000s of files you will want to use a database to keep track of things – DON’T!

• Every time you do something, you have to update the DB • It doesn’t usually take long before inconsistencies arise (eg someone deletes a file

by hand).

• Databases are a pain to work with by hand (SQL syntax, forgettable rules)

• Use the file-system (folders, filenames) to keep track. Egs: • once file.nc has been processed, rename it to file.nc.done and just have your

processing look for files *.nc. (rename it back to file.nc to run it again, use ls or dir to see where things are up to, and rm to get rid of things that didn’t work).

• Create zero size files as breadcrumbs

– touch file.nc.FAIL.STEP2

– ls *.FAIL.* to see how many failures there were and at what step

• Use directories to group data that need to be grouped – for example all files for a particular composite.


Filenames are really important

• Filenames are a good place to store metadata relevant to the processing workflow: • They’re easy to access without opening the file

• You can use file system tools to select data

• Use YYYYMMDD (or YYYYddd) for dates in filenames – then they will automatically sort into time order (cf DDMMYY, DDmonYYYY)

• Make it easy to get metadata out of file names: • Fixed width numerical fields (F1A.dat, F10B.dat, F100C.dat is harder to

interpret by program than F001A.dat, F010B.dat, F100C.dat)

• Structured names – but don’t go overboard! – D-20130812.G-1455.P-aqua.C-20130812172816.T-d000000n274862.S-n.pds

– Eg. ls *.G-1[234]* to choose files at a particular time of day


Logging and Provenance

• Every time you do something (move data, feed it to a program, put it somewhere): write a time-stamped message to a log file. • Write a function that automatically prepends a timestamp to a piece of text

you give to it.

• Time-stamps are really useful for profiling – identifying where the bottlenecks are, or figuring out if something has gone wrong.

• Huge log files are a tiny marginal overhead • Make them easy to read by program (eg grep)

• Make your processing code report a version (number, or description), and its inputs, to the log file. Write the log file into the output data file as a final step. • This lets you understand what you did months later (so you don’t do it again)

• Keeps the relevant log file with the data (so you don’t lose it, or mix it up)


Final Thoughts

• Most of this is applicable to other data intensive parallel processing tasks • Eg. spatio-temporal model output grids • Advantages may vary depending on file size

• Data organisation has many subtleties – a little work in understanding can offer great returns in performance

• Keep an eye on file format capabilities • More CPUs is a double edged sword • Data efficiency will only become more important • Haven’t really touched on spatial metadata (v. important for ease of

end-use/analysis – but tedious (=automatable)) • Get your data into a self-documenting machine-readable open file

format – and you’ll never have to reformat by hand again.

• These are things we now do out of habit because they work for us • Perhaps they’ll work for you?


Marine & Atmospheric Research Edward King Team Leader: Remote Sensing & Software

t +61 3 6232 5334 e [email protected] w www.csiro.au/cmar

MARINE & ATMOSPHERIC RESEARCH

Thank you

Documents

Big Data Pragmaticalities - University of Tasmania