26
Big Data Pragmaticalities Experiences from Time Series Remote Sensing MARINE & ATMOSPHERIC RESEARCH Edward King Remote Sensing & Software Team Leader 3 September 2013

Big Data Pragmaticalities - University of Tasmania

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Pragmaticalities - University of Tasmania

Big Data Pragmaticalities Experiences from Time Series Remote Sensing

MARINE & ATMOSPHERIC RESEARCH

Edward King Remote Sensing & Software Team Leader

3 September 2013

Page 2: Big Data Pragmaticalities - University of Tasmania

Overview

• Remote sensing (RS) and RS time series (type of processing & scale)

• Opportunities for parallelism

• Compute versus Data

• Scientific programming versus software engineering

• Some handy techniques

• Where next

Big Data Pragmaticalities 2 |

Page 3: Big Data Pragmaticalities - University of Tasmania

Automated data collection….

Big Data Pragmaticalities 3 |

Page 4: Big Data Pragmaticalities - University of Tasmania

Presto! Big Data(sets).

Big Data Pragmaticalities 4 |

Page 5: Big Data Pragmaticalities - University of Tasmania

More Detail…

L0 (raw sensor)

L1B (calibrated)

L2 (derived

quantity)

Examples 1km imagery 3000 scenes/year x 500MB/scene x 10 years = 15TB 500m imagery x 4 = 60TB

Remapped

Composites

Page 6: Big Data Pragmaticalities - University of Tasmania

Recap - Big Picture View

• These archives are large

• They are often only stored in raw format

• We usually need to do some significant amount of processing to extract the geophysical variable(s) of interest

• We often need to process the whole archive to achieve consistency in the data

• As scientists, unless you have a background in high performance computing and data intensive science, this is a daunting prospect.

• There are things that can make it easier…

Big Data Pragmaticalities 6 |

Page 7: Big Data Pragmaticalities - University of Tasmania

+ + = “best pixels” User

Output types…

Big Data Pragmaticalities 7 |

Scenes: … User

Composites:

+ + = …etc

Page 8: Big Data Pragmaticalities - University of Tasmania

Things to notice

• Some operations are done over and over again to data from different times. • For example: processing Monday data and Tuesday data are independent • This is an opportunity to do things in parallel (ie all at the same time)

• Operations on one place in the data are completely independent to

operations in other places. • For example: Processing data from WA doesn’t depend on data from Tas. • This is another opportunity to do things in parallel (ie all at the same

time)

Big Data Pragmaticalities 8 |

Page 9: Big Data Pragmaticalities - University of Tasmania

12th ARSPC - Fremantle

Note: This general pattern is often referred to as a “HADOOP” or “MAP-REDUCE” system, and there are software frameworks that formalise it – eg it lies behind Google search indexing. (Disclaimer: I’ve never used one)

Page 10: Big Data Pragmaticalities - University of Tasmania

So what?

• Our previous example – 10yrs x 3000 scenes/yr @ 10mins/scene = 5000hrs = 30weeks

– Give me 200 CPUs = 25hours

• But what about the data flux? • 15TB/30 weeks = 3 GB/hour

• 15TB/25 hours = 600 GB/hour

• Problem is transformed from compute bound to I/O bound

Big Data Pragmaticalities 10 |

~0.5GB

Page 11: Big Data Pragmaticalities - University of Tasmania

Key tradeoff #1:

• Can you supply data fast enough to make the most of your computing?

• How much effort you put into this depends on • How big is your data set

• How much computing you have available

• How many times you have to do it

• How soon you need your result

• Figuring out how to balance data organisation and supply against time spent computing is key to getting the best results.

• Unless you have an extraordinarily computationally intensive algorithm, you’re (usually) better off focussing on steps to speed up data.

Big Data Pragmaticalities 11 |

Page 12: Big Data Pragmaticalities - University of Tasmania

Computing Clusters

Big Data Pragmaticalities 12 |

Workstation 2 CPUs (15 weeks)

My first (& last) cluster (2002) 20 CPUs (1.5 weeks)

NCI (now obsolete) 20000 CPUs (20 mins)

Page 13: Big Data Pragmaticalities - University of Tasmania

Plumbing & Software

• Somehow we have to connect data to operations: • Operations = atmosphere correction | remap | calibrate | myCleverAlgorithm

– Might be pre-existing packages

– Your own special code (Fortran, C, Python,…. Matlab, IDL)

• Connect = provide the right data to the right operation and collect the results

– Usually you will use a scripting language since you need:

– To work with the operating system

– Run programs

– Analyse file names

– Maybe read log files to see if something went wrong

• Software for us is like glassware in a chem lab: a specialised setup for our experiments; you can get components off the shelf, but only you know how you want to connect them together.

• Bottom line – you’re going to be doing some programming of some sort.

Big Data Pragmaticalities 13 |

Page 14: Big Data Pragmaticalities - University of Tasmania

Scientific Programming versus Software Engineering (Key Tradeoff #2)

• Do you want to do this processing only once, or many times? • Which parts of your workflow are repeated, which are one-off?

• Eg base processing many times, followed by one-off analysis experiments

• How does the cost of your time spent programming compare with the availability of computing and time spent running your workflow? • Why spend a week making something twice as fast if it already runs in two days?

(maybe because you need to do it many times?)

• Will you need to understand it later?

Big Data Pragmaticalities 14 |

Page 15: Big Data Pragmaticalities - University of Tasmania

Proprietary fly in the ointment (#1) • If you use licenced software (IDL, Matlab

etc.) you need licences for each CPU you want to run on.

• This may mean you can’t use anything like as much computing as you otherwise could.

• These languages are good for prototyping and testing

• But, to really make the most of modern computing, you need to escape the licencing encumbrance = migrate to free software.

• PS: Windows is licenced software

Big Data Pragmaticalities 15 |

• Example: we have complex IDL code that we run on a big data set at the NCI. We have only 4 licences. It runs in a week (6 days). If we had 50 licences -> 12hours. We can live with that since there would be weeks and weeks of coding and testing to port to Python.

Page 16: Big Data Pragmaticalities - University of Tasmania

How to do it…

Page 17: Big Data Pragmaticalities - University of Tasmania

Maximise performance by 1. Minimise the amount of programming you do

• Exploit existing tools (eg std. processing packages, operating system cmds) • Write things you can re-use (data access, logging tools) • Choose file names that make it easy to figure out what to do • Use the file-system as your database.

2. Maximise your ability to use multiple CPUs • Eliminate unnecessary differences (eg data formats, standards) • Look for opportunities to parallelise • Avoid licencing (eg proprietary data formats, libraries, languages)

3. Seek data movement efficiency everywhere • Data layout • Compression • RAM disks

4. Minimise the number of times you have to run your workflow • Log everything (so there is no uncertainty about whether you did what you think you did)

Big Data Pragmaticalities 17 |

Page 18: Big Data Pragmaticalities - University of Tasmania

RAM disks

• Tapes are slow • Disks are less slow • Memory is even less slow • Cache is fast – but small • Most modern systems have multiple GB of RAM for each CPU,

which you can assign to working memory and as virtual disk.

• If you have multiple processing steps, which need intermediate file storage – use a RAM disk. Can get a factor of 10 improvement.

Big Data Pragmaticalities 18 |

DISK TAPE CPU

RAM

Cache

Page 19: Big Data Pragmaticalities - University of Tasmania

Compression • Data that is half the size takes half as long to move (but then you have to

uncompress it – but CPUs are faster than disks) • Zip, gzip will usually get you a factor of 2-4 compression • Bzip2 is often 10-15% better

• BUT – it is much slower (factor of 5).

• Don’t store random precision (3.14 compresses more than 3.1415926) • Avoid recompressing (treat compressed archive as read-only, ie copy-

uncompress-use-delete, DO NOT move-uncompress-use-recompress-moveBack)

Big Data Pragmaticalities 19 |

Remote Disk

File.gz

CPU (decompression) RAM File

Page 20: Big Data Pragmaticalities - University of Tasmania

Data Layout

• Look at your data access patterns and organise your code/data to match

• Eg 1. if your analysis uses multiple files repeatedly, reorganise the data so you reduce the number of open & close operations

• Eg 2. Big files tend to end up as contiguous blocks on a disk, so try and localise access to data, not jumping around which will entail waiting for the disk.

Big Data Pragmaticalities 20 |

Access by row

Access by column

Page 21: Big Data Pragmaticalities - University of Tasmania

Data Formats (and metadata)

• This is still a religious subject, factors to consider: • Avoid proprietary (may need licences or libraries for undocumented

formats) – versus open formats that are publicly documented • Self-contained (keep header (metadata) and data together) • Self-documenting formats have structure that can be decoded using only

information already in the file • Architectural independence – will work on different computers • Storage efficiency – binary versus ascii • Access efficiency and flexibility – support for different layouts • Interoperability – openness and standard conformance = reuse

• Need some conventions around metadata for consistency

• Automated metadata harvest (for indexing/cataloguing)

• Longevity (& migration) • Answer: use netCDF or HDF (or maybe FITS in astronomy)

Big Data Pragmaticalities 21 |

Page 22: Big Data Pragmaticalities - University of Tasmania

The file-system is my database

• Often in your multi-step processing of 1000s of files you will want to use a database to keep track of things – DON’T!

• Every time you do something, you have to update the DB • It doesn’t usually take long before inconsistencies arise (eg someone deletes a file

by hand).

• Databases are a pain to work with by hand (SQL syntax, forgettable rules)

• Use the file-system (folders, filenames) to keep track. Egs: • once file.nc has been processed, rename it to file.nc.done and just have your

processing look for files *.nc. (rename it back to file.nc to run it again, use ls or dir to see where things are up to, and rm to get rid of things that didn’t work).

• Create zero size files as breadcrumbs

– touch file.nc.FAIL.STEP2

– ls *.FAIL.* to see how many failures there were and at what step

• Use directories to group data that need to be grouped – for example all files for a particular composite.

Big Data Pragmaticalities 22 |

Page 23: Big Data Pragmaticalities - University of Tasmania

Filenames are really important

• Filenames are a good place to store metadata relevant to the processing workflow: • They’re easy to access without opening the file

• You can use file system tools to select data

• Use YYYYMMDD (or YYYYddd) for dates in filenames – then they will automatically sort into time order (cf DDMMYY, DDmonYYYY)

• Make it easy to get metadata out of file names: • Fixed width numerical fields (F1A.dat, F10B.dat, F100C.dat is harder to

interpret by program than F001A.dat, F010B.dat, F100C.dat)

• Structured names – but don’t go overboard! – D-20130812.G-1455.P-aqua.C-20130812172816.T-d000000n274862.S-n.pds

– Eg. ls *.G-1[234]* to choose files at a particular time of day

Big Data Pragmaticalities 23 |

Page 24: Big Data Pragmaticalities - University of Tasmania

Logging and Provenance

• Every time you do something (move data, feed it to a program, put it somewhere): write a time-stamped message to a log file. • Write a function that automatically prepends a timestamp to a piece of text

you give to it.

• Time-stamps are really useful for profiling – identifying where the bottlenecks are, or figuring out if something has gone wrong.

• Huge log files are a tiny marginal overhead • Make them easy to read by program (eg grep)

• Make your processing code report a version (number, or description), and its inputs, to the log file. Write the log file into the output data file as a final step. • This lets you understand what you did months later (so you don’t do it again)

• Keeps the relevant log file with the data (so you don’t lose it, or mix it up)

Big Data Pragmaticalities 24 |

Page 25: Big Data Pragmaticalities - University of Tasmania

Final Thoughts

• Most of this is applicable to other data intensive parallel processing tasks • Eg. spatio-temporal model output grids • Advantages may vary depending on file size

• Data organisation has many subtleties – a little work in understanding can offer great returns in performance

• Keep an eye on file format capabilities • More CPUs is a double edged sword • Data efficiency will only become more important • Haven’t really touched on spatial metadata (v. important for ease of

end-use/analysis – but tedious (=automatable)) • Get your data into a self-documenting machine-readable open file

format – and you’ll never have to reformat by hand again.

• These are things we now do out of habit because they work for us • Perhaps they’ll work for you?

Big Data Pragmaticalities 25 |

Page 26: Big Data Pragmaticalities - University of Tasmania

Marine & Atmospheric Research Edward King Team Leader: Remote Sensing & Software

t +61 3 6232 5334 e [email protected] w www.csiro.au/cmar

MARINE & ATMOSPHERIC RESEARCH

Thank you