65
1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

Embed Size (px)

Citation preview

Page 1: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

1

High Performance Presentation:5 slides/Minute?

(65 slides / 15 minutes)

IO and DB “stuff” for LSSTA new world record?

Jim Gray

Microsoft Research

Page 2: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

2

TerraServer Lessons Learned• Hardware is 5 9’s (with clustering)• Software is 5 9’s (with clustering)• Admin is 4 9’s (offline maintenance)• Network is 3 9’s (mistakes, environment)

• Simple designs are best• 10 TB DB is management limit

1 PB = 100 x 10 TB DBthis is 100x better than 5 years ago.(yahoo!, HotMail are 300TB, Google! Is 2PB)

• Minimize use of tape–Backup to disk (snapshots)–Portable disk TBs

99 9999 9 9 999 9 999 99

Page 3: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

3

Serving BIG images• Break into tiles (compressed):

– 10KB for modems– 1MB for LANs

• Mosaic the tiles for pan, crop

• Store image pyramid for zoom– 2x zoom only adds 33% overhead

1 + ¼ + 1/16 + …

• Use a spatial index to cluster & find objects

1.6x1.6 km2 image

.8x.8 km2 image

.4x.4 km2 image

.2x.2 km2 tile

Page 4: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

4

Economics

• People are more than 50% of costs

• Disks are more than 50% of capital

• Networking is the other 50% – People– Phone bill– Routers

• Cpus are free (they come with the disks)

Page 5: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

5

SkyServer/ SkyQuery Lessons• DB is easy• Search

– It is BEST to index– You can put objects and attributes in a row

(SQL puts big blobs off-page)– If you can’t index, you can extract attributes and quickly compare– SQL can scan at 5M records/cpu/second– Sequential scans are embarrassingly parallel

• Web services are easy• XML Data Sets :

– a universal way to represent answers– minimize round trips: 1 request/response– Diffgrams allow disconnected update

Page 6: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

6

How Will We Find Stuff?Put everything in the DB (and index it)

• Need dbms features: Consistency, Indexing, Pivoting, Queries, Speed/scalability, Backup, replicationIf you don’t use one, you’r creating one!

• Simple logical structure: – Blob and link is all that is inherent– Additional properties (facets == extra tables)

and methods on those tables (encapsulation) • More than a file system • Unifies data and meta-data• Simpler to manage• Easier to subset and reorganize• Set-oriented access• Allows online updates • Automatic indexing, replication

SQLSQL

Page 7: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

7

How Do We Represent Data To The Outside World?

• File metaphor too primitive: just a blob• Table metaphor too primitive: just records• Need Metadata describing data context

– Format– Providence (author/publisher/ citations/…)– Rights– History– Related documents

• In a standard format• XML and XML schema• DataSet is great example of this• World is now defining standard schemas

schema

Data ordifgram

<?xml version="1.0" encoding="utf-8" ?>

- <DataSet xmlns="http://WWT.sdss.org/">

- <xs:schema id="radec" xmlns="" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata">

<xs:element name="radec" msdata:IsDataSet="true">

<xs:element name="Table">

  <xs:element name="ra" type="xs:double" minOccurs="0" />

  <xs:element name="dec" type="xs:double" minOccurs="0" /> …

- <diffgr:diffgram xmlns:msdata="urn:schemas-microsoft-com:xml-msdata" xmlns:diffgr="urn:schemas-microsoft-com:xml-diffgram-v1">

- <radec xmlns="">

- <Table diffgr:id="Table1" msdata:rowOrder="0">

  <ra>184.028935351008</ra>

  <dec>-1.12590950121524</dec>

  </Table>

- <Table diffgr:id="Table10" msdata:rowOrder="9">

  <ra>184.025719033547</ra>

  <dec>-1.21795827920186</dec>

</Table>

</radec> 

</diffgr:diffgram>

</DataSet>

Page 8: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

8

Emerging Concepts

• Standardizing distributed data– Web Services, supported on all platforms– Custom configure remote data dynamically– XML: Extensible Markup Language– SOAP: Simple Object Access Protocol– WSDL: Web Services Description Language– DataSets: Standard representation of an answer

• Standardizing distributed computing– Grid Services– Custom configure remote computing dynamically– Build your own remote computer, and discard– Virtual Data: new data sets on demand

Page 9: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

9

Szalay’s Law:The utility of N comparable datasets is N2

• Metcalf’s law applies to telephones, fax, Internet.• Szalay argues as follows:

Each new dataset gives new information2-way combinations give new information.

• Example: Combine these 3 datasets– (ID, zip code)– (ID, birth day)– (ID, height)

• Other example: quark star: Chandra Xray + Hubble optical,+600 year old records..Drake, J. J. et al. Is RX J185635-375 a Quark Star?. Preprint, (2002).

X-ray, optical,

infrared, and radio

views of the nearby Crab

Nebula, which is now in a state of

chaotic expansion after a

supernova explosion first

sighted in 1054 A.D. by Chinese Astronomers.

Crab star 1053 AD

Page 10: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

10

Science is hitting a wallFTP and GREP are not adequate

• You can GREP 1 MB in a second• You can GREP 1 GB in a minute • You can GREP 1 TB in 2 days• You can GREP 1 PB in 3 years.

• Oh!, and 1PB ~10,000 disks

• At some point you need indices to limit searchparallel data search and analysis

search and analysis tools• This is where databases can help

• You can FTP 1 MB in 1 sec• You can FTP 1 GB / min (= 1 $/GB)

• … 2 days and 1K$• … 3 years and 1M$

Page 11: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

11

Networking: Great hardware & Software

• WANs @ 5GBps (1 = 40 Gbps)

• GbpsEthernet common (~100 MBps)– Offload gives ~2 hz/Byte– Will improve with RDMA & zero-copy

– 10 Gbps mainstream by 2004

• Faster I/O– 1 GB/s today (measured)– 10 GB/s under development– SATA (serial ATA) 150MBps/device

Page 12: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

12

Bandwidth: 3x bandwidth/year for 25 more years

• Today: – 40 Gbps per channel (λ)

– 12 channels per fiber (wdm): 500 Gbps

– 32 fibers/bundle = 16 Tbps/bundle

• In lab 3 Tbps/fiber (400 x WDM)

• In theory 25 Tbps per fiber

• 1 Tbps = USA 1996 WAN bisection bandwidth

• Aggregate bandwidth doubles every 8 months!

1 fiber = 25 Tbps1 fiber = 25 Tbps

Page 13: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

13

Redmond/Seattle, WA

San Francisco, CA

New York

Arlington, VA

5626 km10 hops

Information Sciences InstituteInformation Sciences InstituteMicrosoftMicrosoft

QwestQwestUniversity of WashingtonUniversity of Washington

Pacific Northwest GigapopPacific Northwest GigapopHSCC HSCC (high speed connectivity consortium)(high speed connectivity consortium)

DARPADARPA

Hero/Guru Networking

Page 14: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

14

Real Networking• Bandwidth for 1 Gbps “stunt” cost 400k$/month

– ~ 200$/Mbps/m (at each end + hardware + admin)– Price not improving very fast– Doesn’t include operations / local hardware costs

• Admin… costs more ~1$/GB to 10$/GB• Challenge: Go home and FTP from a “fast”server• The Guru Gap: FermiLab <-> JHU

– Both “well connected”– vBNS, NGI, Internet2, Abilene,….– Actual desktop-to-desktop ~ 100KBps– 12 days/TB (but it crashes first).

• The reality: to move 10GB, mail it! TeraScale Sneakernet

Page 15: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

15

How Do You Move A Terabyte?

14 minutes6172001,920,0009600OC 192

2.2 hours1000Gbps

1 day100100 Mpbs

14 hours97631649,000155OC3

2 days2,01065128,00043T3

2 months2,4698001,2001.5T1

5 months360117700.6Home DSL

6 years3,0861,000400.04Home phone

Time/TB$/TBSent

$/MbpsRent

$/monthSpeedMbps

Context

Source: TeraScale Sneakernet, Microsoft Research, Jim Gray et. all Source: TeraScale Sneakernet, Microsoft Research, Jim Gray et. all

Page 16: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

16

There Is A Problem

• GREAT!!!!– XML documents are portable objects– XML documents are complex objects– WSDL defines the methods on objects

(the class)

• But will all the implementations match?– Think of UNIX or SQL or C or…

• This is a work in progress.

Niklaus Wirth: Niklaus Wirth: Algorithms + Data Structures = ProgramsAlgorithms + Data Structures = Programs

Page 17: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

17

Changes To DBMS’s

• Integration of Programs and Data– Put programs inside the database

allows OODB– Gives you parallel execution

• Integration of Relational, Text, XML, Time• Scaleout (even more)• AutoAdmin (“no knobs”)• Manage Petascale databases

(utilities, geoplex, online, incremental)

Page 18: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

18

Publishing Data

Roles

Authors

Publishers

Curators

Archives

Consumers

Traditional

Scientists

Journals

Libraries

Archives

Scientists

Emerging

Collaborations

Project web site

Data+Doc Archives

Digital Archives

Scientists

Page 19: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

19

The Core Problem: No Economic Model

• The archive user has not yet been born. How can he pay you to curate the data?

• The Scientist gathered data for his own purposeWhy should he pay (invest time) for your needs?

• Answer to both: that’s the scientific method

• Curating data (documenting the design, the acquisition and the processing)Is very hard and there is no reward for doing it.The results are rewarded, not the process of getting them.

• Storage/archive NOT the problem (it’s almost free)

• Curating/Publishing is expensive.

Page 20: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

20

SDSS Data Inflation – Data Pyramid

• Level 1AGrows 5TB pixels/year growing to 25TB~ 2 TB/y compressed growing to 13TB~ 4 TB today (level 1A in NASA terms)

• Level 2Derived data products ~10x smaller But there are many catalogs.

• Publish new edition each year – Fixes bugs in data.

– Must preserve old editions

– Creates data pyramid

• Store each edition – 1, 2, 3, 4… N ~ N2 bytes

• Net: Data Inflation: L2 ≥ L1

E1

E2

E3E4

4 editions oflevel 1A data(source data)

4 editions of level 2 derived data products. Note that each derived product is small, but they are numerous. This proliferation combined with the data pyramid implies that level2 data more than doubles the total storage volume.

time

Level 1A 4 editions of Level 2 products

Page 21: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

21

What’s needed?(not drawn to scale)

Science Data & Questions

Scientists

DatabaseTo store

dataExecuteQueries

Plumbers

Data Mining

Algorithms

Miners

Question & AnswerVisualizat

ion

Tools

Page 22: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

22

CS Challenges For Astronomers

• Objectify your field:– Precisely define what you are talking about.– Objects and Methods / Attributes– This is REALLY difficult.– UCDs are a great start but, there is a long way to go

• “Software is like entropy, it always increases.” -- Norman Augustine, Augustine’s Laws– Beware of legacy software – cost can eat you alive– Share software where possible.– Use standard software where possible.– Expect it will cost you 25% to 40% of project.

• Explain what you want to do with the VO– 20 queries or something like that.

Science Data & Questions

Scientists

Page 23: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

23

Challenge to Data Miners: Linear and Sub-Linear Algorithms

• Today most correlation / clustering algorithmsare polynomial N2 or N3 or…

• N2 is VERY big when N is big (1018 is big)

• Need sub-linear algorithms

• Current approaches are near optimal given current assumptions.

• So, need new assumptionsprobably heuristic and approximate

Data MiningAlgorit

hms

Miners

Techniques

Page 24: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

24

Challenge to Data Miners: Rediscover Astronomy

• Astronomy needs deep understanding of physics.

• But, some was discovered as variable correlations then “explained” with physics.

• Famous example: Hertzsprung-Russell Diagramstar luminosity vs color (=temperature)

• Challenge 1 (the student test): How much of astronomy can data mining discover?

• Challenge 2 (the Turing test):Can data mining discover NEW correlations?

Data MiningAlgorit

hms

Miners

Page 25: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

25

Plumbers: Organize and Search Petabytes

• Automate – instrument-to-archive pipelines

It is is a messy business – very labor intensiveMost current designs do not scale (too many manual steps)BaBar (1TB/day) and ESO pipeline seem promising.A job-scheduling or workflow system

– Physical Database design & access• Data access patterns are difficult to anticipate • Aggressively and automatically use indexing, sub-setting.• Search in parallel

• Goals– Answer easy queries in 10 seconds.– Answer hard queries (correlations) in 10 minutes.

Database

To store data

ExecuteQueries

Plumbers

Page 26: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

26

Scale UP

Scaleable Systems

• Scale UP: grow by adding components to a single system.

• Scale Out: grow by adding more systems.

Scale OUT

Page 27: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

27

What’s New – Scale Up

• 64 bit & TB size main memory

• SMP on chip: everything’s smp

• 32… 256 SMP: locality/affinity matters

• TB size disks

• High-speed LANs

Page 28: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

28

Who needs 64-bit addressing?You! Need 64-bit addressing!

• 640K ought to be enough for anybody. Bill Gates, 1981

• But that was 21 years ago == 221/3 = 14 bits ago.

• 20 bits + 14 bits = 34 bits so.. 16GB ought to be enough for anybody Jim Gray, 2002

• 34 bits > 31 bits so…34 bits == 64 bits

• YOU need 64 bit addressing!

Page 29: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

29

64 bit – Why bother?• 1966 Moore’s law:

4x more RAM every 3 years. 1 bit of addressing every 18 months

• 36 years later: 236/3 = 24 more bits Not exactly right, but…

32 bits not enough for servers32 bits gives no headroom for clients

So, time is running out ( has run out )• Good news:

Itanium™ and Hammer™ are maturingAnd so is the base software (OS, drivers, DB, Web,...)

Windows & SQL @ 256GB today!

Page 30: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

30

64 bit – why bother?• Memory intensive calculations:

– You can trade memory for IO and processing

• Example: Data Analysis & Clustering a JHU• in memory CPU time is

~NlogN , N ~ 100M• Disk M chunks

→ time ~ M2

• must run many times• Now running on

HP Itanium Windows.Net Server 2003 SQL Server

Graph courtesy of Alex Szalay & Adrian Pope of Johns Hopkins University

Memory in GB

1.0

10.0

100.0

1000.0

10000.0

100000.0

0 10 20 30 40 50 60 70 80 90 100

No of galaxies in Millions

CPU

time

(hrs

)

1

4

32

256

year

decade

week

day

month

Page 31: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

31

Amdahl’s balanced System Laws• 1 mips needs 4 MB ram and needs 20 IO/s • At 1 billion instructions per second

need 4 GB/cpuneed 50 disks/cpu!

• 64 cpus … 3,000 disks

1 bips1 bipscpucpu4 GB4 GB

RAMRAM 50 disks50 disks10,000 IOps10,000 IOps

7.5 TB7.5 TB

Page 32: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

32

The 5 Minute Rule – Trade RAM for Disk Arms

• If data re-referenced every 5 minutes It is cheaper to cache it in ram than to get it from disk

A disk access/second ~ 50$ or ~ 50MB for 1 second or ~ 50KB for 1,000 seconds.

• Each app has a memory “knee” Up to the knee, more memory helps a lot.

Page 33: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

33

Three TPC Benchmarks: GBs help a LOT!

even if cpu clock is slower

0

25,000

50,000

75,000

100,000

4x1.6Ghz IA32+8GB 4x1.6Ghz IA32+32GB 4x1Ghz Itanium 2 +48GB

Tra

ns

ac

tio

ns

Pe

r S

ec

on

d

64 bit Reduces IO, saves disks• Large memory reduces IO• 64-bit simplifies code• Processors can be faster (wider word)• Ram is cheap (4 GB ~ 1k$ to 20k$)• Can trade ram for disk IO • Better response time.• Example

– tpcC • 4x1Ghz Itanium2 vs • 4x1.6Ghz IA32 • 40 extra GB

→ 60% extra throughput

4x1.6GhzIA328GB

4x1 GhzIA6448GB

4x1.6GhzIA3232GB

Page 34: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

34

AMD Hammer™ Coming Soon• AMD Hammer™ is 64bit capable• 2003: millions of Hammer™ CPUs will ship • 2004: most AMD CPUs will be 64bit • 4GB ram is less than 1,000$ today

less than 500$ in 2004• Desktops (Hammer™)

and servers (Opteron™).• You do the math,…

Who will demand 64bit capable software?

Page 35: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

35

A 1TB Main Memory • Amdahl’s law: 1mips/MB , now 1:5

so ~20 x 10 Ghz cpus need 1TB ram• 1TB ram ~ 250k$ … 2m$ today

~ 25k$ … 200k$ in 5 years• 128 million pages

– Takes a LONG time to fill– Takes a LONG time to refill

• Needs new algorithms • Needs parallel processing• Which leads us to…

– The memory hierarchy– smp – numa

Page 36: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

36

• If cpu is always waiting for memoryPredict memory requests and prefetch– done

• If cpu still always waiting for memoryMulti-program it (multiple hardware threads per cpu) – Hyper Threading: Everything is SMP– 2 now more later– Also multiple cpus/chip

• If your program is single threaded– You waste ½ the cpu and memory bandwidth– Eventually waste 80%

• App builders need to plan for threads.

Hyper-Threading: SMP on chip

Page 37: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

37

The Memory Hierarchy• Locality REALLY matters• CPU 2 G hz, RAM at 5 Mhz

RAM is no longer random access.• Organizing the code gives 3x (or more)• Organizing the data gives 3x (or more)

• Level latency (clocks) size• Registers 1 1 KB• L1 2 32 KB• L2 10 256 KB• L3 30 4 MB• Near RAM 100 16 GB• Far RAM 300 64 GB

Page 38: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

38

RAM

Off chip

Icache

Arithmatic Logical Unit

Dcache

L2 cache

The Bus

Remote cache

DiskNetwork

Other Cpus

Other Cpus

Other Cpus

Other Cpus

registers

L1 cache

Remote RAM Remote RAM

Page 39: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

39

Scaleup Systems Non-Uniform Memory Architecture (NUMA)

Coherent but… remote memory is even slower

All cells see a common memory

Slow local main memory Slower remote main memory

Scaleup by adding cellsScaleup by adding cells

Planning for 64 cpu, 1TB ram Planning for 64 cpu, 1TB ram

Interconnect, Interconnect, Service Processor, Service Processor, Partition management Partition management are vendor specificare vendor specific

Several vendors doing thisSeveral vendors doing thisItanium and HammerItanium and HammerSystem interconnect System interconnect

Crossbar/SwitchCrossbar/Switch

Partition Partition managermanager

Config DBConfig DB

CPUCPU CPUCPUCPUCPU CPUCPU

MemMem MemMemMemMem MemMem

I/OI/O ChipsetChipset

CPUCPU CPUCPUCPUCPU CPUCPU

MemMem MemMemMemMem MemMem

I/OI/O ChipsetChipset

CPUCPU CPUCPUCPUCPU CPUCPU

MemMem MemMemMemMem MemMem

I/OI/O ChipsetChipset

CPUCPU CPUCPUCPUCPU CPUCPU

MemMem MemMemMemMem MemMem

I/OI/O ChipsetChipset

Service Service ProcessorProcessor

Service Service ProcessorProcessor

Page 40: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

40

Changed Ratios Matter

• If everything changes by 2x, Then nothing changes.

• So, it is the different rates that matter.Improving FAST Improving FAST

CPU speedCPU speed

Memory & disk sizeMemory & disk size

Network BandwidthNetwork Bandwidth

Slowly changing Slowly changing

Speed of lightSpeed of light

People costsPeople costs

Memory bandwidthMemory bandwidth

WAN pricesWAN prices

Page 41: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

41

Disks are becoming tapes• Capacity:

– 150 GB now, 300 GB this year, 1 TB by 2007

• Bandwidth:– 40 MBps now

150 MBps by 2007

• Read time – 2 hours sequential, 2 days random now

4 hours sequential, 12 days random by 2007

150 IO/s 40 MBps150 IO/s 40 MBps

150 GB150 GB

200 IO/s 150 MBps200 IO/s 150 MBps

1 TB1 TB

Page 42: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

42

Disks are becoming tapesConsequences

• Use most disk capacity for archivingCopy on Write (COW) file system in Windows and other OSs.

• RAID10 saves arms, costs space (OK!).• Backup to disk

Pretend it is a 100GB disk + 1 TB disk– Keep hot 10% of data on fastest part of disk.– Keep cold 90% on colder part of disk

• Organize computations to read/write disks sequentially in large blocks.

Page 43: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

43

Wiring is going serial and getting FAST!

• Gbps Ethernet and SATA built into chips

• Raid Controllers: inexpensive and fast.

• 1U storage bricks @ 2-10 TB

• SAN or NAS (iSCSI or CIFS/DAFS)

Enet

100MBps/link

8xSATA

150M

Bps/lin

k

Page 44: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

44

NAS – SAN Horse Race• Storage Hardware 1k$/TB/y

Storage Management 10k$...300k$/TB/y

• So as with Server ConsolidationStorage Consolidation

• Two styles: NAS (Network Attached Storage) File Server

SAN (System Area Network) Disk Server

• I believe NAS is more manageable.

Page 45: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

45

SAN/NAS Evolution

ModularModular

MonolithicMonolithic

SealedSealed

Page 46: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

46

IO ThroughputK Access Per Second Vs. RPM

Kaps vs. RPMKaps vs. RPM

00 50005000 1500015000 20000200001000010000

KapsKaps

00

4040

8080

120120

200200

160160

Page 47: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

47

Comparison Of Disk Cost$’s for similar performance

Seagate Disk Prices*Seagate Disk Prices*

*Source: Seagate online store, quantity one prices*Source: Seagate online store, quantity one prices

$29.7$455Fibre15K RPM36.7 GBX15 36LP

$29.7$455SCSI15K RPM36.7 GBX15 36LP

$32.5$325SCSI10K RPM36.7 GB36 ES 2

$14.0$101ATA7200 RPM40 GBATA 1000

$15.9$86ATA5400 RPM40 GBATA 100

$/K RevCostConnect.SpeedSizeModel #

Page 48: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

48

Comparison Of Disk Costs ¢/MB for different systems

Seagate 6.4¢$1155Int SCSI181 GB

WD 2.3¢$276Ext. ATA120 GB

Dell 1.4¢$115Int. ATA80 GB

Cost/MBCostTypeSizeMfg.

EMC xx¢SANXX GB

Source: DellSource: Dell

Page 49: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

49

Why Serial ATA Matters

• Modern interconnect

• Point-to-point drive connection

– 150Mbs –> 300Mbs

• Facilitates ATA disk arrays

• Enables inexpensive“cool” storage

Page 50: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

50

Performance (on Y2k SDSS data)

time vs queryID

1

10

100

1000

Q08 Q01 Q09 Q10A Q19 Q12 Q10 Q20 Q16 Q02 Q13 Q04 Q06 Q11 Q15B Q17 Q07 Q14 Q15A Q05 Q03 Q18

seco

nd

s cpu

elapsedae

• Run times: on 15k$ HP Server (2 cpu, 1 GB , 8 disk)

• Some take 10 minutes• Some take 1 minute • Median ~ 22 sec. • Ghz processors are fast!

– (10 mips/IO, 200 ins/byte)– 2.5 m rec/s/cpu

cpu vs IO

1E+0

1E+1

1E+2

1E+3

1E+4

1E+5

1E+6

1E+7

0.01 0.1 1. 10. 100. 1,000.CPU sec

IO c

ount 1,000 IOs/cpu sec

~1,000 IO/cpu sec ~ 64 MB IO/cpu sec

Page 51: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

51

NVO: How Will It Work?

• Define commonly used `atomic’ services

• Build higher level toolboxes/portals on top

• We do not build `everything for everybody’

• Use the 90-10 rule:– Define the standards and interfaces– Build the framework– Build the 10% of services

that are used by 90%– Let the users build the rest

from the components

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5

# of services# o

f u

sers

Page 52: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

52

Federation

Data Federations of Web Services• Massive datasets live near their owners:

– Near the instrument’s software pipeline– Near the applications– Near data knowledge and curation– Super Computer centers become Super Data Centers

• Each Archive publishes a web service– Schema: documents the data– Methods on objects (queries)

• Scientists get “personalized” extracts

• Uniform access to multiple Archives– A common global schema

Page 53: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

53

Grid and Web Services Synergy• I believe the Grid will be many web services

share data (computrons are free)

• IETF standards Provide – Naming– Authorization / Security / Privacy– Distributed Objects

Discovery, Definition, Invocation, Object Model

– Higher level services: workflow, transactions, DB,..

• Synergy: commercial Internet & Grid tools

Page 54: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

54

Web Services: The Key?• Web SERVER:

– Given a url + parameters – Returns a web page (often dynamic)

• Web SERVICE:– Given a XML document (soap msg)– Returns an XML document– Tools make this look like an RPC.

• F(x,y,z) returns (u, v, w)

– Distributed objects for the web.– + naming, discovery, security,..

• Internet-scale distributed computing

Yourprogram

DataIn your address

space

Web Service

soap

object

in

xml

Yourprogram Web

Server

http

Web

page

Page 55: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

55

Grid?

• Harvesting spare cpu cycles is not important– They are “free” (1$/cpu day)– They need applications and data (which are not free)

(1$/GB shipped)

• Accessing distributed data IS important– Send the programs to the data– Send the questions to the databases.

• Super Computer Centers becomeSuper Data Centers

Super Application Centers

Page 56: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

56

The Grid: Foster & Kesselman (Argonne National Laboratory)

Internet computing and GRID technologies promise to change the way we tackle complex problems. They will enable large-scale aggregation and sharing of computational, data and other resources across institutional boundaries …. Transform scientific disciplines ranging from high energy physics to the life sciences

Page 57: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

57

Grid/Globus

• Leader of the pack for GRID middleware

• Layered software toolkit– 1: Grid Fabric (OS, TCP)

– 2: Grid ServicesGlobus Resource Allocation ManagerGlobus Information Service (meta-computing directory

service)Grid Security InfrastructureGridFTP

– 3: Application ToolkitsJob submissionMPICH-G2 message passing interface

– 4:Specific ApplicationsOVERFLOW Navier-Stokes flow solver

Page 58: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

58

Globus in gory detailSHELL SCRIPTSglobus-mds-search '(&(hn=denali.mcs.anl.gov)

(objectclass=GlobusSystemDynamicInformation))' cpuload1 |\

sed -n -e '/^hn=/p' -e '/^cpuload1=/p' |\ sed -e 's/,.*$//' -e 's/=/ /g' |\ awk '/^hn/{printf "%s", $2} /^cpuload/{printf

" %s\n", $2}‘

if [ $# -eq 0 ]; then echo "provide argument <number of processes to

start>" 1>&2 exit 1fiif [ -z "$GRAMCONTACT" ] ; then GRAMCONTACT="`globus-hostname2contacts -type

fork pitcairn.mcs.anl.gov`"fipwd=`/bin/pwd`rsl="&(executable=${pwd}/myjobtest)(count=$1)"arch=`${GLOBUS_INSTALL_PATH}/sbin/config.guess`${GLOBUS_INSTALL_PATH}/tools/${arch}/bin/globusrun

-o -r "${GRAMCONTACT}" "${rsl}"

LIBRARIES/* get process id and hostname */

pid = getpid();

rc = globus_libc_gethostname(hn, 256);

globus_assert(rc == GLOBUS_SUCCESS);

/* get current time and convert to string format. setting [25] to zero will strip the newline character. */

mytime = time(GLOBUS_NULL);

timestr = globus_libc_ctime_r( &mytime, buf, 30 );

timestr[25] = '\0';

globus_libc_printf("%s : process %d on %s came to \ life\n",timestr, pid, hn);

/*THE BARRIER!!! */

globus_duroc_runtime_barrier();

/*Passed the barrier: get current time again and print it out.*/

mytime = time(GLOBUS_NULL);

timestr = globus_libc_ctime_r( &mytime, buf, 30 );

globus_libc_printf("%s : process %d on %s passed \the barrier\n", timestr, pid, hn);

/*TODO 1: get the layout of the DUROC job using first globus_duroc_runtime_intra_subjob_rank() and then globus_duroc_runtime_inter_subjob_structure(). */

/* We are done.*/

rc = globus_module_deactivate_all();

globus_assert(rc == GLOBUS_SUCCESS);

return 0;

Page 59: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

59

Shielding Users

• Users do not want to deal with XML,they want their data

• Users do not want to deal with configuring grid computing, they want results

• SOAP: data appears in user memory, XML is invisible

• SOAP call: just a remote procedure

Page 60: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

60

Atomic Services

• Metadata information about resources– Waveband– Sky coverage– Translation of names to universal dictionary (UCD)

• Simple search patterns on the resources– Cone Search– Image mosaic– Unit conversions

• Simple filtering, counting, histogramming• On-the-fly recalibrations

Page 61: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

61

Higher Level Services

• Built on Atomic Services• Perform more complex tasks• Examples

– Automated resource discovery– Cross-identifications– Photometric redshifts– Outlier detections– Visualization facilities

• Expectation:– Build custom portals in matter of days from existing building

blocks (like today in IRAF or IDL)

Page 62: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

62

SkyQuery• Distributed Query tool using a set of

services

• Feasibility study, built in 6 weeks from scratch– Tanu Malik (JHU CS grad student) – Tamas Budavari (JHU astro postdoc)

• Implemented in C# and .NET

• Won 2nd prize of Microsoft XML Contest

• Allows queries like:

SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,

TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5

AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

Page 63: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

63

ArchitectureImage cutout

SkyNodeSDSS

SkyNode2Mass

SkyNodeFirst

SkyQuery

Web Page

Page 64: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

64

Cross-id Steps

• Parse query• Get counts• Sort by counts• Make plan• Cross-match

– Recursively, from small to large

• Select necessary attributes only• Return output• Insert cutout image

SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,

TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) > 2 AND o.type=3

Page 65: 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

65

Show Cutout Web Service