Microsoft Large Databases
andGrid Computing
Jim GrayMicrosoft Research
[email protected]://research.Microsoft.com/~gray
Presentation to Kaiser Information Management Briefing
21 May 2003
About me• in Microsoft research (located in San Francisco)• A database researcher
– IBM, Tandem, DEC, Microsoft• Work on Scalable Systems
– Building supercomputers from commodity components.
• Do academic/government things too– PITAC, GriPhyn TAB, NSF/CISE,
Library of Congress, …• For the last 4 years,
been working with the astronomy community to build the World Wide Telescope.
Agenda• TerraServer
– What it is– What we learned– What we are doing now.
• SkyServer / WWT– What it is – What we learned– What we are doing now
• Grid Computing– General comments– Build a web service
TerraServerTerraService.net
• A photo of the United States– 1 meter resolution (photographic/topographic)– USGS data– Some demographic data (BestPlaces.net)– Home sales data– Linked to Encarta Encyclopedia
• 15 TB raw, 6 TB cooked (grows 10GB/w)
• Point, Pan, zoom interface• Among top 1,000 websites
– 40k visitors/day– 4M queries/day– 3 B page views (in 5 years)
• All in an SQL database
TerraServer StatisticsJune ‘98June ‘98 Jan ‘99Jan ‘99 Jan ‘00Jan ‘00 May ‘00May ‘00 Sept ’01 Dec ‘02
SQL 7.0SQL 7.0
1.0 TB Db1.0 TB Db
SQL 2000SQL 2000
1.0 TB Db1.0 TB Db
SQL 2000SQL 2000
1.2 TB Db1.2 TB Db
SQL 2000SQL 2000
1.4 TB Db1.4 TB Db
SQL 2000SQL 2000
2.0 TB Db2.0 TB Db
SQL 2000SQL 2000
2.0 TB Db2.0 TB Db
SQL 2000SQL 2000
2.0 TB Db2.0 TB Db
1 Server / Win NT 4.0 EE 2nd Server / Win 2k DataCenter 4 Node / Win2k Datacenter Failover Cluster
SQL 7.0SQL 7.0
1.0 TB Db1.0 TB Db
217 m Rows
SQL 7.0SQL 7.01 Server1 Server
1.5 TB Db1.5 TB Db
SQL 2000SQL 20001 Server1 Server.8 TB Db.8 TB Db
298 m Rows
SQL 7.0SQL 7.0.75 TB Db.75 TB Db
173 m Rows
755mRows
SQL 2000SQL 2000
.8 TB Db.8 TB Db
231 m Rows
900 m Rows Unique Users Page Views Image Tiles Db Queries Bytes Xfered
DailyAverage
40,0111,266,838
3,735,7894,484,089
70 gb
PeakDay
277,292
12,388,10410,475,674
163 gb
2,401,209
June 1998 -Oct, 200263,656,904
2,015,539,6055,943,641,0247,134,186,170
108tb
TerraServer Cluster
SQL\Inst1SQL\Inst1
SQL\Inst2SQL\Inst2
SQL\Inst3SQL\Inst3
SpareSpare
F GLKP Q
E EJ JO O
IHM NR S
22002200 22002200 22002200
220022002200220022002200
22002200 22002200 22002200
One SQL database per rackOne SQL database per rackEach rack contains 4.5 TBEach rack contains 4.5 TB1 rack not in picture1 rack not in picture18.0 TB total 18.0 TB total
Meta DataMeta DataStored on 101 GBStored on 101 GB““Fast, Small Disks”Fast, Small Disks”(18 x 18.2 GB)(18 x 18.2 GB)
Imagery DataImagery DataStored on 4 339 GBStored on 4 339 GB““Slow, Big Disks”Slow, Big Disks”(15 x 73.8 GB)(15 x 73.8 GB)
Added 90 72.8 GBAdded 90 72.8 GBDisks in Feb 2001Disks in Feb 2001to create 18 TB SANto create 18 TB SAN
8 Compaq DL360 “Photon” Web Servers8 Compaq DL360 “Photon” Web Servers
Fiber SANFiber SANSwitchesSwitches
4 Compaq ProLiant 8500 Db Servers4 Compaq ProLiant 8500 Db Servers
Cluster Configuration
1
CompaqSANswitch
by BrocadeCommunications
CompaqStorageWorks
MA8000/HSG80Controllers (3) 2
3Compaq
ProLiant 8500(4)
100-Mbps
Ethernet
Internet
Internet
Gigibit
Eth
erne
t
MicrosoftCorporat
e LAN
Extreme NetworksSummit 48
Switch
Summit 7iSwitch (2)
Cisco 12000Internet Router
Compaq DL360 (6)(Windows 2000 Web Servers)TerraServer.microsoft.com
Compaq DL360 (10)
Database
Cluster
ADICLTOTape
Library
TerraServer SAN
TerraServer Becomes a Web ServiceTerraServer.net -> TerraService.Net
• Web server is for people.• Web Service is for programs
– The end of screen scraping– No faking a URL:
pass real parameters.– No parsing the answer:
data formatted into your address space.
• Hundreds of users but a specific example:– US Department of Agriculture
Vision: One Stop Shopping to Data Anywhere, Anytime, AnyplaceBusiness Applications Need Data
Customer Service Toolk it Web Based Appli cati onPublic Acc ess to ServiceCenter Data
NASISNASISSoilsSoils
OrthoOrthoPhotosPhotos
CommonCommonLandLandUnitsUnits
•One stop Shopping•Site Location•Data Selecti on•Data Extraction(cookiecutting) for vec tor, raster, andtabular•Component Architecture
StrategicBusinessApplications
Data Marts&Warehouses
•Data Formatting includingreprojection and Mr. Si dcompressi on•Data Packaging•Data Delivery inc luding FTP,CD, and immediate dow nload•Public and Internal Security
Services•Standards Enforc ement•Automated Retri val underprogram c ontrol•Compatibility w ith FGDCand Open GIS Standards•COTS or GOTS based•Print Map
State &State &CountyCounty
DataData
GISGISCriticalCriticalThemesThemes
APFO NCG APFO States NCG
ES RIArcV iew
And now.. 4 slides from the “customer”who built a portal using TerraService
Data Gateway Functional Overview
Navigation Service
Catalog Service
Ship Service
<<Requests Products>>
Item Broker
Customer Orders Data
XML
Order Placer
Listen for OrderPlacer RaisedEventSelect sequenced ItemOutput XMLrasie event : stats.delivery start
validate (dtd)Insert into SQL@@Identity / GUID to clientreturn est timeraise OrderMgr.event
Order Database
Selects from
XML Request for dataLoggerCalled by anyonerasies to stats svc'
ASP
XMLXML
Soil Data Viewer
39.3
27.5
27.3
21.7
15.9
8.9
12.0
11.5
11.3
6.9
5.34.8
4.6
2.9
1.6
0.9
9 10B
10
12
33
14 18
29
5A
24
26
21
22
27
6A
25
1720
11
28
19
16
31
9C
9A
13
13A
32
30
31A
22A28A
16A
30A
25A
LandunitsFields Within Buffer
Buffer Area Within Fields 5A 6A10B18202425262728293030A3131A32
Pipelines 97
2000 0 2000 4000 FeetN
Buffer Area Within Fields
USDA1:15840 NRCS
Geospatial Data
Acknowledges item ready for delivery
Data Services
Package Service
Send order info
FTP Services
Rimage CD Service
Product Catalog Updates
Billing Services
NCGC - Fort Worth, Texas
ITC - Fort Collins, Colorado
TerraService
Custom End ProductWeb Soil Data Viewer XML Soil ReportSoil Interpretation Map
ESRISpatial Data Engine
WebSDVArcIMS Connector
Connects to ArcIMS; communication is done through ArcIMS XML (AXL)
Retrieves and processes Soils Data from the NASIS relational Database
Image RetrieverIMSNavigator
Generates maps (JPGs) using ArcIMS
Retrieves imagery from the Microsoft TerraServer
Terraserver
GeospatialData
BusinessRules
National SoilsData
Database Server - Microsoft SQL Server
Database Server - ESRI Spatial Data Server
Web Server - COM+ Applications
Microsoft Terraserver
Brief tour of TerraService
• Show map service• Show some methods• SeeTerraService.NET:
An Introduction to Web Services Tom Barclay; Jim Gray; Eric Strand; Steve Ekblad; Jeffrey Richter, MSR TR 2002-53, pp 13, June 2002
What We Learned• You can build and manage a very popular website
with relatively little effort (if you do it right and have Tom Barclay)
• Loading 20 TB takes a lot of energy• And you get to do it many times -- automate• Tape and tape software are problematic• Triplex and snap-shot disks works
(we have never had to use it, but..)• The internet gives you 2-9’s
Servers can run at 4 9’s easily, 5 9’s with effort.
What we are doing now.• Building with 3K$ 2TB bricks• 4 bricks = 1 backend• Triplexing systems • Duplexing sites.• 4*3*2 = 24k$ for Geoplex• Very simple operations model• See: • “TeraScale SneakerNet:
Using Inexpensive Disks for Backup, Archiving, and Data Exchange,” Jim Gray; Wyman Chong; Tom Barclay; Alex Szalay; Jan Vandenberg, pp. 1-8, May 2002
Agenda• TerraServer
– What it is– What we learned– What we are doing now.
• SkyServer / WWT– What it is – What we learned– What we are doing now
• Grid Computing– General comments– Build a web service
SkyServerSkyServer.SDSS.org
• Like the TerraServer, but looking the other way: a picture of ¼ of the universe
• Pixels +Data Mining
• Astronomers get about 400 attributes for each “object”
• Get Spectrograms for 1% of the objects
Why Astronomy Data?•It has no commercial value
–No privacy concerns–Can freely share results with others–Great for experimenting with algorithms
•It is real and well documented–High-dimensional data (with confidence intervals)–Spatial data–Temporal data
•Many different instruments from many different places and many different times•Federation is a goal•The questions are interesting
–How did the universe form?
•There is a lot of it (petabytes)
IRAS 100
ROSAT ~keV
DSS Optical
2MASS 2
IRAS 25
NVSS 20cm
WENSS 92cm
GB 6cm
Demo of SkyServer
• Shows standard web server• Pixel/image data• Point and click • Explore one object• Explore sets of objects (data mining)
Virtual Observatoryhttp://www.astro.caltech.edu/nvoconf/
http://www.voforum.org/
• Premise: Most data is (or could be online)• So, the Internet is the world’s best telescope:
– It has data on every part of the sky– In every measured spectral band: optical, x-ray, radio..
– As deep as the best instruments (2 years ago).– It is up when you are up.
The “seeing” is always great (no working at night, no clouds no moons no..).
– It’s a smart telescope: links objects and data to literature on them.
Time and Spectral DimensionsThe Multiwavelength Crab Nebulae
X-ray, optical,
infrared, and radio
views of the nearby Crab
Nebula, which is now in a state of
chaotic expansion after a supernova
explosion first sighted in 1054 A.D. by Chinese Astronomers.Slide courtesy of Robert Brunner @ CalTech.
Crab star 1053 AD
Federation
Data Federations of Web Services• Massive datasets live near their owners:
– Near the instrument’s software pipeline– Near the applications– Near data knowledge and curation– Super Computer centers become Super Data Centers
• Each Archive publishes a web service– Schema: documents the data– Methods on objects (queries)
• Scientists get “personalized” extracts• Uniform access to multiple Archives
– A common global schema
Grid and Web Services Synergy• I believe the Grid will be many web services
share data (computrons are free)
• IETF standards Provide – Naming– Authorization / Security / Privacy– Distributed Objects
Discovery, Definition, Invocation, Object Model– Higher level services: workflow, transactions, DB,..
• Synergy: commercial Internet & Grid tools
Web Services: The Key?• Web SERVER:
– Given a url + parameters – Returns a web page (often dynamic)
• Web SERVICE:– Given a XML document (soap msg)– Returns an XML document– Tools make this look like an RPC.
• F(x,y,z) returns (u, v, w)
– Distributed objects for the web.– + naming, discovery, security,..
• Internet-scale distributed computing
Yourprogram
DataIn your address
space
Web Service
soap
object
in xml
Yourprogram Web
Server
http
Web
page
SkyQuery: a prototype• Defining Astronomy Objects and Methods.• Federated 3 Web Services (fermilab/sdss, jhu/first, Cal Tech/dposs)
multi-survey cross-matchDistributed query optimization (T. Malik, T. Budavari, Alex Szalay @ JHU)
http://SkyQuery.net/• My first web service (cutout + annotated SDSS images) online
– http://skyservice.pha.jhu.edu/devel/ImgCutout/chart.asp
• WWT is a great Web Services (.Net) application– Federating heterogeneous data sources.– Cooperating organizations– An Information At Your Fingertips challenge.
Demo of Image Cutout Service
• Shows image cutout• Show project and debugging project• Show hello World• Show “theAnswer” method
SkyQuery (http://skyquery.net/)
• Distributed Query tool using a set of services• Feasibility study, built in 6 weeks from scratch
– Tanu Malik (JHU CS grad student) – Tamas Budavari (JHU astro postdoc)
• Implemented in C# and .NET• Allows queries like:
SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2
SkyNode Basic Web Services• Metadata information about resources
– Waveband– Sky coverage– Translation of names to universal dictionary (UCD)
• Simple search patterns on the resources– Cone Search– Image mosaic– Unit conversions
• Simple filtering, counting, histogramming• On-the-fly recalibrations
Portals: Higher Level Services• Built on Atomic Services• Perform more complex tasks• Examples
– Automated resource discovery– Cross-identifications– Photometric redshifts– Outlier detections– Visualization facilities
• Goal:– Build custom portals in days from existing building blocks
(like today in IRAF or IDL)
ArchitectureArchitecture Image cutout
SkyNodeSDSS
SkyNode2Mass
SkyNodeFirst
SkyQueryWeb Page
Summary So Far
• Some real web services deployed today• Easy to build & deploy• Services publish data, Portals unify it• Tools really work!• I’m using C# and foundation classes of
VisualStudio, a great! Tool• A nice book explaining the ideas:
(.Net Framework Essentials, Thai, Lam isbn 0-596-00302-1)
Possible Relevance to You• This web service stuff is REAL• If you have a class,
It is a way to publish data:InternetIntranet
• It is a way to find datadata comes with schema no more screen scraping/parsing
• Business model unclear– Your ideas go here.
Yourprogram
DataIn your address
space
Web Service
soap
objectin xml
What We Learned• Web services really are a breakthrough.• Data mining worked beautifully. See
Data Mining the SDSS SkyServer Database,” J. Gray, D. Slutz, A. Szalay, A. Thakar, P. Kuntz, C. Stoughton, MSR TR 2002-1, pp1-40, 2002.
• You can operate a system in Chicago from San Francisco – Terminal Server is wonderful.
• The Internet gives you 2 9’s of availability• TeraScale SneakerNet works well
What we are doing now.• Loading more data (next data release)• Preparing for the next generation • Building the WWT• Web Services for the Virtual Observatory,
Alexander S. Szalay, Tamás Budavária, Tanu Malika, Jim Gray, and Ani Thakar, SPIE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa, Hawaii,
• Petabyte Scale Data Mining: Dream or Reality?,Alexander S. Szalay; Jim Gray; Jan vandenBerg, SIPE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa, Hawaii,
• Online Scientific Data Curation, Publication, and Archiving Jim Gray; Alexander S. Szalay; Ani R. Thakar; Christopher Stoughton; Jan vandenBerg, SPIE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa, Hawaii,
Agenda• TerraServer
– What it is– What we learned– What we are doing now.
• SkyServer / WWT– What it is – What we learned– What we are doing now
• Grid Computing– General comments– Build a web service
The Grid
• Computation Grid: harvest Internet cpus.• Data Grid: Share files • Application Grid: Web services• Access Grid: teleconferencing
The Microsoft View• Web Services will subsume the Grid
–The Grid will be data and servicesnot renting cycles
• OGSA: evolution of Globus Toolkit to Web services concepts and technologies…
• Lots of encouragement from Microsoft, IBM, Oracle, Sun
• GGF as forum for discussion
Engagement with Grid Community• Goal: GXA as infrastructure for Grids• Working with Globus & GGF
– Funding work at Argonne National Lab (Globus) – Globus Toolkit 3, and CondorG on Windows
• http://www.globus.org/win-alpha/ (we sponsored this)– OGSA for .NET (prototyping)
• http://www.globus.org/ogsa/ – Also OGSI.NET at U. VA is very interesting
• http://www.cs.virginia.edu/~gsw2c/ogsi.net.html– GGF
• Active membershp
• HPC .net kit – see http://www.microsoft.com/HPC– Part of .net server scale out development– Includes MPI-CH 1.2.4, distributed job scheduler,…– Thomas Sterling, Beowulf on Windows, MIT Press 2001
What’s Microsoft Doing • Mostly .NET, W3C standards, web services, …• I think SkyQuery is the best web service (grid
app) in GriPhyN today.• My stuff is grid computing• But…• Globus (GT3), OGSA, and CondorG ported to
Windows (we sponsored it)• We have a HPC toolkit: MPI-CH 1.2.4• See
http://www.microsoft.com/windows2000/hpc/ for many useful links
I Can Talk About Computing on Demand But… Best to read
• Distributed Computing Economics, Jim Gray, MSR-TR-2003-24, March 2003
• The slides that follow are based on that paper.
Distributed Computing Economics
• Why is Seti@Home a great idea• Why is Napster a great deal?• Why is the Computational Grid uneconomic• When does computing on demand work?• What is the “right” level of abstraction• Is the Access Grid the real killer app?
Computing is Free
• Computers cost 1k$ (if you shop right)• So 1 cpu day == 1$
• If you pay the phone bill (and I do)Internet bandwidth costs 50 … 500$/mbps/m(not including routers and management).
• So 1GB costs 1$ to send and 1$ to receive
Why is Seti@Home a Good Deal?
• Send 300 KB for costs 3e-4$
• User computes for ½ day: benefit .5e-1$
• ROI: 1500:1
Why is Napster a Good Deal?
• Send 5 MB costs 5e-3$• ½ a penny per song• Both sender and receiver can afford it.
• Same logic powers web sites (Yahoo!...):– 1e-3$/page view advertising revenue– 1e-5$/page view cost of serving web page– 100:1 ROI
The Cost of Computing:Computers are NOT free!
• Capital Cost of a TpcC system is mostly storage and storage software (database)
• IBM 32 cpu, 512 GB ram 2,500 disks, 43 TB (680,613 tpmC @ 11.13 $/tpmc available 11/08/03)http://www.tpc.org/results/individual_results/IBM/IBMp690es_05092003.pdf
• A 7.5M$ super-computer• Total Data Center Cost:
40% capital &facilities60% staff
(includes app development)
TpcC Cost Components DB2/AIXhttp://www.tpc.org/results/individual_results/IBM /IBM p690es_05092003.pdf
cpu/mem29%
storage61%
software10%
Computing Equivalents1 $ buys
• 1 day of cpu time• 4 GB ram for a day• 1 GB of network bandwidth• 1 GB of disk storage• 10 M database accesses • 10 TB of disk access (sequential)• 10 TB of LAN bandwidth (bulk)
Some consequences• Beowulf networking is
10,000x cheaper than WAN networkingfactors of 105 matter.
• The cheapest and fastest way to move a Terabyte cross country is sneakernet.24 hours = 4 MB/s50$ shipping vs 1,000$ wan cost.
• Sending 10PB CERN data via networkis silly: buy disk bricks in Geneva, fill them, ship them – one way.
TeraScale SneakerNet: Using Inexpensive Disks for Backup, Archiving, and Data ExchangeJim Gray; Wyman Chong; Tom Barclay; Alex Szalay; Jan vandenBergMicrosoft Technical Report may 2002, MSR-TR-2002-54 http://research.microsoft.com/research/pubs/view.aspx?tr_id=569
How Do You Move A Terabyte?
14 minutes6172001,920,0009600OC 192
2.2 hours1000Gbps
1 day100100 Mpbs
14 hours97631649,000155OC3
2 days2,01065128,00043T3
2 months2,4698001,2001.5T1
5 months360117700.6Home DSL
6 years3,0861,000400.04Home phone
Time/TB$/TBSent$/MbpsRent
$/monthSpeedMbpsContext
Computational Grid Economics• To the extent that computational grid is like
Seti@Home or ZetaNet or Folding@home or… it is a great thing
• The extent that the computational grid is MPI or data analysis, it fails on economic grounds: move the programs to the data, not the data to the programs.
• The Internet is NOT the cpu backplane.• The USG should not hide this economic fact
from the academic/scientific research community.
Computing on Demand• Was called outsourcing / service bureaus in my
youth. CSC and IBM did it.• Payroll is standard outsource.• Now we have Hotmail, Salesforce.com, Oracle.com,
….• Works for standard apps.• Airlines outsource reservations.
Banks outsource ATMs.• But Amazon, Amex, Wal-Mart, ...
Can’t outsource their core competence.• So, COD works for commoditized services.• It is not a new way of doing things: think payroll.
What’s the right abstraction level for Internet Scale Distributed Computing?• Disk block? No too low.• File? No too low.• Database? No too low.• Application? Yes, of course.
– Blast search– Google search– Send/Get eMail– Portals that federate astronomy archives
(http://skyQuery.Net/)• Web Services (.NET, EJB, OGSA)
give this abstraction level.
Access Grid• Q: What comes after the telephone?• A: eMail?• A: Instant messaging?• Both seem retro technology: text & emotons.• Access Grid
could revolutionize human communication.• But, it needs a new idea. • Q: What comes after the telephone?
Distributed Computing Economics
• Why is Seti@Home a great idea?• Why is Napster a great deal?• Why is the Computational Grid uneconomic• When does computing on demand work?• What is the “right” level of abstraction?• Is the Access Grid the real killer app?
Based on: Distributed Computing Economics, Jim Gray, Microsoft Tech report, March 2003, MSR-TR-2003-24
http://research.microsoft.com/research/pubs/view.aspx?tr_id=655
Agenda• TerraServer
– What it is– What we learned– What we are doing now.
• SkyServer / WWT– What it is – What we learned– What we are doing now
• Grid Computing– General comments– Build a web service