Google Scale Data Management

33/27/27

(partially based on slides by Qing Li)

Google (..a lecture on that??)

A good example of where a good understanding of computer scienceand some entrepreneurial spirit can take you..

44/27/27

Google (what?..couldn’t they come up with a better name?)

..doesn’t mean anything, right? ...or does it mean “search” in French?... Surprising, but “Google” actually means stg.

• “Googol” is the name of a “number”…

• A big one : 1 followed by 100 zeros

..this reflects the company's mission....and the reality of the world:

• …there are a whole lot of “data” out there…

• …and whoever helps people (e.g. me) manage it effectively deserves to be rich!

55/27/27

..in fact there is more to Google than search...

These are data intensive applications

66/27/27

Crawling

Question #1: How can it know so much?...because it crawls…

Database

77/27/27

A whole lot of data

One SearchEngine

A whole lot of users

Question #2: How can it be so fast?

88/27/27

Crawling

d1d3K2

d1d2K1

Question #1: How can it be so fast?...because it indexes…

Indexing

Indexes: quick lookup tables(like the index pages in a book)

Database

99/27/27

Database(copies of all web pages!)

Front-end web server

1010/27/27

Index servers

use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (databases, indexing)

1111/27/27

Index servers Content servers

1212/27/27

Index servers Content servers

1313/27/27

Index servers

use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (databases, indexing) …most people want to know the same thing anyhow (caching)

CACHE(copies of recent results)

Content servers

1414/27/27

Question 3: How can it serve the most relevant page?

ANALYZE!!!!

1515/27/27

Text analysis• Analyze the content of a page to determine if it is

relevant to the query or not

1616/27/27

Link analysis• Analyze the (incoming and outgoing) links of pages to

determine if the page is “worthy” or not

1717/27/27

Link analysis• Analyze the (incoming and outgoing) links of pages to

determine if the page is “worthy” or not

User analysis• Analyze the users search context to guess what the

user wants to do to serve the most relevant content (and the advertisement!!)

1818/27/27

Text Analysis

Text analysis

Rawtext

CrawlingWeb

2121/27/27

Term Extraction

Extraction of index terms• Stemming

• “informed”, “informs” inform

2222/27/27

Term Extraction

• Compound word identification • {“white”, “house”} or {“white house”} ????

2323/27/27

Term Extraction

• Compound word identification • {“white”, “house”} or {“white house”} ????

• Removal of stop words• “a”, “an”, “the”, “is”, “are”, “am”, …

• they do not help discriminate between relevant and irrelevant documents

2424/27/27

Stop words

Those terms that occur too frequently in a language are not good discriminators for search

Rank (of the term)

Frequencyof usage

in english language

stop words

the an ASU

2626/27/27

“Inverted” index

search

Google

Directory

Doc #1---------------

Doc #2---------------

Doc #5---------------

Query“ASU”

DocumentCollection

requires “fast” string search!!(hashes, search trees)

2828/27/27

What are the weights????

They need to capture how good the term is in describing the content of the page

t1 better than t2

),"("_),"("

Docinpagekeywordsofnumber

DocASUinpagenumberDocASUtf

term-frequency

Document

2929/27/27

They need to capture how discriminating the term is (remember the stop words??)

t2 better than t1

3030/27/27

They need to capture how discriminating the term is (remember the stop words??)

t2 better than t1

)"("___

__log)"("

ASUwithdocumentofnumber

documentsofnumberASUidf

inverse-document-frequency

3131/27/27

They need to capture how • good the term is in describing the content of

the page

• discriminating the term is..

)"("),"("),"(" ASUidfDocASUtfDocASUtfidf

3333/27/27

How are the results ranked based on relevance?

3838/27/27

Vectors…what are they???

Page with “ASU” weight <0.5>, “CSE” weight <0.7>, and “Selcuk” weight <0.9>

Selcuk

<0.5,0.7.0.9>

3939/27/27

A web database (a vector space!)

Database

Selcuk

4040/27/27

A web database (a vector space!)

Query “ASU CSE Selcuk”

Selcuk

4141/27/27

Measuring relevance: “Find 2 closest vectors”

Selcuk

Requires “multidimensional” index structures

Query “ASU CSE Selcuk”

4242/27/27

How about “importance” of a page…can we measure it?

4343/27/27

We might be able to learn how important a page is by studying its connectivity (“link analysis”)

more links to a page implies a better page

4444/27/27

We might be able to learn how important a page is by studying its connectivity (“link analysis”)

Links from a more important page should count more than links from a weaker page

4545/27/27

Hubs and authorities

•Good hubs should point to good authorities

•Good authorities must be pointed by good hubs.

“Graph theory” helps analyze large networks of links

4646/27/27

PageRank

PR(A) = PR(C)/1

4747/27/27

PageRank

PR(A) = PR(C)/1

PR(B) = PR(A) / 2

4848/27/27

PageRank

PR(A) = PR(C)/1

PR(B) = PR(A) / 2

PR(C) = PR(A) / 2 + PR(B)/1

These types of problems are generally solved using “linear algebra”

5050/27/27

How to compute ranks??

PageRank (PR) definition is recursive• PageRank of a page depends on and influences PageRanks

of other pages

5151/27/27

How to compute ranks??

PageRank (PR) definition is recursive

• PageRank of a page depends on and influences PageRanks of other pages

Solved iteratively• To compute PageRank:

• Choose arbitrary initial PR_old and use it to compute PR_new

• Repeat, setting PR_old to PR_new until PR converges (the difference between old and new PR is sufficiently small)

• Rank values typically converge in 50-100 iterations

5252/27/27

Related CSE courses 185 Introduction to Web Development 205 Object-Oriented Programming and Data Structures 310 Data Structures and Algorithms 355 Introduction to Theoretical Computer Science 412 Database Management 414 Advanced Database Concepts 450 Design and Analysis of Algorithms 457 Theory of Formal Languages 471 Introduction to Artificial Intelligence 494 Cognitive Systems and Intelligent Agents 494 Information Retrieval, Mining 494 Numerical Linear Algebra Data Exploration 494 Introduction to High Performance Computing

5353/27/27

Related CSE courses 185 Introduction to Web Development 205 Object-Oriented Programming and Data Structures 310 Data Structures and Algorithms 355 Introduction to Theoretical Computer Science 412 Database Management 414 Advanced Database Concepts 450 Design and Analysis of Algorithms 457 Theory of Formal Languages 471 Introduction to Artificial Intelligence 494 Cognitive Systems and Intelligent Agents 494 Information Retrieval, Mining 494 Numerical Linear Algebra Data Exploration 494 Introduction to High Performance Computing

5454/27/27

Reference L. Page and S. Brin,The PageRank citation ranking: Bringing

order to the web , Stanford Digital Library Technique, Working paper 1999-0120, 1998.

L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Web Conference (WWW 98), 1998.

Google Scale Data Management

Documents

Software Systems at Scale · 2019-05-03 · None of the following slides depict an actual Google service or actual Google technologies. They are indicative of the scale at which Google

Large-scale Graph Mining @ Google NYdimacs.rutgers.edu/Workshops/ParallelAlgorithms/Slides/vahab-dima… · Large-scale graph mining. Google NYC Large-scale graph mining Develop a

Development at the Speed and Scale of Google

Google Eart - GJH · Google Eart Name of the program: Google Earth Manufacturer: Google Inc. Version: 5.1.3533.1731 ... Scale Legend Software Review David Hoffman, MYP IV ... Microstation

Monarch: Google’s Planet-Scale In-Memory Time Series …Google LLC monarch-paper@google.com ABSTRACT Monarch is a globally-distributed in-memory time series data-base system in Google

Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi - Codemotion Milan 2016

Google Data Highlighter

Processing large-scale graphs with Google Pregel

2013-03-24 Continuous Integration at Google Scale

Building a Distributed Build System at Google Scale

Google Cloud computing at scale. o Punch scaled test … · 2020. 3. 9. · Google Cloud computing at scale. Google Cloud Platform engineering, Platform engineering, Developer Operations,

IoT at Google Scale

Planetary-Scale Geospatial Data Analysis with Google Earth Engine

Machine Learning at Google Scale - QConSP · Machine Learning at Google Scale ML APIs and TensorFlow. Michel Pereira Google Cloud Customer Engineer @michelpereira@ What is Neural

Infinite Scale - Introduction to Google App Engine

Google Scale Data Management - public.asu.edu · Google BigTable BigTable is a fast and extremely large-scale database management system; It is a compressed, high performance, and

Development at the Speed and Scale of Google - QCon

Google data centers

Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Processing large-scale graphs with Google(TM) Pregel