34
Large Scale Data Processing with DryadLINQ Dennis Fetterly Microsoft Research, Silicon Valley Workshop on Data-Intensive Scienti fic Computing Using DryadLINQ

Large Scale Data Processing with DryadLINQ

Embed Size (px)

DESCRIPTION

Large Scale Data Processing with DryadLINQ. Dennis Fetterly Microsoft Research, Silicon Valley Workshop on Data-Intensive Scientific Computing Using DryadLINQ. Outline. Brief introduction to TidyFS Preparing/loading data onto a cluster Desirable properties in a Dryad cluster - PowerPoint PPT Presentation

Citation preview

Large Scale Data Processing with DryadLINQ

Dennis FetterlyMicrosoft Research, Silicon Valley

Workshop on Data-Intensive Scientific Computing Using

DryadLINQ

Outline

• Brief introduction to TidyFS• Preparing/loading data onto a cluster• Desirable properties in a Dryad cluster• Detailed description of several IR algorithms

TidyFS goals

• A simple distributed filesystem that provides the abstractions necessary for data parallel computations

• High performance, reliable, scalable service• Workload – High throughput, sequential IO, write once– Cluster machines working in parallel– Terasort

• 240 machines reading at 240 MB/s = 56 GB/s• 240 machines writing at 160 MB/s = 37 GB/s

TidyFS Names

• Stream: a sequence of partitions – i.e. tidyfs://dryadlinqusers/fetterly/clueweb09-English– Can have leases for temp files or cleanup from crashes

• Partition:– Immutable– 64 bit identifier– Can be a member of multiple streams– Stored as NTFS file on cluster machines– Multiple replicas of each partition can be stored

Stream-1 Part 1 Part 2 Part 3 Part 4

Preparation of Data

• Often substantially harder than it appears• Issues:– Data format– Distribution of data– Network bandwidth

• Generating synthetic datasets is sometimes useful

Data Prep – Format

• Text records are simplest– Caveat – information that is not in the line• e.g. - if a line number encodes information

• Binary records often require custom code to load to cluster– Serialization/de-serialization code generated by

DryadLINQ uses C# Reflection

Custom Deserialization Codepublic class UrlDocIdScoreQuery{ public string queryId; public string url; public string docId; public string queryString; public double score;

public static UrlDocIdScoreQuery Read(DryadBinaryReader reader) { UrlDocIdScoreQuery rec = new UrlDocIdScoreQuery(); rec.queryId = ReadAnyString(reader); rec.queryString = ReadAnyString(reader); rec.url = ReadAnyString(reader); rec.docId = ReadAnyString(reader); rec.score = reader.ReadDouble(); return rec; } public static string ReadAnyString(DryadBinaryReader dbr) {…}}

Data Prep - Loading

• DryadLINQ job– Often needs a dummy input anchor

• Custom program– Write records to TidyFS partitions

• “SneakerNet” often a good option

Data Loading - DryadLINQ

• Need input “anchor” to run on cluster– Generate or use existing stream

• Sample:IEnumerable<Entry> GenerateEntries(Random x, int numItems){

for (int i = 0; i < numItems; i++) {// code to generate records yield return record;

}}

Data Gen - DryadLINQ

• Need input “anchor” to run on cluster– Generate or use existing stream

• Sample:IEnumerable<Entry> GenerateEntries(Random x, int numItems){

for (int i = 0; i < numItems; i++) {// code to generate records yield return record;

}}

DryadLINQ Job

var streamname = "tidyfs://datasets/anchor”;var os = @"tidyfs://msri/teamname/data?compression=" + CompressionScheme.GZipFast;var r = PartitionedTable.Get<int>(streamname)

.Take(1) .SelectMany(x => Enumerable.Range(0, partitions)) .HashPartition(x => x, partitions) .Select(x => new Random(x)) .SelectMany(x => GenerateEntries(x, numItems)) .ToPartitionedTable(os);

Data Loading - Databases

• Bulk copy into files– Use queries to produce multiple files

• Perform queries within DryadLINQ UDFIEnumerable<Entry> PerformQuery(string queryArg){

var results = “select * from …”;foreach (var record in results) {

yield return record;}

}

Building a cluster

• Overall goal – a high-throughput system– Not latency sensitive

• More slower computers often better than fewer faster computers

• Multiple cores better that frequency• Multiple disks – increase throughput• Sufficient RAM

Networking a Cluster

• Network topology – medium to large clusters– Attempt to maximize cross rack bandwidth– Two tier topology• Rack switches and core switches

• Port aggregation– Bond multiple connections together • 1 GbE or 10 GbE

Cluster Software

• Runs on Windows HPC Server 2008• Academic Release – For non-commercial use

• Commercial License

DryadLINQ IR Toolkit

• Library that uses DryadLINQ• Source code for a number of IR algorithms– Text retrieval - BM25/BM25F– Link based ranking - PageRank/SALSA-SETR– Text processing - Shingle based duplicate detection

• Designed to work well with ClueWeb09 collection– Including preprocessing the data to load the cluster

• Available from http://research.microsoft.com/dryadlinqir/

ClueWeb09 Collection

• Collected/Distributed by CMU• 1 billion web pages crawled in Jan/Feb 2009• 10 different languages– en, zh, es, ja, de, fr, ko, it, pt, ar

• 5 TB, compressed - 25 TB, uncompressed • Available to research community• Dataset available for your projects– Web graph, 503m English web pages

Example: Term Frequencies

Count term frequencies in a set of documents:

var docs = new PartitionedTable<Doc>(“tidyfs://dennis/docs”);var words = docs.SelectMany(doc => doc.words);var groups = words.GroupBy(word => word);var counts = groups.Select(g => new WordCount(g.Key, g.Count()));counts.ToPartitionedTable(“tidyfs://dennis/counts.txt”);

IN

metadata

SM

doc => doc.words

GB

word => word

S

g => new …

OUT

metadata

Distributed Execution of Term Freq

SM

DryadLINQGB

S

LINQ expression

IN

OUT

Dryad execution

20

Execution Plan for Term Frequency

(1)

SM

GB

S

SM

Q

GB

C

D

MS

GB

Sum

SelectMany

Sort

GroupBy

Count

Distribute

Mergesort

GroupBy

Sum

pipelined

pipelined

21

Execution Plan for Term Frequency

(1)

SM

GB

S

SM

Q

GB

C

D

MS

GB

Sum

(2)

SM

Q

GB

C

D

MS

GB

Sum

SM

Q

GB

C

D

MS

GB

Sum

SM

Q

GB

C

D

MS

GB

Sum

BM25 “Grep”• For batch evaluation of queries calculating

BM25 is just a select operation string queryTermDocFreqURLLocal = @"E:\TREC\query-doc-freqs.txt";Dictionary<string, int> dfs = GetDocFreqs(queryTermDocFreqURLLocal); PartitionedTable<InitialWordRecord> initialWords = PartitionedTable.Get<InitialWordRecord>(initialWordsURL);

var BM25s = from doc in initialWords select ComputeDocBM25(queries, doc, dfs);

BM25s.ToPartitionedTable(“tidyfs://dennis/scoredDocs”);

PageRankRanks web pages by propagating scores along hyperlink structure

Each iteration as an SQL query:

1. Join edges with ranks2. Distribute rank on edges3. GroupBy edge destination4. Aggregate into ranks.5. Repeat.

One PageRank Step in DryadLINQ// one step of pagerank: dispersing and re-accumulating rankpublic static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks){ // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank);

// re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum());}

A Complete DryadLINQ Program

var pages = DryadLinq.GetTable<Page>(“tidyfs://pages.txt”); // repeat the iterative computation several times var ranks = pages.Select(page => new Rank(page.name, 1.0)); for (int iter = 0; iter < iterations; iter++) { ranks = PRStep(pages, ranks); } ranks.ToDryadTable<Rank>(“outputranks.txt”);

public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links;

public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; }

public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } }

public struct Rank { public UInt64 name; public double rank;

public Rank(UInt64 n, double r) { name = n; rank = r; } }

public static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank);

// re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum());}

PageRank Optimizations

• Benchmark PageRank on 954m page graph• Naïve approach – 10 iter ~3.5 hours 1.2TB• Apply several optimizations– Change data distribution– Pre-group pages by host– Renaming host groups with dense names– Cull out leaf nodes– Pre-aggregate ranks for each host

• Final version – 10 iter 11.5 min 116 GB

Tactics for Improving Performance

• Loop unrolling• Reduce data movement– Improve data locality

• Choose what to Group

Gotchas

• Non-deterministic output– E.g. RNG in user defined function

• Writing to shared state

Schedule for Today

• 9:30 – 10:00 Meet with team, finalize project• 10:30-12:00 Work on projects, discuss

approach with a speaker

Backup Slides

Cluster Configuration

Head Node

TidyFS ServersCluster machines running tasks and TidyFS storage service

How a Dryad job reads from TidyFS

TidyFS Service

Cluster Machines

Job ManagerList Partitions in Stream Part 1, Machine 1

Part 2, Machine 2

Schedule VertexPart 1

Schedule VertexPart 2

Machine 1

Machine 2

Get Read PathMachine 1, Part 1

Get Read PathMachine 2, Part 2

D:\tidyfs\0001.data

D:\tidyfs\0002.data…

How a Dryad job writes to TidyFS

TidyFS ServiceCluster Machines

Job Manager

Machine 1

Machine 2

Schedule Vertex 1

Schedule Vertex 2

createStr1_v1

createStr1_v2

Part 1

Part 2

Str1_v1

Str1_v2

Part1

Part 2

How a Dryad job writes to TidyFS

TidyFS ServiceCluster Machines

Job Manager

Machine 1 GetWritePathMachine 1, Part 1

GetWritePathMachine2, Part 2

D:\tidyfs\0001.data

D:\tidyfs\0002.dataMachine 2

Str1_v1

Str1_v2

Part1

Part 2

Completed

Completed

Create Str1

Str1

ConcatenateStreams(str1, str1_v1, str1_v2)

Delete Streamsstr1_v1, str1_v2

AddPartitionInfo(Part 1, Machine 1, Size, Fingerprint, …)

AddPartitionInfo(Part 2, Machine 2, Size, Fingerprint, …)