22
Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting. Prepare for RSA talk. Postings

Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Embed Size (px)

Citation preview

Page 1: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Other formats for data

Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel

processing exerciseHomework: Plans for group sorting. Prepare

for RSA talk. Postings

Page 2: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Linked list

• Big array for data• Array of arrays: think of rows• Each row has information + one or more

pointers to other rows. Various ways:– Forward pointing list: next item– Forward and back: next and previous item– Tree: first child item and next sibling

• or first child, next sibling, parent• or first child, next sibling, parent1, parent2

Page 3: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Family example: name, a parent, 1st child, next sibling

Esther -1 1 7

Anne 0 6 2

Jeanine 0 3 -1

Daniel 2 5 4

Aviva 2 -1 -1

Annika 3 -1 -1

Page 4: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Exercise

• Make your family tree• each row has a name, parent1, (optionally

include second parent), first child, next sibling

• you need to start somewhere• Put down Not defined for things not in the

table.• Put down -1 for cases of no children, no

next sibling

Page 5: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Hash tables

• Problem: how to find elements in a table?– no intrinsic order. If there was, you could use

binary search.– Binary search: Compare value (or the key) to

the middle value, if less than, search the lower half, if greater than, search the upper half, keep going…

– Aside: Meyer family geography game

Page 6: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Hash table approach• Have key-value pairs.• Have task of finding if current key is in the table.

– Assume there is a hash function that inputs the key and outputs the hash which corresponds to a slot in the table.

• fixed time to compute the function• go to that spot. If empty, then store key-value there. If not

empty, compare the keys, if it matches, then …. If not, check the next position, continue.

– http://en.wikibooks.org/wiki/Data_Structures/Hash_Tables

Page 7: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Associative array

• Normal arrays use indices, typically starting with 0.

• An associative array uses values. Consider a set of 4 products: table, desk, chair, lamp. An associative array could be used to store the prices:

table=>100, desk=>150, chair=>50, lamp=>20

Page 8: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

key-value pairs

• so called key-value pairs is generalization of associative array and used in other systems.

• At its most general, there can be more than one key-value for a given key and the basic software OR your program needs to take care of this situation.

Page 9: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

JSON• http://www.json.org/ • Format (syntax) for information

– smaller than XML– available in many language

• name / value pairs – create using brackets. Use dot notation to

access and modify

• arrays– create using square brackets. Square

brackets with indices to access and modify.

Page 10: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Example

var course = {"name":"Topics", "teacher": "Jeanine Meyer", "days": "MR"};course.name =>"Topics"

course.teacher => "Jeanine Meyer"

course.days => "MR"

Page 11: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Example

var list = { "class_list": [ {"firstname":"Groucho", "lastname": "Marx"}, {"firstname":"Harpo", "lastname": "Marx"}, {"firstname":"Zeppo", "lastname": "Marx"}, {"firstname":"Curly", "lastname": "Stooge"} ]};

list[2].firstname => "Zeppo"

Page 12: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Big Data

• buzz word more than specific product• Data that is

– large in Volume– changes rapidly [or application requires up-to-date

values] Velocity– different formats Variable

• PLUS not necessarily all owned by the organization attempting to use it.– in this case, can only query, no changes/updates,

deletions or additions

Page 13: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Note

• A company / organization can store data in its own CLOUD (on servers) or cloud service offered by a vendor and still have total control.– Could even be relational database– Very large data bases, may be just key-value

pairs

Page 14: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Cloud

… can refer to one, some or all of the following

• where the programs are

• where the data is

• where the processors (aka computers) are for doing the calculations

Page 15: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

REST

• Representational State Transfer– a "standard" / framework / style of

communicating with Web services– typically, get information in the form of XML or

JSON or something else

• Posting opportunity: find a specific service that provides REST connections….

Page 16: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Parallel processing / distributed processing

• Large amounts (volumes) of data

• Multiple number of processors

• How to speed up accomplishment of tasks?– Embarrassingly parallel refers to tasks that is

easy to parallelize• Take a list of numbers (say, prices) and increase

each by 10%• ?

Page 17: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

What about

• Tasks in which some parts can be done in parallel, but some cannot

• How to devise ways to take advantage of multiple processors

Page 18: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Parallel exercise

• Divide into groups of 5

• Each take a deck of cards

• Shuffle

• Devise plan to sort into order– suits hearts, spades, diamonds, clubs, – each suit A, 2, …. J, Q, K

Page 19: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Hadoop

• open source utilities for distributed computing

• http://hadoop.apache.org/

• Includes MapReduce

Page 20: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

MapReduce

A MapReduce job

• map sets up tasks to be done in parallel

• reduce combines the results– may be local combine step and then a

reduce across all output steps

• Requires a file system

• Data is in key/value pairs

Page 21: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Applications

• What are applications that using multiple processors for a [big] gain in speed?

Page 22: Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting

Homework

• Come up with improved parallel sorting

• Postings: more on Hadoop, MapReduce, Big Data, etc.