Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Schedule
• Undergraduateso June 30, 2:00-6:00, Background materialo July 2, 2:00-6:00, Data findingo July 4, 2:00-6:00, Data finding and machine learningo July 6, 2:00-6:00, Machine learning
• Graduateso July 1, 2:30-5:30, Introduction and data findingo July 3, 8:30-11:30, Data finding and machine learningo July 3, 2:30-5:30, Machine learning
2
Introduction
3
Useful References
• http://www.mgnet.org/~douglas/Classes/bigdata/2019su-index.html
• Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman, Mining of Massive Datasets, 2nd ed. (version 2.1), Stanford University, 2014. The most up to date version is online at http://www.mmds.org. I will lecture from the 3rd edition draft as well.
• Andriy Burkov, The Hundred-Page Machine Learning Book, http://themlbook.com/wiki/doku.php, 2019.
4
Useful References
• Wooyoung Kim, Parallel Clustering Algorithms: Survey, Parallel Clustering Algorithms: Survey, http://grid.cs.gsu.edu/~wkim/index_files/SurveyParallelClustering.html, 2009.
• Deep Learning exercises using TensorFlow, https://www.coursera.org/learn/intro-to-deep-learning/home/welcome.o https://github.com/hse-aml/intro-to-dl
5
Useful Software
• TensorFlowo Version 1.13 is stable. Version 2.0.0-beta is not.o Anaconda or Miniconda environmentso Additional Python packages: jupyter, matplotlib,
pandas
• Tableau• MapReduce, Spark, and workflow systems• Many problems run 1000X faster on a GPU
6
Some Sources of Big Data
• Interactions with dynamic databases• Internet data• City or regional transportation flow control• Environment and disaster management• Oil/gas fields or pipelines, seismic imaging• Government or industry regulation/statistics• Closed circuit camera identification
7
Oil/Gas Pipelines
Picture courtesy of Miriam Webster Dictionary 8
Pipeline Network Properties
• Pipe diameters range from 2 inches to 5 feet.
• Rarely straight and level.• Contain– Possibly different grades of
oil or gas simultaneously.– Pigs as separators.– Sensors (inside and
outside)• Not restricted to oil/gas
pipelines (water, etc.).
9
1970’s Modeling
• Problem modeled mathematically based on time dependent, nonlinear coupled partial differential equations (two models).– Sensors on all pipeline components (recall the cartoon).– Distributed GRID computing with scattered phone booths:
• 2 minicomputers, 4 array processors, a heat pump on top, and a U.S. nickel soldered in place to allow “free” calls for telemetry.
• Sensors provided data (temperature, pressure, and velocity) dynamically based on need and anomalies and controlled by the environment and running model.
• No central computing, just central and distributed control sites.• 2,000 pieces of telemetry/minute in complete KSA network (1978).
10
Current Modeling
• 3D math models of pipelines with topography.• Central computing and fiber optic TCP/IP with
Gigabit Ethernet backup near pipelines.• Many more sensors plus ones to measure pipe
(shape) changes, internal pollutants and external gas leakages.
• When 1978 system replaced in KSA in 1998, 100,000 times the telemetry/minute. In 2014, a tsunami of uncountable data.
11
Monitoring Site Evolution
• In 1970’s, primitive center where “what if” scenarios were run to keep pipelines from breaking in parallel with regular monitoring.
• Now, large scale visualization is used to monitor pipelines in a multiscale framework. Individual high resolution monitors (1080p and 4K+) used for “what if” scenarios.
• Always trying to find anomalies in the data streams to avoid pipeline problems.
12
Computer Science Techniques
13
Hash Tables
• A hash table is a data structure with N buckets.– N is usually a prime number and may be quite
large.– Each bucket contains data.– Accessed using a hash function Key = h(x).• h(x) must be inexpensive to evaluate.• Key is an index 0, 1, …, N-1 into the hash table.• Data x can be found only in bucket h(x).
14
Storing a Hash Table
• If the data is very simple (numbers or short strings), then a spreadsheet may be optimal.
• If the data is arbitrary, then dynamically allocated memory techniques are common.– Common to use linked lists inside of each bucket.– Can be error prone.–Must remember to deallocate all of the hash table
when done, which can also be error prone.–Must decide if duplicates are allowed in a bucket.
15
Common Data Structure
16
012
…
N-2N-1
0
0
0
0
…
Buckets Data for each bucket
Variations:• doubly
linked lists• nested
tables• spreadsheet
Hash Table Functionalities
• Search• Add– Uses Search
• Delete– Uses Search
• Modify (optional)– Uses Search
• Change order of data in a bucket (optional)– Uses Search and possibly Delete and Add
17
Functionality
• Search(x)– Compute Key = h(x)– For each data stored in bucket Key, compare x to
the data.• If a match, then return something that allows the data
to be accessed.• If there is no match, return a Failure notice.
18
Functionality
• Add(x)– F = Search(x)– If F ≠ Failure, then• If no duplicates are allowed, return something that
allows the data to be accessed (and that it is already in the hash table).
– Otherwise,• Probably make a copy of x and add it to bucket h(x).
– Usually added as the first or last element in bucket h(x).– Usually have to modify the linked list for bucket h(x).
19
Functionality
• Delete(x)– F = Search(x)– If F ≠ Failure, then• Remove the data from bucket h(x). This usually means
deleting the copy of x and relinking inside the linked list. There may be other bookkeeping, too.• Return Success.
– Otherwise,• Return Failure.
20
Simple Examples
• Dataset D consists of combinations of a, b, c, …, x, y, z of exactly string length 3.
• We encode each letter by 00, 01, 02, ..., 23, 24, 25. So, abz is 000125 = 125.
• Consider two hash functions:– h1(x) = x mod 7– h2(x) = leading encoded letter in x
• We get two very different hash tables.
21
Example Dataset D
• D = { abc, def, acd,zaa, bbb, bzq,zxw, faq, cap,eld, ssa, bab }, or encoded
• D = { 102, 30405, 203,250000, 10101, 12516,252322, 50016, 20015,41103, 181800, 10001 }
22
h1(x) for D
• The number of buckets is 7 (a prime).• This is not necessarily a well balanced hash
table since too many members of D go into bucket 0.
• We can store the hash table using linked lists.23
x h1(x) x h1(x) x h1(x)
102 4 30405 4 203 0
250000 2 10101 0 12516 0
252322 2 50016 0 20015 1
41103 6 181800 3 10001 5
Hash Table for h1(x)
24
0123456
Buckets Data for each bucket
203 10101 12516 10101 0
20015 0
250000 252322 0
181800 0
0102 30405
10001 0
41103 0
h2(x) for D
• The number of buckets is 26 (not a prime).• This is a very different distribution of data
than for h1(x) and more balanced for our particular D.
• We can store it as a table or spreadsheet.25
x h2(x) x h2(x) x h2(x)
102 0 30405 3 203 0
250000 25 10101 1 12516 1
252322 25 50016 5 20015 2
41103 4 181800 18 10001 1
Hash Table for h2(x)
26
key value value value
0 102 203
1 10101 12516 10001
2 20015
3 30405
4 41103
5 50016
6
7
8
9
10
11
12
key value value value
13
14
15
16
17
18 181800
19
20
21
22
23
24
25 250000 252322
Fracking Data Example
• Open database maintained by the Pennsylvania State government based on the fractured oil and gas wells in the Marcellus Basin.
• There are about 8,000 wells that have been drilled and information is maintained about each in this database.
• Each state in the United States has at least one public database about fracking wells.
• 15.3 million Americans live within 1 mile (1.8 km) of a well drilled since 2000.
• Spreadsheets in the comma separated values format (.csv) or PDF common.
27
Fracking Data File Information
• Each file contains information for a period of time during 2000-2014o Locations of wellso Owner of propertyo Approximate latitude and longitude of each wello Drilling companyo Production information
§ Potential production§ Actual production (units: barrels for oil, 1000 cubic feet for
gas)§ Active/Inactive
o Much more information with some cells blank
28
Interesting Questions
• What are the production curves?o Are they uniform in regions or do they vary a lot?
• How long is there a good payout? (0, 12, 39-40, …, 120 months?)
• Are there some drillers whose wells are more likely to not be in production after some period of time?
• Where are clusters of wells?• How do you visualize the data?• How do you put the data into the right format in order
to ask the right questions and get answers quickly?
29
Data Files
• Approximately 574 MB of files.• First things to do:
o Determine how to use the data (Excel, MongoDB, Hadoop, Matlab, R, etc.).
o Use the data to answer some simple, but interesting questions.
o Visualize the results (Excel, Matlab, R, Tableau, etc.).• Thereafter,
o Determine how to answer general, complex questions.o Use a general database approach that uses all of your
computer’s cores and GPUs.
30