30
SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Embed Size (px)

Citation preview

Page 1: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

SESSION 3WHARTON SUMMER TECH CAMP

Regex

Data Acquisition

Page 2: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

1: REGEX INTRO2: DATA ACQUISITION

Agenda

Page 3: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Editor for Python • Mac

• idlex – more advanced IDLE. • http://idlex.sourceforge.net/download.html

• Windows • Haven’t used them personally but besides canopy IDE• PyCharm

• IDEs are usually heavy • Great for big projects and professional developers but for simple

scripting, I’d stick with idle/idlex/canopy IDE

• If you want to feel like a badass programmer/hacker, you can learn to use EMACS or VIM editor. You have to learn to use them.

Page 4: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Installing Packages for Python• Enthought distribution includes many packages but you

will need to download additional packages later on. • Easy_Install

• https://pypi.python.org/pypi/setuptools/0.9.8#installing-and-using-setuptools

• Pip• http://www.pip-installer.org/en/latest/installing.html#

• Mac users• Open up a terminal and type either and see if you have that • If you do, you can automatically download and install python

packages using • sudo easy_install packagename• sudo pip install packagename

Page 5: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Regular Expression

Page 6: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

What is Regular Expression (RE) ?

• RE or REGEX is a way to describe string patterns to computers

• Basically, an advanced “Find” and “Find and Replace”

• Originated from theoretical comp sci – • For the Interested: “Formal Language Theory”,

“Chomsky hierarchy”, “Automata theory”• Theory that guides programming language

• Popularized by PERL, Ubiquitous in Unix• Almost all programming languages support

REGEX and they are mostly the same

Page 7: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

What is Regular Expression (RE) ?• Given a text T, RE matches the part of T

represented by the RE• RE(Text) = Subset_of_matched(Text) • Then you can do whatever you wish with the

matched part• Regular expression can be complicated and can

consist of multiple patterns • You can distinguish between different matched

patterns• With the matched part of T, you can do something

with it or substitute part of the matched part with something you wish

Page 8: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

“Oh it’s just a text searching tool, so what?”

Page 9: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Well, Google is a text search tool, albeit for different purposes.

The power comes from the fact that by learning regex, you are essentially learning to represent complex text patterns to computers efficiently.

The size of data may be too big for humans to go through or too tedious

Learn their language and tell computers what to do!

Page 10: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

True (paraphrased) quotes from some doctoral students/faculties before I introduced them to REGEX

“I despise aggregating data from the AMT – it took me a week to go through them all”

“[Grunt noise]. I had to filter out IP addresses from surveys by hand and it took me forever”

“I have this data with many different ways of representing the same variables and need to do “fuzzy” matching but don’t know a good way to do this”

Page 11: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition
Page 12: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Reasons to use regex

1. Regular expression will be very useful for data cleaning and aggregating

2. Very useful in basic web scraping.

3. Text data is everywhere and “If you take “text” in the widest possible sense, perhaps 90% of what you do is 90% text processing” (Programming Perl book).

4. Once you learn regex, you can use it in any language since they are similarly implemented.

5. learning regex is one of the first step in learning NLP (natural language processing)

6. You are learning a language of the machines

Page 13: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Usage Examples• You get an output from Amazon Mech Turk (or Qualtrics) and

need to extract and aggregate data and make it usable by R or Stata

• You can check survey outcomes for quality control. Useful for checking if the participants are paying attention or quality control at a massive scale. Related use in web development is checking to see if input format is correct.

• You want to scrape simple information from a website for your project

• One simple algorithm in NLP is matching and counting words. Regex can do that.

• You want to obtain email addresses for your evil spamming purposes. You can do that but don’t.

• Etc. Many possibilities for increase in productivity

Page 14: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

But it takes some time to master

You will need to practice with a cheat sheet next to you.Literally, this is a language (“regular language”) you are learning.Just like any language, this one has vocabularies and grammars to learn.

Page 15: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Tools to practice REGEX• There are great tools to practice regex

• Website • http://gskinner.com/RegExr/

• If you have mac • http://reggyapp.com/ Reggy

• If you have windows • http://www.regexbuddy.com/ Regexbuddy

Page 16: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Basics of REGEX

• Can represent strings literally or symbolically• Literal representations are not powerful but convenient for small tasks

• Symbolic representation is the workhorse • There are a few concepts you need to learn to use this representation

• There are also many special characters with special meanings in REGEX. e.g., . ^ $ * + ? { } [ ] \ | ( )

• http://cloud.github.com/downloads/tartley/python-regex-cheatsheet/cheatsheet.pdf Cheat sheet

Page 17: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Literal Matching

•Match strings literally. •String = “I am a string”•RE= “string”

•Matched string = “string”

That’s it

Page 18: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Literal Matching & Quantifiers• Symbolic matching has many special characters to learn. • Quantifier is one concept• + means match whatever comes before match it 1 or

more• "ba" matches only "ba" • "ba+" matches baa, baaa, baaaa, etc

• ? means match whatever comes before 0 or 1 time• "ba?" matches b or ba

• * means match whatever comes before 0 or more• “ba*” matches b or ba or baa or baaa and so on

Page 19: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

More Quantifiers

•{start,end} means match whatever comes before “start” to “end” many times•"ba{1,3}" matches ba, baa, baaa•“ba{2,}” matches baa, baaa, baaaa and so on

Page 20: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Special Meta characters• As you’ve seen, some characters have special meanings• . ^ $ * + ? { } [ ] \ | ( )• . Means any one character except the newline character \n• ^ dictates that the regex pattern should only be matched if it occurs in the

beginning • String= “the book” RE= “book” YES RE= “^book” NO

• $ is similar to ^ but for ending• [] is used to signify ranges [0-9] means anything from 0 to 9• () used as grouping variable

• Used to group patterns• Can be used to memorize a certain part of the regex

• | is used as “OR” (5|4) matches 5 or 4• \ <-special character to rule them all – used to escape all special meta

characters to represent them as is. \. Matches actual period .• [^stuff] means match anything that’s not “stuff” [^9] match anything but 9

Page 21: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Hey Jude

Hey Jude, don't make it bad

Take a sad song and make it better

Remember to let her under your skin

Then you'll begin to make it

(better ){6}, oh

(nah ){9} Hey Jude

Page 22: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Special Vocabulary Shortcuts• Some vocabularies are so common that shortcuts were

made

• \d matches any digit [0-9]• \w any alphanumeric plus underscore [a-zA-Z0-9_]• \s white spaces – tabs newlines etc. [ \t\n]

• notice that space in the beginning

• \W any non alphanumeric plus underscore [^a-zA-Z0-9_]• \S guess? • \D again?

Page 23: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Flags

• Changes the way regex works • i ignore case • s changes the way . works. Usually . Matches anything except new line \n this flag makes . match everything

• m multiline. Changes the way ^ $ works with newline. Usually, ^ $ matches strictly start or end of string but this flag makes it match on each line.

Page 24: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

REGEX in python• Python library re • import re • The function used is

re.search(pattern, string, flags=0)Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern.

• Pattern: specifies what to be matched • String: actual string to match from • Flags: options – basically changes the way regex works again, flag "i" says ignore case.

Page 25: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

REGEX in python

re.search(pattern, string, flags=0)

re.findall(pattern, string, flags=0)• Pattern: always wrap the pattern with r"" for python. r""

says interpret everything between "" to be raw string – particular to python due to the way python interprets some characters.

s = "This is an example string"

matchedobject=re.search(r"This", s)

matchedobject=re.search(r"this", s)

Page 26: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Regex is easy to learn but hard to master

Example of complex regex

The regex in the next slide is taken from

http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

It validates email based on RFC822 grammar which is now obsolete. It’s not written by hand. It’s produced by combining set of simpler regex.

Page 27: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*)

NO+!+

Page 28: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Lab

• Try some REGEX tutorial• http://www.regexlab.com/• http://www.regular-expressions.info/tutorial.html

• The scripts I uploaded • Play around with the regex tool

• 15-20 minutes

Page 29: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Fire up the REGEX.py

Page 30: SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition

We are now going to use the following

• Download WGET and make sure it works • You may already have wget if you use mac (in terminal, type wget)• http://www.gnu.org/software/wget/

• Get Firefox Developer’s Toolbox