47
Regular Expressions for SEO The Coolest Pattern Matching Search Language... Troy Boileau | Team Leader, SEO & Inbound Marketing Consultant For Powered by Search Internal | October 2013

Regular Expressions (RegEx) for SEO

Embed Size (px)

DESCRIPTION

Regular Expressions are highly technical. This training covers the basics of RegEx and also gives examples of how to use it. Take some time to go through each example and try to figure it out on your own.

Citation preview

Page 1: Regular Expressions (RegEx) for SEO

Regular Expressions for SEO The Coolest Pattern Matching Search Language...

Troy Boileau | Team Leader, SEO & Inbound Marketing Consultant

For Powered by Search Internal | October 2013

Page 2: Regular Expressions (RegEx) for SEO

Some of our clients...

We’re in business because we believe that great brands need both voice and visibility in order to connect people with what matters. A boutique, full-service digital marketing agency in Toronto, Powered by Search is a PROFIT HOT 50-ranked agency that delivers search engine optimization, pay per click advertising, local search, social media marketing, and online reputation management services.

Featured in...

Page 3: Regular Expressions (RegEx) for SEO

RegEx Basics

Practical SEO Uses

RegEx Puzzles for Homework

Page 4: Regular Expressions (RegEx) for SEO

Regular Expressions for SEO

Page 5: Regular Expressions (RegEx) for SEO

http://xkcd.com/

Page 6: Regular Expressions (RegEx) for SEO

RegEx Basics

Page 7: Regular Expressions (RegEx) for SEO

RegEx Basics Use Sublime Text

This is the sexiest text editor / IDE you’ll ever use. It’s light weight, too. It’s the text editor you’ll fall in love with.

Page 8: Regular Expressions (RegEx) for SEO

RegEx Basics Literal Matching

Text I want to match this.

RegEx match this

RegEx matches literal strings. This is like running a normal search in Word. Pretty cool, huh?

Page 9: Regular Expressions (RegEx) for SEO

RegEx Basics Anchors

Text I want this, I want that, I want I want I want

RegEx ^I want

There are a couple of special characters called “Anchors.” The carret (^) represents the beginning of a line. The dollar sign ($) represents the end of a line. You see these a lot in .htaccess files.

Text I want this, I want that, I want I want I want

RegEx I want$

Page 10: Regular Expressions (RegEx) for SEO

RegEx Basics Special Characters

There are also a series of other special characters. These are:

• [ - Starts a Character Class (More Later) • \ - Escapes or modifies the character after it. • . - Wildcard. It represents any character. • | - OR, so (this|that|the other) means this, that, or the other. • ( - Starts a group. • ) - Ends a group.

To match any of these literal characters, put a \backslash in front of it. This also applies to ?+*^$ which we’ve talked about or will get to later.

Page 11: Regular Expressions (RegEx) for SEO

RegEx Basics Quantifiers

A quantifier tells the expression how many times to match the expression before it.

• ? - Zero or one time • + - One or more times • {exactly} - Exactly this many times • {min,max} - Between min and max times • * - Zero or more times

Text Ahhhhhhhhhhh. A spider.

RegEx A[h]+

Page 12: Regular Expressions (RegEx) for SEO

RegEx Basics Greedy vs. Non-Greedy

Quantitative expressions are greedy by default. It’ll repeat the expression as many times as possible before giving up and continuing with the rest of the RegEx. This leads to unexpected issues. To make these quantifiers, *+{}, non-greedy, just add a question mark.

Text <p>test</p>

RegEx (Greedy) <.+>

Text <p>test</p>

RegEx (Lazy) <.+?>

Page 13: Regular Expressions (RegEx) for SEO

RegEx Basics Variations / Character Classes []

A variation is a set of literal characters that can possibly fill a space. For example: The characters in the variation aren’t a GROUP. What the following RegEx is telling the computer is, “Find any of: a t, an h, an e, a pipe, a t, an h, an a, or an n.” That’s not what we want.

Text Well then I’m better than you.

RegEx th[ea]n

Text Well then I’m better than you.

RegEx [then|than]

Page 14: Regular Expressions (RegEx) for SEO

RegEx Basics Groups ()

In the case above, we could use a group to solve our problem. A group isn’t the best answer. It’s for alternation and/or quantification.

Text Well then I’m better than you.

RegEx (then|than)

Text I like redredred apples.

RegEx (blue|green|red)+

Page 15: Regular Expressions (RegEx) for SEO

RegEx Basics Variables / Captured Groups $1

When you use a group, it captures the information in a numbered variable. They count up from $1. You can use the variable when doing a find-replace.

Text https://www.searchersforbeerfridges.com/?vote_number=9001

RegEx Find .+?//(.*?)/.*

RegEx Replace $1

New Text www.searchersforbeerfridges.com

Page 16: Regular Expressions (RegEx) for SEO

Practical SEO Uses

Page 17: Regular Expressions (RegEx) for SEO

Practical SEO Uses Google Analytics – Branded Organic In Analytics I often want to find branded organic search traffic. Let’s look at the GWT data in Analytics for our fictional client, Lett.Me.

Lett Me has a ton of common mis-typings and variations. They get traffic from lm, lm.com, let me, lettme.com, letme.com, let.me, and lett.me. What’s the regular expression that captures all of that?

Page 18: Regular Expressions (RegEx) for SEO

Practical SEO Uses Google Analytics – Branded Organic Here’s the regular expression I came up with. It matches some funky cases like let me.com but that’s fine: You can also remove the square brackets, but I feel like it’s easier to read with them in. Without them it looks like this: Now just save this RegEx in your reporting document and you’ll never have to type out the whole thing again. Imagine what this could do for reporting on keyword groups!

RegEx Find (lm|let[t]?[ ]?[\.]?me)(\.com)?

RegEx Find (lm|let{1,2} ?\.?me)(\.com)?

Page 19: Regular Expressions (RegEx) for SEO

Practical SEO Uses Trim To Root

Trim to Root using Find Replace. Here’s the list: http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/?pg=2 What’s the RegEx?

Page 20: Regular Expressions (RegEx) for SEO

Practical SEO Uses Trim To Root

Trim to Root using Find Replace. Here’s the list: http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/?pg=2 What’s the RegEx?

RegEx Find ^ .*?//(.*?)/.* RegEx Replace

$1

Page 21: Regular Expressions (RegEx) for SEO

Practical SEO Uses Fixing HTML – Nested Tags

I commonly get improperly formatted HTML. Here’s an example: <h2><b></b><i></i>I Wrote This In Microsoft Word!</h2> <h2></h2> <p>This is a great image!</p> <p><img src=“http://site.com/sampleimage.png” /></p> I want to remove all of the empty tags. What’s the RegEx?

Page 22: Regular Expressions (RegEx) for SEO

Practical SEO Uses Fixing HTML – Nested Tags

I commonly get improperly formatted HTML. Here’s an example: <h2><b></b><i></i>I Wrote This In Microsoft Word!</h2> <h2></h2> <p>This is a great image!</p> <p><img src=“http://site.com/sampleimage.png” /></p> I want to remove all of the empty tags. What’s the RegEx?

RegEx Find <[a-z0-9]{1,6}></[a-z0-9]{1,6}>

RegEx Replace

Page 23: Regular Expressions (RegEx) for SEO

Practical SEO Uses Top Level Domains

Find only .bs and .spam top level domains. Here’s the list: http://www.spam.com/bs http://bs.com/spam http://spam.bs.com/balls http://remove-this.bs/test http://www.and-this.spam/ What’s the RegEx?

Page 24: Regular Expressions (RegEx) for SEO

Practical SEO Uses Top Level Domains

Find only .bs and .spam top level domains. Here’s the list: http://www.spam.com/bs http://bs.com/spam http://spam.bs.com/balls http://remove-this.bs/test http://www.and-this.spam/ What’s the RegEx?

RegEx Find .*//(.*?)\.(bs|spam)/.*

RegEx Replace $1

Page 25: Regular Expressions (RegEx) for SEO

Practical SEO Uses Finding Substrings in Domains

Does the domain contain the words “directory” or “article”? The list: http://directorylinks.com/spamspam http://www.spammy.com/link-directory http://shadyarticles.com/ http://newyorktimes.com/?article_id=744 https://bonusarticles.com What’s the RegEx? (If you can match bonus articles without the trailing slash, I salute you!)

Page 26: Regular Expressions (RegEx) for SEO

Practical SEO Uses Finding Substrings in Domains

Does the domain contain the words “directory” or “article”? The list: http://directorylinks.com/spamspam http://www.spammy.com/link-directory http://shadyarticles.com/ http://newyorktimes.com/?article_id=744 https://bonusarticles.com What’s the RegEx? (If you can match bonus articles without the trailing slash, I salute you!)

RegEx Find ^.*?//.*(directory|article).*?(/|\..{2,3}$).*

Page 27: Regular Expressions (RegEx) for SEO

Practical SEO Uses Merging Lists

Does the list of URLs contain domains we’ve already disavowed? Say we’re doing a reconsideration request and we don’t want to consider any of the links we’ve already disavowed. So, we have List A, new links with some old links mixed in, that we want cleansed of any of the domains in List B. It’s a whole process. What do you think it is?

List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/

List B http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/

Page 28: Regular Expressions (RegEx) for SEO

Practical SEO Uses Merging Lists

First I’d use one of the tricks we learned already to format List B in an easier to manipulate way. I’ve bolded it below. What do you think the RegEx F/R is to get that?

List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/

List B http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/

Page 29: Regular Expressions (RegEx) for SEO

Practical SEO Uses Merging Lists

First I’d use one of the tricks we learned already to format List B in an easier to manipulate way. I’ve bolded it below. What do you think the RegEx F/R is to get that?

List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/

List B http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/

RegEx Find ^ .*?//(.*?)/.* RegEx Replace

$1

Page 30: Regular Expressions (RegEx) for SEO

Practical SEO Uses Merging Lists

Great. Now, we’ve learned how to search for substrings (string is a substring of substrings, if that isn’t confusing). How might we turn List B into a set of variations of substrings that we can search through List A with? A tip: \n is the newline character and you need it. What’s the RegEx?

List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/

List B directorylinks.com spam.com mafia-wars.com 192.233.111

Page 31: Regular Expressions (RegEx) for SEO

Practical SEO Uses Merging Lists

Great. Now, we’ve learned how to search for substrings (string is a substring of substrings, if that isn’t confusing). How might we turn List B into a set of variations of substrings that we can search through List A with? A tip: \n is the newline character and you need it. What’s the RegEx?

List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/

List B directorylinks.com spam.com mafia-wars.com 192.233.111

RegEx Find \n

RegEx Replace |

Page 32: Regular Expressions (RegEx) for SEO

Practical SEO Uses Merging Lists

If you did it right, you should have what I’ve currently listed under List B. What’s the final step we need to be able to search List A with the substrings in List B?

List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/

List B directorylinks.com|spam.com|mafia-wars.com|192.233.111

Page 33: Regular Expressions (RegEx) for SEO

Practical SEO Uses Merging Lists

If you did it right, you should have what I’ve currently listed under List B. What’s the final step we need to be able to search List A with the substrings in List B? .*(directorylinks.com|spam.com|mafia-wars.com|192.233.111).*

List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/

List B directorylinks.com|spam.com|mafia-wars.com|192.233.111

Page 34: Regular Expressions (RegEx) for SEO

Practical SEO Uses Finding Client Anchor in HTML

Screaming Frog lets you use Regular Expressions in your searches. One use of this feature is finding out whether or not someone is actually linking to your website or not, because all legitimate anchors share the same format. <a (any or no tags) href=“any variation of your URL” (any or no tags)>(possible other tags)anchor text(possible other tags)</a> In the attached HTML document, find all 3 links to Mooz.com. Bonus: Find only the 2 links to Mooz.com that contain the anchor text, “Cow Melk” or “Milk.”

Page 35: Regular Expressions (RegEx) for SEO

Practical SEO Uses Finding Client Anchor in HTML

Screaming Frog lets you use Regular Expressions in your searches. One use of this feature is finding out whether or not someone is actually linking to your website or not, because all legitimate anchors share the same format. <a (any or no tags) href=“any variation of your URL” (any or no tags)>(possible other tags)anchor text(possible other tags)</a> In the attached HTML document, find all 3 links to Mooz.com. Bonus: Find only the 2 links to Mooz.com that contain the anchor text, “Cow Melk” or “Milk.”

RegEx Find <a.{0,100}href=.{0,100}mooz\.com

<a.{0,100}href=.{0,100}?mooz\.com(.{0,100}?)(Cow Melk|Milk)

Page 36: Regular Expressions (RegEx) for SEO

RegEx Puzzles for Homework

Page 38: Regular Expressions (RegEx) for SEO

RegEx Puzzles for Homework Puzzles

Some Puzzles: • Show only the domain, no sub-domain, with a find-replace. • Find all links that are obviously from a blog. • Format a list of links as domains in a comma separated list.

Page 39: Regular Expressions (RegEx) for SEO

RegEx Puzzles for Homework No Sub-Domains

Show only the domain, no sub-domain, with a find-replace. http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/ http://screw.you.regex.net/ What’s the RegEx?

Page 40: Regular Expressions (RegEx) for SEO

RegEx Puzzles for Homework No Sub-Domains

Show only the domain, no sub-domain, with a find-replace. http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/ http://screw.you.regex.net/ What’s the RegEx?

RegEx Find ^.*?//(.*\.)*(.*)\.(.{2,3})/.*

RegEx Replace $2.$3

Page 41: Regular Expressions (RegEx) for SEO

RegEx Puzzles for Homework Blog or RSS

In the attached sample-urls.txt, find all links that are obviously from a blog or RSS feed. What’s the RegEx?

Page 42: Regular Expressions (RegEx) for SEO

RegEx Puzzles for Homework Blog or RSS

In the attached sample-urls.txt, find all links that are obviously from a blog or RSS feed. What’s the RegEx?

RegEx Find .*(/blog|/article|feed\.|/feed).*

Page 43: Regular Expressions (RegEx) for SEO

RegEx Puzzles for Homework Comma Separated Domains

Format a list of links as domains in a comma separated list. The links: http://www.business2community.com/seo http://www.buzzstream.com/blog/competitive-link-building.html http://www.cansinmert.com/ http://www.canuckseo.com/index.php/2010 http://www.cio.com/article/738249/ http://www.clicktivist.org/ Should be: www.domain.com, www.domain2.com, etc. What’s the RegEx?

Page 44: Regular Expressions (RegEx) for SEO

RegEx Puzzles for Homework Comma Separated Domains

Format a list of links as domains in a comma separated list. The links: http://www.business2community.com/seo http://www.buzzstream.com/blog/competitive-link-building.html http://www.cansinmert.com/ http://www.canuckseo.com/index.php/2010 http://www.cio.com/article/738249/ http://www.clicktivist.org/ Should be: www.domain.com, www.domain2.com, etc. What’s the RegEx?

RegEx Find (|\n).*//(.*)/.* Replace With

$2, Delete trailing comma

Page 46: Regular Expressions (RegEx) for SEO

Questions?

Page 47: Regular Expressions (RegEx) for SEO

Thanks for Hanging Out

Stay in Touch

Twitter: @troyfawkes Google+: http://gplus.to/TroyFawkes Email: [email protected]

www.poweredbysearch.com

www.troyfawkes.com