34
Regular Expressions for Regular Joes (and SEOs)

Regular Expressions for Regular Joes (and SEOs)

Embed Size (px)

DESCRIPTION

A basic introduction to Regular Expressions (aka RegEx or RexExp) for people in the SEO industry. First half is instructional and the second half is situation use cases.

Citation preview

Page 1: Regular Expressions for Regular Joes (and SEOs)

Regular Expressionsfor Regular Joes

(and SEOs)

Page 2: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

What are regular expressions?

• A regular expression (sometimes referred to as regex or regexp) is basically find-and-replace on steroids, an advanced system of matching text patterns.

APRIL 11, 2023 | PAGE 2

I LOVE regular expressions!

Page 3: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Most Common Example: Google Analytics

APRIL 11, 2023 | PAGE 3

• Using “pipes” to exclude pages with extraneous symbols attached to the URL, like UTM tracking parameters.

Page 4: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Where can I use regular expressions?• Many text editors

– Notepad++ is an awesome one for Windows

• SEO Tools for Excel add-on– http://nielsbosma.se/projects/seotools/

• Google Docs– =regexextract() function– =regexmatch() function– =regexreplace() function

• Google Analytics• Screaming Frog• DeepCrawl• .htaccess

– RewriteCond– RewriteRule

• Programming LanguagesAPRIL 11, 2023 | PAGE 4

Page 5: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED. APRIL 11, 2023

RegEx Basics

Each one of these you learn, the more helpful it is.You don’t have to learn all of them.

| PAGE 5

Page 6: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Anchors

• “Anchors” match position in text rather than text itself:– ^ (carat) will match the beginning of a line– $ (dollar sign) will match the end of a line

Example: word word word word

•^word will result in “word word word word”

•word$ will result in “word word word word”

APRIL 11, 2023 | PAGE 6

Page 7: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Character Classes

• [ starts a character class• ] ends a character class

– Any of the characters within [ ] will be matched

Note: ranges like [G-V] (letters g though v) or [1-10] (number 1 through 10) also work.

Example: hnaeyesdtlaeck

•[nedl] will result in “hnaeyesdtlaeck”

Example: Do you do SEO or SEM?

•SE[OM] will result in “Do you do SEO or SEM?”APRIL 11, 2023 | PAGE 7

Page 8: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Miscellaneous Special Characters

• | (pipe) means OR

Example: this or that?– this|that will result in “this or that?”

• . (period) represents any character (wildcard)

Example: Excuse my French; Detect profanity like shit, sh#t, or sh!t.– sh.t will result in “Detect profanity like shit, sh#t, or sh!t.”

APRIL 11, 2023 | PAGE 8

Page 9: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Escaping Characters

There are many characters in regular expressions which have special meanings, so if you wish to find the literal characters they must be “escaped” with a backslash preceding it.

Example: I want to find the period.

– \. I want to find the period.

– If I used just a period without escaping with a backslash:

. will result in “I want to find the period.”APRIL 11, 2023 | PAGE 9

Page 10: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Quantifiers• ? (question mark) means optional. It matches 0 or 1 of the

previous character, essentially making it optional.

Example: is the url http or https?

– https? will result in “is the url http or https?”

• * (asterisk) means zero or more. It will find 0 or more occurrences of the previous character.

Example #1: What’s that photo website again? Is it Flickr, Flicker, or Flickeeer?

– Flicke*r will result in “What’s that photo website again? Is it Flickr, Flicker, or Flickeeer?”

Example #2: hlp help heelp heeeeeeeelp

– he*lp will result in “hlp help heelp heeeeeeeelp”

APRIL 11, 2023 | PAGE 10

Page 11: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Quantifiers - Continued

• + (plus) means one or more. It will find 1 or more occurrences of the previous character.

Example #1: hlp help heelp heeeeeeeelp

– he+lp will result in “hlp help heelp heeeeeeeelp”

Example #2: hlp help heelp heeeeeeeelp hellllllllp

•h.+lp will result in “hlp help heelp heeeeeeeelp hellllllllp”

APRIL 11, 2023 | PAGE 11

Page 12: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Understanding Differences Between Quantifiers

APRIL 11, 2023 | PAGE 12

Animated GIF Example

Page 13: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Quantifiers - Continued

• { } will match a certain quantity of previous characters. You can also specify a range, like “1 to 3” or “3 or more” if you include a , (comma) inside the brackets.

Example #1: buz buzz buzzz buzzzz buzzzzz

– buz{3} will result in “buz buzz buzzz buzzzz buzzzzz”

Note: {3} reads “exactly 3} in plain english.

– buz{2,4} will result in “buz buzz buzzz buzzzz buzzzzz”

Note: {2,4} reads “2 to 4” in plain english.APRIL 11, 2023 | PAGE 13

Page 14: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Groups• Groups are encapsulated in parenthesis ( )

Example: hahaha haha ha haha ha!

– (ha)+ will render “hahaha haha ha haha ha!”

( )APRIL 11, 2023 | PAGE 14

Page 15: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Capture Groups

•Groups can also be easily captured as variables that can be repeated back:– $1 would display the contents of the first group, $2 would

display the contents of the second group and so on.

Example: hello I am paul

– hello I am (.+) used with $1 will capture “paul”

•To disable the capturing of groups we use (?:), so that they can be used solely for the purpose of grouping patterns together.

So with the above example, (?:.+) will not capture anything

APRIL 11, 2023 | PAGE 15

Page 16: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Lookarounds

APRIL 11, 2023 | PAGE 16

• Positive Lookaheads will match a group after the main pattern without actually including it in the result. The expression is (?=)

Example: 1in 250px 2in 3em 40px– [0-9]+(?=px) will result in “1in 250px 2in 3em 40px”

Everything WITH “px”

• A Negative Lookahead is used to specify a group that won’t be matched after the main pattern. The expression is (?!)

Example: 1in 250px 2in 3em 40px– [0-9]+(?!em) will result in “1in 250px 2in 3em 40px”

Everything BUT “em”

Page 17: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

RegEx in Practice

Real Use Cases

APRIL 11, 2023 | PAGE 17

Page 18: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Problem #1

I want to take a list of >2,000 Mashable.com URLs, exported from BuzzSumo.com and segment the <titles> into different segments (list posts, title as a question, etc.) and see which ones received a greater number of social shares.

What is the fastest way of doing this?

Hint:

APRIL 11, 2023 | PAGE 18

Page 19: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Solution #1: SEO Tools for Excel Add-on w/ RegEx

• Is the post title a question?– =RegexpIsMatch(A2,"\?$")

• Is the post a listacle/list post?– =RegexpIsMatch(A2,"^[0-9]*\s|^[0-9]\,[0-9]*\s")

• Extract publishing year from URL– =RegexpFind(D2,"https?:\/\/(?:www\.)?

mashable\.com\/([0-9]{4})\/.+","$1")• Presence of a year in the title

– =IFERROR(RegexpFind(A40,"([0-9]{4})","$1"),“N/A")

APRIL 11, 2023 | PAGE 19

Page 20: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Nice! Took < 1 Minute.

APRIL 11, 2023 | PAGE 20

Page 21: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Problem #2

• There are hundreds of pages with <span> tags that should be rendered as <h2>. Some have class and/or id attributes and some don’t. I want to grab the contents (only) of these span tags for a client.

What is the fastest way?

…RegEx!

APRIL 11, 2023 | PAGE 21

Page 22: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Solution #2: SEO Tools for Excel Add-on w/ RegEx• For a list of URL in Excel, and again with the

SEO Tool for Excel add-on, use a regular expression like this:– =RegexpFindOnUrl(D3,"<span(?:.+)?>(.+)<\/

span>",1)

APRIL 11, 2023 | PAGE 22

Page 23: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Problem #3:

• I want to grab the full description from a long list of YouTube videos. We can grab it from the meta description, but it might be an incomplete description that is truncated, so we need to grab the actual page text.

What’s the fastest way?

APRIL 11, 2023 | PAGE 23

Page 24: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

…Probably XPath, but we can also use RegEx

APRIL 11, 2023 | PAGE 24

Page 25: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Solution #3: SEO Tools for Excel Add-on

• For a list of YouTube video URLs in Excel, use the SEO Tools for Excel Add-on with the following regular expression:– =RegexpFindOnUrl(A1,"<p id=.eow\-description.\

s?>(.+)<\/p>",1)

Please note, that because the HTML utilized a double-quote, you have to use another character in its place so as not to break Excel, like the period, to represent ANY character.

APRIL 11, 2023 | PAGE 25

Page 26: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Problem #4

• I want to quickly change a long list of keywords into the exact match format with the keyword surrounded by brackets, [ ].

What’s the fastest way?

APRIL 11, 2023 | PAGE 26

Page 27: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Solution #4: Notepad++ Example

1. Copy a column of keywords from Excel into Notepad++

2. Control + F and switch to the “Replace” tab.

3. Switch the “Search Mode” to “Regular Expression”

4. Enter ^ in the “Find what” field and [ in the “Replace with” field.

5. Hit the “Replace All” button.6. Then, enter $ in the “Find what”

field and ] in the “Replace with” field.

7. Again, hit the “Replace All” button.

APRIL 11, 2023 | PAGE 27

Page 28: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Problem #5

• I want to identify which keywords from Google Webmaster Tools is Branded/Non-Branded, along with misspellings, from our SQL database in Spotfire.

What’s the fastest way?

APRIL 11, 2023 | PAGE 28

Page 29: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

A Solution: Calculated Column with ~= Operator• Create a calculated column with an

expression like the below:If([keyword]~="unstopable|unstopables|unstoppable|unstoppables|instopable|instopabales|[ui]nstop[a-z]+?b[a-z]+?s?|(scent booster)|(scent boosters)",true,false)– This should find spellings/mis-spellings of Downy

Unstopables

APRIL 11, 2023 | PAGE 29

Page 30: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Other Places We Might Use RegEx

Google Analytics supports regular expressions:

– When creating filters– When setting up goals– When defining goal funnel steps– When defining advanced segments– When using report filters– When using filters in multichannel reporting

APRIL 11, 2023 | PAGE 30

h/t Annie Cushing

Page 31: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Other Places We Might Use RegEx

.htaccess– Redirect a set of URLs matching a certain pattern to a

new URL pattern:Example: RewriteRule ^/dir/index.php?id=(0-9+)\.htm$ file-

$1 [L]

Screaming Frog– URL Rewriting: RegEx Replace– Spider Include/Exclude URLs

APRIL 11, 2023 | PAGE 31

Page 32: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Other Places We Might Use RegEx

Deepcrawl

APRIL 11, 2023 | PAGE 32

Page 33: Regular Expressions for Regular Joes (and SEOs)

COPYRIGHT 2014 CATALYST. ALL RIGHTS RESERVED.

Resources

Helpful tool for testing RegEx and gives a good breakdown of your patterns:• http://www.regexr.com/A handy cheat sheet to print and put on your desk:• http://www.cheatography.com/davechild/chea

t-sheets/regular-expressions/pdf/

SEO Tools for Excel Add-on• http://nielsbosma.se/projects/seotools/Notepad++• http://notepad-plus-plus.org/ APRIL 11, 2023 | PAGE 33