Agenda What is a Search Engine? Examples of popular Search Engines Search Engines statistics Why is...
59
Search Engine Optimization (SEO)
Agenda What is a Search Engine? Examples of popular Search Engines Search Engines statistics Why is Search Engine marketing important? What is a SEO Algorithm?
Agenda What is a Search Engine? Examples of popular Search
Engines Search Engines statistics Why is Search Engine marketing
important? What is a SEO Algorithm? Steps to developing a good SEO
strategy Ranking factors Basic tips for optimization
Slide 3
Examples popular Search Engines
Slide 4
Slide 5
How Do Search Engines Work? Mechanics of a typical search
Slide 6
Results & ads returned ranked
Slide 7
Category of first result
Slide 8
Result for phrase query
Slide 9
How Do Search Engines Work? Spider crawls the web to find new
documents (web pages, other documents) typically by following
hyperlinks from websites already in their database Search engines
indexes the content (text, code) in these documents by adding it to
their databases and then periodically updates this content Search
engines search their own databases when a user enters in a search
to find related documents (not searching web pages in real-time)
Search engines rank the resulting documents using an algorithm
(mathematical formula) by assigning various weights and ranking
factors
Slide 10
Search on the Web Corpus: The publicly accessible Web: static +
dynamic Goal: Retrieve high quality results relevant to the users
need (not docs!) Need Informational want to learn about something
Navigational want to go to that page Transactional want to do
something (web-mediated) Access a service Downloads Shop Gray areas
Find a good hub Exploratory search see whats there Low hemoglobin
United Airlines Tampere weather Mars surface images Nikon CoolPix
Car rental Finland Abortion morality
Slide 11
Search Engines as Info Gatekeepers Search engines are becoming
the primary entry point for discovering web pages. Ranking of web
pages influences which pages users will view. Exclusion of a site
from search engines will cut off the site from its intended
audience. The privacy policy of a search engine is important.
Slide 12
100+ Billion Searches / Month
Slide 13
Search Engine Wars The battle for domination of the web search
space is heating up! The competition is good news for users!
Crucial: advertising is combined with search results! What if one
of the search engines will manage to dominate the space?
Slide 14
Yahoo! Synonymous with the dot-com boom, probably the best
known brand on the web. Started off as a web directory service in
1994, acquired leading search engine technology in 2003. Has very
strong advertising and e-commerce partners
Slide 15
Lycos! One of the pioneers of the field Introduced innovations
that inspired the creation of Google
Slide 16
Google Verb google has become synonymous with searching for
information on the web. Has raised the bar on search quality Has
been the most popular search engine in the last few years. Had a
very successful IPO in August 2004. Is innovative and dynamic.
Slide 17
Live Search ( was: MSN Search) Synonymous with PC software.
Remember its victory in the browser wars with Netscape. Developed
its own search engine technology only recently, officially launched
in Feb. 2005. May link web search into its next version of
Windows.
Slide 18
Important? 80% of consumers find your website by first writing
a query into a box on a search engine (Google, Yahoo, Bing) 90%
choose a site listed on the first page 85% of all traffic on the
internet is referred to by search engines The top three organic
positions receive 59% percent of user clicks. Cost-effective
advertising Clear and measurable ROI Operates under this
assumption: More (relevant) traffic + Good Conversions Rate = More
Sales/Leads
Slide 19
Experiment with query syntax Default is AND, e.g. computer
chess normally interpreted as computer AND chess, i.e. both
keywords must be present in all hits. +chess in a query means the
user insists that chess be present in all hits. computer OR chess
means either keywords must be present in all hits. computer chess
means that the phrase computer chess must be present in all
hits.
Slide 20
The most popular search keywords AltaVista (1998)AlltheWeb
(2002)Excite (2001) sexfree appletsex pornodownloadpictures
mp3softwarenew chatuknude
Slide 21
Free Keyword Research Tools
https://adwords.google.com/o/Targeting/Explorer?__c=10000000
00&__u=1000000000&__o=te&ideaRequestType=KEYWORD_IDE
AS#search.none
https://adwords.google.com/o/Targeting/Explorer?__c=10000000
00&__u=1000000000&__o=te&ideaRequestType=KEYWORD_IDE
AS#search.none Keyword Tool and Traffic Estimator to identify
competitive phrases and search frequencies
http://www.google.com/insights/search
http://www.google.com/insights/search Compare search patterns
across specific regions, categories, time frames and
properties
Slide 22
Web search Users Ill-defined queries Short length Imprecise
terms Sub-optimal syntax (80% queries without operator) Low effort
in defining queries Wide variance in Needs Expectations Knowledge
Bandwidth Specific behavior 85% look over one result screen only
mostly above the fold 78% of queries are not modified 1
query/session Follow links the scent of information...
Slide 23
How far do people look for results?
Slide 24
Architecture of a Search Engine The Web Ad indexes Web spider
Indexer Indexes Search User
Slide 25
25
Slide 26
Q: How does a search engine know that all these pages contain
the query terms? A: Because all of those pages have been crawled
26
Slide 27
Crawling picture Web URLs frontier Unseen Web Seed pages URLs
crawled and parsed Sec. 20.2 27
Slide 28
Motivation for crawlers Support universal search engines
(Google, Yahoo, MSN/Windows Live, Ask, etc.) Vertical (specialized)
search engines, e.g. news, shopping, papers, recipes, reviews, etc.
Business intelligence: keep track of potential competitors,
partners Monitor Web sites of interest Evil: harvest emails for
spamming, phishing Can you think of some others? 28
Slide 29
A crawler within a search engine 29 Web Text indexPageRank Page
repository googlebot Text & link analysis Query hits
Ranker
Slide 30
One taxonomy of crawlers Many other criteria could be used:
Incremental, Interactive, Concurrent, Etc. 30
Slide 31
Basic crawlers This is a sequential crawler Seeds can be any
list of starting URLs Order of page visits is determined by
frontier data structure Stop criterion can be anything
Slide 32
Graph traversal (BFS or DFS?) Breadth First Search Implemented
with QUEUE (FIFO) Finds pages along shortest paths If we start with
good pages, this keeps us close; maybe other good stuff Depth First
Search Implemented with STACK (LIFO) Wander away (lost in
cyberspace) 32
Slide 33
Universal crawlers Support universal search engines Large-scale
Huge cost (network bandwidth) of crawl is amortized over many
queries from users Incremental updates to existing index and other
data repositories 33
Slide 34
Large-scale universal crawlers Two major issues: 1. Performance
Need to scale up to billions of pages 2. Policy Need to trade-off
coverage, freshness, and bias (e.g. toward important pages) 34
Slide 35
Large-scale crawlers: scalability Need to minimize overhead of
DNS lookups Need to optimize utilization of network bandwidth and
disk throughput (I/O is bottleneck) Use asynchronous sockets
Multi-processing or multi-threading do not scale up to billions of
pages Non-blocking: hundreds of network connections open
simultaneously Polling socket to monitor completion of network
transfers 35
Slide 36
Universal crawlers: Policy Coverage New pages get added all the
time Can the crawler find every page? Freshness Pages change over
time, get removed, etc. How frequently can a crawler revisit ?
Trade-off! Focus on most important pages (crawler bias)? Importance
is subjective 36
Slide 37
Web coverage by search engine crawlers This assumes we know the
size of the entire the Web. Do we? Can you define the size of the
Web?
Slide 38
Maintaining a fresh collection Universal crawlers are never
done High variance in rate and amount of page changes HTTP headers
are notoriously unreliable Last-modified Expires Solution Estimate
the probability that a previously visited page has changed in the
meanwhile Prioritize by this probability estimate 38
Slide 39
Do we need to crawl the entire Web? If we cover too much, it
will get stale There is an abundance of pages in the Web For
PageRank, pages with very low prestige are largely useless What is
the goal? General search engines: pages with high prestige News
portals: pages that change often Vertical portals: pages on some
topic What are appropriate priority measures in these cases?
Approximations? 39
Slide 40
Complications Web crawling isnt feasible with one machine All
of the above steps distributed Malicious pages Spam pages Spider
traps incl dynamically generated Even non-malicious pages pose
challenges Latency/bandwidth to remote servers vary Webmasters
stipulations How deep should you crawl a sites URL hierarchy? Site
mirrors and duplicate pages Politeness dont hit a server too often
Sec. 20.1.1 40
Slide 41
41 your guide for the search engines
Slide 42
What is robots.txt? Its a file in the root of your website that
can either allow or restrict search engine robots from crawling
pages on your website.
Slide 43
How does it work? Before a search engine robot crawls your
website, it will first look for your robots.txt file to find out
where you want them to go. There are 3 things you should keep in
mind: Robots can ignore your robots.txt. Malware robots scanning
the web for security vulnerabilities, or email address harvesters
used by spammers, will not care about your instructions. The
robots.txt file is public. Anyone can see what areas of your
website you dont want robots to see. Search engines can still index
(but not crawl) a page youve disallowed, if its linked to from
another website. In the search results itll then only show the url,
but usually no title or information snippet. Instead, make use of
the robots meta tag for that page.
Slide 44
What to put in your robots.txt file User-agent: This is the
line where you define which robot youre talking to. Its like saying
hello to the robot: User-agent: * (Googlebot - Google, Slurp Yahoo)
Disallow: This tells the robots what you dont want them to crawl on
your site: Disallow: / (do not crawl anything on my site) /images/
Allow This tells the robots what you want them to crawl on your
site. Allow: /
Slide 45
What to put in your robots.txt file (Asterisk / wildcard *)
With the * symbol, you tell the robots to match any number of any
characters. Very useful for example when you dont want your
internal search result pages to be indexed. Disallow: *contact* (do
not crawl any urls containing the word contact) $ (Dollar sign /
ends with) The dollar sign tells the robots that it is the end of
the url. Disallow: *.pdf$ # (Hash / comme You can add comments
after the # symbol, either at the start of a line or after a
directive.
Slide 46
What to put in your robots.txt file Crawl-Delay This directive
asks the robot to wait a certain amount of seconds after each time
its crawled a page on your website.. Crawl-delay: 5 Request-rate:
Here you tell the robot how many pages you want it to crawl within
a certain amount of seconds. The first number is pages, and the
second number is seconds. Request-rate: 1/5 # load 1 page per 5
seconds Visit-time: Its like opening hours, i.e. when you want the
robots to visit your website. This can be useful if you dont want
the robots to visit your website during busy hours (when you have
lots of human visitors). Visit-time: 2100-0500 # only visit between
21:00 (9PM) and 05:00 (5AM) UTC (GMT)
Slide 47
Test your page https://www.google.com/webmasters/
Slide 48
48 Search engine optimization
Slide 49
What is SEO? SEO = Search Engine Optimization Refers to the
process of optimizing both the on- page and off-page ranking
factors in order to achieve high search engine rankings for
targeted search terms. Refers to the industry that has been created
regarding using keyword searching a a means of increasing relevant
traffic to a website
Slide 50
Slide 51
What is a SEO Algorithm? Top Secret! Only select employees of a
search engines company know for certain Reverse engineering,
research and experiments gives SEOs (search engine optimization
professionals) a pretty good idea of the major factors and
approximate weight assignments The SEO algorithm is constantly
changed, tweaked & updated Websites and documents being
searched are also constantly changing Varies by Search Engine some
give more weight to on-page factors, some to link popularity
Slide 52
http://seositecheckup.com/
Slide 53
A good SEO strategy: Research desirable keywords and search
phrases (WordTracker, Overture, Google
AdWords)WordTrackerOvertureGoogle AdWords Identify search phrases
to target (should be relevant to business/market, obtainable and
profitable) Clean and optimize a websites HTML code for appropriate
keyword density, title tag optimization, internal linking
structure, headings and subheadings, etc. Help in writing copy to
appeal to both search engines and actual website visitors Study
competitors (competing websites) and search engines Implement a
quality link building campaign Add Quality content Constant
monitoring of rankings for targeted search terms
Slide 54
Ranking factors On-Page Factors (Code & Content) #3 - Title
tags #5 - Header tags #4 - ALT image tags #1 - Content, Content,
Content (Body text) #6 - Hyperlink text #2 - Keyword frequency
& density Off-Page Factors #1 Anchor text #2 - Link Popularity
(votes for your site) adds credibility
Slide 55
What a Search Engine Sees View > Source (HTML code)
Slide 56
Pay Per Click PPC ads appear as sponsored listings Companies
bid on price they are willing to pay per click Typically have very
good tracking tools and statistics Ability to control ad text Can
set budgets and spending limits Google AdWords and Overture are the
two leaders Google AdWordsOverture
Slide 57
PPC vs. Organic SEO Pay-Per-ClickOrganic SEO results in 1-2
days easier for a novice or one little knowledge of SEO ability to
turn on and off at any moment generally more costly per visitor and
per conversion fewer impressions and exposure easier to compete in
highly competitive market space (but it will cost you) Ability to
generate exposure on related sites (AdSense) ability to target
local markets better for short-term and high-margin campaigns
results take 2 weeks to 4 months requires ongoing learning and
experience to achieve results very difficult to control flow of
traffic generally more cost-effective, does not penalize for more
traffic SERPs are more popular than sponsored ads very difficult to
compete in highly competitive market space ability to generate
exposure on related websites and directories more difficult to
target local markets better for long-term and lower margin
campaigns
Slide 58
Keys to Successful SEO Strategy 1. Do not underestimate the
importance of keyword research 2. Be sure to include the proper
tags in your page coding 3. You must have optimized content! (3-5
uses of keyword per 250 words) 4. Use content marketing
Slide 59
Keyword Selection Marketing/Brand Relevance Search Frequency
Competition Optimization Opportunity How closely does the keyword
match your product/service offering, messaging, goals and
objectives? How much competition (large, authority sites) is there
for the particular keyword? Is there already a logical place on the
site to optimize for the particular keyword? How many people are
searching on the particular keyword?