Agenda What is a Search Engine? Examples of popular Search Engines Search Engines statistics Why is Search Engine marketing important? What is a SEO Algorithm?

Agenda What is a Search Engine? Examples of popular Search Engines Search Engines statistics Why is Search Engine marketing important? What is a SEO Algorithm? Steps to developing a good SEO strategy Ranking factors Basic tips for optimization

Examples popular Search Engines

How Do Search Engines Work? Mechanics of a typical search

Results & ads returned ranked

Category of first result

Result for phrase query

How Do Search Engines Work? Spider crawls the web to find new documents (web pages, other documents) typically by following hyperlinks from websites already in their database Search engines indexes the content (text, code) in these documents by adding it to their databases and then periodically updates this content Search engines search their own databases when a user enters in a search to find related documents (not searching web pages in real-time) Search engines rank the resulting documents using an algorithm (mathematical formula) by assigning various weights and ranking factors

Search on the Web Corpus: The publicly accessible Web: static + dynamic Goal: Retrieve high quality results relevant to the users need (not docs!) Need Informational want to learn about something Navigational want to go to that page Transactional want to do something (web-mediated) Access a service Downloads Shop Gray areas Find a good hub Exploratory search see whats there Low hemoglobin United Airlines Tampere weather Mars surface images Nikon CoolPix Car rental Finland Abortion morality

Search Engines as Info Gatekeepers Search engines are becoming the primary entry point for discovering web pages. Ranking of web pages influences which pages users will view. Exclusion of a site from search engines will cut off the site from its intended audience. The privacy policy of a search engine is important.

100+ Billion Searches / Month

Search Engine Wars The battle for domination of the web search space is heating up! The competition is good news for users! Crucial: advertising is combined with search results! What if one of the search engines will manage to dominate the space?

Yahoo! Synonymous with the dot-com boom, probably the best known brand on the web. Started off as a web directory service in 1994, acquired leading search engine technology in 2003. Has very strong advertising and e-commerce partners

Lycos! One of the pioneers of the field Introduced innovations that inspired the creation of Google

Google Verb google has become synonymous with searching for information on the web. Has raised the bar on search quality Has been the most popular search engine in the last few years. Had a very successful IPO in August 2004. Is innovative and dynamic.

Live Search ( was: MSN Search) Synonymous with PC software. Remember its victory in the browser wars with Netscape. Developed its own search engine technology only recently, officially launched in Feb. 2005. May link web search into its next version of Windows.

Important? 80% of consumers find your website by first writing a query into a box on a search engine (Google, Yahoo, Bing) 90% choose a site listed on the first page 85% of all traffic on the internet is referred to by search engines The top three organic positions receive 59% percent of user clicks. Cost-effective advertising Clear and measurable ROI Operates under this assumption: More (relevant) traffic + Good Conversions Rate = More Sales/Leads

Experiment with query syntax Default is AND, e.g. computer chess normally interpreted as computer AND chess, i.e. both keywords must be present in all hits. +chess in a query means the user insists that chess be present in all hits. computer OR chess means either keywords must be present in all hits. computer chess means that the phrase computer chess must be present in all hits.

The most popular search keywords AltaVista (1998)AlltheWeb (2002)Excite (2001) sexfree appletsex pornodownloadpictures mp3softwarenew chatuknude

Free Keyword Research Tools https://adwords.google.com/o/Targeting/Explorer?__c=10000000 00&__u=1000000000&__o=te&ideaRequestType=KEYWORD_IDE AS#search.none https://adwords.google.com/o/Targeting/Explorer?__c=10000000 00&__u=1000000000&__o=te&ideaRequestType=KEYWORD_IDE AS#search.none Keyword Tool and Traffic Estimator to identify competitive phrases and search frequencies http://www.google.com/insights/search http://www.google.com/insights/search Compare search patterns across specific regions, categories, time frames and properties

Web search Users Ill-defined queries Short length Imprecise terms Sub-optimal syntax (80% queries without operator) Low effort in defining queries Wide variance in Needs Expectations Knowledge Bandwidth Specific behavior 85% look over one result screen only mostly above the fold 78% of queries are not modified 1 query/session Follow links the scent of information...

How far do people look for results?

Architecture of a Search Engine The Web Ad indexes Web spider Indexer Indexes Search User

Q: How does a search engine know that all these pages contain the query terms? A: Because all of those pages have been crawled 26

Crawling picture Web URLs frontier Unseen Web Seed pages URLs crawled and parsed Sec. 20.2 27

Motivation for crawlers Support universal search engines (Google, Yahoo, MSN/Windows Live, Ask, etc.) Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc. Business intelligence: keep track of potential competitors, partners Monitor Web sites of interest Evil: harvest emails for spamming, phishing Can you think of some others? 28

A crawler within a search engine 29 Web Text indexPageRank Page repository googlebot Text & link analysis Query hits Ranker

One taxonomy of crawlers Many other criteria could be used: Incremental, Interactive, Concurrent, Etc. 30

Basic crawlers This is a sequential crawler Seeds can be any list of starting URLs Order of page visits is determined by frontier data structure Stop criterion can be anything

Graph traversal (BFS or DFS?) Breadth First Search Implemented with QUEUE (FIFO) Finds pages along shortest paths If we start with good pages, this keeps us close; maybe other good stuff Depth First Search Implemented with STACK (LIFO) Wander away (lost in cyberspace) 32

Universal crawlers Support universal search engines Large-scale Huge cost (network bandwidth) of crawl is amortized over many queries from users Incremental updates to existing index and other data repositories 33

Large-scale universal crawlers Two major issues: 1. Performance Need to scale up to billions of pages 2. Policy Need to trade-off coverage, freshness, and bias (e.g. toward important pages) 34

Large-scale crawlers: scalability Need to minimize overhead of DNS lookups Need to optimize utilization of network bandwidth and disk throughput (I/O is bottleneck) Use asynchronous sockets Multi-processing or multi-threading do not scale up to billions of pages Non-blocking: hundreds of network connections open simultaneously Polling socket to monitor completion of network transfers 35

Universal crawlers: Policy Coverage New pages get added all the time Can the crawler find every page? Freshness Pages change over time, get removed, etc. How frequently can a crawler revisit ? Trade-off! Focus on most important pages (crawler bias)? Importance is subjective 36

Web coverage by search engine crawlers This assumes we know the size of the entire the Web. Do we? Can you define the size of the Web?

Maintaining a fresh collection Universal crawlers are never done High variance in rate and amount of page changes HTTP headers are notoriously unreliable Last-modified Expires Solution Estimate the probability that a previously visited page has changed in the meanwhile Prioritize by this probability estimate 38

Do we need to crawl the entire Web? If we cover too much, it will get stale There is an abundance of pages in the Web For PageRank, pages with very low prestige are largely useless What is the goal? General search engines: pages with high prestige News portals: pages that change often Vertical portals: pages on some topic What are appropriate priority measures in these cases? Approximations? 39

Complications Web crawling isnt feasible with one machine All of the above steps distributed Malicious pages Spam pages Spider traps incl dynamically generated Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters stipulations How deep should you crawl a sites URL hierarchy? Site mirrors and duplicate pages Politeness dont hit a server too often Sec. 20.1.1 40

41 your guide for the search engines

What is robots.txt? Its a file in the root of your website that can either allow or restrict search engine robots from crawling pages on your website.

How does it work? Before a search engine robot crawls your website, it will first look for your robots.txt file to find out where you want them to go. There are 3 things you should keep in mind: Robots can ignore your robots.txt. Malware robots scanning the web for security vulnerabilities, or email address harvesters used by spammers, will not care about your instructions. The robots.txt file is public. Anyone can see what areas of your website you dont want robots to see. Search engines can still index (but not crawl) a page youve disallowed, if its linked to from another website. In the search results itll then only show the url, but usually no title or information snippet. Instead, make use of the robots meta tag for that page.

What to put in your robots.txt file User-agent: This is the line where you define which robot youre talking to. Its like saying hello to the robot: User-agent: * (Googlebot - Google, Slurp Yahoo) Disallow: This tells the robots what you dont want them to crawl on your site: Disallow: / (do not crawl anything on my site) /images/ Allow This tells the robots what you want them to crawl on your site. Allow: /

What to put in your robots.txt file (Asterisk / wildcard *) With the * symbol, you tell the robots to match any number of any characters. Very useful for example when you dont want your internal search result pages to be indexed. Disallow: *contact* (do not crawl any urls containing the word contact) $ (Dollar sign / ends with) The dollar sign tells the robots that it is the end of the url. Disallow: *.pdf$ # (Hash / comme You can add comments after the # symbol, either at the start of a line or after a directive.

What to put in your robots.txt file Crawl-Delay This directive asks the robot to wait a certain amount of seconds after each time its crawled a page on your website.. Crawl-delay: 5 Request-rate: Here you tell the robot how many pages you want it to crawl within a certain amount of seconds. The first number is pages, and the second number is seconds. Request-rate: 1/5 # load 1 page per 5 seconds Visit-time: Its like opening hours, i.e. when you want the robots to visit your website. This can be useful if you dont want the robots to visit your website during busy hours (when you have lots of human visitors). Visit-time: 2100-0500 # only visit between 21:00 (9PM) and 05:00 (5AM) UTC (GMT)

Test your page https://www.google.com/webmasters/

48 Search engine optimization

What is SEO? SEO = Search Engine Optimization Refers to the process of optimizing both the on- page and off-page ranking factors in order to achieve high search engine rankings for targeted search terms. Refers to the industry that has been created regarding using keyword searching a a means of increasing relevant traffic to a website

What is a SEO Algorithm? Top Secret! Only select employees of a search engines company know for certain Reverse engineering, research and experiments gives SEOs (search engine optimization professionals) a pretty good idea of the major factors and approximate weight assignments The SEO algorithm is constantly changed, tweaked & updated Websites and documents being searched are also constantly changing Varies by Search Engine some give more weight to on-page factors, some to link popularity

http://seositecheckup.com/

A good SEO strategy: Research desirable keywords and search phrases (WordTracker, Overture, Google AdWords)WordTrackerOvertureGoogle AdWords Identify search phrases to target (should be relevant to business/market, obtainable and profitable) Clean and optimize a websites HTML code for appropriate keyword density, title tag optimization, internal linking structure, headings and subheadings, etc. Help in writing copy to appeal to both search engines and actual website visitors Study competitors (competing websites) and search engines Implement a quality link building campaign Add Quality content Constant monitoring of rankings for targeted search terms

Ranking factors On-Page Factors (Code & Content) #3 - Title tags #5 - Header tags #4 - ALT image tags #1 - Content, Content, Content (Body text) #6 - Hyperlink text #2 - Keyword frequency & density Off-Page Factors #1 Anchor text #2 - Link Popularity (votes for your site) adds credibility

What a Search Engine Sees View > Source (HTML code)

Pay Per Click PPC ads appear as sponsored listings Companies bid on price they are willing to pay per click Typically have very good tracking tools and statistics Ability to control ad text Can set budgets and spending limits Google AdWords and Overture are the two leaders Google AdWordsOverture

PPC vs. Organic SEO Pay-Per-ClickOrganic SEO results in 1-2 days easier for a novice or one little knowledge of SEO ability to turn on and off at any moment generally more costly per visitor and per conversion fewer impressions and exposure easier to compete in highly competitive market space (but it will cost you) Ability to generate exposure on related sites (AdSense) ability to target local markets better for short-term and high-margin campaigns results take 2 weeks to 4 months requires ongoing learning and experience to achieve results very difficult to control flow of traffic generally more cost-effective, does not penalize for more traffic SERPs are more popular than sponsored ads very difficult to compete in highly competitive market space ability to generate exposure on related websites and directories more difficult to target local markets better for long-term and lower margin campaigns

Keys to Successful SEO Strategy 1. Do not underestimate the importance of keyword research 2. Be sure to include the proper tags in your page coding 3. You must have optimized content! (3-5 uses of keyword per 250 words) 4. Use content marketing

Keyword Selection Marketing/Brand Relevance Search Frequency Competition Optimization Opportunity How closely does the keyword match your product/service offering, messaging, goals and objectives? How much competition (large, authority sites) is there for the particular keyword? Is there already a logical place on the site to optimize for the particular keyword? How many people are searching on the particular keyword?

Documents

Agenda What is a Search Engine? Examples of popular Search Engines Search Engines statistics Why is Search Engine marketing important? What is a SEO Algorithm?