SEO (Search Engine Optimization) · What will we do today? This talk discuses just the most important and interesting ideas in Search Engine Optimization. SEO is the process of improving

SEO (Search Engine Optimization)

Dragiša Miljković[email protected] of electrical engineering and computer scienceFaculty of technical sciencesUniversity of Priština

Whatwillwedotoday?

ThistalkdiscusesjustthemostimportantandinterestingideasinSearchEngineOptimization.SEOistheprocessofimprovingthevisibilityofawebsitewithinsearchresults,sothatitbecomeseasiertofind,morerelevanttosearchengines,andmoreaccessibletothesearchenginecrawlers.Shortlyput,itisaprocessofimprovingawebsite'ssearchenginerankposition.

Whoisthisguy?

M.Sc.Eng.DragišaMiljkovićTeachingfellowatDepartmentofelectricalengineeringandcomputerscience

Facultyoftechnical sciencesUniversityofPrištinaSerbia

Basicterminology

Searchengine isasoftwaresystemusedforsearchingforinformationontheWWW.Theuserenterskeyphrasesofhisinterestintosearchfield,andsearchenginereturnswebcontentresultsinalistofsocalledsearchengineresultspages (SERPs).Thisresultscanbeamixturecomprisedofdifferentwebpages,images,videos,andotherfiletypes.Webisperpetuallychanging,sosearchenginesmustmaintainanearreal-timeindexation.Thisisdonebyconstantlyrunninganalgorithmonawebcrawler.

SERPWebsearchengineprocessakeywordquerysubmittedbya“searcher”,and,asaresponse,itpresentsSERPs.SERPiscomprisedofthelistofresults(usuallyorderedbyrelevancetothequery)thatarereturnedbythesearchengine,butitmayalsocontainotherresults,suchasadvertisements.Therearetwomaincategoriesofresults,organicsearch(returnedbythealgorithmofasearchengine)andsponsored search (i.e.advertisements).– Aresultisdisplayedwithatitle,ahyperlink tothatwebpage,andabrief description (itshowshowthatresultmatchesthequery).

http://www.unibo.it/it SERP

Contextmatters

InternetisacuriosplaceTherearepartsofwebthatarenotbeingaccessedbysearchengines:– deepweb,hiddenfromconventionalsearchengines(e.g.byencryption);

– darkweb,intentionally hiddenfromsearchengines,itusesmaskedIPaddresses,andisaccessibleonlywithaspecialwebbrowser.• Noticethatthedarkwebispartofthedeepweb.

Searchengines–howdotheywork?

Searchengineiscomprisedofthreeprocesses:

– Crawler,

– Indexer,

– Queryengine.

ThisisthetechnicalpartofSEO

TypesofJavaScript

JavaScriptmaybeusedtoenhance HTML&CSS– Itcanbeusedtoimproveuserexperienceandtoaddsomefunctionality,

– ThiseffectsSEOverylittle.

JavaScript maybeusedtoreplace the content ofawebpage– it’sHTMLandCSS– Inthisway,webpagesbecomewebapplications,– Thiscausestroublestosearchengines.

CrawlerAlsoknownas(web)spider.Thisisaninternetbot thatbrowsestheWWWforthepurposeofwebindexing.Therearebillionsofwebpages,sothecrawlerneedstobeconstantlyexecutedonalargenumberofcomputers.CrawlertakestheURLsfrompreviouscrawlsandfromXMLsitemaps,itthentriestofindneworupdatedpagesandaddthemtoGoogleindex(itdetectsSRCandHREF links).

– Itextractshyperlinksandaddsthemtoqueueforcrawling,– Itonlyretrievesapageifitisneworifitischanged,andremovesthedeadones.

Crawler

Criteriathatmattertothecrawler:– Howlong doesittaketoloadawebpage?– HowimportantisthatURL?– Aretheremorehyperlinksonagivenwebpage?

CrawlerparsesHTML

SometextSomemoretext– Somenestedtext– Secondnestedelement

SearchenginecaresonlyaboutURLs

Searchenginedoesnotcare(too)muchaboutthecontentofawebpage,rather– itcrawls,indexes,andranks onlyURLs.– Ageneralrule:onepieceofcontentshouldbeassociatedwithoneURL!

URLsthatcrawlerreturnsareaddedtothesearchengine'sindex.Whenauserentersaqueryinasearchengine,relevantresultswillbereturnedbasedonthesearchengine'salgorithm.

Robotsexclusionprotocol

Knownsimplyasrobots.txtItmustbeinthetopdirectoryoftheserver.Thisisastandardwhichwebsitesuseinordertoregulatewhichareasofthewebsiteshouldthewebcrawlersandotherwebrobotsbeallowedtoprocessandcategorize.Thisissolelyastandard,andnotanenforcedrule,sonotallrobotswillcomply.– E-mailharvesters,malware,andspambots areevenlikelytostartattheareasofthewebsitethatshouldbeomitted.

Unibo robots.txt

http://www.unibo.it/robots.txt

User-agent: *Disallow: /NR/exeresDisallow: /NR/rdonlyresDisallow: /it/allegati/allegati-non-indicizzatiDisallow: /en/attachments/unindexed-attachmentsDisallow: /modelloDisallow: /modello-aDisallow: /modello-bDisallow: /uniboweb/sites/UniboSearch/results.aspxDisallow: /UniboWeb/UniboSearch/results.aspxDisallow: /_layouts

SitemapsinclusionprotocolSitemapsareaURLinclusion protocol.WebmastercanusesitemapsinordertoprovidethesearchengineswithURLsonawebsitethatareavailableforcrawling.SitemapisjustaXMLfilewhichlistsalltheURLsononewebsite.Inordertoenablesearchenginestocrawlwebpagesmoreefficiently,someadditionalmetadata canbeincluded:– Howimportant givenURLis(inrelationtoURLsonthesamewebsite),

– WhenwasthatURLupdated,– Howfrequent changes are,etc.– Itcanalsoincludeinformationaboutspecifictypesofcontentonwebsite(e.g.images,videos).

Sitemaps– Unibo website

JavaScriptaffectscrawling

JavaScriptframeworksareusedtocreateinteractivewebpages,andtocontrolthebehaviouroftheelementsonthepage.InorderforsearchenginetoaccessWebpagecontent,itneedstoberenderedit!Thisisnotajobforacrawler,butforanindexer– InthecaseofGoogle,it’sindexer,calledCaffeine,renderswebpages,andGooglebot doesnotexecuteJavaScriptatall.

– LinksthatareembeddedintoJavaScriptarenotvisibletothecrawler.

Earlierworkaround–AJAX-crawling

InOctober2009Googlecameoutwith“AproposalformakingAJAXcrawlable”.ThisinitiativewasintroducedtomakeJavaScript-basedwebpagesaccessibletoGooglebot.Googlebot sendstotheserverURLofaJSwebpagethatitneeds,andserverrespondswithawebpagethat’sfully-renderedintoaHTMLsnapshot(thisrepresentstheresultofexecutingtheJavaScriptonaheadless browser(browserwithnoGUI)),whichisthenreturnedbacktothecrawler.Nowadays,thisisadeprecatedmethod,asmodernGooglebot hastrulyadvancedJavaScriptrenderingcapabilities.

AJAX-crawlingscheme

ThisschemeacceptsanURLcontainingeithera"#!",ora"fragmentmetatag“(<metaname="fragment"content="!">).ThisURLiscalledprettyURL.Crawlerthenrequeststhecontentofthatpagefromtheserver,butitmodifiesURL byreplacing#! ormetatagswith"?_escaped_fragment_=".ThisURLiscalleduglyURL.

AJAX-crawlingscheme

Eversince2015,Googlebot wasable(atleasttosomeextent)torenderthe“#!”URLsdirectly,soprovidingitwitharenderedversionofthewebpagebecameobsolete.(Though,thisdoesnotapplytoothersearchengines.)Inthesecondquarterof2018,GooglecompletelyswitchedtorenderingJavaScriptpagesonGoogle'sside,anditnolongerrequiresthatwebsitesdothisbythemselves.However,AJAX-crawlingscheme URLsarestillsupportedinGoogle'ssearchresults.

Indexerdoestherendering

Whenthecrawlerprocessesthepage,itsearchesforhyperlinks.ItthensendsthemtotheindexerwhichrendersthemandexecutesJavaScript,whichoftenresultsinfindingnewURLs,whicharethansentbacktothecrawler.Theprocessstopswhencrawlercannotcrawlanyfurther.Butthisdoesnotalwaysworkideally,bottomlineisthatthereisnosearchenginethatcanunderstandandprocessJavaScriptatthelevelmodernbrowserscan.

Indexer

ThemostadvancedindexerisGoogle’sCaffeineEveryindexeranalysesthefollowingthings– Content,– Links,– Layout.

Metatagsfromrobots.txt canbeusedtocontrolindexer’sexecution.

Indexer

Anindexerprovidesthreeservices– Canonicalization

• ItfindsthecanonicalURL,

– WRS (webrenderingservice)• Itrendersawebpage(likeabrowser),

– PageRanker• Indexercalculatesrankofagivenwebpage.

Canonicalization

Canonicalization– IndexerfindsthecanonicalURL(themastercopyofapage)• I.e.allofthefollowingwebpagesaredifferentpagestothecrawler(eventhoughthecontentisthesame):– http://www.example.com– https://www.example.com– http://example.com– http://example.com/index.php– http://example.com/index.php?r...

<linkrel="canonical"href="http://www.unibo.it/it"></head>

Webrenderingservice

Google'sJavaScriptindexingcapabilitiesarewithoutprecedence,itusesChrome41forwebrenderingservice (WRS).– Chrome41 wasreleasedon3rd ofMay2015(sotherearesomemodernfeaturesitdoesn’tsupport),

– WRSisstateless,itdoesn’tstorecookiesorsessiondata,

– IfJavaScriptrequiresanyuseraction,thatwebpagewon’tberendered.

PageRanker

PageRank representsamathematical algorithm usedtodetermineimportanceofapage,andthatprocessisessentiallybasedonassessingthequantityandqualityoflinksleadingtothatwebpage.PageRanker isusedtocalculaterankforagivenwebpage– Whatmattersarethehyperlinks,internal andexternal,– Dampingfactor – theprobabilitythattheuserwillcontinueclicking,ratherthanleavingthatwebsite.

PageRanker sendstheresultstothecrawler.Pageswithhigherimportancearecrawledwithhigherpriority!

Unibo pagerank


HowisJavaScriptindexed?

Thegoldenruleis:ifanuseractionisrequiredinordertoloadsomecontent,thatcontentwon’tbeindexed.Also,anythingthatrequiresuserconsent(e.g.accesstothecamera)isblockedaswell.LinksthatareembeddedintoJavaScriptareextracteduponexecution

– Caffeinedoestheprocessing,– NewlydiscoveredURLsaresenttoGooglebot.

Whatifabuttonshouldbeclicked?

Whenabuttonshouldbeclicked,Googlebot willrenderthatcontentifthecontentresidesonthesamewebpage,butitwillnotindexthecontentasapartofthewebpageifitiscalledfromanotherwebpageusingsomesortofactionthattheusermustperform.

Heading


Heading


Googlebot

Google'swebcrawlingbot,itdiscoversnewandupdatedpagesthataretobeaddedtotheGoogle’sindex.Itisdesignedtobedistributedonseveralmachinesinordertoimproveperformanceandtoscaleasthewebgrows.Googlebot usesanalgorithmicprocess:computerprogramsdeterminewhichsitestocrawl,howoften,andhowmanypagestofetchfromeachsite.

Fetchandrender(asGooglebot)


Googlebot – JavaScriptcrawlingandindexing


Source:https://www.elephate.com/blog/javascript-seo-experiment/(updatedon5th ofMarch2018)

SearchEnginesJavaScriptcrawlingandindexing

Source:https://moz.com/blog/search-engines-ready-for-javascript-crawling(publishedon29th ofAugust2017)

JavaScriptframeworksareamust!

Websiteapplicationdevelopmenttechnologies,suchasReact,Angular,Vue,Backbone,etc,areescalatingthroughoutboththefrontendandbackendwebdevelopment.HavingatleastabasicunderstandingofthistechnologiesisoneoftherequirementsinefficientSEO.

GooglebotcanrenderJavaScript

Googlebot isabletorenderJavaScriptpages(ifitisnotblocked,say,withrobots.txt file,fromaccessingrequiredresources– JavaScriptfiles/frameworks,CSSfiles,serverresponses,3rd-partyAPIs,etc).Throughtherenderprocess,Googlebot extractstitles,descriptiontags,structureddata,andothermeta-data,muchlikeanymodernWebbrowser.Ifresourcesareblockedortemporarilyunavailable,client-sidecodeshouldbemadeinsuchwaythatitfails gracefully.WebpagecontentshouldbeavailableeventhroughbrowsersthatarenotcompatiblewithJavaScriptimplementationsusedforthatwebsite.

AvoidAJAX-crawling

GooglerecommendsthatAJAX-crawlingshouldbeavoidedonnewwebsites,andtomigratetheoldsitesthatstillusethisscheme.– Whenmigrating,“metafragment”tagsshouldberemoved.

– “Metafragment”tagshouldbeusedonlyifthe“escapedfragment”URLdoesn’tservefullyrenderedcontent.

NoteverybotisGoogle

Otherwebbotsarefarlesscapableofrenderingdynamicsites;someofthem mightnotevensupportJavaScriptatall,andjustexpectplainHTML.Sothingsshouldbemadethroughimplementingdown-levelexperiencesothatbotsarenotpreventedfrombeingableofcrawlingthroughnavigation,orseeingthecontentembeddedinawebpage.Oneveryefficientandflexiblewaytoenableallofthesearchenginestoaccessawebpagecontentisserver-sidepre-rendering.– MajorJavaScriptframeworkssupportthisfeaturenatively.

Progressiveenhancement

Websiteshouldbemadethrough“progressiveenhancement”technique,sothatthecontentismadeavailabletoalloftheusers,regardlessofthebrowsertheyuse.Onetechniquethatshouldbeavoidedisredirectinguserstoanunsupportedbrowser page.Whereneeded,apolyfill (oranyotherfallback)shouldbeused!

JavaScriptredirects

Thebestpracticeistopreformredirectionontheserverside,butitisalegitimatepractice(andsometimesitistheonlypossibleoption) touseJavaScript.Googlebot isnotsopatientwithwaiting,soredirectionontheclient-sidewithJavaScriptshouldbedoneasquicklyaspossible.301 redirects arethebestoptionwhenmovingawebpagetodifferentaddress(thispreservestherankingsasmuchaspossible),anditshouldbeusedratherthanJavaScriptredirection.

Mobile-firstindexing

Googlebot,bydefault,willalwaysrenderthepagetheycancrawllinksincludedbyJavaScriptonmobilepages. Googlebot won’tseehyperlinks thatexistondesktopwebpagesandnotonmobilewebpages.– Thosemobilesitesanddesktopsitesneedtobeequivalent!

Mobile-firstpolicy

Mobile-firstpolicy

Mobile-firstpolicy

TestifGooglecanrenderyourpagesproperly

GoogleSearchConsolehasaFetchandRender toolwhichcanbeusedforpreviewinghowGooglebot seesagivenwebpages.– Thistooldoesnotsupport“#!”and“#”URLs.– Generaly,URLswith“#”(notethatisapartfrom“!”)shouldbeavoided,thisisbecauseGooglebot rarelyindexesthoseURLs.

SEOforJavaScriptwebsites

JavaScriptframeworkshelpcreatemodernexperienceforusers,butJavaScriptpresentsachallengeforsearchengines.Luckily,mostofJSframeworkssupportbothfrontandbackendrendering.Thingstokeepinmind:– URLsshouldlooklikestaticURLs

• AvoidhashtagsinURLs

– Usestandard<ahref>linksinHTMLalongwith“onclick”events• Thisensuresdiscoverabilitybysearchengines

– Server-sideJavaScript

Server-sideJavaScriptThisensuresthatthereisplainHTMLforsearchenginestouse.ThoughGooglebot canmanagewithcrawlingandrenderingclient-sideJavaScript,othersearchenginesdonot.Thingstokeepinmind:– Ifyouhavealotofusersusingothersearchengines,youwilllosetraffic.

– EvenGooglehastroublewithheavyclient-siderendering• Thisisalsomuchslower

– Googlemightmisinterpretthecontent• Eventheslightesterrorsaredangerous

– Evencorrectcrawlingandrenderingcanresultinstrangebehaviour

www.unibo.it

Dragiša Miljković

Department of electrical engineering and computer scienceFaculty of technical sciences

University of Priština

[email protected]

Documents

SEO (Search Engine Optimization) · What will we do today? This talk discuses just the most important and interesting ideas in Search Engine Optimization. SEO is the process of improving