Panda Defined

Embed Size (px)

DESCRIPTION

Panda 4 stuff

Citation preview

Gyroop Content WriterReasons we need this software (Google Panda)Ensure writers use the right expert words (don't write fluff)Ensure writers don't overuse or underuse those words (frequency)

The brunt of what you need to know about Google Pandahttp://searchenginewatch.com/article/2241400/4-Steps-to-Panda-Proof-Your-Website-Before-Its-Too-Latehttp://justinbriggs.org/phrase-based-indexing-and-semantics

Why it does what it does:Google wants to see "quality" in the content it ranks high in the search engine results. But what does "quality" mean? What's the difference between "quality" and "spammy" or "crappy"?

As the articles above point out, to be quality, the content should include words and phrases that are related to the keyword. That is, Google should find stuff that it has already deemed is relevant to the niche.

How to accomplish that (What the app does):Allows you to enter new keywordsScrapes the top 50 search engine results for the keyword

Downloads the HTML for each of those 50 URLs

Get the tokenized corpus for each one-word, two-word, and three-word phrase that occurs in the downloaded HTML (Note: software does one at a time to reduce the server load). While tokenizing, also compare the frequencies of those phrases in the Google pages vs. the frequencies in the COCA (Corpus of Contemporary American English)

Aggregate the corpuses into lists and determine required # of uses for the article assignments (1,000 words in this example)

NEED FOR OPTIMIZATION!Currently the database is saving, for each keyword:The original HTML from each scraped pageALL the n-grams (even the ones that get excluded because they're not relevant enough)one-wordtwo-word and yesthree-word

We can probably save a TON of space by deleting all that data once we have the list of JUST the n-grams we want to use for the writers. That will be tiny by comparison.