Upload
dinhduong
View
213
Download
0
Embed Size (px)
Citation preview
Running head: WEB SEARCH EXERCISE
Web Search Exercise
Deborah Bluestein
Southern Connecticut State University
ILS 501-S70: Introduction to Information Science & Technology
Fall 2012
Yunseon Choi, Ph.D.
Department of Information and Library Science
October 15, 2012
1
WEB SEARCH EXERCISE
Abstract
When searching the Web, several problems are encountered that can hinder finding
relevant information. This exercise sought to demonstrate proficiency in using Web search
engines to avoid those problems and obtain relevant results by formulating an appropriate search
strategy, dealing with an overabundance of information, and determining the applicability,
reliability, and currency of search results. In performing the exercise it was found that (1) search
engines may react differently to some of the Boolean operators used to obtain data, (2) at certain
volumes of results the three common search engines in this exercise generally returned many of
the same results at the same level of quality, but in a differing order, (3) the meta search engine
tested did not appear to have same level of satisfaction as the other three, due primarily to the
integration of advertising into the results and the inability to find a count of the documents in the
results, and (4) the advanced search tools tested were an excellent way to cull out remaining
unwanted information after the Boolean operators had been used. Based on this experience, it
appears that the best approach is to use specific Boolean operators to improve the quality of
results on multiple search engines, and then follow up with an advanced tool for stubborn search
terms and excess volume.
2
WEB SEARCH EXERCISE
Web Search Exercise
To begin the exercise, a search subject was determined, initial Boolean operative
strategies were chosen, and a determination was made as to what would constitute a relevant
response resource. Progressive searches were then executed, and the results evaluated.
Search Topic
Göbekli Tepe is a Neolithic archeological site of megalithic structures in southern Turkey
that has been radio carbon dated to 11,000 BCE, which predates most known agriculture and
animal domestication, and came before any known writing, or production of pottery, or metal
working (Curry, 2008). To trace additional scientific findings about this site, the initial search
terms were Göbekli Tepe, neolithic, and pre-pottery, with the expectation that the number of
terms might need to be increased as searches progressively sought to narrow search results.
Methodology
A series of four searches was run, each with increasing restrictions that employed
Boolean and other operators, in an effort to achieve the goal of 300 results or less on each search
engine. To target authoritative results, the initial keywords were chosen to reflect scientific
terminology and avoid ambiguity. Each level of search was tallied for the relevance of the first
50 results of each search engine in consecutive groups of 10. Tables are provided with these
results. When search criteria on a given search engine achieved a level of less than 300, that
search engine was dropped from further testing.
Regarding the Operational Definition of Relevance
To obtain an objective definition of relevance, two studies were reviewed. In the first, by
Cooper and Chen, it was proposed that a simple operational definition of relevance is primarily
search results where the user takes an "action" such as printing, downloading, saving a citation,
3
WEB SEARCH EXERCISE
or jotting down information (pp. 814-815). They also found that in the majority of Web based
library catalog searches such actions only occur infrequently (p. 820). But actions of using the
search results for their content was not the objective of this exercise, and so other aspects of the
Cooper and Chen study became more important. Their other indicators were that relevant
sessions are twice as long, use more Boolean and modified searching, that the viewer looks at
twice as many Web pages, and more records are accessed (pp. 821-822). And, surprisingly, they
found that relevant searches retrieved about 30% more total hits than nonrelevant sessions
(perhaps due to their use of Boolean searching).
To broaden relevance beyond text to more general use, the definition of Lee, Chen, and
Ilie was also considered: "The operational definition of relevance is rooted in visual search
theory, the extent to which filler design components (e.g., text or image) are perceived as being
pertinent to the search tasks" (p. A4).
Because this exercise used search engines rather than catalogs, and accurately tracking
the amount of time spent on each search result was not possible, it was decided that the objective
definition of relevance for this exercise would be: Those text or media resources in the search
result that indicated an authoritative link (such as .edu or .gov), a recognized or academic science
magazine or journal article, an established news agency source, or that underwent the "action" of
being displayed and subsequently confirmed to be of citable quality (such as an academic paper).
Responses deemed not relevant were those without authoritative affiliation and all wikis, blogs,
personal Web pages, advertising, booksellers, and sites requiring passwords. Attempts were also
made to avoid counting differently worded multiple references to the same resource.
4
WEB SEARCH EXERCISE
Search Engines
Two of the three search engines selected were chosen based on their popularity as
indicated in the course lectures: Google and Yahoo. The third, Bing, was added to the group
simply out of curiosity about the site. All of the search engines gave a results tally at the top of
the screen, and Google added the amount of time required to execute the search. The meta
search engine, Metacrawler, was chosen because of its additional search history feature to the left
of the screen. But it (and the other meta searchers) did not provide search result tallies, and for
initial searches the number of pages simply increased as one progressed through the screens.
First Search: "Göbekli Tepe" neolithic OR pre-pottery
To find what the broadest results would indicate, the initial search included the Boolean
OR. But because it was also essential that the term Göbekli Tepe appear in every result, that
phrase was placed in quotation marks. For Google this approach worked well, as at least all of
the results scanned for the first 20 pages of results appeared to contain the phrase. However, this
was not the case for any of the other search engines, which all appeared to extend the use of OR
to Göbekli Tepe despite the quote marks, so that it appeared in only a small portion of each page
of results. Google also had both the most results and the most relevant results, followed by
Metacrawler in relevance (but primarily derived from Google sites). Yahoo results contained a
lot of blogs and YouTube listings. Bing also had many blog and wiki results, and unexpectedly
changed the results tally as one scrolled through the pages, varying from 45,300 results per page
1, down to 1,160 results per page 5. Metacrawler unfortunately included advertising listings
from the other search engines at the beginning and end of each page, and would continue to do
so throughout the exercise, also incorporating commercial listings such as Ebay into the results.
5
WEB SEARCH EXERCISE
Google offered 22,800 results for photos that included links, but there was some
repetition and some were very unclear as to their relationship to the search terms. Google also
gave 33 video results with a wide variety of relevance, yet all but one carried Göbekli Tepe
captions beside their links.
Table I. First Search: Relevant Results by Consecutive 10 ResultsSearch: "Göbekli Tepe" neolithic OR pre-pottery
Results 1st 10 2nd 10 3rd 10 4th 10 5th 10 Total All ResultsGoogle 5 2 2 2 3 14 69,400Yahoo 1 0 0 0 1 2 45,200Bing 1 0 0 0 1 2 45,100*Metacrawler 5 0 1 2 1 9 not stated
* varied
Yahoo had 19 videos but relevance was unclear, it gave no tally for the number of image
results (I stopped counting at 500), which had associated Web addresses but no links, many
photo repetitions, unclear relationships, and some foreign language titles. Yahoo also offered
separate blog results, which were not explored for this exercise. Bing had 3,530 photos with
Web addresses but no links, some of the relationships were unclear, and there was some photo
repetition. Bing had 19 videos, and all but 3 or 4 had captions referencing Göbekli Tepe.
Metacrawler had more than 700 pictures (when I stopped counting) but less than half appeared to
be related to the primary keyword. It also had 18 videos of which only 8 appeared to have
captions referencing Göbekli Tepe.
Second Search: Göbekli Tepe neolithic pre-pottery
Results for this search indicated that all of the search engines used the implied Boolean
operator "and", and many of the same resources began to show up in all the search engines, but
in a different order of priority. This time Yahoo had more relevant results, and as with Google
and Bing, the first several hundred results seemed to enforce the "and" requirement that all terms
be included in the results, although at around 900 results Bing began to drop some keywords.
6
WEB SEARCH EXERCISE
Bing again changed its results tally, varying from 11,300 results per page 1, to 444 results per
page 5. Metacrawler returned the fewest relevant results, and within the first few items quickly
became inconsistent in meeting the "and" requirement of including all terms. Bing and
Metacrawler also tended to have more duplicates of listings.
Table II. Second Search: Relevant Results by Consecutive 10 ResultsSearch: Göbekli Tepe neolithic pre-pottery
Results 1st 10 2nd 10 3rd 10 4th 10 5th 10 Total All Results
Google 5 2 2 0 1 10 36,300Yahoo 3 3 2 4 2 14 9,050Bing 4 4 2 2 1 13 11,300*Metacrawler 3 1 2 2 1 9 not stated
* varied
All of the search engines listed a wiki first, although Google next mixed in other types of
sites, while Yahoo and Bing continued to include a higher portion of wiki and encyclopedia
types of sites (such as about.com) in their early listings.
To test its use, the operator "+" was next added to the beginning of each keyword, which
returned an error message from Google, but otherwise did not noticeably change Google results
or tally. When “+" was added to Yahoo, it rearranged the order of the results somewhat and
lowered the tally by 30 items, but the results items and quality appeared to be essentially the
same. When "+" was added to Bing, the tally dropped by several thousand or several hundred,
an inconsistency that varied each time the Web site was exited by clicking to display a resource,
and re-entered. When "+" was added to Metacrawler it did not alter the results, but Metacrawler
also did not drop the change after executing the browser back screen command. So in order to
get rid of "+" the site had to be exited and re-entered.
Google had 9,690 photo results of the same prior quality, and also eight video results.
This time a quick look at two of the video links resulted in one down site and one site advertising
to purchase a Web page. Yahoo had no remaining video results, but did have image results (28
7
WEB SEARCH EXERCISE
rows X 5 pictures each = approximately 140) with the same mixed quality as before. Bing also
had photos (139) with Web addresses but no links, and some of the relationships were unclear
but there was no photo repetition. None of Bing's remaining videos were clearly related to the
search keywords. Metacrawler had 140 pictures of a quality similar to the others, and six videos
that ran previews as repeating loops simultaneously without having to click on the screen, but the
context of all but two of the videos was unclear.
Regarding the Second Search, First 10 Results
(1) Because the variety of responses was beginning to decrease, this seemed to be a good
point to examine the differences between the first 10 search results from each of the three search
engines (excluding their photo listings). See Appendix A for a list of the links, but be aware that
some of the listings changed their order as the searcher clicked to verify Web sites and then
returned to the search results using the browser back arrow. As mentioned earlier, all three
began with a wiki resource. Google's reference had a slightly different description and link
address, but clicking on the link ended up at the same wiki site as the other two.
(2) A similar difference in description and link address for Google and Yahoo also ended
up at the same material for their second listing, an information site owned by a publically traded
internet company, IAC/InterActiveCorp. On their corporate Web site, IAC advertised that
"Through innovative use of articles, video, searches and many other creative solutions, IAC
allows advertisers to connect to a broad online audience of consumers around the world"
(InterActive Corporation, 2012). This of course, raised questions about the objectivity of the
information which about.com is providing to the Web searcher when IAC is promoting the
interest of their advertisers. The second listing for Bing was a site for academia.edu, which is a
8
WEB SEARCH EXERCISE
membership and password access .edu social networking site and corporation, where researchers
and academics may publish papers (Academia.edu, 2012).
(3) The third resource for Google was an academic paper by leading Göbekli Tepe
archeologists (Peters & Schmidt, 2004), and for both Yahoo and Bing there was another
about.com Web site.
(4) The fourth resource for Yahoo was a Web site in English and Turkish listing various
third party articles and images, but with no attributions in English. Google listed another
academic paper by Schmidt (2003), and Bing repeated its last about.com listing.
(5) Next for Yahoo was a Web site without attribution, which appeared to be once
affiliated with a movie about the archeology site. Bing listed the English/Turkish Web site
previously found in Yahoo, and Google provided a magazine article from the Archaeological
Institute of America (Scham, 2005).
(6) The sixth listings included a Google site with unspecific environmental interests, and
owned by a company registered in England and Wales. Bing listed the earlier Yahoo movie site,
and Yahoo provided a journal article refuting Schmidt's research (Banning, 2011).
(7) Google followed with another academic article from the National Academy of
Sciences (Finlayson, et al., 2011), Yahoo listed the same entry that was Bing's second choice
from Academia.edu, and Bing provided Yahoo's last listing (Banning, 2011).
(8) By now the order of articles was noticeably beginning to shift on Google and Bing,
as following the links to verify and document resources continued and I returned to the search
results using the browser back arrow. Google provided another Schmidt academic article (2011),
and both Yahoo and Bing listed an earlier academic Google entry (Peters & Schmidt, 2004).
9
WEB SEARCH EXERCISE
(9) The ninth entry by Google was the movie-related resource listed earlier by Yahoo
and Bing, while those two sites both listed the Scham article noted earlier by Google.
(10) Google finished with a new age Web site that did not list an attribution, and the final
entry for both Yahoo and Bing was a personal Web site focused on religion.
Overall, there was much repetition between the sites for their first 10 listings, particularly
near the end. Also, during work on this part of the exercise, the sites, particularly Bing,
noticeably altered the order of records. It was almost as if progressive accessing of results sites
to review and document them caused the search engines to consider those actions to be an
indicator of a preference for those sites. At the end, Google appeared to return the most
academic resources in the first 10 listings, which also resulted in a higher amount of relevance
for Google. But as indicated in the first two searches, Google’s relevance advantage leveled off
and was sometimes equaled or surpassed by the other search engines in subsequent groups of 10.
Third Search: Controlled Search Engine Preferences and Using Either NOT or “-“
Experiences from the last search initiated a review of the Web sites before finalizing the
next search criteria. The image results did not seem to progress further in quality, and were not
pursued further. There were the puzzling changing Bing results to consider, and also an initial
test of using the Boolean operator NOT to see where it was effective, since the use of OR had
created some Boolean compatibility questions. While adding a desired author (Schmidt) and
using NOT on Google decreased the number of results, it also did the opposite of what was
intended - rather than eliminating the unwanted items, it had bold print and prioritized both the
unwanted items and the word "not" as if the operator AND was being assumed for both of them.
When "-" was used instead on "Banning", "blog", "academia", "Childe", and "YouTube", the
outcome decreased to 3,810 results. When “-“ was used with "wiki" the wiki results
10
WEB SEARCH EXERCISE
disappeared, but the total results count increased by 25,000 more. Results also increased for
"Wikipedia", but not as much. So the wiki issue was set aside for the remaining Google search.
For Yahoo the NOT operator worked extremely well, particularly when it included the
word "wiki", which dropped the results down to a reasonable 186 and could eliminate Yahoo
from further searching. During the pre-search review of Bing, a search history option was
located and turned off. Then the same keywords and NOT operators were used as with Yahoo,
and had a similar effect. Bing was brought down to a tally of 163 (with its usual proportional
variation from page to page) which could eliminate it from further searching.
Next, the NOT Boolean search was entered into Metacrawler, with the expected outcome:
Yahoo results were without the unwanted terms, but Google and Yandex results had the
unwanted terms in bold print and prioritized. An attempt to add "-" with the unwanted terms at
the end of the keyword stream found the search box could not hold all of the entry. So, only the
"-" search was done, which improved the results quality. It also brought several foreign language
sites into the earlier parts of the list, which might mean that the total results decreased (no tally to
tell). So, it was the "-" results that were evaluated for relevance in the Metacrawler search.
The lists produced by this third search resulted in a much higher number of membership
and password sites in the first 50 results, and also a higher number of booksellers and foreign
language sites. The following chart documents the relevant items from this search.
Table III. Third Search: Relevant Results by Consecutive 10 Results
Search: "Göbekli Tepe" neolithic "pre-pottery" Schmidt NOT banning NOT blog NOT academia NOT Childe NOT YouTube, and also searching on "Göbekli Tepe" neolithic "pre-pottery" Schmidt -Banning -blog -academia -Childe -YouTube
Results 1st 10 2nd 10 3rd 10 4th 10 5th 10 Total All Results
Google (-) 7 2 4 2 2 17 3,810Yahoo (NOT) 4 4 1 2 3 14 186Bing (NOT) 4 2 3 2 0 11 163*Metacrawler (-) 2 3 2 4 3 14 not stated
* varied
11
WEB SEARCH EXERCISE
Fourth Search: Bringing in Advanced Search and Searching on English Only
Because the two remaining search engines with more than 300 results had responded
better to the "-" operator, those terms were again to be eliminated. But this time the items were
entered into the advanced search engine of each site. The effect on Google was immediate
because it enabled the search to continue without the terms wiki and Wikipedia, and permitted
the elimination of foreign languages. It did not, however, delete the word Google or google.com,
which would have eliminated many useless book sale sites. Instead, the word Turkey was added
as a desirable search term and a few other terms were paired with "-", such as ritual and travel.
This final effort resulted in a tally of 267 results.
For Metacrawler the effect of the advanced search screen was even greater. Most of the
"-" terms from the last search were entered to be eliminated, and the foreign language option was
used. The result was a lean 6 pages of results (at 10 results each) for a final tally of 60 results.
As with the third search, both sets of results had a much higher portion of membership and
password sites, and also a higher number of booksellers. As the relevance results in the
following chart indicate, however, quality as a portion of the total remaining lists was high.
Table IV. Fourth Search: Relevant Results by Consecutive 10 Results Advanced Search - Google (English only): neolithic Schmidt turkey "Göbekli Tepe" "pre pottery" -banning -blog -academia -Facebook -Childe -travel -belly -ritual -darkness -Google -YouTube -wiki -Wikipedia -scribd -bruceowen -book -about-com Advanced Search - Metacrawler (English only): neolithic pre pottery Turkey Schmidt "Göbekli Tepe" -banning -blog -academia -Facebook -Childe -travel -belly -ritual -darkness -YouTube -wiki -Wikipedia
Results 1st 10 2nd 10 3rd 10 4th 10 5th 10 Total All Results
Google 4 2 4 2 4 16 267Metacrawler 4 5 1 1 0 11 60
12
WEB SEARCH EXERCISE
Summary
By implementing and customizing the use of Boolean operators, the various difficulties
encountered when using different search engines can be overcome. For Google, the problem
consistently was too much information and its incompatibility with operators such as OR and
NOT. Substituting Google operators effectively began the elimination process and using the
advanced search very quickly relieved the information glut. The compatibility of Yahoo and
Bing to the Boolean operators made it much quicker to get to a reasonable number of results,
although the changing lists in Bing were a definite distraction when direct work with the found
documents began. Metacrawler’s incorporation of retail sales and advertising information into
the results was also distracting, and in some office light it was difficult to see on the screen
where their own promotional lines ended and the search results began – particularly since the
promotions used the keywords being entered. Metacrawler was also very sensitive to the
Boolean issues with Google resources, and it would probably better when refining searches to
simply start with Metacrawler’s advanced search popup.
There are several things that may thwart a maximized search, including the need to
eliminate terms in order to have results that are of a manageable, useable volume. Some of the
actions necessary to reach less than 300 results, particularly for Google and Metacrawler, did
eliminate some of the relevant items from earlier tables. The consequences of failing to cull
significant “noise” from the searches, such as results promoting Google books, leads to missed
opportunities for relevant items to come to the forefront. However, there was sufficient
duplication from Yahoo and Bing to help most of the important items remain visible through the
end of the exercise.
13
WEB SEARCH EXERCISE
Appendix A
The First 10 Listings for Each Web Site in the Second Search
Google Yahoo Bing1. en.wikipedia.org/wiki/
Göbekli_Tepeen.wikipedia.org/wiki/Göbeklitepe
en.wikipedia.org/wiki/Göbeklitepe
2. http://archaeology.about.com/od/huntergatherers/ss/Gobekli-Tepe_2.htm
archaeology.about.com/od/huntergatherers/ss/Gobekli-Tepe
www.academia.edu/Papers/in/Pre-Pottery_Neolithic
3. http://www.mnhn.fr/museum/front/medias/publication/10613_Peters.pdf
archaeology.about.com/od/huntergatherers/ss/Gobekli-Tepe.htm
http://archaeology.about.com/od/huntergatherers/ss/Gobekli-Tepe_2.htm
4. http://www.exoriente.org/docs/00046.pdf
http://gobekli.net/ http://archaeology.about.com/od/huntergatherers/ss/Gobekli-Tepe.htm
5. www.archaeology.org/0811/abstracts/turkey.html
http://gobeklitepe.info/index.html
http://gobekli.net/
6. http://www.environmentalgraffiti.com/offbeat-news/mystery-deliberate-burial-ancient-megalithic-stone-circles/9949
www.jstor.org/pss/10.1086/661207
http://gobekli.net/
7. http://www.pnas.org/content/108/20/8183.full
www.academia.edu/Papers/in/Pre-Pottery_Neolithic
http://www.jstor.org/discover/10.1086/661207?uid=3739576&uid=2&uid=4&uid=3739256&sid=21101264750731
8. http://arheologija.ff.uni-lj.si/documenta/pdf37/37_21.pdf
http://www.mnhn.fr/museum/front/medias/publication/10613_Peters.pdf
http://www.mnhn.fr/museum/front/medias/publication/10613_Peters.pdf
9. http://gobeklitepe.info/news.html http://www.archaeology.org/0811/abstracts/turkey.html
http://www.archaeology.org/0811/abstracts/turkey.html
10. www.crystalinks.com/gobekli_tepe.html
http://genealogyreligion.net/gobekli-tepe-houses-of-the-holy
http://genealogyreligion.net/gobekli-tepe-houses-of-the-holy
14
WEB SEARCH EXERCISE
References
Academia.edu. (2012). About Academia.edu. Retrieved from http://www.academia.edu/about
Banning, E. B. (2011, October). So fair a house. Current Anthropology, Vol. 52(No. 5).
Retrieved from www.jstor.org/pss/10.1086/661207
Cooper, M. D., & Chen, H. (2001, August). Predicting the relevance of a library catalog search.
Journal of the American Society for Information Science & Technology, School of
Information Management & Systems, University of California, 52(10), 813–827.
Retrieved from http://www.google.com/url?
sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&ved=0CB8QFjAA&url=http%3A
%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi
%3D10.1.1.21.2024%26rep%3Drep1%26type
%3Dpdf&ei=O5B5ULq5FsyF0QHHsICgCg&usg=AFQjCNHdHSdLrW98baft2B-
6gDEoGt0gK
Curry, A. (2008, January 18). Seeking the Roots of Ritual. Science www.sciencemag.org, Vol.
319, pp. 278-280. Retrieved from
http://www.clas.ufl.edu/users/davidson/Proseminar/Week%2015%20cosmology,
%20spirituality%20and%20religion/Curry%202008.pdf
Finlayson, B., Mithen, S. J., Najjar, M., Smith, S., Maričević, D., Pankhurst, N., & Yeomans, L.
(n.d.). Architecture, sedentism, and social complexity at prepottery Neolithic A WF16,
Southern Jordan. (B. Smith, Ed.) Proceedings of the National Academy of Sciences
(PNAS), Vol. 108 (No. 20 8183-8188). Retrieved from
http://www.pnas.org/content/108/20/8183.full
15
WEB SEARCH EXERCISE
InterActiveCorporation. (2012). IAC & your business. Retrieved from http://www.iac.com/IAC-
And-Your-Business/
Lee, Y., Chen, A., & Ilie, V. (2012, June). Can online wait be managed? The effect of filler
interfaces and presentation modes on perceived waiting time online. MIS Quarterly, Vol.
36 (No. 2–Appendices), pp. A1-A11. Retrieved from
http://www.misq.org/skin/frontend/default/misq/pdf/appendices/2012/
LeeChenIlieAppendices.pdf
Peters, J., & Schmidt, K. (2004). Animals in the symbolic world of pre-pottery neolithic Göbekli
Tepe, South-eastern Turkey: A preliminary assessment. Anthropozoologica, 39 (1).
Retrieved from http://www.mnhn.fr/museum/front/medias/publication/10613_Peters.pdf
Scham, S. (2008, November/December). The world's first temple. Archaeology., Vol.61(No.6).
Retrieved from www.archaeology.org/0811/abstracts/turkey.html
Schmidt, K. (2003). The 2003 campaign at Göbekli Tepe (Southeastern Turkey). Neo-Lithics
2/03. Retrieved from http://www.exoriente.org/docs/00046.pdf
Schmidt, K. (2010). Göbekli Tepe – the Stone Age Sanctuaries. New results of ongoing
excavations with a special focus on sculptures and high reliefs. Documenta Praehistorica
XXXVII, 239-256. Retrieved from
http://arheologija.ff.uni-lj.si/documenta/pdf37/37_21.pdf
16