1
Method Used: I researched software solutions that focused on plagiarism. Some allowed custom locations to search but, only one allowed for custom location and offered a trial (20 searches). Yelp! has an available dataset for public use for academic purposes. The full dataset is near 6GB and I loaded 100MB, around 50,000 reviews, into a custom search location. For the tests, I generated new, manipulated or otherwise similar reviews for comparisons. Major Hurdles End Goals Determine if there is a significant value in analyzing text in a ‘whole of document’ or larger scale than traditional keyword(s) comparisons. Evaluate the Yelp! dataset to determine if it is a viable dataset for current and future efforts. - including text analysis, visualization creation, analytic technique efforts, etc. Besides word clouds, how could similarity be visualized or conveyed effectively? Identify areas for improvement on traditional text analysis methods & processes. Reveal non-obvious, unknown, or hidden styles or patterns of writing that could be explored further. Understand how plagiarism software actually works. - N-grams, fingerprints, nearest neighbor analysis, bag of words Interesting Findings Project Overview Determine what similarity, even similarly authored text, could be revealed in volumes of data. Can writing styles be evaluated to determine the presence of the same author, same context, same message, etc.? Do writing styles change or transform based on the writing styles and patterns, used by others, around it? Most reviews were < 50 words & lacked a large, or diverse, vocabulary to avoid similar writing styles. Many contained writing styles that used common filler phrases like: First, let me begin by saying… I usually do not leave bad reviews… Love everything about… I have been coming here for years… I have been here a few times... Love love love [this/their]… (Love love love was the most used phrase in the dataset) Available Yelp! dataset was over 4.1 million reviews. Was able to load 1.4 million into OpenOffice before the limit was reached. This increased chances of false positive associations due to the lack of diversity in normal positive/negative based writing. Plagiarism software isn’t designed for this type of use and most require a large $$$ investment. - Few allowed for custom location in check. Most checked the Internet locations only. - Expensive individual and group costs. Plagiarism software has excellent text parsing, comparing, evaluation, visualizing and other desired analytic properties. Was able to search through 50,000 reviews in seconds to locate intentionally crafted and manipulated texts against original. Found possible bots and fake reviewers. - Number of reviews that provided low ratings AND mentioned another restaurant, business or service instead. - Users that posted non-review related information on popular pages. (Links to sites that appeared malicious or misleading) Other reviewers displayed a writing style that took on style from previous reviews for the restaurant, business or service. Finding Similarly Authored Text Jody Coward LAS 919-987-3352 [email protected]

Finding Similarly Authored Text › wp-content › uploads › 2017 › 12 › ... · Yelp! has an available dataset for public use for academic purposes. The full dataset is near

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Finding Similarly Authored Text › wp-content › uploads › 2017 › 12 › ... · Yelp! has an available dataset for public use for academic purposes. The full dataset is near

Method Used: I researched software solutions that focused on plagiarism. Some allowed custom locations to search but, only one allowed for custom location and offered a trial (20 searches). Yelp! has an available dataset for public use for academic purposes. The full dataset is near 6GB and I loaded 100MB, around 50,000 reviews, into a custom search location. For the tests, I generated new, manipulated or otherwise similar reviews for comparisons.

Major Hurdles

End Goals

• Determine if there is a significant value in analyzing text in a ‘whole of document’ or larger

scale than traditional keyword(s) comparisons.

• Evaluate the Yelp! dataset to determine if it is a viable dataset for current and future efforts.

- including text analysis, visualization creation, analytic technique efforts, etc.

• Besides word clouds, how could similarity be visualized or conveyed effectively?

• Identify areas for improvement on traditional text analysis methods & processes.

• Reveal non-obvious, unknown, or hidden styles or patterns of writing that could be explored further.

• Understand how plagiarism software actually works.

- N-grams, fingerprints, nearest neighbor analysis, bag of words

Interesting Findings

Project OverviewDetermine what similarity, even similarly authored text, could be revealed in volumes of data.

• Can writing styles be evaluated to determine the presence of the same author, same context, same message, etc.?

• Do writing styles change or transform based on the writing styles and patterns, used by others, around it?

• Most reviews were < 50 words & lacked a large, or diverse, vocabulary to avoid similar writing styles.

• Many contained writing styles that used common filler phrases like:First, let me begin by saying…I usually do not leave bad reviews…Love everything about…I have been coming here for years…I have been here a few times...Love love love [this/their]… (Love love love

was the most used phrase in the dataset)

• Available Yelp! dataset was over 4.1 million reviews. Was able to load 1.4 million into OpenOffice before the limit was reached. This increased chances of false positive associations due to the lack of diversity in normal positive/negative based writing.

• Plagiarism software isn’t designed for this type of use and most require a large $$$ investment.- Few allowed for custom location in check. Most checked the Internet locations only. - Expensive individual and group costs.

• Plagiarism software has excellent text parsing, comparing, evaluation, visualizing and other desired analytic properties.

• Was able to search through 50,000 reviews in seconds to locate intentionally crafted and manipulated texts against original.

• Found possible bots and fake reviewers. - Number of reviews that provided low ratings AND mentioned another restaurant, business or service instead. - Users that posted non-review related information on popular pages. (Links to sites that appeared malicious or misleading)

• Other reviewers displayed a writing style that took on style from previous reviews for the restaurant, business or service.

Finding Similarly Authored Text

Jody CowardLAS

919-987-3352

[email protected]