Clustered Layout Word Cloud for User Generated Reviewclouds. Rivadeneira et al conducted a user study to obtain performance metrics of four types of tasks using word clouds. The four

Clustered Layout Word Cloud for User Generated ReviewJi Wang

Virginia TechBlacksburg, [email protected]

Jian ZhaoUniversity of Toronto

Toronto, [email protected]

Sheng Guo Chris NorthVirginia Tech

Blacksburg, VA{guos, north}@vt.edu

ABSTRACTUser generated reviews, like those found on Yelp, have be-come important reference material in casual decision making,like dining, shopping and entertainment. However, very largeamounts of reviews make the review reading process timeconsuming. A text visualization could be used to speed up re-view reading process. In this paper, we present the clusteredlayout word cloud; a text visualization that would assist inmaking decision quicker based on user generated reviews. Weused a natural language processing approach, called gram-matical dependency parsing, to analyze user generated re-view content and created a semantic graph. A force-directedgraph layout was applied to the graph to create the clusteredlayout word cloud. We conducted a two-task user study tocompare the clustered layout word cloud to two alternativereview reading techniques: random layout word cloud andnormal block-text reviews. The results showed that the clus-tered layout word cloud offers faster task completion time andbetter user satisfaction level than other two alternative reviewreading techniques.

Author Keywordsword clouds, evaluation, text visualization

ACM Classification KeywordsH.5.m. Information Interfaces and Presentation (e.g. HCI):Miscellaneous

General TermsHuman Factors; Design; Measurement.

INTRODUCTIONUser generated reviews have become a popular resource fordecision making in recent years. For example, Amazon, Yelp,and IMDB provide a large quantity of user generated reviewsfor user decision making or reference. However, in mostcases, reading large quantities of reviews is a difficult andtime-consuming task. Also, a person does not want to spenda large amount of time making the non-critical decision ofwhere to dine or what movie to watch. In this situation, avisualization that summarizes the user generated reviews isneeded for perusing reviews.

Current visualization approaches [27] support the ability tovisualize quantitative information of businesses. However,

quantitative information does not always account for the de-scriptions given in the user review, which can often be de-tailed, providing more context about the business that is be-ing reviewed. Word clouds have the potential to process usergenerated reviews and provide more context than only quan-titative information.

Word clouds are a popular text visualization presenting wordfrequency, popularity or importance through font size. Ma-jority of word clouds lay out the words in a random, thoughaesthetically appealing way. They can provide a summa-rization of large amounts of text in limited space and guideusers’ attention to discover more related information. Wordclouds have been widely used in both business and research,i.e. Opinion Cloud [6], Review Spotlight [28], Tirra[14] andWordle [25] [11].

Although they are an informative tool, the random layout ofword clouds does not provide an overview of the data. In areview reading task, for example, it is necessary to presentclustering of data for navigation and overview. However, therandom layout of word clouds can require significant mentaldemands when trying to understand the review content. Thus,a word cloud’s random layout strategy leads to a higher cog-nitive load. Furthermore, the random layout of a word cloudhas another shortcoming: it precludes using its layout to en-code another dimension of information.

If applied to review reading, the single dimension of wordfrequency cannot provide the whole overview of review con-tent. Adding another dimension of the review content to wordcloud could improve user’s reading performance because itprovide more information to the user.

Semantics are a very important dimension of review con-tent [28] [13]. Grammatical dependency parsing is a effec-tive NLP approach [5]. The processing results in a gram-matical dependency graph (GDG) has been used in applica-tions [26] [4] and enhance the user’s understanding of textsources. In our paper, GDG provides semantic informationfor our clustered layout word cloud.

In this paper, we present an approach to use the layout ofword cloud to embed semantic information from grammati-cal dependency parsing for review reading task. So, our tworesearch questions are:

1. How can semantic information of GDG be embedded intoword clouds?

2. Will embedded semantic information of GDG to layout in-fluence user performance? Furthermore, will words thatare close in semantic information are also spatially close in

1

word cloud layout enhance users’ performance and experi-ences? If so, how?

In this paper, our primary contribution is the concept and pro-totype of using clustered layout word cloud embedded withsemantic information of GDG to enhance user’s performancein review reading. Our word cloud is designed to providemore insight about user generated reviews by creating clus-ters based on semantic information.

As a secondary contribution, we conducted a user study thatcompared our clustered layout word cloud with another twoalternative review reading technique: normal review reading,normal random word cloud. The results of user study showsthat clustered layout word cloud improve the performance:lower task completion time, low error rate, and better usersatisfaction level.

RELATED WORK

Review VisualizationReview visualizations can be categorized as the following twotypes based on the characteristics of reviews.

Quantitative feature based review visualizations are oftenused to visualize general rating, price level and other quan-titative features of business. Wu et al presents a visualizationto show hotel feedback based on quantitative features [27].The drawback of quantitative feature based visualizations isthat, for many products or services, it cannot use quantitativefeature to describe whole things simply.

Content based review visualizations visualize text content ofreviews data, like review opinion. Liu and Street firstly usea natural language processing approach to analyze reviewscontent and then visualize the opinion extracted from thesereviews using a bar chart [13]. Caternini and Rizoli presenteda multimedia interface to visualize the fixed features extractedfrom review content that reflected user’s opinions [2]. Kojiet al presented Review Spotlight: a frequency based wordcloud with ”adjective plus noun” word pairing for visualizinguser reviews [28]. Huang et al present RevMiner: a smart-phone interface that also used natural language processing toanalyze and navigate reviews [9]. Both of the word cloudspresented in Review Spotlight and RevMiner used a randomlayout word cloud approach. Review Spotlight use ”adjec-tive plus noun” word pairing extraction and RevMiner usedtext similarity for semantic information. Our clustered layoutword cloud is also a content based review visualization. Thedifference with Review Spotlight and RevMiner is that weuse grammatical dependency parsing NLP approach providessemantic information.

Word Cloud VisualizationWord clouds (tag clouds) are a very popular text visualizationtool.They provide the word frequency data of a variety of textsources and encode the frequency into the word’s font size.

Kaser and Lemire presented an algorithm to draw wordclouds in the limited space of HTML, based on table com-ponents [10]. Vigas and Wattenberg invented a more com-pact word cloud entitled Wordle, which uses the greedy spacefilling approach create the word cloud while maintaining the

word cloud’s feature of using font size represent word fre-quency or importance level [7].

Recently, word clouds have began to embed semantic infor-mation based on natural language processing results to en-hance the detail provided in word clouds. Cui et al presented acontext preserving tag cloud which used information entropyand word similarity to generate keywords and the distance be-tween words [3]. Paulovich et al presented ProjCloud, whichwas a clustering word cloud approach that provided semanticinformation for documents clustering [21]. It used the MDSapproach to project words into fixed polygons and present therelationships between documents. Both the Review Spotlightword cloud generation approach from Yatani et al [28] andthe RevMiner word cloud generation approach from Huanget al [9] used natural language processing to extract semanticinformation from review content and embed it into their wordclouds.

Our work is related to Review Spotlight [28] and Proj-Cloud [21]. Review Spotlight uses random layout word cloudto show review content, however, our clustered layout wordcloud embeds semantic information to clustered layout topresent review content. Different with ProjCloud, our wordcloud focuses on term level, not document level.

Word Cloud EvaluationWord cloud evaluation is difficult because measuring the in-sight of a visualization from users [19].

Evaluation of word cloud layoutLohmann et al evaluated how the layout of word clouds influ-enced user’s performance in different types of task [15]. Theyfound thematic layouts were good for finding words that be-longed to a specific topic task. However, Lohmann’s workwas not based on real data, but on an experimental semanticlayout word cloud that they created. Our design provided aclustered layout word cloud that was created based on algo-rithms to be described in following sections, so our user studyis different in that the word clouds were dynamically created.At the same time, our study data is real review data from YelpAcademic Dataset [29].

Evaluation based on task competitionTask competition tests are one way to evaluate and compareword cloud visualizations. Bateman et al provided an evalu-ation for word clouds on the aspect of finding the most im-portant word in a tag cloud based on font size, color and po-sition [1]. They found font size and font weight were the as-pects that caught the user’s attention most when using wordclouds. Rivadeneira et al conducted a user study to obtainperformance metrics of four types of tasks using word clouds.The four tasks were search, browsing, impression formation,and recognition [22]. They used the impression formationmetric for word cloud evaluation. Yatani et al also used im-pression formation to evaluate their Review Spotlight wordcloud visualization [28]. Schrammel et al presented topicalword cloud layouts that could assist users in finding specificwords as well as words that were relevant to specific top-ics [23].

To the best of our knowledge, our paper is first attempt to

2

Yelp Review Data

Semantic Graph Layout

Force-Directed Layout

Clustered Word Cloud

NaturalLanguageProcessing

Graph Drawing Algortihm

Word Cloud

Appending

Figure 1. Processing pipeline of clustered layout word cloud

use real review for clustered layout word cloud. At the sametime, we also use real review data and real life task designfor user study. So, this paper’s work can be easy to apply forapplications in our daily life.

CLUSTERED LAYOUT WORD CLOUDUser generated reviews are a useful tool for casual decisionmaking in daily life, when it comes to making selections insocial settings and purchases. Currently though, there arecumbersome amounts of reviews and a user wants to assessand make their decision in a short amount of time, so cur-rent review reading makes it difficult for user to make casualdecision based on the reviews.

In order to improve the user generated review reading processvia the word cloud visualization approach, the word cloudneeds to meet two design requirements, which emphasizeadding more features of detail to the word cloud (based onprevious work [28] [21] [11] ):

• The word cloud needs to use its layout to present semanticinformation.• The word cloud needs to support interaction for retrieval of

review content.

We have implemented a prototype called the clustered lay-out word cloud to achieve these requirements above. Theoverview and techniques to accomplish these requirementsare detailed in the remainder of this section.

System OverviewA clustered layout word cloud is a word cloud which embedssemantic information in layout, meaning it encodes detail ofthe context of the words based on visualization properties,such as spatially. The clustered layout word cloud we imple-mented also supports clickable interaction which allows theuser to conduct keyword searching using words in the visu-alization. This provides the user the ability to further searchreviews that contain keywords that are relevant to them ondemand as opposed to conducting manual searches for theirkeyword.

In our paper, we use grammatical dependency parsing NLPapproach to process review content for semantic informa-tion. The processing result is grammatical dependency graph(GDG). Then we embedded this semantic information intoword cloud layout.

Figure 1 shows our processing pipeline of how we went fromraw user generated review data to the clustered layout wordcloud. The pipeline is divided into four parts:

Part 1: A natural language processing technique was ap-plied to the user generated review data in order to processthe review content and generate the semantic graph layout.

(d)

(a)

(b)

(c)

A

B

C

E

D

F

G

H

Figure 2. (a) Normal Review Reading UI, (b) Normal Random WordCloud UI, (c) Clustered Layout Word Cloud UI.

Part 2: From the semantic graph layout, the LinLogLayoutenergy model layout algorithm was used to create a force-

3

directed graph layout to provide a basis for the clusteredlayout word cloud.Part 3: A word cloud generation approach was then ap-plied to the force-directed graph layout in order to displayclustering information.Part 4: Color encoding and clickable interaction were ap-pended to the word cloud to provide faster recognition ofclusters and achieve the ability to conduct quick keywordsearching respectively.

Figure 2 (c) shows the user interface which we used to presentour clustered layout word cloud. Section F of the interfaceshows the clustered layout word cloud. Section G showsthe history of keywords in the clustered layout word cloudclicked by the user. In the bottom portion of the interface,Section H, shows the search results of the specific keywordclicked by the user. The keyword clicked by the user is high-lighted in search results in section H.

In the following subections, we explain more of the details tothis design.

DataIn our prototype, we used the Yelp Academic Dataset [29]as our user generated review dataset. The Yelp AcademicDataset provides the business profiles and user review con-tent of 250 businesses, like shopping centers and restaurants,from 30 universities for academic and research exploration.The dataset is includes three objects: business profile objects,user profile objects and review content objects. Yelp providesthese objects in JSON format. In our prototype, we loadedthese objects to a MySQL database as three tables: BusinessProfile Table which stored the basic profile information of lo-cal businesses. User Profile Table which stored basic userinformation on Yelp. Review Table which stored review con-tent and votes information of this specify review content fromYelp users.

In our project, we utilized the Business Profile Table to selectbusinesses and restaurants and the Review Table to processthe user generated review content and generate our clusteredlayout word cloud.

Nature Language Processing (NLP)In order to add grammatical dependency to our clusteredlayout word cloud we used the records in a Review Tablefor each business to obtain semantic information. By pre-processing with NLP tools, review data was extracted andconverted to textual graph of key phrases.

To construct the graph, the review content for a specificrestaurant was first extracted from the raw dataset and chun-ked into sentences. Then, the sentences were parsed basedon grammatical relations and eventually the relationship in-formation was filtered to form a context graph from the usergenerated review content for a specific business/restaurantfrom the Yelp Academic Dataset. We used the StanfordParser [5] [24] and the OpenNLP toolkits [20] to create thecontext graphs.

First, we break down each sentence into typed-dependencyparses using the Stanford Parser. In Figure 3, (a) shows thetyped dependency parse for a sampling sentence. (b) shows a

arethe

best

nn

nsubjcop

det

(a)

(b)

best

nn nsubj

Figure 3. (a) A collapsed typed dependency parse for the sentence “La-Baguettes sandwiches are the best!”. (b) The sentence level graph trans-ferred from the parse in (a).

filtered sentence level grammatical relations. We filtered allthe edges with relations such as aux, auxpass, punct, det, cop,etc., and also only terms with specific part-of-speech tags areretained, such as VB, VBD, VBG, VBN, NN, NNP, NNS, JJ,JJR, JJS, etc. From the sentence-level grammatical relationsnow available to us, we performed a concatenation of the re-lations to form a graph for all the reviews of the restaurant theuser is querying. In the first part of this process, we extractedthe main grammatical relations within a sentence. If relationsamongst different sentences share terms (vertices), we con-nected the terms using these shared vertices. The grammati-cal dependency graphs for each restaurant were usually largedue to the amount of reviews we had for each restaurant.

In our final step, we assigned weight values to both the ver-tices and the edges for the semantic graph layout. For assign-ing weigh to vertices, Wi of term Ti, we use the traditionalIDF value as described below,

Wi(Ti) = log(N/dfi) · (log dfi + 1) (1)

N is the number of sentences, and dfi is the document fre-quency which denotes the number of sentences that have thisterm. For weighting the edges, the same strategy for vertexweighting was used, the only difference being that the vari-able dfj means the number of sentences that have this edgetype.

Wj(Ej) = log(N/dfj) · (log dfj + 1) (2)

Force-directed Graph DrawingAlthough the grammatical dependency graph created pro-vides information about relationships amongst the user gener-ated review set for a specific business, it is dense in its amountof connections, so it cannot be presented clearly. The densityof the grammatical dependency graph led us to use a force-directed graph layout approach to create a two-dimensionalrepresentation of the semantic layout.

Force-directed graph layout is a good model for performingthe clustering process in our word cloud design. The processof changing a semantic graph layout to a force-directed graphlayout is easy and widely used in [3] [21] and the resultingforce-directed graph layout is easy to understand. However,the force directed graph layout has a scalability problem; the

4

layout computation process is time-consuming when there isa large number of iterations due to the large amount of edgesand nodes.

In comparison to the Fruchterman-Reingold model, the en-ergy model process of performing a force-directed graph lay-out is better. The energy model directly influences the graphlayout germinating quality and speed [18]. The essential ideaof the energy model is to map the layout to an energy valueand this number is related to the optimal goal of whole layout.Then, the iteration algorithm is used to search all of the pos-sible solutions for the lowest energy case of the entire layout.A good layout is considered to have minimal energy [17]. Atthe same time, the energy model also can provide clusteringinformation based on force-directed graph layout results [18].

In our design, we use LinLogLayout open source code [12] asthe energy model to create our force-directed graph layouts.The running time of generation process was to our satisfac-tion level. During our user study, all clustered layout wordclouds are generated on-the-fly.

Word Cloud Visualization and InteractionThe force-directed graph layout is then turned into our clus-tered layout word cloud. Our design is mainly based on thetraditional word cloud generation approach in [7].

However, clustered layout word cloud is based on NLP resultsand not random placement like in traditional word clouds.Thus, there is one difference with the process of traditionalword cloud generation and our word cloud generation: theinitial positions of word tags are not random. The initial posi-tion of the word tags in our design are provided by the force-directed graph layout, which is generated by LinLogLayoutenergy model.

The generation process of the clustered layout word cloudvisualization can be described as following steps:

Step 1: Scan the vertex list of grammatical dependencygraph and use the weight of the vertices to generate theword list for rendering. High frequency words will be ren-dered first.Step 2: For each word in the word rendering list, the ini-tial position is defined by force-directed layout algorithm.Similar to the traditional word cloud approach, the fontsize is also encoded by the weight of vertex (Equation (1)).There is also a check to see whether the word is in collisionor overlap with other words. In this check, we use doublebuffer mask check for the overlap and collision detection.We draw previous results in one buffer of image and thendraw word-to-be-placed on the other buffer, then we checkthe two buffers whether collision or overlap.Step 3: If collision and overlap occurs, the word followsthe Archimedean spiral in order to find the new position todraw, Until it finds a position without collision or overlapto draw the word.Step 4: Step 2 & 3 are repeated until all the words aredrawn.

We used the Java 2D graphics library to implement this clus-tered layout word cloud appending process. We used colorto indicate different clusterings based on the force directed

graph layout. The color is an added dimension of visual stim-uli, allowing the user to be able to find distinct clusters faster.

In order to achieve the second design requirements of sup-porting interaction for review content based on keywords, weenabled all of the words in the word cloud to be clickable toretrieve the review content about the business that contain theclicked word. When the user clicks a word in the word cloud,a section of our interface shows reviews containing that key-word (Section E of Figure 2 (b) and Section H of Figure 2(c)), with the keyword highlighted in each of the reviews. Theopen source search engine, Apache Lucene [16], was used tobuild an index for review content retrieval.

USER STUDYThe major goal of this user study was to assess the effective-ness of a clustered layout word cloud in users’ performance.To the best of our knowledge, our work is the first to use realreview data to conduct a user study on a clustered layout wordcloud.

ParticipantsWe recruited 15 participants (7 female) in our study. Theiraverage age is 24.39, ranging from 20 to 32. All partici-pants were familiar with normal word clouds. And all partic-ipants were no color blindness and have normal or corrected-to-normal vision. All participants were native English speak-ers. They were all undergraduate (4) or graduate students (11)from our university. Their academic majors are in the fieldsof computer science, mathematics, electrical engineering andveterinarian science. The study lasted about 50 minutes andeach participants were compensated with $30 cash.

ApparatusAll the tasks are performed at a Lenovo Thinkpad T61 laptopwith Intel Centrino 2.10GHz CPU and 4 GB memory withKubuntu 12.04. The keyboard and mouse were external key-board and mouse, like a normal desktop. The display usedwas one Dell 19-inch LCD monitor with 1280×1024 resolu-tion. The users were able to switch between multiple desk-top views. Task completion time was measured using a stopwatch. After performing the visualization tasks, participantscompleted a TLX-based Likert-style questionnaire and sur-vey on a Lenovo Thinkpad T400s laptop.

Review reading techniquesDuring this user study, we also compared the clustered layoutword cloud review reading technique to two other alterna-tive techniques: normal review reading and normal randomword cloud. The normal review reading technique was usedbecause this is commonly how users read reviews and thenormal random word cloud technique was used because itdoesn’t include clustered layout. Thus, it is a good controlgroup in our study.

Normal Review Reading (RR):In this technique, we listed all of the textual user generatedreview content in one pair of restaurants in a normal text ed-itor where user could switch freely between two restaurants’review content if needed ( Figure 2 (a)). The overall ratingfor each review was omitted. User can scroll up and downto check all review content and use the search box in the text

5

editor to search the keywords they want to refine their searchwith and the keyword will be highlighted in the text.

Normal Random Word Cloud (RW):In this technique, we used the random word cloud layout togenerate normal word cloud (Figure 2 (b)). The user can clickthe word they feel is a keyword in their search and reviewcontent with this keyword in it will be displayed in Section Ein Figure 2 (b).

Clustered Layout Word Cloud (CL):In this technique , the user used the Clustered Layout WordCloud to read reviews (Figure 2 (c)). The user can click theword they feel is a keyword in their search and review con-tent with this keyword in it will be displayed in Section H inFigure 2 (c).

In Figure 3, we show an example of the three review readingtechniques for the same Yelp review data. (a) is Normal Re-view Reading, (b) is Normal Random Word Cloud, and (c) isClustered Layout Word Cloud.

TasksThe two types of tasks we used to assess users’ performancewere a decision making task and a feature finding task. Bothtasks mirror events that happen in daily life:

1. The decision making task occurs when people are compar-ing similar restaurants. This task is overview task where arestaurant can be distinguished.

2. The feature finding task occurs when people attempt to findnon-quantitative detail., This task is a specific informationretrieval task where more detail can be located dependingon the specific restaurant of their choice.

For each of the review reading techniques we will let users todo two types of tasks described below:

Decision Making Task:In this task, we let users perform different review readingtechniques to make a decision about which restaurant he orshe will go to from the pair of restaurants provided in thetest. The users will use same review reading technique fortwo pairs during the test, resulting in six decision making tasktests.

We selected 6 pairs of restaurants for this type of task. Threeof the pairs were good-good restaurant pairs, meaning therestaurants in these pairs had high ratings (4 or 5 stars on YelpBusiness Profile Table). The other three restaurant pairs weregood-bad pairs, meaning the restaurants in these pair had sig-nificant differences in their rating: ”good” was high(4 or 5stars) and ”bad” was low(1 or 2 stars). Participants are un-aware of how the good and bad restaurants were chosen. Allthe restaurants in the same pair had similar price level, simi-lar location, and similar cuisine. In order to make counter-balanced ordering, intra-test procedures were changed be-tween participants, meaning one participant would conductthe task on a good restaurant and then a bad restaurant duringa good-bad test, and the next participant would conduct thetask on the restaurants in opposite order (bad-good ordering).

Figure 4. Task Design and Study Procedure

Feature Finding Task:In this task, we want to evaluate how people find the features(food and non-food) of restaurants. The features we definedfor this task are food and non-food features, non-food fea-tures being for example the restaurant’s environment or ser-vice. For the 3 review reading techniques, we use each tech-nique to complete the two distinct feature finding tasks. Oneis to find the food-relevant feature, another one is to find non-food relevant feature. For the 6 tests conducted we provided 6different restaurants to the participant. The 6 restaurants hadhigh ratings (4 or 5 stars) similar to the ”good” restaurants inthe decision making task. The ordering of the review readingtechnique used for the task is randomized for each user, eachtechnique had the participant locate the food and non-foodfeature though.

During the feature finding task we imposed a time limit onhalf of users, the other half of user will did not have timelimit. The time limit for each of the 6 tests was 2 minutes.All tasks are shown as Figure 4.

HypothesesBased on our research questions we present in IntroductionSection, we formulated the two hypotheses listed as below.

[H1] CL leads to better user performance than a RW and RRfor decision making and feature finding on user generatedreviews.

The assumption is that, based on the CL which is embeddedwith NLP results, we expect users to obtain more semanticinformation from the clustered layout compared to the otherforms of presenting review content. We expected the seman-tic information and clustering will also influence the user’stask completion time as well as accuracy of decision makingand feature finding.

[H2] CL has a positive impact on user satisfaction in the de-cision making and feature finding tasks on user generated re-view.

It is expected that user satisfaction feedback will show thatCL will be more preferred amongst participants than RW and

6

(a) (b)

0"

50"

100"

150"

200"

250"

300"

350"

400"

450"

Normal"Review"Reading"

Normal"Random"Word"Cloud"

Clustered"Layout"Word"Cloud"

Task%Tim

e%in%Secon

ds%

0"

50"

100"

150"

200"

250"

300"

350"

400"

450"

Normal"Review"Reading"



Task%Tim

e%in%Secon

ds%

0"

50"

100"

150"

200"

250"

300"

350"

400"

450"

Nomral"Review"Reading"



Task%Tim

e%in%Secon

ds%

(c)

Figure 5. Decision Making Task Completion Time (Good-good pair (a), Good-bad pair (b)) and Feature Finding Task Completion Time (c)

RR.

ProcedureThe study was a within-subject design study, meaning thateach user will use the three different review reading techniqueto complete the two tasks. Before the study started, we willlet user to familiar with three different techniques to use samedata set.

In order to get quantitative and qualitative feedback, wechoose to evaluate how the clustered layout word cloud in-fluence user’s decision on the restaurant choice. The eval-uation composed of task completion time, error rate, TLX-based Liker-style questionnaire [8], user’s preference rankingand qualitative feedback.

For the decision making task, in each trial, we measured com-pletion time and error rate of each task. We assume the per-formance of normal review reading as the baseline in our userstudy. For the feature finding task, only task completion timewas recorded for each trial of participants who are in no timelimit group.

After each review reading technique trial, we let users com-plete the TLX-based Likert-sytle questionnaire [8] to evalu-ate their feedback in mental demand, physical demand, andother metrics to measure task difficulty levels. After each ofthe two task sessions, the user provided a ranking of pref-erence between the three review reading techniques. At theconclusion of the two task sessions, the user gave a generalevaluation for each review reading technique.

RESULTS AND ANALYSISIn our user study, we conducted two types of tasks, a deci-sion making task and a feature finding task, to evaluate threedifferent review reading techniques: Normal Review Reading(RR), Normal Random Word Cloud (RW) and Clustered Lay-out Word Cloud (CL).

Decision Making Task ResultsIn the decision making tasks, users evaluated 6 pairs of restau-rants and choose which one they felt was the best.

Task Completion TimeThe Shapiro-Wilk normal distribution test evaluated the taskcompletion time (in seconds) on good-good pair restau-rants group (p=0.09) and good-bad pair restaurants (p=0.43).p >0.05 means that data follows normal distribution.

We ran 6 (for each reading approach and restaurant pairing)repeated measured ANOVAs on task completion time in forthe decision making task. We found that restaurant pair-ing had a significant effect (F1,14=52.465, p <0.001) on taskcompletion time . We also found there is no significant effectof review reading techniques on task completion time .

We ran a post-hoc one way ANOVA for good-good pairrestaurant groupings and good-bad restaurant groupings. Wefound that the review reading technique have a significant ef-fect on task completion time (F2,42=3.157, p=0.05) in goodgood pairing of restaurants, but no significant effect in goodbad pairing of restaurants (F2,42=0.253, p=0.78). From Fig-ure 5(a), we can found that participants spent less time in thedecision making task using CL compared to the other twotechniques when participants are presented good good pair-ings. In Figure 5(b), it is shown that their is not a significantdifference in task completion time amongst the three tech-niques when participants are presented good bad pairings.

Error RateIn Decision Making Task, the error rate is calculated in good-bad pairings and we assume the correct answer is the goodrestaurant in good-bad pair of restaurants. Based on the deci-sion making tasks, the error rates are as followed: RR (0%),RW (6.7%, 1 error in 15 trials) and CL (6.7%, 1 error in 15trials).

Preferential RankingWe asked participants to provide an overall preference rank-ing of reading approaches, 1 being most preferred and 3 beingleast preferred. The preferential ranking was analyzed with aFriedman test to evaluate differences in median rank acrossthree techniques. The test showed a significant differencebetween the three based on the preferential ranking (χ2(2,N=15)=6.533, p=0.04). The follow-up pairwise Wilcoxontests found that RR had a significantly lower preference rank-ing than CL (p=0.05) and RW (p=0.04). There is no signif-icant difference in preference ordering between CL and RW.The description results are shown in Figure 8 (a).

Feature Finding Task ResultsIn this task, we divided the participants into two groups. Onegroup had a two minute time limit where the other group didnot have a time limit. We used six restaurants for this task.Restaurant # 1,3 and 5 were used for food-feature finding.And Restaurant # 2,4 and 6 were used for non-food feature

7

finding. The goal of this task was to replicate a situation sim-ilar with our daily life events and get users’ feedbacks.

Task Completion TimeWe ran a one way ANOVA for task completion time for thoseparticipants who did not have a time limit. The time data alsopassed the Shapiro-Wilk normal distribution test (p=0.43).We did not find any significant effect on task completion timebetween reading approaches (F2,42=0.253, p=0.78), as shownin Figure 5 (c).

Preferential RankingWe asked participants to provide an overall preference rank-ing of reading approaches, 1 being most preferred and 3 be-ing least preferred. The preferential ranking was analyzedwith a Friedman test to evaluate differences in median rankacross three techniques. The test, again, showed a significantdifference between the three based on the preferential rank-ing (χ2(2, N=15)=8.133, p=0.02). The follow-up pairwiseWilcoxon tests found that CL had a significantly higher pref-erence ranking than RR (p=0.02) and RW (p=0.005). There isno significant difference in preference between RR and RW,shown in Figure 8 (b).

User Satisfaction Level without time limitThe TLX-based Likert-style questionnaire was used to ac-quire feedback of the three review reading techniques fromparticipants whom did not have a time limit imposed on themduring the feature finding task. A Friedman test was con-ducted to observe any differences in scores in the question-naire and as shown in Figure 6, the test results showed therewas no significant difference in any of the responses.

User Satisfaction Level with 2 min time limitThe same TLX-based Likert-style questionnaire was usedto acquire feedback of the three review reading techniquesfrom participants who did have a time limit (2 min) im-posed on them during the feature finding task. Friedman testwas conducted to observe any differences in scores in thequestionnaire. The test results showed a significant differ-ence in mental demand (χ2(2, N=8)=11.826, p=0.003), phys-ical demand (χ2(2, N=8)=6.5, p=0.04), temporal demand(χ2(2, N=8)=10.129, p=0.006) and effort (χ2(2, N=8)=7.00,p=0.03) (Figure 7).The follow-up pairwise Wilcoson tests showed: mental de-mand: CL is significantly lower than RR (p=0.02) and RW(p=0.03); physical demand: CL is significantly lower thanRR (p=0.04); temporal demand: CL is significantly lowerthan RR (p=0.01); effort: RR is significantly higher than RW(p=0.04) and CL (p=0.02).

Qualitative Analysis

Semantic Information RetrievalBased on user feedback, we found the semantic information,provided by natural language processing in layout, improveduser performance in decision making.

It is an easy way to navigate through several reviews thatuse similar terminology to pinpoint specific aspects of arestaurant. It is easier to find out what I’m looking for.(Subject 14)

0

1

2

3

4

5

6

7

Mental Demand

Physical Demand

Temporal Demand

Performance Effort FrustraBon

Normal Review Reading Normal Random Word Cloud Clustred Layout Word Cloud

Figure 6. TLX-based Liker-style questionnaire results without time limit(where lower is better)

0

1

2

3

4

5

6

7

Mental Demand

Physical Demand

Temporal Demand

Performance Effort FrustraBon

Normal Review Reading Normal Random Word Cloud Clustred Layout Word Cloud

Figure 7. TLX-based Liker-style questionnaire results with 2 minutestime limit (where lower is better)

Finding the keywords for ’service’ or ’sandwich’ wasmade easiest with the color coded clouds. It was easyto pick out the keywords that I needed to look at to makemy decisions.(Subject 9)

The visual aspects of font size and color had positive impacton user’s review reading process by using clustered layoutword cloud as well.

The size of the words also made it easier to know whatwas important/more used in the reviews. (Subject 7)

It makes it much easier to look for keywords that helpwhen deciding on which option to pick. It’s well orga-nized and groups similar words and distinguishes themby color. The black and white words with no groupingmake it difficult to find tags. (Subject 15)

Keywords Query by InteractionThrough user feedback, the clickable interaction to querykeywords in review content was found useful.

From these three context preserving could saves my timesince I can click on the things I am interested in andquickly see them highlighted in the reviews (Subject 2)

I still liked the context preserving tag cloud over all theother techniques because it helped me find better key-word to search so I could read more details in the actualreviews. (Subject 9)

Natural Language Processing

8

(a) (b)

0%#

10%#

20%#

30%#

40%#

50%#

60%#

70%#

80%#

90%#

100%#

Normal#Review#Reading#

Normal#Random#Word#Cloud#

Clustered#Layout#Word#Cloud#

3rd#

2nd#

1st#

0%#

10%#

20%#

30%#

40%#

50%#

60%#

70%#

80%#

90%#

100%#

Normal#Review#Reading#

Normal#Random#Word#Cloud#

Clustered#Layout#Word#Cloud#

3rd#

2nd#

1st#

Figure 8. (a) Preference ranking in Decision Making Task; (b) Preferenceranking in Feature Finding Task

There was mixed reviews on factors that were determined tobe attributed to our natural language processing:Users feel not all the necessary word tags were presented inword cloud generated by natural language processing.

I would have ranked the tag clouds higher, but I was un-able to finish one of the tasks because there were no tagsregarding the quality of service at a restaurant. Nor-mally, I liked the context preserving tag cloud more,but I got the impression that fewer tags were included.I liked the clustering, but sometimes couldn’t tell whyterms were included in specific clusters. (Subject 1)

People want to assess the personality of individual reviewersbased on their complete review and the word cloud gener-ated based on the natural language processing used did notprovide specific information about individual reviewer’s per-sonalities.

I preferred reading full reviews because I felt like I couldbetter understand the personality and interests of the re-viewers, which factors a great deal into the way I inter-pret the quality and reliability of the review.(Subject 8)

DISCUSSIONBased on the results of our user study, we can verify that bothof hypotheses are supported.

In our decision making task, we found that the clustered lay-out word cloud had a significant lower task completion timethan the normal random layout word cloud in good-good pairsof restaurants (p=0.05). Good-bad pairs are easy to distin-guish in all three techniques. However, because of good-goodpairs have similar high general rating, they need more con-text information to support users’ decision. In order word,users need to spend more time to find evidences. In this situa-tion, the clustered layout word cloud with semantic informa-tion can offer more context information to users and enhanceusers’ performance.

The error rate of the two word cloud layouts was the same,6.7%, and users preferred the clustered layout word cloudover the random layout word cloud. In our natural lan-guage processing techniques and word cloud visualizationapproach, there is word or content discrimination and bias.Through the experiment results, we can see that there are nosignificant influence the error rate.

Thus, the experiment results support [H1] in respect to ourdecision making task.

In our decision making task, we saw that participants signif-icantly preferred the Clustered Layout Word Cloud and Nor-mal Random Word Cloud more than Normal Review Read-ing. In our feature finding task, the user preference ranking ofClustered Layout Word Cloud was significantly higher thanthat of Normal Review Reading (p=0.024) and Normal Ran-dom Word Cloud (p=0.005).

In our time-limited feature finding task test, we found that theuser’s satisfaction levels of the clustered layout word cloudtest was significantly less than that of the normal randomword cloud test in mental demand, physical demand, tempo-ral demand and effort. However, there was no significant dif-ferent in feature finding task without time limit. We think thereason is our task design. The idea behind the feature findingtask is to perform the categorization and clustering processin our mind. It was assumed that the clustered layout wordcloud would help users perform the clustering process better.

Based on user feedback from our two tasks, qualitative andquantitative, [H2] is fully supported.

Despite mixed reviews given on the natural language process-ing technique we use, statistical analysis results show that er-ror rate was low. If our NLP algorithm was improved, weassume that content will be more vivid for users; all the nec-essary word tags will be shown, clustering and font size willbe more accurate, and personality information will also beavailable in the word cloud.

CONCLUSIONIn this paper, we present one new visualization technique: se-mantic based clustered layout word cloud. We used a gram-mar dependent graph natural language processing techniqueto get a semantic graph of user generated review content. Theenergy based force directed graph layout algorithm was ap-plied to the semantic graph to create a force directed graphlayout of the content. Based on this layout, we used a wordcloud visualization approach on the force directed graph lay-out to generate clustered layout word cloud. In order to havebetter user experience, we appended interaction functionalityto the word cloud. The intention of the clustered layout wordcloud was to reduce review reading time, gain more insightand improve user experience.

We also ran a user study to evaluate how the clustered layoutword cloud would help users gain more information, task per-formance, error rate and user satisfaction level. We used theYelp Academic Dataset and designed two types of tasks forour evaluation process: a decision making task and a featurefinding task. The user study results supported our hypothesesthat the clustered layout word cloud can provide better userperformance in decision making task with good-good pairsof restaurants and let users have better satisfaction level.

In conclusion, clustered layout word clouds can provide userswith an efficient and more satisfactory experience during theirreview reading process.

FUTURE WORKIn the future, we plan to append more information on the clus-tered layout word cloud, like time-series restaurant reviews

9

and sentiment analysis of review information. We will alsoapply a more sophisticated NLP technique for processing thereview content data as well as enable a search box function-ality for finding words easier within the word cloud. Further-more, by manipulating the NLP algorithm, we will also tryto expose keywords that previously didn’t appear in the clus-tered layout word cloud and therefore provide a better morecustomizable and possibly interactive review reading experi-ence for the users.

REFERENCES1. Bateman, S., Gutwin, C., and Nacenta, M. Seeing things

in the clouds: the effect of visual features on tag cloudselections. In Proceedings of the nineteenth ACMconference on Hypertext and hypermedia, HT ’08, ACM(Pittsburgh, PA, USA, 2008), 193–202.

2. Carenini, G., and Rizoli, L. A Multimedia Interface forFacilitating Comparisons of Opinions. In IUI ’09:Proceedings of the 14th international conference onIntelligent user interfaces (2009), 325–334.

3. Cui, W., Wu, Y., Liu, S., Wei, F., Zhou, M., and Qu, H.Context-Preserving, Dynamic Word CloudVisualization. IEEE Computer Graphics andApplications 30, 6 (2010), 42–53.

4. Culotta, A., and Sorensen, J. Dependency tree kernelsfor relation extraction. In Proceedings of the 42ndAnnual Meeting on Association for ComputationalLinguistics - ACL ’04, Association for ComputationalLinguistics (Morristown, NJ, USA, 2004), 423–es.

5. de Marneffe, Marie-Catherine, Bill MacCartney, C. M.Generating typed dependency parses from phrasestructure parses. In Proceedings of the 5th InternationalConference on Language Resources and Evaluation(2006), 449—-454.

6. Economist Opinion Cloud.http://infomous.com/site/economist/.

7. Feinberg, J. Wordle. In Beautiful Visualization. 2009,ch. 3.

8. Hart, S., and Staveland, L. Development of NASA-TLX(Task Load Index): Results of empirical and theoreticalresearch. In Human mental workload, P. A. Hancock andN. Meshkati, Ed. North Holland Press, Amsterdam,1988.

9. Huang, J., Etzioni, O., Zettlemoyer, L., and Clark, K.RevMiner: An Extractive Interface for NavigatingReviews on a Smartphone. In UIST (2012).

10. Kaser, O., and Lemire, D. Tag-Cloud Drawing:Algorithms for Cloud Visualization, 2007.

11. Koh, K., Lee, B., Kim, B., and Seo, J. ManiWordle:Providing Flexible Control over Wordle. Visualizationand Computer Graphics, IEEE Transactions on 16, 6(Nov. 2010), 1190–1197.

12. LinLogLayout. http://code.google.com/p/linloglayout/.13. Liu, B., and Street, S. M. Opinion Observer : Analyzing

and Comparing Opinions on the Web. In WWW ’05Proceedings of the 14th international conference onWorld Wide Web (2005), 342–351.

14. Liu, S., Zhou, M. X., Pan, S., Song, Y., Qian, W., Cai,W., and Lian, X. TIARA : Interactive , Topic-BasedVisual Text Summarization. ACM Transactions onIntelligent Systems and Technology (TIST) 3, 2 (2012).

15. Lohmann, S., Ziegler, J., and Tetzlaff, L. Comparison ofTag Cloud Layouts: Task-Related Performance andVisual Exploration. vol. 5726 of Lecture Notes inComputer Science. Springer Berlin / Heidelberg, 2009,book part (with own title) 43, 392–404.

16. Lucene. http://lucene.apache.org/.

17. Noack, A. Energy Models for Graph Clustering. Journalof Graph Algorithms and Applications 11, 2 (2007),453–480.

18. Noack, A. Modularity clustering is force-directedlayout. Physical Review E 79, 2 (Feb. 2009).

19. North, C. Toward measuring visualization insight. IEEEComputer Graphics and Applications 26, 3 (May 2006),6–9.

20. OpenNLP. http://opennlp.apache.org/.

21. Paulovich, F., Toledo, F., and Telles, G. SemanticWordification of Document Collections. ComputerGraphics 31, 3pt3 (June 2012), 1145–1153.

22. Rivadeneira, A. W., Gruen, D. M., Muller, M. J., andMillen, D. R. Getting our head in the clouds: towardevaluation studies of tagclouds. CHI ’07, ACM (SanJose, California, USA, 2007), 995–998.

23. Schrammel, J., Leitner, M., and Tscheligi, M.Semantically structured tag clouds: an empiricalevaluation of clustered presentation approaches. CHI’09, ACM (Boston, MA, USA, 2009), 2037–2040.

24. Stanford Parser. http://nlp.stanford.edu/software.

25. Viegas, F., Wattenberg, M., and Feinberg, J.Participatory Visualization with Wordle. IEEETransactions on Visualization and Computer Graphics15, 6 (Nov. 2009), 1137–1144.

26. Wang, M., Smith, N., and Mitamura, T. What is theJeopardy model? A quasi-synchronous grammar forQA. In Proc. of EMNLP-CoNLL (2007).

27. Wu, Y., Wei, F., Liu, S., Au, N., Cui, W., Zhou, H., andQu, H. OpinionSeer: interactive visualization of hotelcustomer feedback. IEEE transactions on visualizationand computer graphics 16, 6 (2010), 1109–18.

28. Yatani, K., Novati, M., Trusty, A., and Truong, K. N.Review spotlight: a user interface for summarizinguser-generated reviews using adjective-noun word pairs.In Proceedings of the 2011 annual conference onHuman factors in computing systems - CHI ’11, ACMPress (New York, New York, USA, 2011), 1541.

29. Yelp Academic Dataset.http://www.yelp.com/academic dataset.

10

Documents

Clustered Layout Word Cloud for User Generated Reviewclouds. Rivadeneira et al conducted a user study to obtain performance metrics of four types of tasks using word clouds. The four