Tag-Cloud Drawing : Algorithms for Cloud Visualization Owen Kaser, University of New Brunswick, Saint John, NB, Canada Daniel Lemire, Universite du Quebeca

Tag-Cloud Drawing : Algorithms for Cloud Visualization

Owen Kaser , University of New Brunswick, Saint John, NB, CanadaDaniel Lemire, Universite du Quebeca Montreal Montreal, QC, CanadaWWW’07: 16th International World Wide Web Conference

IntroductionTag Cloud usually use font size to show the

relative importance or frequency of tag.A consequence is wasteful white space that is

problematic in small-display device.Clumps of white space are not aesthetically

pleasing .Try to optimize the display of tag cloud and

place associated tags near one another.Use EDA algorithm, min-cut placement for area

minimization and clustering in tag clouds.Use Knuth-Plass algorithm for text justification

and a book-placement exercise considered by Skiena.

Related WorkTag clouds have been attributed to

Coupland but have been popularized by the Web site Flickr.

Tag cloud are commonly associated with folksonomies and social software.

Graph drawing suggest some metrics that make graph easy to understand and pleasing to eyes.

Related Work (Cont.) Other type of tag-cloud display. Hassan-Montero and Herrero-Solana have proposed

improving tag-cloud by clustering similar tags together.

Millen et al. have proposed that user be dynamically remove able to remove less significant tags and add index in large clouds.

Bielenberg has proposed circular clouds, where the most heavily weighted tags appear closer to the center.

Dubinko et al. have proposed a model to represent tags over a time line.

Russel has proposed cloudalicious, a tool to study the evolution of the tag cloud over time.

Jaffe et al. have integrated tag clouds inside maps for displaying tags having geographical information, such as pictures taken at a given location.

Related Work (Cont.)Improvement of layout of HTMLHurst et al. showed that it is

possible to make HTML table more pleasing.

Ongoing work to improve the layout of text in HTML pages using Cascading Style Sheet.

BackgroundTypesettingEDA: Physical Design

Typesetting

Greedy MethodFits as many words per line as

possible, starting a new line whenever further words cannot be placed on the current line.

This approach used by most browsers.Greedy approach can be done on-line,

without waiting the end of paragraph.It is fast but can also produce

suboptimal solutions.

Typesetting (Cont.)

Dynamic ProgrammingKnuth and Plass compute an

optimal solution using dynamic programming.

TEX system can quickly determine where to break line and fit text onto the page.

Their total-fit algorithm minimizes the sum of squares of each line’s badness inline.

Typesetting (Cont.) The total-fit algorithm can be summarized excluding

hyphenation and penalties. We can compute for all possible j = 1,….,n in time O(n2) and

O(n) space. Label the words of a paragraph from 1 to n.

bk,j - The badness measure resulting from a line containing

words k to j and bk,j = 0 while k > j.

tj - The minimal possible sum of square of the line badness

when the jth word ends a line and t0 = 0.

Kj – For j > 1, the last word of the line prior to the one the one

ending with jth word.

21,

21,

min ( )

argmin ( )

j k j k k j

j k k k j

t t b

K t b

EDA: Physical DesignElectronic design automation (EDA) is the

category of tools for designing and producing electronic systems.

Placement and floorplanning are two closely related stages during many physical design flows.

Mathematically, floorplanning and placement solve the same problem.

Floorplanning is often done early in the design stage and gives a “a bird’s eyes” view of the layout.

On the other hand, placement is typically done with complete knowledge module shape.

Recent tools blur the destinction.

EDA: Physical Design (Cont.)

Placement Approaches in EDAPlacement problem are typically

NP-hard.Approaches include force-

directed placement, simulated annealing, min-cut placement.

For speed, min-cut placement is often chosen.

Models For Cloud OptimizationTag Clouds with Inline TextTag Clouds with Arbitrary

PlacementTag Relationships

Tag Clouds with Inline TextInline text is a paragraph (block) made

exclusively of inline HTML elements such as span, font, em, b, i, strong, a and br.

Any area outside a tag but inside the tag cloud will be referred to as “white”.

The primary view has the width and height of each tag fixed.

But still can change the height and width of tag by using HTML style or CSS.

Do not include a penalty for squeezing tags or spaces.

Do not take into account symmetry or homogeneity.

Tag Clouds with Inline Text

Badness Measuring

k : numbers of tags i : ranging from 1 to k

hj : height of tag i

wj : width of tag i

The badness of a line is only a function of the set of tag dimension (wj , hj).

( )

( 1)

i i

i

h h h w

w w k W

W : normal width of a white space. h : h = max hj


Example 1.We have tags on the line in (width, height) format: (32,14),(45, 16),

(24,12).Tag cloud width w is 128 pixels.Expected white-space width of 4 pixels between tags. W = 4.

The line heighth = max{14,16,12} = 16extra white space on the line 128 – 32 – 45 – 24 – (2 * 4) = 19Contributing to the badness by19 * 16 = 304The first and last tags have lesser heights than the second tag, and they contribute respectively32(16-14) + 24(16-12) = 160Total badness304 + 160 = 464


In the spirit of the Knuth-Plass total-fit algorithm, we might define the overall badness of a tag cloud as the sum of the squares.

(l2) Summing the squares of the badness has the benefit of penalizing more heavily solutions with some very bad lines.

(l1) Merely summing the line badness tend to produce shorter clouds

(l∞) Minimize the maximum badness across all line might generate very tall clouds.

Tag Clouds with Arbitrary PlacementAssumption1. tags may be reordered and placed arbitrarily

(but without overlap or rotation) in the plane;2. tag relationships are known, and strongly

related tags should be in close proximity;3. tag-cloud width has an upper bound;4. tag-cloud height should be small, to reduce

scrolling; 5. (optional) tags may be deformed slightly

(made shorter but wider, for instance), so long as tag area remains (nearly) constant;

6. (optional) large clumps of white space are bad.

Tag Clouds with Arbitrary Placement (Cont.)There is no analogue to a “line” of tags

when arbitrary placement is allowed.We need to sum white area

surrounding tags .Another goal is to obtain spatial

clustering of semantically related tags.

Small values indicated better clustering.

,

( , ) ( , )p q

w p q d p q

Tag RelationshipsOne method of determining tag relationships

counts co-occurrences, when a pair of tags have been assigned to the same resource.

Another view is that each resource corresponds to a hyperedge in a hypergraph, whose members consist of the tags.

For instance, the hyperedge {bottle, gas, beer} from the first view would correspond to the edges { (bottle, gas), (bottle, beer), (gas, beer) } in the second view.

We should use graph instead of hypergraph.

Tag Relationships (Cont.)

SolutionsCloud Layout with Inline TextCloud Layout with Arbitrary

Placement

Cloud Layout with Inline TextApply dynamic programming or shelf-

packing.First breed of algorithms : take an ordered

list of tags and choose where to break lines. 1. First design a simple greedy method :

Tags are added to line until the line is full and create new line when needed.

2. Then apply Knuth-Plass algorithm except that :

The last line is not an exception: it cannot be half empty without penalty ;

if, and only if, a tag exceeds the maximal width, then it will be given a line of its own; no other overfull lines are allowed.

Cloud Layout with Inline Text (Cont.)The second breed of algorithms :

attempting to decrease the badness. (NP-hard)

Strip packing problem (SPP) (with 10 time randomly shuffling tags)

Other heuristic method are based on approximation algorithms for SPP.◦NEXT FIT DECREASING HEIGHT (NFDH)◦FIRST FIT DECREASING HEIGHT (NFDH)◦FIRST FIT DECREASING HEIGHT WEIGHT

(NFDHW)

Cloud Layout with Arbitrary PlacementMin-cut PlacementMin-cut placement recursively decomposes a

collection of tags by bipartitioning. Then each group is recursively split.

Ideally, the bipartition must be fairly balancedThe cut size (the number — or perhaps total

weight — of edges/hyperedges containing tags in both groups) should be small.

There should be an influence of “outside” tags.Min-cut placement can run in O(mlogn) time if

we use the Fiduccia-Mattheyeses bipartitioning heuristic.

Cloud Layout with Arbitrary Placement (Cont.)

Cloud Layout with Arbitrary Placement (Cont.)Slicing FloorplansRecursive bipartitioning’s effect

can be represented in a slicing tree.

Cloud Layout with Arbitrary Placement (Cont.)Nested Tables for Slicing

FloorplansThe Table is either 2x1 or 1x2,

denpending whether the slicing-tree node is tagged ‘H’ or ‘V’.

Cloud Layout with Arbitrary Placement (Cont.)

EDA Placement Is Not (Quite) Tag PlacementWe can simply feed our tag-cloud data to an

EDA placement, but we found it appropriate to modify the EDA tool.

Long tags are unusual for EDA.Tags cannot be rotated.Tags do not need to consider wire area.Each tag in the cluster is related to every other

tag, and thus dividing them should be much more expensive.

Different solution quality levels and running time requirement.

Experimental ResultsTest DataTag Clouds with In-line TextTag Clouds with Arbitrary

Placement

Test Data Tags and their accompanying importance levels (0-9)

were obtained from ZoomClouds and Project Gutenberg. On average, clouds had 93 tags.

ZoomClouds is Web site using the Yahoo! Content Analysis API.

Experiment retreives 65 different tag clouds and normalized the weights with a linear function.

Test data were also derived from word co-occurrences in 20 e-books produced by Project Gutenberg .

The importance i of tag T was determined as

f, r and t are respectively the frequencies of the most frequent tag, the least frequent retained tag, and the tag T.

101

t ri

f r

Tag Clouds with In-line Text

Alphabetically-sorted tags are, on average, 40% larger than weight-sorted tags.

Dynamic programming does not reduce the area of the tag clouds for weight-sorted tags, but offers a reduction of about 3% for alphabetically sorted tags.

The random-shuffling algorithm does worse than sorting by weight.

Tag Clouds with In-line Text (Cont.)The NFDH heuristic gives about

the same average tag-cloud height as does the weight-sorted greedy algorithm.

The FFDH and FFDHW heuristics offer an average reduction of about 3% in the height of the ZoomClouds tag clouds, and of 1% and 2% respectively for the Project Gutenberg tag clouds.

Tag Clouds with In-line Text (Cont.)(l∞) can generate unacceptably tall tag clouds

(3 times taller than normal).The difference in height between (l1) and (l2)

aggregates is well below 1%.The most competitive algorithms are FFDH,

FFDHW and either the greedy or dynamic-programming algorithms applied to weight-sorted tags.

if the l1 norm is chosen, the FFDHW heuristic is the clear winner and dynamic programming is not worth the effort

If the sum of squares is preferred, it is a close race.

Tag Clouds with In-line Text (Cont.)

Tag Clouds with Arbitrary PlacementSome changes of EDA algorithms

◦Modified program to perform graph bipartitioning rather than hypergraph pratitioning.

◦Fiduccia-Mattheyes heuristic.◦Estimating the correct amount of

“padding” area is not required for tag placement.

◦Add an estimate of the absolute width of a floorplan area.

Tag Clouds with Arbitrary Placement (Cont.)

Results

Interestingly, 100-tag is faster than 50-tag.

Floorplaning sizing was only small part of over-all time.

C-soft shows that unacceptably long runtimes.

Experimental Results – Tag Clouds with Arbitrary Placement (Cont.)

compaSS has tighter cloud than other algorithms.

The sorted greedy heuristic used 2–19% less area than min-cut heuristic.

With 200 tags, it is remarkable that the more sophisticated compaSS approach was not as good as the greedy heuristic.

Tag Clouds with Arbitrary Placement (Cont.)

The min-cut approach clearly (and unsurprisingly) outperformed greedy approaches and compaSS

compaSS is apparently better at grouping than the sorted greedy heuristic. This is counterintuitive and reveals a weakness in using Equation 1.

ConclusionFuture work should include browser-based

implementations.For in-line text, our cloud-badness model is

probably incomplete since it ignores some basic symmetry issues.

Differences between tag-cloud layout and EDA placement, we plan to test an industrial strength min-cut placement tool.

The new hyphenate property might encourage the use of slightly more sophisticated line-breaking algorithms in browsers.

Documents

Tag-Cloud Drawing : Algorithms for Cloud Visualization Owen Kaser, University of New Brunswick, Saint John, NB, Canada Daniel Lemire, Universite du Quebeca