9

Click here to load reader

Towards a theory of tables

Embed Size (px)

Citation preview

Page 1: Towards a theory of tables

International Journal of Document Analysis (2006) 8(2): 123–131DOI 10.1007/s10032-006-0016-y

ORIGINAL PAPER

Matthew Hurst

Towards a theory of tables

Received: 1 June 2005 / Accepted: 6 January 2006 / Published online: 24 March 2006c© Springer-Verlag 2006

Abstract Tables appearing in natural language documentsprovide a compact method for presenting relational infor-mation in an immediate and intuitive manner, while simul-taneously organizing and indexing that information. Despitetheir ubiquity and obvious utility, tables have not receivedthe same level of formal characterization enjoyed by sen-tential text. Rather, they are modeled in terms of geom-etry, simple hierarchies of strings and database-like rela-tional structures. Tables have been the focus of a large vol-ume of research in the document image analysis field andlately, have received particular attention from researchers in-terested in extracting information from non-trivial elementsof web pages. This paper provides a framework for repre-senting tables at both the semantic and structural levels. Itpresents a representation of the indexing structures presentin tables and the relationship between these structures andthe underlying categories.

Keywords Table understanding · Information extraction

1 Table processing systems

The recent increase in research studying automatic systemsdealing with the location, recognition and understanding oftables is a consequence of the ubiquitous information over-load that computational linguists and language engineers arenow enjoying [5, 9, 10, 11, 13, 15]. The web demonstratesthe utility of tables as a direct and compact document ele-ment for presenting anything from simple statistics (e.g. votecounts) and the formated results of online database queries,to complex relations indicating the results of detailed scien-tific inquiry or the organization of complex conceptual hi-erarchies (e.g. the classification of meteorite types and theirobservable features). In the more traditional media, tableshave a history that pre-dates that of sentential text.

M. Hurst (B)Nielsen BuzzMetrics, 56, West 22nd Street, 3rd Floor, New York,NY 10010, USA

Research into the design and implementation of sys-tems capable of capturing and manipulating tables has asignificant, and ongoing, pre-web history in the documentimage processing community (e.g. [12]). Document imageunderstanding tackles the problem of converting bitmaps ofscanned documents into computer readable formats that maythen be searched, classified, clustered and so on. In addition,there is a large body of work on the editing and formattingof tables [2, 19, 20], as well as investigations into the psy-chological aspects of information presentation in tabular for-mats [3, 21].

The more immediate history of table processing has wit-nessed inquiry into the problem of getting at the contentof the table in a meaningful manner by providing an un-derstanding of the structure of the table and the manner inwhich different elements interact to produce a meaningful‘reading’ [5, 7].

Both the threads of research—that dealing with the re-construction of tables from low-level input such as documentimages and plain text documents, and that dealing with theinterpretation and manipulation of abstract representationsof tables—have proposed and exploited characterizations ofthe table suitable to the various tasks. These characteriza-tions have been couched at a variety of formal levels. Inother cases, the work reported has relied on our intuitive un-derstanding of what a ‘table’ is.

Document image understanding generally describes thetable in terms of lines and unanalysed text blocks [8]. Ata slightly higher level, the table is viewed as either a set ofcells located on a two-dimensional grid [12], or as a structurelaid out in an XY tree [1]. At the highest level [19], a tableis modeled as a set of categories (of strings) which are com-bined in some way to provide a mapping to a data domain.

All of these approaches are useful for systems deal-ing with the appropriate level of analysis. However, tableshave yet to be analysed in a manner similar to natural lan-guage text. The basic content element of these models is thestring, and the semantic differences that distinguish the ta-ble fragments in Fig. 1, for example, cannot be representedor analysed.

Page 2: Towards a theory of tables

124 M. Hurst

Fig. 1 String-based theories cannot distinguish semantic variation, nor are they sensitive to the interaction between the syntax of the content andthe structure of the table

This paper presents a theory of tables that views the tableas a presentation of a set of relations holding between orga-nized hierarchical concepts (categories). It then proceeds toindicate the relationships between this characterization andthe strings (viewed as syntactic natural language compo-nents) that appear in the table—demonstrating the interac-tion between the syntactic structure of the string content ofthe table, the structure of the table and the manner in whichthe strings may represent elements of categories, categorystructure and relations between categories. The theory alsoprovides an understanding of the cause and form of the var-ious types of ambiguity that the table presents to automaticsystems due to the limitations of the two-dimensional gridthat the table is expressed on and the overloading of tablestructure.

The motive is to provide a conceptualization couched inbasic terms that will assist research into table processing sys-tems by delivering the following:

– a framework for representing categories, an indication ofthe different types of categories found in the table and anunderstanding of the relationship between the syntax ofthe strings in the cells and the logical structures that theyrepresent.

– an understanding of the relations that the table describes.– a pragmatic approach to deriving the categories of a ta-

ble based on its structure that may then be mapped tosemantic objects.

In more general terms a theory will be of use due to thefollowing considerations: a theory allows us to characterizetables into classes in a principled manner and evaluate meth-ods tackling table processing tasks in terms of their ability todeal with different classes of tables; understand the distribu-tion of table classes in order to tune systems to give the bestperformance; understand the significance of certain featuresof tables which may be exploited and evaluated by machinelearning methods.

What makes the table a unique type of document elementdistinct from the logical structure of the document (a hierar-chy of sections, paragraphs and sentences) and the syntac-tic structure of the sentential text, is the manner in whichits structure is overloaded. The table uses its simple two-dimensional geometry to denote both the organization of itsterms and the relations that hold between those terms.

The key observation is that the basic elements of the ta-ble at the relational level are nodes in a hierarchy. Their ex-pression often requires the expression in some form of that

hierarchy. This is in contrast to the sentence whose basicelements are lexical items. The challenge of understandingthe table is to recognize the categories and to understand therelationships holding between them. Both of these meaning-ful components are expressed via a common set of simplemechanisms based on the geometry of the table.

2 Categories, relations and the interpretation of tables

A table’s main utility is the expression of one or more rela-tionships that hold between two or more sets of concepts. Toprovide a context for our discussion, we first define the basicrelation.1

Definition 1 A relation, R, defined over a set of sets, S ={S0, . . . ,Sn}, R : S0 × · · ·×Sn , is a subset of their Cartesianproduct. The elements of the relation are called tuples.

A relation may be represented by a set containing mem-bers of the Cartesian product for which it holds, or by a func-tion mapping the Cartesian product to the boolean domain.For example, if set A is {a, b} and set B is {x, y} the Carte-sian product is the set of tuples {〈a, x〉, 〈a, y〉, 〈b, x〉, 〈b, y〉}and a relation R = {t0, t1} is {〈a, y〉, 〈b, y〉} which may beimplemented as a function that maps each element from theCartesian product to the boolean domain {true,false},i.e. fR : A × B → {true,false}.

The sets over which the various relations in a table aredefined we term categories (they are sometimes referred toas domains in the table processing and database literature),which are sets of concepts organized to a greater or lesserextent according to some hierarchical principle. Providingan interpretation of a table means that we must define theset of categories described by, present in or implied by thetable, and the set of relations defined for combinations ofthose categories.

Definition 2 For a table, T , an interpretation, [[T ]] is a tuple〈S,R〉, where S is an interpretation of the set of categoriesdescribed by the table and R is a representation of the set ofrelations described by the table.

Clearly this definition is incomplete. We need to definehow we map the syntactic description of T to the semanticdescription 〈S,R〉. This may be done by providing the syn-tactic apparatus for describing the table then showing how

1 Following the definition in [19].

Page 3: Towards a theory of tables

Towards a theory of tables 125

we can map its basic elements and compositions of its ba-sic elements, to the universe of semantic objects, sets andlogical statements and relations.

However, before we can do that, it is important to under-stand the differences between the syntactic–semantic dividein sentential text and that in tables. From there, we will mo-tivate and discuss certain types of categories and relationswith a view to developing a pragmatic model of tables thatis both useful for everyday interpretive and generative sys-tems but contains all the elements required for a completemodel-theoretic approach.

3 The expression of categories and the interactionbetween linguistic syntax and category structure

When considering the manner in which language is used toexpress that an object or concept is an instance of a (moregeneral) concept, a member of a class of objects or repre-sents a sub-category of a category, we can observe the fol-lowing patterns in sentential text.

syntactic: ‘100 dollars’, ‘$ 100’,‘the city of New York’, ‘New York City’,‘New York, the City’.

semantic: ‘New York is a City’, ‘100 is aquantity of American Dollars’.

contextual: ‘How many dollars?’ — ‘100.’, ‘100,are you crazy?’

We use the term syntax do describe the structure of lin-guistic objects. To distinguish this with the structure of ta-bles, we don’t refer to the syntax of the table, but to the struc-ture of the table. The structure of the table will be defined indetail in Sect. 5, for now we introduce the structure in thetable as being some sort of path connecting the cells whichindicates the order in which they are read. If two cells, ci ,c j , are adjacent on that path, then we indicate this using thenotation ci → c j . The arrow indicates the order in whichthe cells are read. Siblings, children of a common parent,may be indicated by set notation. For convenience, we usethe textual content of a cell to denote the cell. For example,this notation may be used to describe the first table in Fig. 1in the following manner : {‘Name’ → ‘Fred’, ‘Name’ →‘Isaac’, ‘Colour’ → ‘brown’, ‘Colour’ → ‘blue’, ‘Fred’ →‘brown’, ‘Isaac’ → ‘blue’}.2

Now we turn to the table, and look at how we might ex-press membership of a set. Below we describe a number ofway in which set membership can be described within a ta-ble, followed by discussion.

structural: ‘City’→‘New York’,‘Amounts inDollars’ → ‘100’.

syntactic: ‘$ 100’.

2 The cell ‘Colour’ is in the head of the table, ‘Name’, ‘Fred’and ‘Isaac’ are in the stub and the remainder are in the table matrix.The position of ‘Name’ in the abstract table is given by the relativeco-ordinates of the top-left and bottom-right corners of the cell and are(0, 0, 0, 0). That for ‘blue’ is (1, 2, 1, 2).

structural, syntactic: {‘$500’, . . . , ‘100’, . . .}, ‘$’ →‘100’,‘The City of’→‘New York’,‘The number of students taking’→‘Computer Science’.

contextual: ‘All Amounts in Dollars’,‘100’ (im-plied by table, e.g. a reference to ‘America’ and tothe use of ‘local currency’; or with the documentexplicitly stating the use of dollars in a phrase talkingabout, but external to, the table).

The methods available to the table present uswith an ambiguity. In some cases the structure isused to express the existence of some form of log-ical relationship between elements (for example, atype of relationship in ‘City’→‘New York’). In othercases, the structure of the table reflects the syntac-tic structure we might find in sentential text (for ex-ample, ‘$’ → ‘100’, ‘The number of studentstaking’→‘Computer Science’) or omits syntacticmaterial due to sibling-hood ({‘$500’, . . . , ‘100’}).

If we consider the purely structural expressions, we cansee that they are of the form Ci → C j where Ci and C j areexpressions which may be interpreted as definitions of setsof concepts, and that there is a logical relationship betweenCi and C j , and that C j is a subset of Ci . A simple exam-ple is the type of relationship. For example, the structure‘Animal’→‘Swine’ may be interpreted as [[‘Animal’]]= {x : animal(x)}, [[‘Swine’]] = {x : swine(x)} and therelationship between the two sets is type of. We call sucha relationship a category constraint.

In order to capture this hierarchical structure, we first de-fine the abstract category—the nodes in the hierarchy. Notethat the examples above represent trivial hierarchical struc-tures of unit depth.

Definition 3 An abstract category, C , is a tuple 〈c, h,S〉where c is an expression defining a set of objects, S is a setof categories (the sub-categories) and h is an expression ofthe relationship between C and each member of S.

This entirely abstract characterization can be used todefine certain types of categories with varying formal-ity. Uschold and Gruninger [18] suggested four levels offormality when describing ontological systems: highly infor-mal (natural language), semi-informal (restricted and struc-tured form of natural language), semi-formal (a formallydefined language) and rigorously formal. In addition, we canintroduce certain modes. An arbitrary category is any typeof category for which h is undefined. An anonymous cate-gory is one for which both c and h are undefined.

A purely informal category might be expressed with nat-ural language : ‘A type of organization of’ ‘theset’ ‘of animals’. This may be captured by what wewill call a literal category.

Definition 4 A literal category, L , is a tuple 〈R, h,S〉,where R is a set of strings, and S is a set of sub-categories

Page 4: Towards a theory of tables

126 M. Hurst

Fig. 2 WordNet definition for eukaryotes

and h is a string indicating the nature of the relationship be-tween L and its sub-categories S.

For example, the type of organization of animals may beexpressed as a literal category as follows.

〈‘animals’, ‘type of’, {〈‘swine’,‘type of’, . . .〉, . . .}〉

We will motivate and discuss this type of category in thecourse of this paper.

Perhaps the most commonly considered categories arewhat we might term ontological categories and include‘type of’ and ‘part of’ hierarchies. For example, the orga-nization of life forms into certain classes. The Tree of Lifeproject [14] is a perfect demonstration of this type of hierar-chy starting off with the division of ‘life form’ into sub-classes {‘Eubacteria’, ‘Archaea’, ‘Eukaryotes’}.Definitions of the manner in which the category was par-titioned are provided informally, such as the following:

Eukaryotes are usually distinguished from otherforms of life by the presence of nuclei and the pres-ence of a cytoskeleton. The nuclei contain genetic in-formation which is organized into discrete chromo-somes and contained within a membrane-boundedcompartment. The word ‘eukaryote’ means ‘true nu-clei’.

WordNet [6] presents a similar organization of conceptswith additional linguistic features. The entry for eukaryotesis shown in Fig. 2.

The organization of objects in this manner is somehowintuitive. There appears to be some form of commonalityin the manner in which subsequent partitions of categoriesinto sub-categories is carried out. This may be contrastedwith less intuitive sequences of sub-categories such as thecategory in Fig. 3.

What distinguished the ‘intuitive’ forms of hierarchyfrom arbitrary categories?

The type of features inspected to form the life-form hier-archy are in some way consistent throughout the hierarchy,whereas the type of features in the animal hierarchy that areinspected are not. One way we might indicate the type ofrelationship between a category and its sub-categories is interms of the types of features used to partition the set of ob-jects in the category into the set of sub-categories.

First, we define the logical category as follows.

Definition 5 A logical category, C , is a tuple 〈e, h,S〉where e is a logical expression defining a set of semantic

objects, S is a set of logical categories and h is an expres-sion of the type of relationship between C and each memberof S.

We can then introduce a special case of logical categoryby defining h according to the type of features used to par-tition e into the sub-categories S. This allows us to distin-guish what we intuitively consider to be arbitrary categoriesfrom those which have a discernible type. Of course, argu-ments always exists which would ensure any collection offeatures belong together, so we term the division of featuresinto types a view. It is the definition of the view that carriesthe obligation of representing some notion of ‘features of thesame type’. What we aim to do here is provide a meaningfulrepresentation of this mechanism.

A view of a set of objects O is defined in the followingmanner (assuming some knowledge representation model-ing the universe of things).

Definition 6 – The feature set of O, FO, is the set of fea-tures used to describe the objects in O.

– If T is a set of features, a feature, f, is of type T if f ∈ T .T defines a feature type.A view of O is a set of feature types V such that {⋃ T ∈

V} = FO and {⋂ T ∈ V} = ∅.

The basic idea is to specify the type of feature that isused to make the partition of objects. This specification thenallows us to characterize the definition of sub-categories asthose objects in the category that have specific values forspecific features.

For example, if A is the set of animals and T ⊂ FA isthe set {sex} we can write:

〈A, T , {〈x : value of(sex, x,male), , {}〉,〈x : value of(sex, x,female), , {}〉}〉.(Note that the leaf categories have no sub-categories and

so the tuple contains empty positions denoted by and {}.)Using this mechanism to describe the expressions used

to partition the category in terms of feature types providesa description of the relationship between a category and itsub-categories.

Fig. 3 Heterogeneous relationships

Page 5: Towards a theory of tables

Towards a theory of tables 127

Fig. 4 Sample tables: a [16] p. 19; b Newsweek

Categories define successive subsets. These subsets maybe represented by the appropriate composition of the expres-sions used to define the hierarchy of categories. For example,{x : x ∈ A ∧ value of(sex, x,male)}. We can also statethat for any sub-category of a category, the set defined bythat sub-category is a subset of the category; and that no sub-categories of a category intersect. Naturally we use similarcompositions to define the set of object formed by the in-teraction between two categories that are not hierarchicallyassociated such as the set formed by the intersection if ani-mals and male objects were presented as discrete categories(as perhaps they should be).

In later discussion, we will demonstrate the relationshipbetween the table’s structure and set of categories that maybe driven out by analyzing dependencies in the data. Theseso-called table categories offer a compromise between theoverly general literal category and the restrictive and elusiveform of the logical category.

4 Types of relation expressed in the table

We can distinguish two types of category. An access cate-gory is one used to index the central information presentedin the table. For example, the category indicating the set ofpeople by their names in the first row of tables in Fig. 1 isan access category. The categories that are indexed we termdata categories. The set of eye colours in the same table isa data category. There is a clear relationship between thesetwo types of categories and what we call the functional de-scription of the table (to be introduced later) which distin-guished the layout of the table in a similar manner to thehead, stub and matrix.

We distinguish three classes of relation that may be ex-pressed by a table.

Class 1 The first type of relation is that which takes a num-ber of access categories and associates them with a datacategory.

Class 2 The second type of relation is that which is definedover a number of data categories, producing a correla-tion between one or more data categories and a new datacategory.

Class 3 The third type is that which associates elements ofClass 1 and Class 2 relations into a record as suggestedby the geometry of the table.

The class of a relation, R, is denoted by a superscript: R1,

R2, R

3 and is further illustrtated in Sect. 9.3

So far, we have introduced only examples of what wemight term indexed tables. An indexed table is one whichpresent a series of relations indexed by one or more accesscategories. However, relations may be presented in a tabu-lar format in an anonymous manner like those shown in thetable in Fig. 4a). This table introduces data values found atthe intersection of two categories. However, significantly, itpresents those data values in meaningful groups. In this case,the relation holds for tuples grouping four data values (i.e.the rows in the table).

5 An abstract model of table layout and structure

So far we have discussed the categories and relations thatgo to make up a table. To complete the model we need toconsider the geometry, layout and structure of the table.

We start off with an initial model of the table in the sim-plest abstract terms.

Definition 7 A table, T , is a set of cells C.

The basic elements of the table, the cells, have a locationon the page. Abstracting the absolute location of the cell interms of the media by which the table is presented, we canview the cells as having relative location. This can be mod-eled by providing a function r mapping between the set ofcells and pairs of relative co-ordinates.

In common parlance, the areas of the table that are dis-tinguished as being the structures through which the essen-tial data presented by the table is accessed are termed thehead—the upper most area of access cells, and stub—theleft most area of access cells. We generalize these layout-based terms by using the functional characterizations thatthey suggest. Cells in the table are either part of the accessstructure (access cells), part of the data presented by the ta-ble (data cells) or have some other function (for example,cells containing expository material). This aspect of the ta-ble may be modeled by f a mapping from cells to an elementof the set {access, data, other}.

It is important to note that the classification of cell func-tion is intended to represent the author’s intent (or primary

3 The interaction between Class 1 and Class 2 relations andClass 3 relations is similar to that between the tables of relationaldatabases and the tables produced by the join operation [17].

Page 6: Towards a theory of tables

128 M. Hurst

Fig. 5 Sample table

intent if there are multiple uses) in presenting the data ina table. A reader of the table may derive relationships andinformation from the table other than that intended by theauthor. In such situations a different assignment of functionis required to model the reader’s use of the table. For suchalternate uses, one can generally imagine restructuring thetable to better facilitate this novel use.

What we refer to as the structure of the table is the latticeby which a reader navigates the cells in the table. From anycell, there are a set of cells that may be read next on routeto the final data cell. We call this structure the simple tablerelation and it can be expressed by a relation, S, indicatingwhich pairs of cells form the source and sink of an arc in thereading.

The complete model of the abstract geometry, layout andstructure of the table is as follows.

Definition 8 A table, T , is a tuple 〈C,r,f, S〉 where C is aset of cells, r is a function mapping from cells to relativeco-ordinates (r : C → (X0, Y0, X1, Y1)), f is a functionmapping the domain of cells to the domain functional de-scriptions (f : C → {access, data, other}) and S is a relationdefining the navigation of the table (S : C × C).

The table in Fig. 5 is defined by 〈C,r,f, S〉.

6 Syntactic and structural table content

In Sect. 3, we looked at how the table expresses categoricalinformation. Here we review this process and relate it to thelogical/literal category distinction. A logical category maybe expressed by using the following syntactic and structuraldevices. First we look at the possible methods for formingsyntactic objects.

– we may use a string generated from the logical expres-sion (‘Animal’).

– we may use a string generated from the relationshipholding between parent and sub-categories (‘type’).

– we may use a string generated from the logical expres-sion and the relationship holding between parent andsub-category (‘Animal type’).

Fig. 6 The realization of categories, sub-categories and relations in atable

– we may use a combination of these elements derivedfrom ancestors of the specified category. This hierarchi-cal structure may be expressed by structure in the tableor by structure in the cell.

Noting that the category structure may be reduced into asyntactic structure we now look at the structural expressionof the logical category.

– any syntactic expression derived as an expression ofsome or all of the category may be mapped on to a struc-ture in the table reflecting the syntax of the expression.Note that there may be a correlation between the syntaxof the expression and the category structure.

– the structure of the category may be mapped to the struc-ture of the table.

These mappings describe how logical categories are re-lated to the literal categories. A logical category may bemapped to a number of literal categories and a literal cat-egory may be formed from number of logical categories.

The task of interpreting a table must tackle the poten-tial ambiguities that the logical/literal category distinctioncauses. If we have two logical categories forming a table (A,B), one will form the access category and the other the datacategory. The table will describe a relation holding betweenthe two categories — R : A × B. We have discussed howthe categories may be expressed in the table, however thereis also the possibility to express the relation in the table aswell.

For example, if we wish to express the following infor-mation in a table, we could construct those tables shown inFig. 6 (α is used to denote the set of natural language ex-pression which may be interpreted as the semantic object αand �α� is used to denote an example from that set).

– m A is the set {a, b}.– B is the set {x, y}.– R is the relation {〈a, x〉, 〈b, y〉}.

�R� may be ‘�B� in terms of �A�’, or simply �B�which is a member of R. In other words, there is an am-biguity which may effect the formation of literal categoriesif B∩R�= ∅. A set of tables realizing these conditions isshown in Fig. 6 together with concrete examples expressingthe following information.

– A is the set of boxes {a, b}, the name of a is ‘Oak’ andthe name of b is ‘Ash’.

– B is the set of objects {x, y}, the name of x is ‘ball’and the name of y is ‘cube’.

– R is the relation contains, {〈a, x〉, 〈b, y〉}.In the logical account of the table there is no difference

as all the above tables will be expressed with the same set ofconstructs.

Page 7: Towards a theory of tables

Towards a theory of tables 129

Fig. 7 A non-trivial table from [4]

The key point about literal categories, in view of theirunrestricted nature, is that, when constructing literal cate-gories in the analysis of the table, we want to constructthose categories that will help us find a mapping to the log-ical categories, and not to blindly mimic the structure of thetable.

7 Table relations: structural dependancies in the tableand data driven category analysis

We need a principled way to form literal categories, oth-erwise we will just end up with literal categories that sim-ply reflect the functional characterization of the table, with ahead category, a stub category and a data category. A methodto derive suitable literal categories is outlined below in con-cert with an illustrative example using the table in Fig. 7.

Definition 9 A reading path to a data cell is an ordered setof the string content of the cells encountered by the readertraversing the table either horizontally or vertically from theoutside of the table to the data cell according to the STRexcluding the data cell.

For example, the set {‘State’, ‘q’, ‘probability’} is areading path for the data cell ‘1.0’.

Definition 10 A reading of a data cell is the union of all thereading paths to that cell.

For the same cell, the reading would be:

{‘States’, ‘q’, ‘probability’, ‘ε’}.Definition 11 A reading set is the set of readings for all thedata cells in the table

The reading set from the example table is as follows.

{{‘States’, ‘q’, ‘sequence’, ‘ε’},{‘States’, ‘q’, ‘sequence’, ‘b’},{‘States’, ‘q’, ‘probability’, ‘ε’},{‘States’, ‘q’, ‘probability’, ‘b’},{‘States’, ‘r’, ‘sequence’, ‘ε’},{‘States’, ‘r’, ‘sequence’, ‘b’},{‘States’, ‘r’, ‘probability’, ‘ε’},{‘States’, ‘r’, ‘probability’, ‘b’}

}

Definition 12 A string which always appears with anotherstring in a reading is said to be dependent on that string. Adependency is a tuple 〈si , s j 〉 indicating that si is dependenton s j and the dependency relation defines the set of stringdependencies. Strings that never appear together in a readingare mutually independent.

For example, ‘q’ is always present when ‘States’ ispresent.

Definition 13 The dependency index for a dependency is acount of the number of times the dependency appears in anyreading contained in the reading set.

The dependency 〈‘q’, ‘States’〉 occurs four times in thereading set so its dependency index is 4.

Definition 14 The reading index for a string is a count of thenumber of times the string appears in any reading containedin the reading set.

The reading index for ‘States’ is 8.

Definition 15 We chain dependencies to form a maximaldependency in the following way. If 〈si , s j 〉 holds in the de-pendency relation and 〈s j , sk〉 holds in the dependency rela-tion then we form the maximal dependency 〈si , s j , sk〉. Thisis carried out exhaustively until the maximal dependency canno longer be extended.

Definition 16 The set of all maximal dependencies is themaximal dependency set.

The number of occurrences of a string in the maximaldependency set is the count of the maximal dependenciesit appears in and is limited to the the string’s reading index.This encodes the intuition that, in terms of semantics, a cell’scontents can’t modify the contents of cells from more thanone category.

If a conflict is found when generating the maximal de-pendency set due to this restriction, the dependency set maybe modified to effectively filter out bogus dependencies (dueto the numeric constraint outlined above) thereby resultingin singular dependency sets. This action causes alterationsto be made to the maximal dependency set.

Calculating the maximal dependency sets for the exam-ple requires that we make a decision about where the string‘States’ is to go. This decision is motivated by the num-ber of occurrences that the string represents in the currentdependencies set: in total 24. In this case, the following issuggested:

{{‘States’, ‘q’}, {‘States’, ‘r’}, {‘sequence’},{‘probability’}, {‘ε’}, {‘b’}}

Tuples in the maximal dependency set that repeat are cat-egory values and category values which contain mutually in-dependent strings can be formed into categories.

Page 8: Towards a theory of tables

130 M. Hurst

{〈{‘States’}, , {〈{‘q’}, , {}〉, 〈{‘r’}, , {}〉}〉,〈{‘’}, , {〈{‘sequence’}, , {}〉, 〈{‘probability’}, ,{}〉}〉,〈{‘’}, , {〈{‘ε’}, , {}〉, 〈{‘b’}, , {}〉}〉

}

The method outlined above provides one approach toforming literal categories given the structure and layout ofa table. It cannot discover the presence of expressions indi-cating relations.

8 Interpreting abstract tables

We have now presented representational systems for the ta-ble in terms of geometry, layout and structure as well as forcategories and relations. These two characterizations of thetable allow us to define an interpretation of the table. Giventhe various ambiguities that the table presents to the interpre-tation process we define two different types of interpretationbased on the abstraction of the category presented earlier.These two types are based on the following general form.

Definition 17 If T is a table, an interpretation T is a map-ping from 〈C, r, f, S〉 to 〈C,R〉 where C is a set of cate-gories and R is the triple 〈R1,R2,R3〉 representing the setof Class 1, 2 and 3 relations.

Providing specifications for C allows us to define the literalinterpretation

Definition 18 If T is a table, a literal interpretation [[ST]] isa mapping from 〈C,r,f, S〉 to 〈S,R〉 where S is a set ofliteral categories.

and the logical interpretation

Definition 19 If T is a table, a logical interpretation [[LT]]is a mapping from 〈C,r,f, S〉 to 〈L,R〉 where L is a set oflogical categories.

The literal interpretation provides a pragmatic, thoughless refined, definition of what it means to understand the ta-ble. The logical interpretation allows us to model the processof understanding in a more precise manner.

9 An illustrative example

Consider the following set of objects.

– Y is the set of years of the Common Era.– A is the set of American Cities.– M is the set of murder events.– I is the set of integers.

We want to express the relation R defined over Y×A×Iwhich holds for each tuple 〈d, l, i〉 (d ∈ Y, l ∈ A, i ∈ I)exactly when i ≡| {∀e ∈ M : event location(e, l) ∧

event time(e, d)} | — i.e. the number of murders thatoccurred in a certain city in a certain year.

This information can be presented in a table such as thatshown in Fig. 4b (ignoring for the moment the ‘PercentChange’). Following the method outlined in Sect. 7, wecan derive the following literal categories (note that we usea summary notation here in which we elide the definition ofthe relationship between a category and its sub-categories).The structure is shown in Fig. 8.

The Class 1 relation expressed by these literal categories,R

1={t10 , t1

1 , . . .}, is as follows.

{〈‘New York’, ‘1990’, ‘2,245’〉,. . .〈‘Philadelphia’, ‘1996’, ‘431’〉

}

(Note that the strings are used to indicate the respective lit-eral categories.)

Of interest is the use of the string ‘MURDERS’. Thereare a number of interpretations that we could derive. It mayexpresses the set of murders (M) or it may represent therelation R.

Now lets suppose that we want to extend the tableto include an indication of the change in the number ofmurders as a percentage of those committed in 1990. P,P

2={t20 , t2

1 , . . .}, is the relation defined over R × R × Rwhere R is the set of rational numbers. This situation is il-lustrated by the the complete table shown in Fig. 4b and isan example of a Class 2 relation.

Finally, the Class 3 relation, indicating the percentchange in murders between 1990 and 1996 for certain Amer-ican cities, is as follows.

{〈t1

0 , t11 , t2

0 〉,. . .}

Fig. 8 Structure of literal categories

Page 9: Towards a theory of tables

Towards a theory of tables 131

10 Conclusion

The aim of this paper was to develop representational mech-anisms capable of characterizing tables in a precise and de-tailed manner. In addition, we wanted to ensure that theframework developed would be of practical use, and so thestrategy of providing a number of specializations of a gen-eral representation was used to increase the utility of themethods when applied to real world table processing prob-lems.

We characterized the table in terms of geometry (rela-tive co-ordinates), layout (functional areas—generalizationof head, stub and matrix), structure (navigation of cells), cat-egories and relations. This characterization distinguished el-ements of table structure in terms of categories, which arehierarchical organizations of concepts, in terms of contentsyntax (indicating the potential for these two structural as-pects to interact) and in terms of the relations expressed bythe table.

We accounted for the strings in the table as being ex-pressions of categories and relations, and we account for thestructure of the table as being a reflection of the structure ofthe categories, the relations that the table describes and thesyntax of natural language expressions denoting both.

References

1. Abu-Tarif, A.A.: Table processing and understanding. Master’sthesis, Rensselaer Plytechnic Institute, Troy, New York (1998)

2. Biggerstaff, T.J., Endres, D.M., Forman, I.R.: Table: Object ori-ented editing of complex structures. In: Proceedings of the Inter-national Conference on Software Engineering. IEEE Comp. Soc.(1984)

3. Cameron, J.P.: A cognitive model for tabular editing. TechnicalReport OSU-CISRC-6/89-TR 26, Computer and Information Sci-ence Research Center, Ohio State University (1989)

4. Charniak, E.: Statistical Language Learning. MIT Press,Cambridge, Massachusetts (1993)

5. Chen, H.-H., Tsai, S.-C., Tsai, J.-H.: Mining tables from largescale html texts. In: Proceedings of the 18th International Confer-ence on Computational Linguistics, Saarbrucken, Germany (2000)

6. Fellbaum, C.: WordNet: An Electronic Lexical Database. MITPress, Cambridge, Massachusetts (1999)

7. Ferguson, D.: Parsing financial statements efficiently and accu-rately using C and prolog. In: Proceedings of PAP ’97 (1997)

8. Green, E., Krishnamoorthy, M.: Model-based analysis of printedtables. In: Proceedings of the International Conference on Docu-ment Analysis and Recognition, vol. 95, pp. 214–217 (1995)

9. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: A system for under-standing and reformulating tables. In: Proceedings of the FourthICPR Workshop on Document Analysis Systems, Rio De Janeiro,Brazil (2000)

10. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Table structure recog-nition and its evaluation. In: Document Recognition and RetrievalVIII, San Jose, SPIE (2001)

11. Kieninger, T.G., Strieder, B.: T-recs table recognition and valida-tion approach. In: Proceedings of the AAAI Fall Symposium onUsing Layout for the Generation, Understanding and Retrieval ofDocuments. AAAI (1999)

12. Laurentini, A., Viada, P.: Identifying and understanding tabularmaterial in compound documents. In: Proceedings of the Interna-tional Conference on Pattern Recognition, (1992)

13. Lopresti, D., Nagy, G.: Automated table processing: An (opin-ionated) survey. In: Proceedings of the Third IAPR InternationalWorkshop on Graphics Recognition(GREC ’99) (1999)

14. Madison, D.R.: Tree of life. http://ag.arizona.edu/tree/phlogeny.html. Cited 1996.

15. Ng, H.T., Lim, C.Y., Koo, J.L.T.: Learning to recognize tables infree text. In: Proceedings of the 37th Annual Meeting of the As-sociation for Computational Linguistics, pp. 443–450 Maryland,USA (1999)

16. Swain, P.H., Davis, S.M. (eds.): Remote Sensing: The QuantitativeApproach. McGraw-Hill, New York (1978)

17. Ullman, J.D.: Principles of Database and Knowledge-Base Sys-tems. Computer Science Press, Rockville, MD (1988)

18. Uschold, M., Gruninger, M.: Ontologies: Principles, methods andapplications. Knowl. Eng. Rev. 11(2), (1996)

19. Wang, X.: Tabular abstraction, editing, and formatting. Ph.D. the-sis, University of Waterloo, Waterloo, Ontario, Canada (1996)

20. Wang, X., Wood, D.: A conceptual model for tables. In: Princi-ples of Digital Document Processing, Notes in Computer Science.Springer-Verlag, Berlin Heidelberg Germany (1998)

21. Wright, P., Hull, A.J., Lickorish, A.: Psychological factors in read-ing tables. In: Proceedings of the 22nd International Conferenceon Psychology (1984)

Matthew Hurst graduated from Edinburgh University in 1992 andcompleted an MPhil at Cambridge in Computer Speech and LanguageProcessing. He then worked at The University of Edinburgh on a num-ber of projects involving text and document analysis before enrolingin the PhD programme. While studying for his PhD, he completed aEuropean Science and Technology Fellowship in Japan. After workingfor IBM Research, Tokyo he moved to the United States of Americato work for a number of companies with unique applications utilizingapplied natural language processing and document analysis. He is cur-rently the Director of Science and Innovation at Nielsen BuzzMetrics.