FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering

Linyong Nan1 Chiachun Hsieh3 Ziming Mao1 Xi Victoria Lin2∗ Neha Verma1

Rui Zhang4 Wojciech Kryscinski2 Nick Schoelkopf1 Riley Kong5 Xiangru Tang1

Murori Mutuma1 Ben Rosand1 Isabel Trindade1 Renusree Bandaru4

Jacob Cunningham4 Caiming Xiong2 Dragomir Radev1,2

1 Yale University 2 Salesforce Research 3 The University of Hong Kong4 Penn State University 5 Archbishop Mitty High School

{linyong.nan, ziming.mao}@yale.edu, [email protected]

Abstract

Existing table question answering datasetscontain abundant factual questions that pri-marily evaluate the query and schema com-prehension capability of a system, but theyfail to include questions that require com-plex reasoning and integration of informa-tion due to the constraint of the associatedshort-form answers. To address these is-sues and to demonstrate the full challenge oftable question answering, we introduce Fe-TaQA, a new dataset with 10K Wikipedia-based {table, question, free-form answer,supporting table cells} pairs. FeTaQA yields amore challenging table question answering set-ting because it requires generating free-formtext answers after retrieval, inference, and inte-gration of multiple discontinuous facts from astructured knowledge source. Unlike datasetsof generative QA over text in which answersare prevalent with copies of short text spansfrom the source, answers in our dataset arehuman-generated explanations involving enti-ties and their high-level relations. We providetwo benchmark methods for the proposed task:a pipeline method based on semantic parsing-based QA systems and an end-to-end methodbased on large pretrained text generation mod-els, and show that FeTaQA poses a challengefor both methods.

1 Introduction

Question Answering (QA) is the task of produc-ing answers to natural language questions basedon knowledge resources (Burke et al., 1997; Yaoand Van Durme, 2014; Chen et al., 2017). Oneof the primary goals of QA is to allow users todirectly and efficiently interact with large-scaleand heterogeneous knowledge sources. In the realworld, knowledge sources take a variety of forms,including unstructured texts (documents, passages,

∗Now at Facebook AI.

or conversations), structured knowledge bases ordatabases, and semi-structured tables, each requir-ing dedicated modeling approaches.

For QA over text, a sequence modeling approachis usually adopted to encode the query and thecontext, and answers are either categorical (Laiet al., 2017), extractive (Rajpurkar et al., 2016;Yang et al., 2018) or abstractive/generative (Ko-cisky et al., 2017; Nguyen et al., 2016; Fan et al.,2019; Kwiatkowski et al., 2019).

For table-based QA, a common approach is toapply semantic parsing on the query and the tableschema to generate a logical form (e.g. a SQL-likedatabase query) that can be executed to retrieve theanswer from the relevant portion of the table (Pa-supat and Liang, 2015; Iyyer et al., 2017; Zhonget al., 2017; Yu et al., 2018). The answers are ex-tracted facts/entities in the table, therefore usuallyin short-form.

Though existing datasets have enabled signifi-cant progress for table QA, their limitations preventthem from reflecting the full challenge of the task.Users of QA systems tend to ask complex questionswhich require elaborate answers, often containingexplanations, while existing datasets are limited tosimple short-form answers.

To address these shortcomings we present Fe-TaQA, a Free-form Table Question Answeringdataset which includes long, informative, and free-form answers. FeTaQA challenges QA systemswith the following tasks: 1) retrieving multipleentities from tables based on the query; 2) reason-ing over relations of these entities that are perti-nent to the query and integrating these pieces ofinformation into a coherent answer 3) aggregatingassociated information into an explanation whenthe query is abstract or ambiguous; and 4) gen-erating an informative, relevant, and faithful an-swer to the query. In addition, with tables sourcedfrom Wikipedia, FetaQA covers a diverse set of top-

arX

iv:2

104.

0036

9v1

[cs

.CL

] 1

Apr

202

1

Figure 1: Examples of FeTaQA instances. Only part of the original table is shown for better visualization. Theseexamples are referred as (a), (b), (c), (d) from upper left to bottom right in the paper.

Dataset Knowledge Source Answer Format Avg # WordsWikipedia articles Stories, books, movie scripts Online forum texts Wikipedia tables in Answer

SQuAD (Rajpurkar et al., 2016) Text-span 3.2HotpotQA (Yang et al., 2018) Short-form entity 2.2NarrativeQA (Kocisky et al., 2018) Free-form text 4.7ELI5 (Fan et al., 2019) Free-form text 130.6WikiTableQuestions (Pasupat and Liang, 2015) Short-form entity 1.7SequenceQA (Saha et al., 2018) Short-form entity 1.2HybridQA (Chen et al., 2020e) Short-form entity 2.1

FeTaQA Free-form text 18.9

Table 1: Comparison of FeTaQA with other QA datasets.

ics and includes semi-structured tables containingun-normalized text, including numbers, dates, andphrases. FeTaQA examples are presented in Figure1 and differences between FeTaQA and other QAdatasets are described in Table 1.

We formulate generative table question answer-ing as a Sequence-to-Sequence learning problem toevaluate the state-of-the-art models’ performanceson FeTaQA. We propose two benchmark methodsand provide experiment results for them. The firstone is an end-to-end model that integrates queryand table comprehension, logical reasoning, andlanguage generation by adapting T5 (Raffel et al.,2019). The other is a pipeline model that achievescontent selection and surface realization in separatemodules involving TAPAS (Herzig et al., 2020).

Through human studies, we evaluate answersgenerated by our proposed models as well as thereference answer based on fluency, correctness, ad-equacy (informativeness), and faithfulness. Theresults indicate the challenging nature of FeTaQAand that there is much room for improvement in QAsystems. We make the dataset available online.1

1https://github.com/Yale-LILY/FeTaQA

2 Dataset

Here we introduce FeTaQA and describe the pro-cess and criteria for collecting the tables, questionsand answers. Some statistics of FeTaQA are shownin § 2.4.

2.1 Desiderata

We frame generative table question answering asthe problem of generating an answer a to a ques-tion q based on a semi-structured table T and itsmetadata m. Our goal was to construct a table QAdataset {(qi, ai, Ti,mi)|i = 1 . . . n} that includes alarge number of tables on diverse topics. The tablesshould be intelligible, well-formed, and moderatelysized to make retrieval challenging yet plausible.Each table pairs a question with an answer sentence.The question should require retrieval and reasoningover multiple sources of information in the table,and the answer should integrate both facts and in-ferences into a coherent sentence that answers thequestion. Both questions and answers should benatural and fully grounded in the context of theentire table and its metadata such as the title.

https://github.com/Yale-LILY/FeTaQA

2.2 Data Collection Method

We start building the dataset by collecting data in-stances from ToTTo (Parikh et al., 2020), a recentlarge-scale Table-to-Text dataset that contains ta-bles and table-grounded sentences obtained from adiverse variety of Wikipedia pages. Additionally,ToTTo comes with annotations of table cells thatsupport the sentence: a sentence is supported bythe cell contents if it is directly stated or can belogically inferred by them. ToTTo applied severalheuristics to sample the tables and the candidatesentences from Wikipedia pages, and their annota-tors are asked to revise sentences and highlight thecorresponding table regions so that the sentencesstill have the varied language and structure foundin natural sentences while being grounded to thetable.

Sampling examples from the ToTTo datasetwas conducted in multiple steps. We first sampletables whose sizes are within 3 to 34 rows longand 3 to 7 columns wide (up to 75th percentileof all ToTTo table sizes) to avoid truncation ofsequence of linearized table for transformer-basedmodels, whose default maximum input sequencelength is 512. To ensure sentences contain severaltable entities, we further select tables whoseannotation of highlighted regions covers multiplerows. We also collect a subset of single-rowhighlighted regions which span multiple rowsor columns in content. Following this samplingprocedure, we were able to obtain 16,576{table, metadata, highlighted region, sentence}instances with which we conduct the annotationprocedure as described below. The flowchart ofthe sampling process is found in Figure 7 of theAppendix.

We adopted these table-grounded sentences asthe answers in our new QA dataset since they arelong, natural sentences containing rich informationand inferences over the corresponding table. Wealso exploit ToTTo’s annotations of table cells (thehighlighted table region) as the weak supervision(denotations) for training models and labels forevaluating model retrieval competency. We parsedthe tables (originally in HTML format) into a 2-dimensional array, where the first row correspondsto the table header. We also processed merged cellsby copying the cell content and cell highlightedregion to all the individual cells that compose theoriginal merged cell.

2.2.1 Question AnnotationQuestion annotations were collected with the helpof human judges in two phases: an internal phaseconducted by on-site expert annotators, and an ex-ternal phase conducted by crowd workers on Ama-zon Mechanical Turk. To streamline the process,we built a custom web interface to visualize tableHTML and metadata, augmented with web widgetsthat allow table region highlighting and sentenceediting. A screenshot of the annotation interface isshown in Figure 8 of the Appendix.

Provided the necessary context, the annotatorswere asked to write a question whose answer isthe provided ToTTo sentence. The annotators weregiven the option to modify the sentence, the tablecell content, and the highlighted region to bettermatch the associated question.

Internal Annotations In the first phase of anno-tation, we enrolled 15 internal annotators who wereprovided with preliminary guidelines. In additionto the annotation task, they were asked to providefeedback regarding the task instructions and theuser experience of the website, based on which weiteratively modified the guideline and the websitedesign.

External Annotations For external annotations,we hired MTurk workers who have completed atleast 500 HITs, have 97% approval rate, and arefrom English-speaking regions. To ensure that theMTurk annotators understand our task, we pro-vided an instruction video for the interactive anno-tation tool usage, FAQs that clarify the annotationswe desire, along with good vs. bad annotationexamples. We also created a Slack channel forcrowdsourced workers to ask questions and clarifydoubts.

Annotation Evaluation To ensure FeTaQA is ofhigh quality, we evaluate crowdsourced annotationsas follows. First we auto-rejected questions thatfall outside the length range (4 to 25) or convolutedquestions that contain more than two interrogatives(259 examples in total). For the remaining annota-tions, we built another web interface for evaluationand asked internal evaluators to label an annota-tion as “approve”, “reject” or “questionable” andscore the annotation based on its fluency, faithful-ness, and the extent to which the question needsthe full sentence as the answer. Internal evaluatorswere also asked to modify the question annotationsthat were not approved. Our final dataset anno-

tators contribution is distributed as follows: wehave 3,039 (30%) instances from internal annota-tors, 7,291 (70%) from MTurk workers. In total,our dataset contains 10,330 instances.

2.3 Dataset Split

Randomly splitting the dataset may make train, de-velopment, and test splits contain tables with sim-ilar contents (Finegan-Dollak et al., 2018; Lewiset al., 2020). Therefore, to increase the generaliza-tion challenge, we calculated the Jaccard similarityof two instances based on the set of tokens shownin table headers and questions, and split the datasetin such a way that models are evaluated on testsplit instances that are least similar to those usedfor training. We first sampled 800 instances ran-domly as a seed split. Then we add those that haveJaccard similarities greater than 0.465 to the seedsplit. This process generates two splits of 70% and30% of all instances, the former becomes the trainsplit and the latter is randomly divided with a ra-tio of 1:2 to form the development and test splits.This results in 7,326/1,001/2,003 instances in thetrain/dev/test splits, respectively.

2.4 Data Analysis and Statistics

Basic statistics of FeTaQA are shown in Table 2,and human evaluation scores and inter-evaluatoragreements are reported in Table 3. A quantitativeand qualitative analysis of FeTaQA shows it con-tains abundant complex questions that require re-trieval of multiple entities in the context, as shownby the human evaluation score for question com-plexity, and that the median number of highlightedcells (denotations) is 6, which is twice as much asthe corresponding number for ToTTo. These de-notations are correct and adequate as indicated bythe corresponding high evaluation scores. The free-form answers have a median of 18 tokens in length,and are grounded to the table and the denotations,also suggested by the high evaluation scores.

Topics Similar to ToTTo, we use WikimediaFoundation’s topic categorization model (Asthanaand Halfaker, 2018) to investigate the topics dis-tribution of FeTaQA. Although our dataset is lim-ited to topics presented in ToTTo, we are able tosample instances that have evenly distributed top-ics, as shown in Figure 2. We found that most ofthe instances are related to biography, sports andgeographical regions. There are also abundant in-stances related to media, politics and government.

Property Value

Unique Tables 10,330Question Length (Median/Avg) 12 / 13.2Answer Length (Median/Avg) 18 / 18.9Rows per Table (Median/Avg) 12 / 13.8Columns per Table (Median/Avg) 5 / 5.9No. of Highlighted Cell (Median/Avg) 6 / 8.0Percentage of Cells Highlighted (Median/Avg) 10.7% / 16.2%Page Title Length (Median/Avg) 2 / 3.3Section Title Length (Median/Avg) 2 / 1.9

Training Set Size 7,326Development Set Size 1,001Test Set Size 2,003

Table 2: FeTaQA Core Statistics

Annotation Quality Score >= 4(%) % Agreement Randolph’s Kappa

/ 95% CI

Question Complexity 52.6 0.65 0.48 / [0.41, 0.55]Denotation Correctness 89.0 0.88 0.82 / [0.76, 0.88]Denotation Adequacy 91.6 0.89 0.83 / [0.77, 0.89]Answer Fluency 95.0 0.92 0.89 / [0.84, 0.94]Answer Correctness 92.4 0.91 0.86 / [0.80, 0.92]Answer Adequacy 90.6 0.88 0.82 / [0.76, 0.88]Answer Faithfulness 95.6 0.93 0.89 / [0.84, 0.94]

Table 3: Human evaluation over 100 samples of Fe-TaQA. 5 internal evaluators are asked to rate the sam-ples on a scale of 1 to 5. We report % of samples thathave score ≥ 4 to show high quality of FeTaQA, andreport percent agreement and Randolph’s Kappa (Ran-dolph, 2005) (with 95% CI) to show that our humanevaluation has high inter-annotator agreement.

Figure 2: FeTaQA Topics Distribution.

Question Types FeTaQA has diverse and com-plex questions, as illustrated in Figure 3. Com-parison of question type distributions with othertable QA datasets is shown in Figure 9 ofthe Appendix. We found that in FeTaQA, alarge percentage of what questions are askingentities in plural, or abstract entity such asoutcome, result, margin, percentage. In addition,there is a higher percentage of how questions thatare not how many/much, compared to existing tableQA datasets.

Figure 3: FeTaQA questions by top 5 most frequent starting words, where box size represents frequency.

3 Models

To quantify the challenge posed by FeTaQA forstate-of-the-art models, we used two modeling ap-proaches that have been shown to be effective forthe existing table question answering datasets, withsome modifications made to adjust to our task.Model configurations are shown in Figure 4.

3.1 Pipeline ModelQuestion answering over tables is usually seen as asemantic parsing task. Based on the table schema,a semantic parser maps the question to a logicalform that can be used to retrieve the result fromthe table. The answer is usually a single entitythat is either a table cell value or the aggregationresult of multiple cell values (aggregation opera-tors include COUNT, MAX, SUM, etc.). A tablesemantic parser is trained using logical forms assupervised examples, but due to its high annotationcost, an alternative approach is to use denotationsfor weak supervision. The denotations are usuallythe targets of the existing table QA tasks. With thisapproach, the parser is able to retrieve denotationsdirectly.

However, in our task, targets are generated textsinstead of retrieved denotations, suggesting that wealso need a generator to integrate the retrieved infor-mation into a cogent sentence. Therefore, we pro-pose a pipeline model with two separately trainedmodules, described below.

Weakly Supervised Table Semantic ParsingThe first module adopts a table semantic parserthat is pre-trained with weak supervision. Weuse TAPAS (Herzig et al., 2020), a state-of-the-artmodel for table QA, to start with. We fine-tune iton FeTaQA with our annotated denotations (high-lighted table regions). We believe fine-tuning is

crucial for our task because TAPAS is pre-trainedon questions that require retrieval of limited deno-tations (single entity or homogeneous entities thatcan be aggregated with COUNT, SUM, or AVG oper-ation), while FeTaQA questions require retrieval ofmultiple entities and complex aggregation opera-tions. Details of experiment results are provided inSection 4.3. Note that besides denotations, TAPASalso predicts an aggregation operation (choose fromCOUNT, SUM, AVG, NONE) applied to the predicteddenotations to obtain the final answer. However,we use NONE as the aggregation operation label forfine-tuning due to the lack of annotations, thereforeleaving the inference of aggregation operation tothe second module.

Data-to-Text As shown in Figure 5, we fine-tuneT5 (Raffel et al., 2019) on DART (Nan et al., 2021)to obtain a Data-to-Text model as the second mod-ule of the pipeline to perform surface realization oftable cells (denotations in our case). We first con-vert the denotation prediction into the triple-set for-mat with the following scheme: for each table cellin the highlighted region, we generate the followingtriple: [[TABLECONTEXT], column header,cell value], where column header is thecell’s corresponding column name. Similar toDART, we use [TABLECONTEXT] as a spe-cial token for converting a table cell into atriple. We then incorporate the metadata intotriples by replacing column header with thefield name (TABLE TITLE, PAGE TITLE) andcell value with the metadata content (table ti-tle text, page title text). We end up with a triple-setcontaining all highlighted table cells and the meta-data (table title and title of the Wikipedia page thatincludes the table). We further fine-tune the Data-to-Text model on ToTTo instances so that it adapts

Figure 4: Pipeline model and End-to-End model diagrams.

to our formation of triple-set inputs. To avoid ex-posure to FeTaQA test instances, we fine-tune witha sample of 8K ToTTo instances that are not usedfor creating FeTaQA.

Figure 5: Weakly supervised fine-tuning of table se-mantic parser on FeTaQA. We choose a checkpoint ofTAPAS-base fine-tuned on WikiTableQuestions to startwith. After fine-tuning, the table semantic parser pre-dicts denotations, which are then converted to triplesand sent to the Data-to-Text module.

3.2 End-to-End Model

In this approach, we model the task as a sequence-to-sequence learning problem by linearizing tableT appended to question q as the source sequence,and treating the free-form answer a as the target se-quence. We propose a simple linearization schemeas a baseline: table rows are concatenated with[SEP] tokens in between, and cells in each roware separated by spaces. Since the input sequencelength may exceed the model limit, we prependq to table linearization T , using [CLS] tokens asprefixes for separation. We fine-tune models fromthe T5-family on the FeTaQA train set. The lin-earization scheme is visualized in Figure 6.

Figure 6: Table linearization in end-to-end model.

4 Experiments

In this section, we explain the experiment settingsand report the automatic and human evaluations onmodel outputs.

4.1 Experiment Setup

We first experiment with the pipeline model in azero-shot setting, that is, without any fine-tuningon FeTaQA. We use a checkpoint of TAPAS-basethat is fine-tuned on WikiTableQuestions (Pasupatand Liang, 2015) to perform table semantic parsingimplicitly in order to produce a set of denotations,which is then converted to a triple-set as describedin 3.1. We then employ a T5-large model (Raf-fel et al., 2019) that goes through two fine-tuningstages: in the first stage it is fine-tuned on the down-stream Data-to-Text task with DART (Nan et al.,2021); in the second stage it is further fine-tunedon ToTTo instances to adapt to the triple-set for-mulation we proposed. We denote this setting asPipeline - zeroshot in Table 4. Next weexperiment with the pipeline model by fine-tuningthe table semantic parser on FeTaQA. We furtherfine-tune the TAPAS-base checkpoint (WTQ fine-tuned) on FeTaQA train set and select models basedon their performance on the development set. We

use the same Data-to-Text model as described inthe zero-shot setting.

For the End-to-End model, we adapt HuggingFace’s implementation (Wolf et al., 2020) of T5(Raffel et al., 2019) for our task. We use a standardT5-tokenizer with additional [CLS] and [SEP]tokens and the model vocabulary is resized accord-ingly. Since we expect the input sequence to besignificantly longer than the target, we fine-tunedthe models using T5’s “summarize: ” prefix. Themotivation behind this is to avoid simple extrac-tion from the table since abstractive summarizationis supposed to rephrase important details in thesource. T5-small is trained on 4 Tesla K80 GPUswith per-device batch size of 16 for 30 epochs(about 6,900 steps). T5-base is trained on 4 TeslaK80 with per-device batch size of 4 (due to GPUmemory constraints) for 80 epochs (about 36,640steps). As for T5-large, we distributed the layersacross 8 Tesla K80 to train with a batch size of 4for 80 epochs (about 80k steps).

4.2 Evaluation MetricsWe use a variety of automatic metrics and humanevaluation (Section 4.4) to evaluate the quality ofthe generated answers. We report sacreBLEU (Post,2018), ROUGE-{1, 2, L} (Lin, 2004), and ME-TEOR (Banerjee and Lavie, 2005) that evaluatethe n-gram match between generated and referenceanswers. Considering the limitations of these mea-sures in evaluating the semantic meanings of sen-tences, we also report BERTScore (Zhang et al.,2020) and BLEURT (Sellam et al., 2020) that incor-porate semantics using contextual embeddings. Toevaluate the retrieval competency of table semanticparsers, we applied various set similarity metrics tothe predicted and reference denotation lists. Specifi-cally, we report Jaccard similarity, Overlap, Cosinesimilarity, and Dice similarity.

4.3 Results and DiscussionsOur experimental results on the FeTaQA test setare summarized in Table 4. The T5-large model us-ing an End-to-End modeling approach achieves thehighest performance scores in almost all evaluationmetrics. Also, we observe a large performance gapbetween pipeline models and End-to-End models,even though the latter only adopt a simple lineariza-tion strategy for encoding tables.

We also see that after fine-tuning on FeTaQAwith denotations as weak supervisions, the pipelinemodel improves by almost 2 BLEU points. To fur-

ther examine the source of this improvement, wereport the evaluation of table semantic parser per-formance in Table 5, from which we also observean improvement in retrieval capability. However,we note that compared with the reference denota-tions that have a median of six table cells beinghighlighted (shown in 2), our table semantic parseris only able to predict two table cells on averagebefore fine-tuning on FeTaQA, and three table cellson average after. This indicates a large space forimprovement. We suspect that the low performanceof denotation predictions and the loss of relationalinformation between denotations lead to the inade-quate performance of pipeline models.

4.4 Human Evaluation

To further evaluate the quality of the answers gen-erated by different models comparing to the ref-erences, we conduct our human evaluation basedon four criteria: (1) fluency if an answer is naturaland grammatical; (2) correctness if an answer iscorrect; (3) adequacy if an answer contains all theinformation that is asked; (4) faithfulness if an an-swer is faithful and grounded to the contents of thetable and the highlighted region. Each evaluatoris asked to examine an answer given the questionand the full context (table, highlighted region, andmetadata) and give a score on a scale of 1 to 5 foreach of the criteria. We ask five internal annota-tors to evaluate 100 samples of FeTaQA instances.Each sample is paired with 3 answers: the refer-ence, the pipeline model result, and the End-to-Endmodel result.

Table 6 attests to the high quality of our annota-tions and the challenging nature of FeTaQA. Simi-lar to the evaluation result of the automatic metrics,we observe a large gap between the pipeline modeland the End-to-End model, with the latter one sig-nificantly outperforming its counterpart in termsof answer correctness, adequacy, and faithfulness.Comparing the best performing End-to-End modeloutputs to human references, we see that there isroom for improvement in the future.

5 Related Work

Generative QA Generative question answeringdatasets such as NarrativeQA (Kocisky et al., 2018),CoQA (Reddy et al., 2019), TriviaQA (Joshi et al.,2017), and MS MARCO (Nguyen et al., 2016) allhave free-form answers that are generated basedon the contexts of Wikipedia articles, books, movie

sacreBLEU1 ROUGE-1 ROUGE-2 ROUGE-L METEOR BERTScore BLEURT

Pipeline - zeroshot 9.16 0.38 0.20 0.33 0.22 0.88 -0.79Pipeline - fine-tuned 11.00 0.40 0.22 0.35 0.24 0.91 -0.35

End-to-End - T5-small 21.60 0.55 0.33 0.47 0.40 0.94 0.08End-to-End - T5-base 28.14 0.61 0.39 0.51 0.47 0.96 0.31End-to-End - T5-large 30.54 0.63 0.41 0.53 0.49 0.96 0.57

Table 4: Experiment results on the test split of FeTaQA.

Jaccard OverlapCoff. Cosine Dice

Zeroshot 0.065 0.300 0.140 0.109Fine-tuned 0.101 0.311 0.184 0.161

Table 5: Evaluation of denotation prediction on the testsplit of FeTaQA. We report performance of TAPAS inzero-shot and fine-tuned with weak supervision.

Source Fluent(%)

Correct(%)

Adequate(%)

Faithful(%)

Pipeline 85.2 25.4 8.4 23.6End-to-End 94.6 54.8 48.4 50.4Reference 95.0 92.4 90.6 95.6

Table 6: Human evaluation over 100 samples of modeloutputs and references. We report the percentage ofoutputs that have scores of 4 or 5.

scripts, dialogues or web documents. These re-sponses are mostly crowd-sourced and are reportedto mostly contain copies of short text spans fromthe source. By contrast, ELI5 (Fan et al., 2019)is a long form question answering dataset con-taining a diverse set of complex questions, eachpaired with a paragraph-long answer and 100 rel-evant web source documents (Petroni et al., 2020;Krishna et al., 2021). FeTaQA is the first datasetfor generative question answering over tables. Un-like the existing generative QA datasets that assessmulti-documents retrieval and abstraction capabil-ity, FeTaQA poses new challenges in the reasoningand integration capability of a system given a struc-tured knowledge source.

QA over Tables and Semantic Parsing Severaldatasets have been proposed to apply semantic pars-ing on tables, including WikiTableQuestions (Pasu-pat and Liang, 2015), SequentialQA (Iyyer et al.,2017), WikiSQL (Zhong et al., 2017), Spider (Yuet al., 2018). With the development of pre-trained

1SacreBLEU signature:BLEU+case.lc+numrefs.1+smooth.exp+tok.13a+version.1.3.7

language models, recent work (Yin et al., 2020;Herzig et al., 2020; Eisenschlos et al., 2020; Iidaet al., 2021) jointly learns representations for nat-ural language sentences and structured tables, andYu et al. (2020, 2021) use pre-training approachfor table semantic parsing. HybridQA (Chen et al.,2020e) and OTT-QA (Chen et al., 2020a) have con-texts of both structured tables and unstructured text.MultiModalQA (Talmor et al., 2021) contains com-plex questions over text, tables and images. Thesedatasets define a table QA task that is extractive innature by restricting their answers to be short-form,while FeTaQA frames table QA as a generationtask.

Data-to-text generation Recent neural end-to-end models tested on the WebNLG 2017 dataset(Gardent et al., 2017) have focused on incorporat-ing pre-training and fine-tuning for specific gen-eration tasks (Chen et al., 2020c; Kale, 2020) toimprove performance and strengthen generaliza-tion ability. However, recent models featuringseparate content-planning and surface realizationstages have exhibited improvements (Moryossefet al., 2019; Iso et al., 2020) over comparablebaselines. TabFact (Chen et al., 2020d) is com-posed of Wikipedia tables coupled with state-ments labeled as either “ENTAILED” or “RE-FUTED” by the table. LogicNLG (Chen et al.,2020b) features statements logically entailed fromtables. ToTTo (Parikh et al., 2020) is a large-scale open-domain dataset consisting of Wikipediatables with a set of highlighted table cells anda sentence description of those highlighted cells.DART (Nan et al., 2021) is an open-domainData-to-Text dataset that contains table-ontology-preserving data samples with diverse predicate setoccurred in Wikipedia tables.

6 Conclusion

In this paper, we introduced the task of generativetable question answering with FeTaQA, a table QAdataset consisting of complex questions that require

free-form, elaborate answers. We also proposedtwo modeling approaches: (1) a pipeline model thatincorporates a table semantic parser and (2) a Data-to-Text generator, and an End-to-End model thatincludes query comprehension, reasoning and textgeneration. Our experimental results indicate thatthe End-to-End model with a simple table encod-ing strategy achieves much higher scores than thepipeline model that requires table semantic parsing.Furthermore, we show that FeTaQA introduces newchallenges for table question answering that callfor innovative model designs in the future.

ReferencesSumit Asthana and Aaron Halfaker. 2018. With few

eyes, all hoaxes are deep. Proc. ACM Hum.-Comput.Interact., 2(CSCW).

Satanjeev Banerjee and Alon Lavie. 2005. METEOR:An automatic metric for MT evaluation with im-proved correlation with human judgments. In Pro-ceedings of the ACL Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla-tion and/or Summarization, pages 65–72, Ann Ar-bor, Michigan. Association for Computational Lin-guistics.

Robin D Burke, Kristian J Hammond, Vladimir Ku-lyukin, Steven L Lytinen, Noriko Tomuro, and ScottSchoenberg. 1997. Question answering from fre-quently asked question files: Experiences with thefaq finder system. AI magazine, 18(2):57–57.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computa-tional Linguistics.

Wenhu Chen, Ming-Wei Chang, Eva Schlinger,W. Wang, and William W. Cohen. 2020a. Openquestion answering over tables and text. ArXiv,abs/2010.10439.

Wenhu Chen, Jianshu Chen, Y. Su, Zhiyu Chen, andWilliam Yang Wang. 2020b. Logical natural lan-guage generation from open-domain tables. In ACL.

Wenhu Chen, Yu Su, X. Yan, and W. Wang. 2020c.Kgpt: Knowledge-grounded pre-training for data-to-text generation. In EMNLP.

Wenhu Chen, Hongmin Wang, Jianshu Chen, YunkaiZhang, Hong Wang, LI SHIYANG, Xiyou Zhou,and William Yang Wang. 2020d. Tabfact: A large-scale dataset for table-based fact verification. ArXiv,abs/1909.02164.

Wenhu Chen, Hanwen Zha, Zhiyu Chen, WenhanXiong, Hong Wang, and William Wang. 2020e. Hy-bridqa: A dataset of multi-hop question answeringover tabular and textual data. Findings of EMNLP2020.

Julian Eisenschlos, Syrine Krichene, and ThomasMuller. 2020. Understanding tables with interme-diate pre-training. In Findings of the Associationfor Computational Linguistics: EMNLP 2020, pages281–296, Online. Association for ComputationalLinguistics.

Angela Fan, Yacine Jernite, Ethan Perez, David Grang-ier, Jason Weston, and Michael Auli. 2019. ELI5:Long form question answering. In Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics, pages 3558–3567, Florence,Italy. Association for Computational Linguistics.

Catherine Finegan-Dollak, Jonathan K. Kummerfeld,Li Zhang, Karthik Ramanathan, Sesh Sadasivam,Rui Zhang, and Dragomir Radev. 2018. Improvingtext-to-SQL evaluation methodology. In ACL 2018.Association for Computational Linguistics.

Claire Gardent, Anastasia Shimorina, Shashi Narayan,and Laura Perez-Beltrachini. 2017. The WebNLGchallenge: Generating text from RDF data. In Pro-ceedings of the 10th International Conference onNatural Language Generation, pages 124–133, San-tiago de Compostela, Spain. Association for Compu-tational Linguistics.

Jonathan Herzig, Pawel Krzysztof Nowak, ThomasMuller, Francesco Piccinno, and Julian Eisenschlos.2020. TaPas: Weakly supervised table parsing viapre-training. In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics, pages 4320–4333, Online. Association forComputational Linguistics.

Hiroshi Iida, June Thai, Varun Manjunatha, and MohitIyyer. 2021. Tabbie: Pretrained representations oftabular data. In NAACL.

Hayate Iso, Yui Uehara, Tatsuya Ishigaki, HiroshiNoji, Eiji Aramaki, Ichiro Kobayashi, YusukeMiyao, Naoaki Okazaki, and Hiroya Takamura.2020. Learning to select, track, and generate fordata-to-text. Journal of Natural Language Process-ing, 27(3):599–626.

Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. 2017.Search-based neural structured learning for sequen-tial question answering. In Proceedings of the55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1821–1831, Vancouver, Canada. Associationfor Computational Linguistics.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. arXiv preprint arXiv:1705.03551.

https://doi.org/10.1145/3274290

https://doi.org/10.1145/3274290

https://www.aclweb.org/anthology/W05-0909



https://doi.org/10.18653/v1/P17-1171

https://doi.org/10.18653/v1/P17-1171

https://doi.org/10.18653/v1/2020.findings-emnlp.27

https://doi.org/10.18653/v1/2020.findings-emnlp.27

https://doi.org/10.18653/v1/P19-1346

https://doi.org/10.18653/v1/P19-1346

https://www.aclweb.org/anthology/P18-1033

https://www.aclweb.org/anthology/P18-1033

https://doi.org/10.18653/v1/W17-3518

https://doi.org/10.18653/v1/W17-3518

https://doi.org/10.18653/v1/2020.acl-main.398


https://doi.org/10.18653/v1/P17-1167

https://doi.org/10.18653/v1/P17-1167

Mihir Kale. 2020. Text-to-text pre-training for data-to-text tasks. arXiv preprint arXiv:2005.10433.

Tomas Kocisky, Jonathan Schwarz, Phil Blunsom,Chris Dyer, Karl Moritz Hermann, Gabor Melis, andEdward Grefenstette. 2017. The narrativeqa readingcomprehension challenge. CoRR, abs/1712.07040.

Tomas Kocisky, Jonathan Schwarz, Phil Blunsom,Chris Dyer, Karl Moritz Hermann, Gabor Melis, andEdward Grefenstette. 2018. The narrativeqa readingcomprehension challenge. Transactions of the Asso-ciation for Computational Linguistics, 6:317–328.

Kalpesh Krishna, Aurko Roy, and Mohit Iyyer. 2021.Hurdles to progress in long-form question answer-ing. In NAACL.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-field, Michael Collins, Ankur Parikh, Chris Alberti,Danielle Epstein, Illia Polosukhin, Matthew Kelcey,Jacob Devlin, Kenton Lee, Kristina N. Toutanova,Llion Jones, Ming-Wei Chang, Andrew Dai, JakobUszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-ral questions: a benchmark for question answeringresearch. Transactions of the Association of Compu-tational Linguistics.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard Hovy. 2017. RACE: Large-scale ReAd-ing comprehension dataset from examinations. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, pages785–794, Copenhagen, Denmark. Association forComputational Linguistics.

Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel.2020. Question and answer test-train overlap inopen-domain question answering datasets. arXivpreprint arXiv:2008.02637.

Chin-Yew Lin. 2004. ROUGE: A package for auto-matic evaluation of summaries. In Text Summariza-tion Branches Out, pages 74–81, Barcelona, Spain.Association for Computational Linguistics.

Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019.Step-by-step: Separating planning from realizationin neural data-to-text generation.

Linyong Nan, Dragomir Radev, Rui Zhang, AmritRau, Abhinand Sivaprasad, Chiachun Hsieh, Xian-gru Tang, Aadit Vyas, Neha Verma, Pranav Kr-ishna, Yangxiaokang Liu, Nadia Irwanto, JessicaPan, Faiaz Rahman, Ahmad Zaidi, Murori Mutuma,Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan,Xi Victoria Lin, Caiming Xiong, Richard Socher,and Nazneen Fatema Rajani. 2021. DART: Open-domain structured data record to text generation. InNAACL.

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,Saurabh Tiwary, Rangan Majumder, and Li Deng.2016. Ms marco: A human generated machine read-ing comprehension dataset. In CoCo@ NIPS.

Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann,Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, andDipanjan Das. 2020. ToTTo: A controlled table-to-text generation dataset. In Proceedings of the 2020Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1173–1186, On-line. Association for Computational Linguistics.

Panupong Pasupat and Percy Liang. 2015. Compo-sitional semantic parsing on semi-structured tables.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages1470–1480, Beijing, China. Association for Compu-tational Linguistics.

Fabio Petroni, Aleksandra Piktus, Angela Fan, PatrickLewis, Majid Yazdani, Nicola De Cao, JamesThorne, Yacine Jernite, Vassilis Plachouras, TimRocktaschel, et al. 2020. Kilt: a benchmark forknowledge intensive language tasks. arXiv preprintarXiv:2009.02252.

Matt Post. 2018. A call for clarity in reporting BLEUscores. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computa-tional Linguistics.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J. Liu. 2019. Exploring the limitsof transfer learning with a unified text-to-text trans-former. CoRR, abs/1910.10683.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392, Austin,Texas. Association for Computational Linguistics.

Justus J. Randolph. 2005. Free-marginal multiraterkappa (multirater k[free]): An alternative to fleiss’fixed-marginal multirater kappa.

Siva Reddy, Danqi Chen, and Christopher D Manning.2019. Coqa: A conversational question answeringchallenge. Transactions of the Association for Com-putational Linguistics, 7:249–266.

Amrita Saha, Vardaan Pahuja, Mitesh Khapra, KarthikSankaranarayanan, and Sarath Chandar. 2018. Com-plex sequential question answering: Towards learn-ing to converse over linked question answer pairswith a knowledge graph. In AAAI 2018.

Thibault Sellam, Dipanjan Das, and Ankur P. Parikh.2020. Bleurt: Learning robust metrics for text gen-eration.

Alon Talmor, Ori Yoran, Amnon Catav, Dan Lahav,Yizhong Wang, Akari Asai, Gabriel Ilharco, Han-naneh Hajishirzi, and Jonathan Berant. 2021. Mul-timodalQA: complex question answering over text,

http://arxiv.org/abs/1712.07040


https://doi.org/10.18653/v1/D17-1082

https://doi.org/10.18653/v1/D17-1082





https://doi.org/10.18653/v1/2020.emnlp-main.89

https://doi.org/10.18653/v1/2020.emnlp-main.89

https://doi.org/10.3115/v1/P15-1142

https://doi.org/10.3115/v1/P15-1142

https://doi.org/10.18653/v1/W18-6319

https://doi.org/10.18653/v1/W18-6319




https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264



https://openreview.net/forum?id=ee6W5UgQLa


tables and images. In International Conference onLearning Representations.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander Rush. 2020. Trans-formers: State-of-the-art natural language process-ing. In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W. Cohen, Ruslan Salakhutdinov, andChristopher D. Manning. 2018. Hotpotqa: A datasetfor diverse, explainable multi-hop question answer-ing. CoRR, abs/1809.09600.

Xuchen Yao and Benjamin Van Durme. 2014. Infor-mation extraction over structured data: Questionanswering with Freebase. In Proceedings of the52nd Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 956–966, Baltimore, Maryland. Associationfor Computational Linguistics.

Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Se-bastian Riedel. 2020. TaBERT: Pretraining for jointunderstanding of textual and tabular data. In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 8413–8426, Online. Association for Computational Lin-guistics.

Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, BailinWang, Yi Chern Tan, Xinyi Yang, Dragomir Radev,Richard Socher, and Caiming Xiong. 2020. Grappa:Grammar-augmented pre-training for table semanticparsing. arXiv preprint arXiv:2009.13845.

Tao Yu, Rui Zhang, Oleksandr Polozov, ChristopherMeek, and Ahmed Hassan Awadallah. 2021. Score:Pre-training for context representation in conversa-tional semantic parsing. In ICLR.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li,Qingning Yao, Shanelle Roman, Zilin Zhang,and Dragomir Radev. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages3911–3921, Brussels, Belgium. Association forComputational Linguistics.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-uating text generation with bert.

Victor Zhong, Caiming Xiong, and Richard Socher.2017. Seq2sql: Generating structured queriesfrom natural language using reinforcement learning.arXiv preprint arXiv:1709.00103.


https://doi.org/10.18653/v1/2020.emnlp-demos.6






https://doi.org/10.3115/v1/P14-1090

https://doi.org/10.3115/v1/P14-1090

https://doi.org/10.3115/v1/P14-1090



https://doi.org/10.18653/v1/D18-1425

https://doi.org/10.18653/v1/D18-1425

https://doi.org/10.18653/v1/D18-1425



A Appendix

The Appendix contains the following contents:

• Flowchart of ToTTo instances sampling pro-cess. (Figure 7)

• Screenshot of FeTaQA annotation interface.(Figure 8)

• Question type distribution comparison be-tween FeTaQA and other Table QA datasets.(Figure 9)

Figure 7: Flowchart of ToTTo filtering process

Figure 8: FeTaQA annotation interface

Figure 9: Question type distribution comparison between different Table QA datasets

Documents

FeTaQA: Free-form Table Question Answering