Flambé: A Customizable Framework for Machine Learning ...ment. However, it does not have a natural way to run hyperparameter tuning, or advanced trial sam-pling and scheduling. Ray

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 181–188Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

181

Flambé: A Customizable Framework forMachine Learning Experiments

Jeremy WohlwendASAPP Inc.

[email protected]

Nicholas MatthewsASAPP Inc.

[email protected]

Ivan ItzcovichASAPP Inc.

[email protected]

Abstract

Flambé is a machine learning experimenta-tion framework built to accelerate the entireresearch life cycle. Flambé’s main objectiveis to provide a unified interface for prototyp-ing models, running experiments containingcomplex pipelines, monitoring those experi-ments in real-time, reporting results, and de-ploying a final model for inference. Flambéachieves both flexibility and simplicity by al-lowing users to write custom code but instantlyinclude that code as a component in a largersystem which is represented by a concise con-figuration file format. We demonstrate the ap-plication of the framework through a cutting-edge multistage use case: fine-tuning and dis-tillation of a state of the art pretrained lan-guage model used for text classification. 1

1 Introduction

Scientists and engineers in the machine learningcommunity dedicate many hours and resouces to-wards preprocessing data, iterating on model ar-chitectures, tuning hyperparameters, aggregatingresults and ultimately deploying their most per-formant model. While frameworks like PyTorch(Paszke et al., 2017) and Tensorflow (et al., 2016)abstract away the details of operations like back-progpagation and make building models possiblein a few lines of code, they do not explicitly aimto solve these other parts of the research cycle.

The explosion of available resources in the ma-chine learning community (Dean et al., 2018) hasincluded many tools that address one or moreof these other phases of research, but these iso-lated tools do not always work harmoniously withone another, trading off customizability to providehigh-level interfaces. Understanding that machine

1The code and documentation can be found athttps://flambe.ai

learning research particularly in the field of Natu-ral Language Processing might require innovationat any level of abstraction and across any stagein the research process, we’ve built Flambé toinclude standardized implementations of model-ing components, hyperparameter optimization anddistributed execution that can all be effortlessly re-placed with custom user-developed code.

By facilitating customization and iteration on aparticular data pipeline and model architecture, weaim for Flambé users to spend the majority of theirtime doing research, not re-implementing tools fortraining, tuning, reporting and deploying.

Flambé’s contributions are:

1. Modular machine learning components to de-velop replicable, state of the art researchresults. This includes: neural networkcomponents (pretrained or not), benchmarkdatasets, and standardized training and eval-uation modules.

2. A configuration format that natively enablessearching over hyperparameters and runningremote multistage experiments at scale.

3. Smooth reporting and exporting, to facilitatesharing models and results with collaboratorsand the larger community.

4. An open source framework for both the aca-demic community and teams in industry.

We demonstrate the application of our frame-work through a cutting-edge use case, namelyknowledge distillation of a state of the art languagemodel, the BERT model (Devlin et al., 2019), ona downstream text classification task.

2 Related work

Many different tools are attempting to tackle thevarious challenges of building machine learning

182

systems from different angles. Frameworks likePyTorch and Tensorflow (et al., 2016) providethe building blocks of models as simple modulese.g. various linear and recurrent layers, losses,optimizers etc. Many model implementationshave been built on top of these modules, withsome proposing new standardizations of specificarchitectures like sequence-to-sequence modeling(et al, 2019).

Libraries such as Keras (Chollet et al., 2015)offer a high-level API for building and trainingmodels. Others including AllenNLP (Gardneret al., 2018), FastAI (Howard et al., 2018) andTexar (et al, 2018) focus on some specific domainsor tasks like reading comprehension or text styletransfer. These types of frameworks tend to focuson training a single model at a time, but many re-search experiments consist of complex multistagepipelines, with hyperparameter tuning and dis-tributed computation required at each stage. WithFlambé, users can write their custom code inde-pendent from these concerns, and then easily startusing algorithms like Hyperband (Li et al., 2016)and Bayesian Optimization (Bergstra et al., 2013),link components across stages, and run everythingon a cluster without any modifications.

MLFlow (Zaharia et al., 2018) focuses on ex-periment tracking, metric reporting, and containspowerful features aimed at production deploy-ment. However, it does not have a natural way torun hyperparameter tuning, or advanced trial sam-pling and scheduling.

Ray (Moritz et al., 2017) implements infrastruc-ture for distributing computational tasks on a clus-ter, and it also provides a higher level extension,Tune (Liaw et al., 2018), that handles hyperparam-eter optimization.

Flambé leverages and builds upon existingtools, connecting the dots between frameworkslike PyTorch and Ray, and providing a smooth in-tegration between them with a powerful layer ofabstraction on top. By not trying to re-implementsolved problems like back-propagation and dis-tributed task execution, we can focus our attentionon usability and efficiency.

3 The Flambé Framework

Flambé executes experiments which are com-posed of a pipeline of modeling and processingstages (Subsection A), extensions that import user-supplied code (Subsection B), links to existing

Figure 1: Example YAML config for text classificationon the TREC dataset. The highlighted and labeled sec-tions refer to the subsections in 3.1. There are a numberof different objects that could be used in any place ofthis config e.g. the optimizer could be !torch.SGDand the scheduler tune.HyperOpt (Bayesian opti-mization). Note the pipeline stage names “stage0”, etc.are arbitrary.

components (Subsection C), and tunable hyper-parameters (Subsections D, E, F). All of thesefeatures are demonstrated in the Experimentshown in Figure 1, which defines a simple textclassification task consisting of training an LSTM(Hochreiter and Schmidhuber, 1997) on the TRECdataset (Li and Roth, 2002).

183

Each tag in the YAML (Oren Ben-Kiki, 2009)config (anything beginning with ‘!’) correspondsto a python object that will be initialized with thekeyword arguments following the tag. These tagsare not hardcoded into the system, and users canuse their own classes in the config just as easilyas the ones we’ve already built. After we explainall the aforementioned features, we introduce howFlambé saves object state, enables simple metriclogging, and deploys models for production.

3.1 Walkthrough

In this section we present an example driven ex-planation of the core features as they’re used inFigure 1.

A. Pipeline

The most important section of the YAML file isthe pipeline section. This section contains aseries of stages which each implement a stepmethod. The example shown in Figure 1 contains3 stages: (1) dataset loading and processing, (2)training of each model variant, and (3) evaluatingthe best model from stage1.

A stage in the pipeline can be any Python ob-ject. Users need only add a parent class to theirclass definition if they intend to use it in the YAMLconfig. All objects will receive the keyword argu-ments given inline in the configuration file. Forexample, in Figure 1 the TextClassifier ob-ject receives an embedding, encoder and decoder,matching its definition in code:

All subclasses of Flambé classes like Modelare automatically registered with YAML

B. Extending Flambé with Custom Code

Flambé is flexible because of its ability to use cus-tom Flambé objects in the experiment configura-tion file. By default, only classes in the mainFlambé library and PyTorch can be referenced, butby using the extensions feature users can in-clude their own classes and functions, from eitherlocal or remote source code repositories.

To create an extension, users need only organizetheir code into one or more pip-installable pack-ages. After declaring the extensions and includingthem at the top of the config file, they are useableanywhere in the YAML configuration file.

In the example, the TRECDataset object isdefined in an external extension hosted in GitHub.By adding its URL at the top of the YAML con-figuration file, the cl.TrecDataset object andany other Flambé class can be used. If you can-not or do not want to inherit from one of ourpipeline classes (Model, Trainer, etc.) you caninherit from flambe.nn.Module which willsupply the minimum needed functionality to sup-port use in the config file and automatic hierarchi-cal serialization (See later sections).

C. Referencing Earlier ObjectsA core feature of Flambé is the ability to connect(or “link”) different components with the !@ no-tation, a custom YAML tag we’ve implemented.Any value anywhere in the pipeline can be a ref-erence to an earlier value that has already beendefined. Each link consists of the identifier ofa stage, e.g. “stage1” which in this case is theTrainer object, followed by the rest of the ob-ject attributes. In the highlighted example (C), thelink stage0.train means that the data key-word argument for BaseSampler should pointto the train attribute of the TCProcessor.

D. Hyperparameter SearchIn addition to referencing other values via links,the value for any parameter in the config can bereplaced with either a list of possible options totry (for grid search) or a distribution for samplingpossible options. Grid search options are definedwith the !g tag followed by the list of candidatevalues; Flambé will automatically duplicate thestage, choosing a single value for each variant ofthe stage. In the example we use this mechanismto search over different numbers of layers.

If distributions are used instead of lists of can-didate values, Flambé performs a simple randomsearch. Users can also specify a search field thatmaps stage names to the hyperparameter searchalgorithm, e.g. Bayesian optimization, whichchanges the distributions used to sample the tun-able hyperparameters.

When Links reference stages with multiplevariants, the stage containing the link is duplicatedas many times as there are variants.

184

E. Trial SchedulingRegardless of the strategy used to choose hyper-parameters, some variants will start to clearly out-perform others and scheduling algorithms like Hy-perband (Li et al., 2016) use that information tointelligently allocate resources to the variants thatare performing the best. Flambé surfaces an in-terface to these schedulers in the same way asthe search algorithms: “schedulers” maps pipelinestage names to the desired scheduling algorithms,as shown in the example configuration.

F. Selecting the Best VariantsAfter trying many different combinations of hy-perparameters, only the best will propagate to thenext stages if the reduce operation is used. For ex-ample, with reduce mapping stage1 to 1 in theexample, only the single best configuration, withthe optimal number of layers, will be evaluated inthe final stage. In order to use this feature, thestages need to supply a metric fn that can be usedto rank the variants.

3.2 Hierarchical SerializationWhile PyTorch already provides a clear and robustsaving mechanism, we augment this functionalitywith a generic serialization protocol for all objectsthat includes opt-in versioning and a directorybased file format that anyone can inspect. Ratherthan dumping all of the model weights and otherstate into a single file, the directory based struc-ture mirrors the object hierarchy and enables thepossibility of referencing a specific component.Rather than having to load the save file to inspectthe contents, it can be navigated like any other di-rectory. By default, only what PyTorch normallysaves is included in the save file; users can add ad-ditional state by overriding custom state andload custom state

3.3 Using a clusterTo run experiments on a cluster, an additionalpiece of YAML is needed to define the remotemanager. As shown below in Figure 3 one canindicate the instance types and a timeout flag forboth the orchestrator and the factories. We usethis feature to keep our experiment tracking web-site running on the orchestrator once an experi-ment is over, but also to keep factories alive whenrapidly experimenting or debugging. The orches-trator will communicate with workers in the clus-ter via Ray and Tune to execute and checkpoint

Figure 2: Save file directory structure for theExperiment in Figure 1

.

progress at each step. If an experiment fails oris interrupted, it can be quickly resumed with anadditional flag resume: True. Crucially, thisremote functionality allows to distribute the exe-cution of the variants across a cluster of machinesby only adding a few lines to the configuration.

Figure 3: Example remote config for AWS cluster.

3.4 DeployingTypically after experimentation, machine learningprojects require packaging a model together withsome preprocessing and post-processing functionsinto a single inference-ready interface, e.g. atext classifier that actually takes raw string(s) asinput. Flambé facilitates this use-case with theExporter object, wherein users can define anew version of the model from the best variantstested, and with the right interface for later use.

3.5 Library usageIn addition to using the Flambé framework viaYAML configuration files, users can also use theindividual objects (e.g. the Trainer, or RNNEn-coder classes) in any python script. This usagemay be important for users that already have a pro-duction codebase (including training scripts) writ-ten purely in Python. In a future version of the

185

software we plan to support creating full exper-iments and deploying models via code (insteadof YAML) to enable dynamic experiment creationand model exporting.

3.6 Logging

Flambé provides full integration with Python’slogging module and Tensorboard ((et al.,2016)). Users are able to visualize their resultsby simply including log statements in their code(See Figure 4).

Figure 4: Example log statement. Logging can be doneanywhere inside a custom object.

All variants will appear under the same plot foreasy analysis (see Figure 5).

4 Case study: BERT Distillation

In this section we showcase Flambé’s ability totransform a pre-existing codebase with no pre-existing support for hyperparameter optimizationinto a complex multi-stage pipeline with a YAMLconfig less than 80 lines long. Furthermore, Wewere able to find the optimal set of parameters inroughly half the time otherwise needed by addingHyperband scheduling (Li et al., 2016), and run-ning the experiment over a large cluster.

BERT (Devlin et al., 2019) is a popular modelwhich performs competitively across several NLPtasks by leveraging language model pre-trainingover a very large corpus. Two crucial issues withthe BERT model are the size of the model, and itsinference speed, which generally inhibits its usein production environments. To address this issue,recent efforts have shown that most of BERT’sperformance on a downstream task can be con-served, while dramatically reducing its memoryfootprint (Chia et al., 2018).

In this experiment, we fine-tune the BERTmodel on two standard text classification bench-marks: TREC (Li and Roth, 2002) and SentimentTreebank (Socher et al., 2013). We then applyknowledge distillation to reduce the BERT modelto a simple 4 layer, 256 units, SRU network (Leiet al., 2018). This is a typical multistage experi-ment with preprossessing, fine tuning, and distil-lation stages. All of this can be expressed in a sin-

Model TREC SST2 # ParametersSRU 94.8 86.2 ≈ 5MBERT 96.8 91.0 ≈ 110MDISTILLED 95.5 87.8 ≈ 5M

Table 1: Accuracy on benchmark text classificationdatasets: TREC and SST2 (Binary Sentiment Tree-bank). Distilling BERT improves the accuracy of thebase SRU model, while reducing the number of param-eters by more than 95%. All models were trained orfine-tuned using Flambé. The SRU and DISTILLEDmodel have the same architecture, the SRU model be-ing trained from scratch and the DISTILLED modelbenefiting from the BERT model’s improved perfor-mance.

Figure 5: Some runs are pruned early by the Hyberbandscheduling algorithm. The x-axis is training steps, andthe y-axis is accuracy.

gle, concise configuration. Results are providedin Table 1. The full configuration, containing allthree stages and their respective hyperparameters,is provided as supplementary material.

Not only can Flambé express the above experi-ment in a concise configuration, but using a stateof the art trial scheduling algorithm such as Hyper-band (Li et al., 2016) can be accomplished with asingle additional line in the configuration. Figure5 shows Hyperband allocating more training stepsto the best-performing models. In this example,defining grid searches, running over a cluster, andusing a scheduling algorithm on an existing code-base required little to no effort.

5 Future work

Flambé aims to integrate with research and engi-neering workflows through its focus on usability,modularity and reproducibility. We continue to

186

pursue this goal by developing a large collectionof machine learning components including stateof the art models, benchmark datasets, and noveltraining strategies. Real, working, and repro-ducible experiment configurations will showcasethese components alongside their performance intask-based leaderboards. In parallel, we will con-tinue to develop user-friendly abstractions like theability to auto-scale clusters based on the sizeof each stage in the pipeline, and to monitor oreven alter experiment execution in real-time froma website.

ReferencesMartin Abadi et al. 2016. Tensorflow: A system for

large-scale machine learning. In 12th USENIX Sym-posium on Operating Systems Design and Imple-mentation (OSDI 16), pages 265–283.

Myle Ott et al. 2019. fairseq: A fast, extensible toolkitfor sequence modeling. In Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics (Demon-strations), pages 48–53, Minneapolis, Minnesota.Association for Computational Linguistics.

Zhiting Hu et al. 2018. Texar: A modularized,versatile, and extensible toolbox for text genera-tion. In Proceedings of Workshop for NLP OpenSource Software (NLP-OSS), pages 13–22, Mel-bourne, Australia. Association for ComputationalLinguistics.

James Bergstra, Dan Yamins, and David D Cox. 2013.Hyperopt: A python library for optimizing the hy-perparameters of machine learning algorithms. InProceedings of the 12th Python in science confer-ence, pages 13–20. Citeseer.

Yew Ken Chia, Sam Witteveen, and Martin Andrews.2018. Transformer to cnn: Label-scarce distillationfor efficient text classification.

François Chollet et al. 2015. Keras. https://keras.io.

Jeff Dean, David Patterson, and Cliff Young. 2018. Anew golden age in computer architecture: Empow-ering the machine-learning revolution. IEEE Micro,38(2):21–29.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.AllenNLP: A deep semantic natural language pro-cessing platform. In Proceedings of Workshop forNLP Open Source Software (NLP-OSS), pages 1–6, Melbourne, Australia. Association for Computa-tional Linguistics.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Jeremy Howard et al. 2018. fastai. https://github.com/fastai/fastai.

Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and YoavArtzi. 2018. Simple recurrent units for highly par-allelizable recurrence. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 4470–4481.

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Ros-tamizadeh, and Ameet Talwalkar. 2016. Hyperband:A novel bandit-based approach to hyperparameteroptimization. arXiv preprint arXiv:1603.06560.

Xin Li and Dan Roth. 2002. Learning question clas-sifiers. In COLING 2002: The 19th InternationalConference on Computational Linguistics.

Richard Liaw, Eric Liang, Robert Nishihara, PhilippMoritz, Joseph E Gonzalez, and Ion Stoica.2018. Tune: A research platform for distributedmodel selection and training. arXiv preprintarXiv:1807.05118.

Philipp Moritz, Robert Nishihara, Stephanie Wang,Alexey Tumanov, Richard Liaw, Eric Liang,William Paul, Michael I. Jordan, and Ion Stoica.2017. Ray: A distributed framework for emergingAI applications. CoRR, abs/1712.05889.

Ingy döt Net Oren Ben-Kiki, Clark Evans. 2009.Yaml. https://yaml.org/spec/1.2/spec.html.

Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.In NIPS-W.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 conference onempirical methods in natural language processing,pages 1631–1642.

Matei Zaharia, Andrew Chen, Aaron Davidson, AliGhodsi, Sue Ann Hong, Andy Konwinski, Sid-dharth Murching, Tomas Nykodym, Paul Ogilvie,Mani Parkhe, et al. 2018. Accelerating the machinelearning lifecycle with mlflow. Data Engineering,page 39.

https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdfhttps://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdfhttps://www.aclweb.org/anthology/N19-4009https://www.aclweb.org/anthology/N19-4009https://www.aclweb.org/anthology/W18-2503https://www.aclweb.org/anthology/W18-2503https://www.aclweb.org/anthology/W18-2503https://keras.iohttps://keras.iohttps://www.aclweb.org/anthology/N19-1423https://www.aclweb.org/anthology/N19-1423https://www.aclweb.org/anthology/N19-1423https://www.aclweb.org/anthology/W18-2501https://www.aclweb.org/anthology/W18-2501https://github.com/fastai/fastaihttps://github.com/fastai/fastaihttps://www.aclweb.org/anthology/C02-1150https://www.aclweb.org/anthology/C02-1150http://arxiv.org/abs/1712.05889http://arxiv.org/abs/1712.05889https://yaml.org/spec/1.2/spec.htmlhttps://yaml.org/spec/1.2/spec.html

187

A Screenshots

Below is a screenshot of the reporting site that includes a progress bar, links to see the console outputand Tensorboard, and a download link for the model weights:

When you launch an experiment from the console, you will see a series of status updates asshown below:

188

B BERT Configuration File

Documents

Flambé: A Customizable Framework for Machine Learning ...ment. However, it does not have a natural way to run hyperparameter tuning, or advanced trial sam-pling and scheduling. Ray