View
217
Download
1
Tags:
Embed Size (px)
Citation preview
SAT and CSP competitions &
benchmark libraries:some lessons learnt?
Toby WalshNICTA & UNSW
Sydney, Australia
Whats the best way to benchmark systems?
QuickTime™ and a decompressor
are needed to see this picture.
Outline
» Benchmark libraries» Founding CSPLib.org
» Competitions» SAT competition judge» TPTP competition judge» …
Why?
» Why did I set up CSPLib.org» I needed problems against which to benchmark my latest inference techniques
» Zebra and random problems don’t cut it!
» I thought it would help unify and advance the CP community
Random problems
» +ve» Easy to generate» Hard (if chosen from phase transition)
» Impossible to cheat» You can solve 1000 variable random 3SAT problems at l/n=4.2, I’ll be impressed
Random problems
» -ve» Lack structures found in real world» Unrepresentative
» E.g. random 3SAT either have many solutions or none
» Different methods work well on them» Random SAT: forward looking algorithms» Industrial SAT: backward looking algorithms
Why?
» Thesis: every mature field has a benchmark library» Deduction started in 1960s
» TPTP set up in 1993
» SAT started in 1960s» SAT DIMACS challenge in 1992» SATLib set up in 1999
» CP started in 1970s» CSPLib set up in 1998
Why?
» Thesis: every mature field has a benchmark library» Spatial and temporal reasoning started in early 80s (or before?)
» It’s been approximately 30 years so it’s about time you guys set one up!
Benchmark libraries» CSPLib.org
» Over 35k unique visitors
» Still not everything I’d want it to be
» But state of the art for experimentation is now much better than it was» I haven’t seen a zebra for a very long time
QuickTime™ and a decompressor
are needed to see this picture.
An ideal library
» Desiderata taken from:» CSPLib: a benchmark library for constraints, Proc. CP-99
QuickTime™ and a decompressor
are needed to see this picture.
An ideal library
» Location» On the web and easy to find
» TPTP.org» CSPLib.org» SATLib.org» QBFLib.org» …» http://elib.zib.de/pub/mp-testdata/tsp/tsplib/tsplib.html
» http://mat.gsia.cmu.edu/COLOR/instances.html
An ideal library
» Easy to use» Tools to make benchmarking as painless as possible
» tptp2X, …
» Diverse» To help prevent over-fitting
An ideal library
» Large» Growing continuously» Again helps to prevent over-fitting
» Extensible» To new problems or domains
An ideal library
» Complete» One stop for your problems
» Topical» For instance, it should report current best solutions found
An ideal library
» Independent» Not tied to a particular solver or proprietary input language
» Mix of difficulties» Hard and easy problems» Solved and open problems» With perhaps even a difficulty index?
An ideal library
» Accurate» It should be trusted
» Used» A valued resource for the community
Problem format
» Lo-tech or hi-tech?
QuickTime™ and a decompressor
are needed to see this picture.
Lo-tech formats
» DIMACS format used in SATLib
c a simple examplep cnf 3 21 -1 01 2 3 0
This represents: x v -x, x or y or z
Lo-tech formats
» DIMACS format used in SATLib» +ve
» All programming languages can read integers!
» Small amount of extensibility built in (e.g. QBF)
» -ve» Larger extensions are problematic (e.g. beyond CNF to arbitrary Boolean circuits)
Hi-tech formats
» CP competition<instance>
<presentation name="4-queens" description="This problem involves placing 4 queens on a chessboard" nbSolutions="at least 1" format="XCSP1.1 (XML CSP Representation 1.1)"
/> <domains nbDomains="1">
<domain name="dom0" nbValues="4" values="1..4" /> </domains> <variables nbVariables="4"> <variable name="X0" domain="dom0"/>
… </variables>
<relations nbRelations="3"> <relation
name="rel0" domain="dom0 dom0” nbConflicts="10 conflicts="(1,1)(1,2)(2,1)(2,2)(2,3)(3,2)(3,3)(3,4)(4,3)(4,4)" />
… </relations > <constraints nbConstraints="6">
<constraint name="C0" scope="X0 X1" relation="rel0"/>…
Hi-tech formats
» XML» +ve
» Easy to extend» Parsing tools can be provided
» -ve» Complex and verbose» Computers can parse terse structures easily
No-tech formats
» CSPLib» Problems are specified in natural language» No agreement at that time for an input language
» One focus was on how you model a problem
» Today there is more consensus on modelling languages like Zinc
No-tech formats
» CSPLib» Problems are specified in natural language
» But you can still provide in one place» Input data» Results» Code» Parsers …
Getting problems
» Submit them yourself» Initially, you must do this so library has some critical mass first time people look at it
» But it becomes tiresome and unrepresentative to do so continually
» Ask at every talk» Tried for several years but it (almost) never worked
Getting problems
» Need some incentive» Offer money?» Price of entry for the competition?» If you have a competition, users will submit problems that their solver is good at?
Competitions
QuickTime™ and a decompressor
are needed to see this picture.
Libraries + Competitions
» You can have a library without a competition» But you can’t have a competition without a library
QuickTime™ and a decompressor
are needed to see this picture.
Libraries + Competitions
» Libraries then competition» TPTP then CASC» Easy and safe!
» Libraries and competition» Planning» RoboCup» …
Increasing complexity
» Constraints» 1st year, binary extensional» 2nd year, limited number of globals» 3rd year, unlimited
» Planning» Increasing complexity» Time, metrics, uncertainty, …
QuickTime™ and a decompressor
are needed to see this picture.
Benefits
» Gets ideas implemented
» Rewards engineering» Progress needs both science and engineering!
» Puts it all together
QuickTime™ and a decompressor
are needed to see this picture.
Benefits
» Gives greater importance to important low-level issues» In SAT:
» Watched literals» VSIDS» …
Benefits
» Witness the progress in SAT» 1985, 10s vars» 1995, 100s vars» 2005, 1000s vars» …» Not just Moore’s law at play!
Pitfalls
» Competitions require lots of work» Organizers get limited (academic) reward
» One solution is to organize also competition special issues
QuickTime™ and a decompressor
are needed to see this picture.
Pitfalls
» Competitions encourage incremental improvements» Don’t have them too often!
» You may discover a local minimum» E.g. MDPs for speech recognition» Give out best new solver prize?
The Chaff story
» Industrial problems, SAT & UNSAT instances» 2008, 1st MiniSAT (son of zChaff)» 2007, 1st RSAT (son of MiniSAT)» 2006, 1st MiniSAT» 2005, 1st SatELite GTI
(MiniSAT+preprocessor)» 2004, 1st zChaff (Forklift from 2003 was
better)» 2003, 1st Forklift» 2002, 1st zChaff
Other issues
» Man-power» Organizers
» One is not enough?
» Judges» All rules need interpretation
» Compute-power» Find a friendly cluster
QuickTime™ and a decompressor
are needed to see this picture.
Other issues
» Multiple tracks» SAT/UNSAT» Random/industrial/crafted» …» Certificate/Uncertificated
Other issues
» Holding problems back if possible» Release some problems so competitors can ensure solver compliance
» But hold most back so competition is blind!
Other issues
» Multiple phases» Too many solvers for all to compete with long timeouts
» First phase to test correctness » Second phase to throw out the slow solvers (who cost you many timeouts)
» Third phase to differentiate between better solvers
Other issues
» Reward function» <#completed, average time, …>» solution purse + speed purse
» Points for each problem divided between those solvers that solve it
» Getting buy in from competitors» It will (and should) evolve over time!
Other issues
» Prizes» Give out many!» Good for people’s CVs
» Good motivator for future years
QuickTime™ and a decompressor
are needed to see this picture.
Other issues
» Open or closed source?» Open to share progress» Closed to get the best
» Last year’s winner» Condition of entry» To see progress is being made!
Other issues
» Smallest unsolved problem» Give a prize!
» Timing» Run during the conference» Creates a buzz so people enter next year» Get a slot in program to discuss results» Get a slot in banquet to give out prizes
Conclusions
» Benchmark libraries» When an area is several decades old, why wouldn’t you have one?
» Competitions» Designed well, held not too frequently, & with buy-in from the community, why wouldn’t you?
Questions
» Disagreements» Other opinions» Different experiences
» …QuickTime™ and a
decompressorare needed to see this picture.