Lecture 2: Introduction to AI - Columbia University › ~huayang › files › Output.pdf · Anticipated all major arguments against AI Broke down AI into knowledge, reasoning, language

Lecture 2: Introduction to AI

What is Artificial Intelligence?

History of AI

The State of the Art

What is AI?

Systems thatthink like humans

Systems thatthink rationally

Systems thatact like humans

Systems thatact rationally

Acting Humanly: The Turing Test

Test of intelligence: The Imitation Game

Anticipated all major arguments against AI

Broke down AI into knowledge, reasoning, language understanding, and learning

Predicted the success of AI: “[By 2000, programs will be able to] play the imitation game so well that an average interrogator will not have more than 70% chance of making the right identification after 5 minutes of questioning.”

Thinking Humanly: Cognitive Science

Requires understanding of the biological processed of human thought

What level of abstraction is appropriate?

Thinking Rationally: Laws of Thought

Aristotle: What are correct arguments and thought processes?

From mathematics, logic, and philosophy through to modern AI

But, is all intelligent behavior logical?

What is the purpose of thinking?

Acting Rationally: Do the Right Thing

The right thing: That which is expected to maximize goal achievement

But, the right thing doesn’t necessarily involve thinking (e.g. blinking).

(Modern) History of AI

1943: McCulloch & Pitts’ circuit model of the brain1950: Turing’s “Computing Machinery and Intelligence”1950s: Samuel’s checkers; Newell & Simon’s logic theorist1965: Robinson’s logical reasoning algorithm1960s: Complexity problems1970s: Early knowledge-based systems1980s: Expert systems1990s: Agent model and scientific formalization

State of the Art: Can a Machine...

... plan and control the operation of a spacecraft?

... play a master-level game of chess?

... steer a car on a cross-country trip of the US?

... diagnose lymph-node pathology at an expert level?

... manage the logistics planning and scheduling of over 50,000 vehicles, cargo, and soldiers during wartime?... assist surgeons during microsurgery?... solve crossword puzzles better than humans?... unload any dishwasher and put everything away?

Problem Solving via Uninformed Search

Traveling through RomaniaCurrently in AradFlight leaves tomorrow from BucharestGoal: be in BucharestProblem:

states = various citiesactions = drive between cities

Solve:find the sequence of cities to drive through: Arad, Sibiu, Fagaras, Bucharest

Stating the Problem

Initial State: at Arad

Successor Function: S(state) = { (action,state,cost), ... }

Goal Test: Goal(state) = True if state is goal

Path Cost: Sum of step costs c(x,a,y) = g(n)

Solution: Sequence of actions from initial to goal state

States vs Nodes

State: representation of a physical configurationNode: data structure in search graph including

stateparent nodeactionpath cost -- g(x)depth

Implementing States

(defun make-node (&key state parent action (path-cost 0) (depth 0))(list state parent action path-cost depth))

(defun node-state (node) (car node))(defun node-parent (node) (cadr node))(defun node-action (node) (caddr node))(defun node-path-cost (node) (cadddr node))(defun node-depth (node) (car (cddddr node)))

OR

(defstruct node state (parent nil) (action nil) (path-cost 0) (depth 0))

Search Strategies

Defined by picking the order of node expansionStrategies evaluated by:

completeness: does it always find a solution if one exists?time complexity: number of nodes generatedspace complexity: maximum number of nodes in memoryoptimality: does it always find a least cost solution?

Time and space complexity are measured in terms of:b: maximum branching factor of the search tree (graph)d: depth of least-cost solutionm: maximum depth of the state space (may be ∞)

Uninformed Search

Breadth-First

Uniform-Cost

Depth-First

Depth-Limited

Iterative Deepening

General Search

(pause here to show lisp code)

Breadth-First Search

Fringe is FIFO queue

Complete (if b is finite)

Time complexity = O(bd+1)

Space complexity = O(bd+1) -- keeps all nodes in memory

Optimal if unit cost

ZerindZerindTimisoaraSibiu TimisoaraSibiu

Arad Oradea

Arad Lugaj

Arad

Fagaras

Riminiu V

Oradea

Arad OradeaArad Oradea

Arad LugajArad Lugaj

Arad

Fagaras

Riminiu V

Oradea

Arad

Fagaras

Riminiu V

Oradea

Sibiu Timisoara Zerind

AradArad

Searching the Tree (Graph)AradArad

Uniform-Cost Search

Fringe = queue ordered by path cost, g(n)

Equivalent to breadth-first if step costs are all equal

Complete if step costs are non-negative

Time & space complexity: similar to breadth-first

Optimal because nodes are expanded in increasing order of path cost, g(n)

Depth First Search

Fringe is LIFO queue (stack)i.e. enqueuef = #’(lambda (x y) (append y x))

Incomplete -- fails in infinite-depth spaces or spaces with loops

modify code to avoid repeated statescomplete in finite spaces

Time complexity is O(bm) -- terrible if m >> b, but often much faster than breadth-firstSpace complexity is O(bm) -- linear!!!Optimal -- No. Why?

ZerindZerindTimisoaraSibiu TimisoaraSibiu

Arad Oradea

Arad Lugaj

Arad

Fagaras

Riminiu V

Oradea

Arad OradeaArad Oradea

Arad LugajArad Lugaj

Arad

Fagaras

Riminiu V

Oradea

Arad

Fagaras

Riminiu V

Oradea

Sibiu Timisoara Zerind

AradArad

Searching the Tree (Graph)AradArad

Other Uniformed Searches

Depth-limited is depth-first with a cutoff (nodes at maximum depth have no successors)

Iterative deepening search is iterative calls to depth-limited search, increasing depth cutoff each time

What’s the overhead?

Is it complete? What are the time and space complexities? Is it optimal?

Eliminating Repeated States

Failure to detect repeated states turns linear problem into exponential one

Fixing this turns tree search into graph search

Learning from Observations

Types of Learning

Supervised

Unsupervised

Reinforcement

Given (x, f(x)), what is f?

Inductive Learning

Given a collection of examples (x, f(x)), find h that approximates f

h is the hypothesis

a good hypothesis is one that generalizes well

a hypothesis space, H, is the set of hypotheses that are considered

RegressionLinear regression

Polynomial regression

Ockham’s razor

Realizable and unrealizable learning problems

a learning problem is realizable if H contains f

Tradeoff between expressiveness of H and the complexity of finding simple, consistent h

Classification

Learning a discrete-valued function is called classification

f(vertebrate) = (fish, amphibian, reptile, bird, mammal)

a decision tree takes as input an object described by attributes and returns a decision

Blood Temperature?

warm cold

Has Feathers? Has Scales?

yes no yes no

BIRD MAMMAL Has Fins? AMPHIBIAN

yes no

REPTILEFISH

Boolean ClassificationSpecial case in which f(x) = yes/no

∀x f(x) ⇔ (P1(x) ∨ P2(x) ∨ ... ∨ Pn(x)) where

Pi(x) is the conjunction of all tests along path i from root to leaf, for all i such that leaf is “yes”

Since it is one variable and all predicates are unary, this is actually propositional

Decision trees are fully expressive for the class of propositional languages

Inducing Decision Trees from Examples

Training Set: a list of examples of attribute value pairs with their corresponding outcomes

Test Set: another set of examples -- never used for subsequent training

(defun learn-decision-tree (examples attributes default)(cond

((null examples) default)((all-same-classification examples) (classification (car examples)))((null attributes) (most-frequent-classification examples))(t (let (

(best (choose-attribute attributes examples))(m (most-frequent-classification examples)) )

(make-decision-tree:root best:subtrees (mapcar (lambda (examples-of-each)

(learn-decision-tree examples-of-each(remove best attributes)m) )

(divide-examples examples best) )) )

)))

Information

if the possible answers vi have probabilities P(vi), then the information content of the actual answer is

I(P(v1),...,P(vn)) = ∑i -P(vi)log2P(vi)

I(0.5,0.5) = -0.5(-1) - 0.5(-1) = 1I(0.25, 0.25, 0.25, 0.25) = 4(-0.25)(-2) = 2

How much information is still needed after an attribute is selected?

Remainder(A) = ∑i I(pi/(pi+ni), ni/(pi+ni))(pi+ni)/(p+n)

Gain(A) = I(p/(p+n), n/(p+n)) - Remainder(A)

(choose-attribute attribute examples) chooses the attribute with the largest gain

Rule-Based Expert Systems

Rule-Based SystemsRule-based systems are structured as two “memories”

Knowledge base of rules

Working memory of facts

These can be logic based and use unification to determie rule satisfaction

Rules = definite horn clauses (multiple FOL conditions, single FOL assertion)

Facts = horn facts (FOL predicates with terms including variables)

Or “production systems” that use pattern matching instead of unification

Rules can have multiple assertions as well as multiple conditions

Facts have no variables

Structure of Production Rules

Production rules have a left-hand side (LHS) and a right-hand side (RHS) corresponding to a condition and action respectively.

LHS is a conjunction of patterns with at least one pattern non-negated.

RHS is a set of actions that modify working memory.

Sample Rules

(((agent (location =x) (holding =a)) (arrow (name =a)) (wumpus (location (& =y !x))) (room (name =x) (adjacent =y)))

((MODIFY 1 (holding nil)) (ASSERT shoot (arrow =a) (at =y))) )

(((shoot (arrow =x) (at =y)) (=object (location =y)))

((MODIFY 2 (status dead)) (REMOVE 1)) )

Sample Working Memories

((agent (location 1) (holding A)) (arrow (name A)) (wumpus (location 2)) (room (name 1) (adjacent 2)))

((agent (location 1) (holding nil)) (arrow (name A)) (wumpus (location 2)) (room (name 1) (adjacent 2)) (shoot (arrow A) (at 2)))

((agent (location 1) (holding nil)) (arrow (name A)) (wumpus (location 2) (status dead)) (room (name 1) (adjacent 2)))

(((agent (location =x) (holding =a)) (arrow (name =a)) (wumpus (location (& =y !x))) (room (name =x) (adjacent =y)))


(((shoot (arrow =x) (at =y)) (=object (location =y)))

((MODIFY 2 (status dead)) (REMOVE 1)) )

Early Ground-Breaking Rule-Based Systems

R1: Digital Equipment Corporation’s expert system for configuring large computer systems – a sales force tool to accurately meet customer requirements. Forward Chaining.

ACE: AT&T Bell Laboratories’ expert system for diagnosing telephone line repairs based on trouble tickets. Forward Chaining.

MYCIN: Bacterial infection diagnosis including explanation system. Backward Chaining.

Dendral: Elucidation of the molecular structure of unknown organic compounds taken from known groups of such compounds. Backward Chaining.

Rule Inference Engines

Like Horn clause forward chaining (or backward chaining) algorithms, rule inference engines find rules that are satisified, fire them, thus altering the working memory, enabling other rules to be satified (and fired, etc...)

Rule inference engines repeat three steps:

Match: determine which rules are satisfied by which facts in working memory producing the conflict set of instantiations

Conflict Resolution: select which instantiation is best

Act: Fire the instantiation, altering working memory

Rule inference halts either when no rule matches or when an explicit halt action is fired

Conflict Resolution

Many alternative algorithms to select which instantiation should fire

Two factors often used to sort select the instantiation are recency and specificity

Recency: time tags on WMEs

Specificity: How many individual values matched

Conflict resolution algorithms are linear

Inefficient Match Algorithm

For each LHS, match each clause against each working memory element (WME), maintaining variable binding consistency among clauses within the same LHS

For each set of WMEs that successfully match the clauses on a LHS, add the instantiation, that is, the pair (Rule, WME-set), to the conflict set

After conflict resolution and act phases, repeat match with all LHSs and all WMEs including changes

Observation

Changes in WM result in changes in the conflict set

Therefore, only the changes in WM should be addressed in an efficient match algorithm

Rete Match Algorithm

Precompile rules into a dataflow network (rete) data structure

Nodes in the rete represent the match of one value

Nodes store partial instantiations

Leaves of the rete represent complete instantiations

Sample Rete(((agent (location =x) (holding =a)) (arrow (name =a)) (wumpus (location (& =y !x))) (room (name =x) (adjacent =y)))


class: agent

location: =x

holding: =a

class: arrow

name: =a

class: wumpus

location: =y !x

class: room

name: =x

adjacent: =y

P:(agent (location 1) (holding A))Q:(arrow (name A))R:(wumpus (location 2))S:(room (name 1) (adjacent 2))

R

R: =y 2

P,R: =x 1, =y 2

Q

Q: =a A

P,Q: =a A

P

P: =x 1

P: =a A

S

S: =x 1

S: =y 2

P,R,S: =x 1, =y 2

R,S: =y 2

P,Q,R,S: =a A, =x 1, =y 2

Improving Rule-Based System Performance

Performance and Parallelism

• many patterns and many facts result often result in poor performance of rule-based systems, even with state-saving algorithms such as rete

• since rule bases are collections of unordered rules, and working memories are collections of unordered facts (time tags used for recency heuristic, not to impose a specific, unbreakable order), parallel processing is an obvious approach for performance improvement

Relevance and Match Complexities

• relevance complexity — complexity of determining intracondition satisfaction of a rule, i.e. the constants

• match complexity — complexity of determining intercondition satisfaction of a rule, i.e. the variables

Rule Parallelism

• distribute one rule per processor, keeping only relevant working memory for each rule

• during match, process all rules simultaneously

• perform conflict resolution via a parallel resolve algorithm — perhaps in O(log n) time

• act phase broadcasts changes to all processors, absorbed by relevant rules

Problems with Rule Parallelism

• inherent sequential nature — even if a given cycle is faster, each cycle must be done in sequence

• temporal redundancy — small changes to working memory per cycle means little work to be done in each cycle

• culprit rules — certain rules require substantially more match time than others, leading to load imbalance among processors; these rules are high in match complexity

Considering Node Parallelism

• better approach is to distribute rete nodes across parallel processors rather than rules

• each processor has its local memory to store the partial instantiations

• however, the three problems still exist — inherent sequential nature, temporal redundancy, and culprit nodes (rather than rules)

Addressing Inherent Sequential Nature and Temporal Redundancy

• combining rule chains — rules which often fire in succession are combined into macro rules with more complex LHSs and resulting in more changes to working memory on their RHSs

• multiple rule firing for non-conflicting rules

• specifying rule sets — flow of control is abstracted out of rules, explicitly identifying sequential requirements

Combining Rule Chains

• if A, B, and C then make D and E

• if B and D then make F and remove D

becomes

• if A, B, and C then make E and F

• perhaps more rules created, but such rules are parallelizable

Addressing Culprit Rules/Nodes

• creating constrained copies of culprit rules/nodes — culprit rules/nodes are copied many times, each copy constrained to be relevant to a distinct subset of possible working memory elements

• this process results in a shift to increasing relevance complexity and decreasing match complexity — i.e. fewer variable binding tests (joins) and more constant tests (selects)

• relevance complexity is easily parallelized whereas match complexity causes load imbalance

(p join-pieces! (goal ^type try-join îd1 nil îd2 nil)! (piece ^color <x> îd )! (piece ^color <x> îd { <j> <> })! -->! (modify 1 îd1 îd2 <j>))

Rete of Join-Pieces(p join-pieces! (goal ^type try-join îd1 nil îd2 nil)! (piece ^color <x> îd )! (piece ^color <x> îd { <j> <> })! -->! (modify 1 îd1 îd2 <j>))

class = goal class = piece

type = try-join

id1 = nil

id2 = nil

natural join

join unequal ids

join equal colors

Analyzing the Culprit

• say n = 1000 puzzle pieces

• upon adding the goal of type try-join, n2 = 1,000,000 tests occur at join node

• suppose that node only looks at RED pieces and there are 20 colors, evenly distributed

(p join-pieces-RED! (goal ^type try-join îd1 nil îd2 nil)! (piece ^color RED îd )! (piece ^color RED îd { <j> <> })! -->! (modify 1 îd1 îd2 <j>))

Analyzing the Culprit

• each copy is relevant to n ≈ 50 puzzle pieces

• upon adding the goal of type try-join, each copy only performs n2 = 2,500 tests and all are performed in parallel

• 400-fold speed improvement with 20 processors

(cc piece color)...(p join-pieces-1! (goal ^type try-join îd1 nil îd2 nil)! (piece ^color <X> ^hash-color 1 îd )! (piece ^color <X> ^hash-color 1 îd { <j> <> })! -->! (modify 1 îd1 îd2 <j>))

• if only one occurrence of a variable, n copies

• if two or more occurrences of a variable, yet they are all tested to be equal to each other, still only n copies

• however, if m different variables occur, or if one color must be <x> whereas another must be <> <x> then nm copies

How Many Copies?

(cc piece color)...(p join-pieces-a,b! (goal ^type try-join îd1 nil îd2 nil)! (piece ^color <X> ^hash-color a îd )! (piece ^color <> <X> ^hash-color b îd { <j> <> })! -->! (modify 1 îd1 îd2 <j>))

• in purely equijoin conditions, linear speed up, even without parallel processing — an additional linear speed up resulting in quadratic speed up with parallel processing

• in non-equijoin conditions, no speed up without parallelism, but linear speed up with parallel processing

Copy-Constrain Speed Up

Overall Performance Improvements

Uncertainty and Bayes’s Rule

Logic Theory vs Decision Theory

In propositional logic, propositional variables can be true or false in different models

In probabilistic decision theory, propositional variables have associated probabilities

Probabilities assign a degree of belief rather than a degree of truth

Random Variables

Boolean random variables: Cavity, Toothache, Catches (each = <true, false>)

P(Cavity=true) sometimes written as P(cavity)

Discrete random variables: Weather = <sunny, rainy, cloudy, snow>

P(Weather=sunny) sometimes written as P(sunny)

Continuous random variables: variables with real number values

Probability Axioms

0 ≤ P(a) ≤ 1

P(true) = 1; P(false) = 0

P(a ∨ b) = P(a) + P(b) - P(a ∧ b)

For discrete random variables ∑ P(V=v) = 1

Atomic EventsCavity Toothache Catches Atomic Event

T T T cavity ∧ toothache ∧ catches

T T F cavity ∧ toothache ∧ ¬catches

T F T cavity ∧ ¬toothache ∧ catches

T F F cavity ∧ ¬toothache ∧ ¬catches

F T T ¬cavity ∧ toothache ∧ catches

F T F ¬cavity ∧ toothache ∧ ¬catches

F F T ¬cavity ∧ ¬toothache ∧ catches

F F F ¬cavity ∧ ¬toothache ∧ ¬catches

P

0.108

0.012

0.072

0.008

0.016

0.064

0.144

0.576

Atomic Events are MECE; ∑ P(e) = 1

Prior and ConditionalProbabilities

The prior probability of a proposition is the degree of belief we have that it is true with no additional evidence — P(cavity) = 0.2

The conditional probability can change based on evidence — P(cavity|toothache) = 0.6

The Product Rule: P(a ∧ b) = P(b|a)P(a)

Inference and Independence

Exhaustive, complete inference algorithm simply calculates all conditional probabilities to find best next action — O(2n)

If some random variables can be deemed independent based on domain knowledge, they can be factored out

Bayes’s Rule

Since conjunction is commutative,P(a ∧ b) = P(a|b)P(b) = P(b|a)P(a)

Thus,P(b|a) = P(a|b)P(b)/P(a)

The conditional probability of b given a is equal to the conditional probability of a given b times the ratio of the prior probabilities of b and a

Using Bayes’s Rule

Known: if there is a cavity, there is a 90% chance that the dentist tool will catch

Known: prior probabilities of cavities and catches are 20% and 28% respectively

What is the probability of having a cavity if the tool catches?

P(cavity|catches) = P(catches|cavity)P(cavity)/P(catches) = 64%

Combining Evidence

P(Cause|Effect1∧Effect2∧...∧EffectN) is difficult to compute because we need the conditional probabilities of the conjunction for each value of Cause — O(2n)

However, if the Effects are independent of each other (even if they are all dependent on the cause), then you can computeP(Effect1∧Effect2∧...∧EffectN|Cause) = P(Effect1|Cause)P(Effect2|Cause)...P(EffectN|Cause)

Heuristic Search

Best-First Search

Use an evaluation function f(n) for each node

estimates “desirability”

Expand most desirable node in fringe

Enqueueing function maintains fringe in order of f(n) -- smallest (lowest cost) first

Two approaches: Greedy and A*

Romania

Map of roads between cities with distances (as used in uninformed search)

Straight-line distances to Bucharest from each city (as the crow flies)

Arad 366, Bucharest 0, Craiova 160, etc...

Greedy Best-First Search

We introduce h(n): a heuristic function that estimates the cost from n to goal

Evaluation function f(n) = h(n),

h(n) = straight line distance from state(n) to Bucharest

Greedy Best-First Search expands the node that appears to be closest to the goal

Bucharest0

Bucharest0

Fagaras176

Oradea380

Rimnicu193

Sibiu253

Timisoara329

Zerind374

Sibiu253

Fagaras176

Arad366

Greedy Best-First Search

Properties of Greedy Search

Complete? No (if tree search) – can get stuck in loops; Yes if repeated nodes are eliminated (graph search)

Time? O(bm), but a good heuristic dramatically improves performance

Space? O(bm), keeps all nodes in memory

Optimal? No. Greedy search is like heuristic depth-first

A* Search

Avoid expanding paths that are already expensive

f(n) = g(n) + h(n)

g(n) = path cost from initial to n

h(n) = estimated cost from n to goal

f(n) = estimated cost from initial to goal through n

Bucharest418=418+0

Craiova615=455+160

Bucharest418=418+0

Craiova526=366+160

Pitesti417=317+100

Sibiu553=300+253

Bucharest450=450+0

Sibiu591=338+253

Sibiu393=140+253

Timisoara447=118+329

Zerind449=75+374

Pitesti417=317+100

Arad646=280+366

Fagaras415=239+176

Oradea671=291+380

Rimnicu413=220+193

Rimnicu413=220+193

Sibiu393=140+253

Fagaras415=239+176

Arad366=0+366

A* Search

Admissible Heuristics

A heuristic h(n) is admissible if for every node n, h(n) <= h*(n), where h*(n) is the true cost to reach the goal from n

An admissible heuristic thus never overestimates the cost to reach the goal -- that is, it must be optimistic

For example, straight line distance is admissible

Theorem: if h(n) is admissible, A* using tree-search is optimal

Proof of Optimality of A*

Suppose some suboptimal goal G2 has been generated (in the fringe). Let n be an unexpanded node in the fringe such that n is on the shortest path to an optimal goal Gf(G2) = g(G2) since h(G2) = 0g(G2) > g(G) since G2 is suboptimalf(G) = g(G) since h(G) = 0f(G2) > f(G) from aboveh(n) <= h*(n) since h is admissibleg(n) + h(n) <= g(n) + h*(n)f(n) <= f(G)Hence f(G2) > f(n) so A* will never select G2 for expansion G2

n

G

S

A* Tree vs Graph Search

A* with admissible h is optimal for tree searchNot so for graph search – A* may discards repeated states even if cheaper routes to them (i.e. g(n)) are foundFix in two ways

modify graph search to check and replace repeated state nodes with cheaper alternativesleave graph search as is, but insist on consistent h(n)

Consistent Heuristics

A heuristic is consistent if for every node n, every successor n’ of n generated by action a, h(n) <= c(n,a,n’) + h(n’)Consistent heuristics satisfy triangularityDifficult to concoct an admissible yet inconsistent heuristicIf h is consistent, f(n’)

= g(n’) + h(n’)= g(n) + c(n,a,n’) + h(n’)>= g(n) + h(n)= f(n)

That is, f(n) is non-decreasing along any pathTheorem: If h(n) is consistent, A* using graph-search is optimal

Properties of A*

Complete? Yes (unless there are infinitely many nodes with f <= f(G))

Time? Exponential

Space? Keeps all nodes in memory

Optimal? Yes

Admissible HeuristicsFor example, the 8-puzzle

h1(n) = number of misplaced tiles

h2(n) = total Manhattan distance

7 2 45 68 3 1

1 23 4 56 7 8

h1(S) = h2(S) =

83+1+2+2+2+3+3+2 = 18

Dominance

For admissible heuristics h1 and h2, h2 dominates h1 if h2(n) >= h1(n) for all nTypical time complexities (number of expanded nodes) for 8-puzzle

d = 12IDS = 3,644,035A*(h1) = 227A*(h2) = 73

d = 24IDS = too manyA*(h1) = 39,135A*(h2) = 1,641

Relaxation

Finding heuristics systematically by relaxing a problem

A problem with fewer restrictions on actions is a relaxed problem

The cost of an optimal solution to the relaxed problem is an admissible heuristic for the original problem

For 8-puzzle, allowing tiles to move anywhere generates h1 and allowing tiles to move to any adjacent square generates h2

For Romania problem, straight line distance is a relaxation generating its heuristic

Local Search

Local Search: Goal = Solution

Integrated circuit design: How should the circuitry be laid out on the chip to optimize space and function?

Job-shop scheduling: How should resources (human or equipment) be allocated and scheduled optimally?

Portfolio management: How should financial assets be allocated to optimize investment goals in a market environment?

Local Search Algoritms

Path is irrelevant

State space is set of “complete” configurations

Find a configuration that satisfies constraints

Keep a single “current” state; try to improve it

N-Queens Problem

Put n queens on an n x n board with no two queens attacking each other

Q QQ Q

Q QQ

Q

Q Q

QQ

Hill-Climbing Search

“Like climbing a mountain in thick fog with amnesia”

Depending on initial state, can get stuck in local maxima

Hill-Climbing Search(defun hill-climb (state successorf evalf)

(let ((next (best (funcall successorf state) state evalf)))(if (eql state next)

state(hill-climb next successorf evalf))

))

(defun best (neighbors state evalf)(select-max

(make-node :value (funcall evalf state) :state state)(mapcar #’(lambda (s) (make-node :value (funcall evalf s) :state s)) neighbors)))

(defun select-max (best-so-far rest)(cond

((null rest) (node-state best-so-far))((> (node-value best-so-far) (node-value (car rest))) (select-max best-so-far (cdr rest)))(t (select-max (car rest) (cdr rest)))))

Hill-Climbing the 8-Queensh = number of pairs of attacking queens

successorf = move one queen along its column

18 12 14 13 13 12 14 1414 16 13 15 12 14 12 1614 12 18 13 15 12 14 1415 14 14 Q 13 16 13 16

Q 14 17 15 Q 14 16 1617 Q 16 18 15 Q 15 Q18 14 Q 15 15 14 Q 1614 14 13 17 12 14 12 18

h=17

A Local Minimum: h=1Q

QQ

QQ

QQ

Q

Local Beam Search

Keep track of k states rather than just one

Start with k randomly generated states

At each iteration, generate all successors of all k states

If any one is a goal, stop; else select the k best successors and repeat

Why is this different than just running hill-climbing k times in parallel?

Stochastic Local Beam Search

Like local beam search except select k next states randomly with probability proportional to value

Addresses clustering issues that arise from local beam search

Genetic Algorithms

A successor state is generated by combining two parent statesStart with k randomly generated states: the populationStates are represented as strings over a finite alphabetFitness function results with higher values for better statesProduce the next generation (next k states) by selection, crossover, mutation

Genetic 8-Queens

Fitness function = number of non-attacking pairs (goal = 28)

e.g. 24 ÷ (24+23+20+11) = 31%

Adversarial Search

Adversarial Search

Optimal decisions with the Minimax algorithm

αβ pruning

Imperfect decisions

Adversarial vs Search

Unpredictable opponent

explore moves for all possible replies

assume worst case (perfect opponent)

Time limits force suboptimal search

Game TreeTwo player game, deterministic, turns

Minimax Defined

Minimax(node) =

Utility(node) if terminal

maximum of Minimax of each successor if Max node

minimum of Minimax of each successor if Min node

Minimax AlgorithmPerfect play for deterministic games: Choose move with highest minimax value

This results in the best achievable payoff against best opponent

3 12 8 2 4 6 14 5 2

3

3

2 2

Minimax Algorithm

(pause for Lisp code)

Properties of Minimax

Complete (if tree is finite)

Optimal against optimal opponent

Time complexity = O(bm)

Space complexity = O(bm)

For chess, b ≈ 35, m ≈ 100; exact solution is infeasible

αβ pruning

keep track of best and worst possible minimax values and don’t pursue paths that cannot improve over these values

α is the value of the best choice found so far along the path for max

if value is worse than α, max-node will avoid it, pruning that branch

similarly define β for min

αβ pruning

3 12 8 2 14 5 2

≥3

≤2 ≤14≤5 ≤2

3

≤3 3

Properties of αβ pruning

Pruning does not effect final result — still complete and optimal

Good move ordering improves pruning effectiveness

With perfect ordering, time complexity becomes O(bm/2) effectively doubling search depth

Reasoning about relevant computations — i.e. metareasoning

Resource Limitations

In a timed game, suppose you have 100 seconds per move

Suppose we can explore 104 nodes/second

Thus 106 nodes/move

Must cut off the search at the appropriate depth, rather than using the complete tree

Use cutoff function for terminalp

Use evaluation function (heuristic) instead of utility for evalp

evaluation functions typically weighted sum of feature functions

∑ wifi(s)For chess, bm = 106 and b = 35 means m = 4; 4-ply look ahead is a terrible chess player

Games involving chance

E.g. BackgammonInsert chance nodes between max and min nodes

Minimax(n) =Utility(n) if terminalmax of minimax of successors if max nodemin of minimax of successors if min node∑ p(s) ⋅ minimax(s) if chance node

Documents

Lecture 2: Introduction to AI - Columbia University › ~huayang › files › Output.pdf · Anticipated all major arguments against AI Broke down AI into knowledge, reasoning, language