View
3
Download
0
Category
Preview:
Citation preview
Artificial Intelligence
Ram Meshulam 20041
Lesson 1
About
• Lecturer: Prof. Sarit Kraus
• TA: Galit Haim
• (almost) All you need can be found on the
Ram Meshulam 20042
• (almost) All you need can be found on the
course website:
– http://u.cs.biu.ac.il/~haimga/Teaching/AI/
Course Requirements 1
• The grade is comprised of 70% exam and 30% exercises.
• 3 programming exercises will be given. Work individually.
• All the exercises are counted for the final grade, but you
can pass the course without submitting them if your final
Ram Meshulam 20043
can pass the course without submitting them if your final
grade (composed from the exam and exercises grades) is
above the required threshold. The exercises are equally
counted.
• Exercises will be written in C++ or JAVA only. They
should compile and run on planet machine, and will be
submitted via “submit”. Be precise!
Course Requirements 2• Exercises are not hard, but work is required. Plan your
time ahead!
• When sending me mail please include the course number
(89-570) in the header, to pass the automatic spam filter.
• You (probably) will be required to participate in AI
Ram Meshulam 20044
• You (probably) will be required to participate in AI
experiments.
• See other general rules in: http://u.cs.biu.ac.il/~haimga/Teaching/AI/assignments/general-rules.pdf
Course Schedule
• Lesson 1:– Introduction
– Transferring a general problem to a graph search problem.
Ram Meshulam 20045
search problem.
• Lesson 2– Uninformed Search (BFS, DFS etc.).
• Lesson 3– Informed Search (A*,Best-First-Search etc.).
Course Schedule – Cont.
• Lesson 4– Local Search (Hill Climbing, Genetic
algorithms etc.).
• Lesson 5
Ram Meshulam 20046
• Lesson 5
– “Search algorithms” chapter summery.
• Lesson 6-7– Game-Trees: Min-Max & Alpha-Beta
algorithms.
Course Schedule – Cont.
• Lesson 8-9
– Planning: STRIPS algorithm
• Lesson 10-11-12
Ram Meshulam 20047
• Lesson 10-11-12
– Learning: Decision-Trees, Neural Network,
Naïve Bayes, Bayesian Networks and more.
• Lesson 13
– Questions and exercise.
AI – Alternative Definitions
• Elaine Rich and Kevin Knight: AI is the study of how to make computers do things at which, at the moment, people are better.
• Stuart Russell and Peter Norvig: [AI] has to do with smart programs, so let's get on and write some.
Ram Meshulam 20048
smart programs, so let's get on and write some.
• Claudson Bornstein: AI is the science of common sense.
• Douglas Baker: AI is the attempt to make computers do what people think computers cannot do.
• Astro Teller: AI is the attempt to make computers do what they do in the movies.
AI Domains
• Games – chess, checkers, tile puzzle.
• Expert systems
• Speech recognition and Natural language
Ram Meshulam 20049
• Speech recognition and Natural language
processing, Computer vision, Robotics.
AI & Search
• "The two most fundamental concerns of AI researchers are knowledge representation and search”
• “knowledge representation … addresses the
Ram Meshulam 200410
• “knowledge representation … addresses the problem of capturing in a language…suitable for computer manipulation”
• “Search is a problem-solving technique that systematically explores a space of problem states”.Luger, G.F. Artificial Intelligence: Structures and Strategies for
Complex Problem Solving
Solving Problems with Search
Algorithms
• Input: a problem P.
• Preprocessing:– Define states and a state space
Ram Meshulam 200411
– Define states and a state space
– Define Operators
– Define a start state and goal set of states.
• Processing:– Activate a Search algorithm to find a path form
start to one of the goal states.
Example - Missionaries & Cannibals
• State space – [M,C,B]
• Initial State – [3,3,1]
• Goal State – [0,0,0]
Ram Meshulam 200412
• Goal State – [0,0,0]
• Operators – adding or subtracting the vectors [1,0,1], [2,0,1], [0,1,1], [0,2,1] or [1,1,1]
• Path – moves from [3,3,1] to [0,0,0]
• Path Cost – river trips
Breadth-First-Search Pseudo code
• Intuition: Treating the graph as a tree and scanning top-
down.
• Algorithm:
BFS(Graph graph, Node start, Vector Goals)
Ram Meshulam 200413
BFS(Graph graph, Node start, Vector Goals)
1. L� make_queue(start)
2. While L not empty loop
1. n L.remove_front()
2. If goal (n) return true
3. S successors (n)
4. L.insert(S)
3. Return false
Breadth-First-Search Attributes
• Completeness – yes
• Optimality – yes, if graph is un-
weighted.
( , )b d< ∞ < ∞
Ram Meshulam 200414
• Time Complexity:
• Memory Complexity:
– Where b is branching factor and d is the
solution depth
• See water tanks example.
)( 1+dbO
)()...1( 112 ++ =−++++ ddbObbbbO
Artificial Intelligence
Ram Meshulam 200415
Lesson 2
Uninformed Search
• Uninformed search methods use only information available in the problem definition.– Breadth First Search (BFS)
Ram Meshulam 200416
– Breadth First Search (BFS)
– Depth First Search (DFS)
– Iterative DFS (IDA)
– Bi-directional search
– Uniform Cost Search (a.k.a. Dijkstra alg.)
Depth-First-Search Pseudo code
DFS(Graph graph, Node start, Vector Goals)
1. L make_stack(start)
2. While L not empty loop
Ram Meshulam 200417
2. While L not empty loop
2.1 n L.remove_front()
2.2 If goal (n) return true
2.3 S successors (n)
2.4 L.insert(S)
3. Return false
Depth-First-Search Attributes
• Completeness – No. Infinite loops or
Infinite depth can occur.
• Optimality – No. m
Ram Meshulam 200418
• Time Complexity:
• Memory Complexity:
– Where b is branching factor and m is the
maximum depth of search tree
• See water tanks example
4
1
2
3
5
( )mO b
( )O bm
Limited DFS Attributes
• Completeness – Yes, if d≤l
• Optimality – No.
• Time Complexity: ( )lO b
Ram Meshulam 200419
• Time Complexity:
– If d<l, it is larger than in BFS
• Memory Complexity:
– Where b is branching factor and l is the
depth limit.
( )O b
( )O bl
DepthDepth--FirstFirst Iterative-Deepening
0
1,3,
92,6,16
Ram Meshulam 200420
The numbers represent the order generated by DFID
c4,10 5,13 c7,17 8,20
11 12 21 22c14 15 18 19
Iterative-Deepening Attributes
• Completeness – Yes
• Optimality – yes, if graph is un-weighted.
• Time Complexity:
Ram Meshulam 200421
• Time Complexity:
• Memory Complexity: ( )O db
– Where b is branching factor and d is the maximum
depth of search tree
)())1(...)1()(( 2 ddbObbdbdO =++−+
State Redundancies
• Closed list - a hash table which holds the
visited nodes.
• For example BFS:Closed List
Ram Meshulam 200422
• For example BFS:Closed List
Open List (Frontier)
Bi-directional Search
• Search both from initial state to goal state.
• Operators must be symmetric.
Ram Meshulam 200423
S G
Bi-directional Search Attributes
• Completeness – Yes, if both directions use BFS
• Optimality – yes, if graph is un-weighted and both
directions use BFS.
• Time and memory Complexity: )( 2/dbO
Ram Meshulam 200424
• Time and memory Complexity:
• Pros.
– Cuts the search tree by half (at least theoretically).
• Cons.
– Frontiers must be constantly compared.
)( 2/dbO
Minimum cost path
• General minimum cost path-search
problem:
– Find shortest path form start state to one of the
Ram Meshulam 200425
– Find shortest path form start state to one of the
goal states in a weighted graph.
– Path cost function is g(n): sum of weights from
start state to goal.
Uniform Cost Search
• Also known as Dijkstra’s algorithm.
• Expand the node with the minimum path
cost first.
Ram Meshulam 200426
cost first.
• Implementation: priority queue.
Uniform Cost Search Attributes
• Completeness: yes, for positive weights
• Optimality: yes
• Time & Memory complexity: )( / ec
bO
Ram Meshulam 200427
• Time & Memory complexity:
– Where b is branching factor, c is the optimal solution cost
and e is the minimum edge cost
)(bO
Example of Uniform Cost Search• Assume an example tree with different edge costs, represented by
numbers next to the edges.
a
b c
2 1
28
Notations for this example:
generated node
expanded node
c
b
12ccf
12
g d e
Example of Uniform Cost Search
a2 1
29
Closed list:
Open list:0
a
Example of Uniform Cost Search
a
b c
2 1
1212
30
Closed list:
Open list:2 1
b c
a
Example of Uniform Cost Search
a
b c
2 1
12c
12
e
31
Closed list:
Open list:
c cd e
2 2 3
b d e
a c
Example of Uniform Cost Search
a
b c
2 1
12c
12
e
32
Closed list:
Open list:
c ccf g d e
a c b
2 3 3 4
d e f g
Example of Uniform Cost Search
a
b c
2 1
12c
12
e
33
Closed list:
Open list:
c ccf g d e
3 3 4
e f g
a c b d
Example of Uniform Cost Search
a
b c
2 1
12c
12
e
34
Closed list:
Open list:
c ccf g d e
3 4
f g
a c b d e
Example of Uniform Cost Search
a
b c
2 1
12c
12
e
35
Closed list:
Open list:
c ccf g d e
4
g
a c b d e f
Example of Uniform Cost Search
a
b c
2 1
12c
12
e
36
Closed list:
Open list:
c ccf g d e
a c b d e f g
Informed Search
• Incorporate additional measure of a
potential of a specific state to reach the
goal.
Ram Meshulam 200437
goal.
• A potential of a state to reach a goal is
measured through a heuristic function h(n).
• An evaluation function is denoted f(n).
Best First Search Algorithms
• Principle: Expand node n with the best evaluation function value f(n).
• Implement via a priority queue
Ram Meshulam 200438
• Implement via a priority queue
• Algorithms differ with definition of f :– Greedy Search:
– A*:
– IDA*: iterative deepening version of A*
– Etc’
( ) ( )f n h n=( ) ( ) ( )f n g n h n= +
Exercise
• Q: Does a Uniform-Cost search be considered as a Best-First algorithm?
• A: Yes. It can be considered as a Best-First algorithm with evaluation function f(n)=g(n).
Ram Meshulam 200439
algorithm with evaluation function f(n)=g(n).
• Q: In what scenarios IDS outperforms DFS?, BFS?
• A:– IDS outperforms DFS when the search tree is a lot
deeper than the solution depth.
– IDS outperforms BFS when BFS run out of memory.
Exercise – Cont.
• Q: Why do we need a closed list?
• A: Generally a closed list has two main functionalities:– Prevent re-exploring of nodes.
– Hold solution path from start to goal (DFS based algorithms have it anyway).
Ram Meshulam 200440
it anyway).
• Q: Does Breadth-FS find optimal path length in general?
• A: No, unless the search graph is un-weighted.
• Q: Will IDS always find the same solution as BFS given that the nodes expansion order is deterministic?
• A: Yes. Each iteration of IDS explores new nodes the same order a BFS does.
Artificial Intelligence
Ram Meshulam 200441
Lesson 3
Informed Search
• Incorporate additional measure of a
potential of a specific state to reach the
goal.
Ram Meshulam 200442
goal.
• A potential of a state to reach a goal is
measured through a heuristic function h(n),
thus always h(goal) = 0.
• An evaluation function is denoted f(n).
Best First Search Algorithms
• Principle: Expand node n with the best evaluation function value f(n).
• Implement via a priority queue
Ram Meshulam 200443
• Implement via a priority queue
• Algorithms differ with definition of f :– Greedy Search:
– A*:
– IDA*: iterative deepening version of A*
– Etc’.
( ) ( )f n h n=( ) ( ) ( )f n g n h n= +
Properties of Heuristic functions
• The 2 most important properties:
– relatively cheap to compute
– relatively accurate estimator of the cost to reach a goal.
Usually a “good” heuristic is if ½opt(n)<h(n)≤opt(n)
Ram Meshulam 200444
• Examples:
– Navigating in a network of roads from one location to
another. Heuristic function: Airline distance.
– Sliding-tile puzzles. Heuristic function: Manhattan
distance - number of horizontal and vertical grid units each
tile is displaced from its goal position
Heuristic Function h(n)
• Admissible/Underestimate: h(n) never
overestimate the actual cost from n to goal
Ram Meshulam 200445
• Consistent/monotonic (desirable) :
h(m)-h(n) ≤w(n,m) where m is parent of n. This
ensures f(n) ≥f(m).
Best-FS Algorithm Pseudo code
1. Start with open = [initial-state].
2. While open is not empty do
1. Pick the best node on open.
Ram Meshulam 200446
1. Pick the best node on open.
2. If it is the goal node then return with success.
Otherwise find its successors.
3. Assign the successor nodes a score using the
evaluation function and add the scored nodes
to open
General Framework using Closed-
list (Graph-Search)
GraphSearch(Graph graph, Node start, Vector goals)
1. O make_data_structure(start) // open list
2. Cmake_hash_table // closed list
3. While O not empty loop
1. n O.remove_front()
Ram Meshulam 200447
1. n O.remove_front()
2. If goal (n) return n
3. If n is found on C � continue
4. //otherwise
5. O successors (n)
6. Cn
4. Return null //no goal found
Greedy Search Attributes
• Completeness: No. Inaccurate heuristics can
cause loops (unless using a closed list), or
entering an infinite path
Ram Meshulam 200448
entering an infinite path
• Optimality: No. Inaccurate heuristics can
lead to a non optimal solution.
• Time & Memory complexity:
( )mO b
s
g
a b
1 3
12
h=2 h=1
A* Algorithm
• Combines greedy h(n) and uniform cost g(n) approaches.
• Evaluation function: f(n)=g(n)+h(n)
Ram Meshulam 200449
A* Pseudo codeA-Star(Graph graph, Node start, Node goal, HeuristicFunction h)
1. O make_priority_queue(startNode) // open list
2. Cmake_hash_table // closed list
3. While O not empty loop
1. n O.remove_front() //O is sorted by f(n)=g(n)+h(n) values
2. If goal (n) return n
3. If n is found on C � continue
Ram Meshulam 200450
3. If n is found on C � continue
4. //otherwise
5. S successors (n)
6. For each node s in S
1. Set s.g=n.g+w(n,s)
2. Set s.parent=n //for path extraction
3. Set s.h=h(s) //to calculate f
4. Os
7. Cn
4. Return null //no goal found
A* Algorithm (1)• Completeness:
– In a finite graph: Yes
– In an infinite graph: if all edge costs are finite and have a minimum positive value, and all heuristic values are finite and non-negative.
Ram Meshulam 200451
• Optimality:– In tree-search: if h(n) is admissible
– In graph-search: if it is also consistent
A* Algorithm (2)
• optimally efficient: A* expands the
minimal number of nodes possible with any
given (consistent) heuristic.
Ram Meshulam 200452
given (consistent) heuristic.
• Time and space complexity:
– Worst case: Cost function f(n) = g(n)
– Best case: Cost function f(n) = g(n) + h*(n)
)( / ecbO
)(bdO
A* Application Example
• Game: Tales of Trolls
and Treasures
• Yellow dots are nodes
in the search graph.
Ram Meshulam 200453
in the search graph.
IDA* Algorithm
• Each iteration is a depth-first search that keeps track of the cost evaluation f = g + h of each node generated.
• The cost threshold is initialized to the
Ram Meshulam 200454
• The cost threshold is initialized to the heuristic of the initial state.
• If a node is generated whose cost exceeds the threshold for that iteration, its path is cut off.
IDA* Pseudo code
• IDAStar-Main (Node root)
1. Set bound = f(root);
2. WHILE (bound<infinity)1. Set bound= IDAStar(root, bound)
Ram Meshulam 200455
• IDAStar(node n, Double bound)
1. if n is a goal, Exit algorithm and return goal
2. if n has no children, return infinity
3. fn = infinity
4. for each child c of n, Set f=f(c )1. IF (f<= bound) fn=min(fn, IDAStar(c,bound))
2. Else fn=min(fn,f)
5. Return fn
IDA* Attributes
• The cost threshold increases in each iteration to the total cost of the lowest-cost node that was pruned during the previous iteration.
• The algorithm terminates when a goal state is
Ram Meshulam 200456
• The algorithm terminates when a goal state is reached whose total cost does not exceed the current threshold.
• Completeness and Optimality: Like A*
• Space complexity:
• Time complexity*: )( / ecbO
)(cO
Duplicate Pruning
• Do not enter the father of the current state
– With or without using closed-list
• Using a closed-list, check the closed list before
entering new nodes to the open list
Ram Meshulam 200457
entering new nodes to the open list
– Note: in A*, h has to be consistent!
– Do not remove the original check
• Using a stack, check the current branch and
stack status before entering new nodes
Exercise
• Q: What are the
advantages of IDA*
over:
– A*?
OptimalitySpaceInformed
pruning
Endless
branch
Alg.
Adv.
VA*
Ram Meshulam 200458
– A*?
– DFS (no closed list)?
– Uniform-Cost (closed
list)?
VVVDFS
VVUC
Exercise – Cont.• Q: When IDA* is not preferable?
• A:– A space graph with dense node duplications
– When all the node costs are different, if the asymptotic complexity of A* is O(N) - IDA*‘s complexity can get in the worst case to O(N2).
Ram Meshulam 200459
O(N2).
• Q: What algorithm we’ll get if we implement Greedy search on a uniform cost graph using– h(n)= g(n) ?
– h(n)= -g(n) ?
• A: – h(n)= g(n) � BFS
– h(n)= -g(n) � DFS
Exercise – True/False.
Sentence
DFS is not optimal
Forward Search is always more
preferable than Backwards Search
True/False
True, see DFS slides for example
False, For example if there are
more start nodes than goal nodes,
or it is more natural to go
Ram Meshulam 200460
ID alg. is always equal or slower
than BFS (assuming nodes
expansion order is deterministic)
IDS alg. is the exact
implementation of BFS
db
or it is more natural to go
backwards (expert systems).
True. The last iteration expands
nodes as BFS.
False. its space complexity is bd
instead of .
Artificial Intelligence
Ram Meshulam 200461
Lesson 4
אלגוריתמי� המבצעי� שיפור
איטרטיבי.עבור בעיות בהן המטרה לא ידועה•
.להרוויח כמה שיותר כסף -:דוגמאות•
.לארוז בכמה שפחות נפח -
.לשבץ עם כמה שפחות קונפליקטים - .לשבץ עם כמה שפחות קונפליקטים -
יודעים רק איך להשוות בין שני מצבים ולומר מי מהם •
.יותר טוב
ומנסים לעשות שינויים מקומיים כדי , מגרילים פתרון•
.לשפר אותו
Ram Meshulam 200462
Local Search• Local improvement, no paths
• Look around at states in the local neighborhood and choose the one with the best value
• Pros: - Quick (usually linear)
– Sometimes enough
Ram Meshulam 200463
– Linear space complexity
– can often find reasonable solutions in large or infinite (continuous) state spaces for which systematic algorithms are unsuitable.
– Suitable for optimization problems: Math problems for finding optimal value for functions under specific constrains.
• Cons:– Not optimal: Travelling Sale Person problem: Find the shortest
path s.t every city will be visited only once.
– Can stuck on local maximum, plateau.
Local Search – Cont.
• In order to avoid local
maximum and
plateaus we permit
moves to states with
pAlgorithm
p=0Hill
Climbing,GSAT
Ram Meshulam 200464
moves to states with
lower values in
probability p.
• The different
algorithms differ in p.
p=1Random Walk
p=c (domain
specific)
Mixed Walk,
Mixed GSAT
p=acceptor(dh,
T)
Simulated
Annealing
Hill Climbing
f-valuewhile f-value(state) <= f-value(next-best(state))
state := next-best(state)
Ram Meshulam 200465
states
f-value = evaluation(state)
Hill Climbing• Always choose the next best successor
• Stop when no improvement is possible
• The problems:
– Stops in local maximum– Stops in local maximum
– If the best neighbor is equal to the node, it chooses
the neighbor
– If there are some equals neighbors, choose one
randomly
– Can stuck with no progress because of all above
Ram Meshulam 200466
In order to avoid plateaus and
local maximum:
- Sideways move: go to sons in which their value
equal to mine
- Stochastic hill climbing: Choose the node with
the highest grade (how much its solution is the highest grade (how much its solution is
good)
- Random-restart algorithm
Ram Meshulam 200467
Random Restart Hill Climbing
. hill climbingבחר בנקודה רנדומאלית והר� את 1.
א� הפתרו� שמצאת טוב יותר מהפתרו� הטוב 2..שמור אותו–ביותר שנמצא עד כה .שמור אותוביותר שנמצא עד כה
.�1חזור ל3.
.איטרציותלאחר מספר קבוע של – ? מתי נסיי�
שבה� איטרציותלאחר מספר קבוע של –לא נמצא שיפור לפתרו� הטוב ביותר שנמצא עד
.כה
Ram Meshulam 200468
Random Restart Hill Climbing
Ram Meshulam 200469 Ram Meshulam 200469
f-value = evaluation(state)
Simulated Annealing
נאפשר ירידה ממצב השיא אליו , במקו� להתחיל בכל פע� מחדש•
.הגענו
.התהלי� דומה לטיפוס הרי� אבל בכל שלב בוחרי� צעד אקראי•
•�-א� הצעד משפר את ער f נבצע אותו.
.נבצע אותו בהסתברות מסוימת, אחרת•
כל עוד לא מוצאי� חזקתיתפונקצית ההסתברות יורדת בצורה •
.פתרו�Ram Meshulam 200470
Simulated Annealing• Permits moves to states with lower values
• Gradually decreases the frequency of such moves and their size.
• Analogue to physical process of freezing liquid.
• Schedule()
Ram Meshulam 200471
• Schedule()– Returns the current temperature
– Depends on start temperature and round number
• Acceptor()– Returns the probability of choosing “bad” node.
– Depends on h(n)-h(n_son) and current temperature.
Simulated Annealing – Pseudo code
• Simulated Annealing(start node s, Temperature t, )1. Set startTemp=t //for schedule function
2. Set h= h(s)
3. Set round=0
Ram Meshulam 200472
3. Set round=0
4. while terminal condition not true1. Set s_new = choose random son of s
2. Set h_new = h(s_new)
3. if (h_new < h) or (random() < acceptor(h_new-h,t))1. Set s=s_new
2. Set h=h_new
3. Set t=schedule(startTemp, round)
4. Set round=round+1
Simulated Annealing – Pseudo code
Cont.
• Acceptor func: Decides
whether to go to a bad
node or not…example:tc
h
e •∆
−
Ram Meshulam 200473
• Schedule func: Decrease
the temp following the
rounds. example:
roundc startTemp•0<c<1
10 ≤< c
GSAT
• Greedy local search procedure for satisfying
logic formulas in a conjunctive normal form
(CNF).
Ram Meshulam 200474
(CNF).
• An implementation of Hill Climbing for the
CNF domain.
• Note: SAT is NP-Complete problem.
GSAT• Searcher:
• states: variable assignments
• actions: flip a variable's assignment
• score: the number of unsatisfied clauses
• Start with a random assignment.
• While not sat...
• Flip the value assigned to the variable that yields greatest
number of satisfied clauses.
• Repeat #flips.
• Repeat with new random assignment #trials.
Ram Meshulam 200475
GSAT – Pseudo code
• GSAT(clauses C,Integer tries, Integer flips)
1. for i=1 to tries
1. Set T=a randomly generated truth assignment
Ram Meshulam 200476
1. Set T=a randomly generated truth assignment
2. for j= 1 to flips
1. if T satisfies C then return T
2. FLIP any variables in T that results in the greatest decrease
in the number of unsatisfied clauses
3. Save the currently best T
Genetic Algorithm
• Inspired by Darwin's theory of evolution:
survival of the fittest.
• Begins with a set of solutions
Ram Meshulam 200477
• Begins with a set of solutions
“chromosomes” called population.
• Best solutions from generation n are taken
and used to form a generation n+1 applying
crossover and mutation operators.
Genetic Algorithm Pseudo code
• choose initial population
• evaluate each individual's fitness
• repeat until terminating condition
Ram Meshulam 200478
• repeat until terminating condition
– select individuals to reproduce //better fitness �better
//chance to be selected
– mate pairs at random
– in crossover_prob. apply crossover operator
– in mutation_prob. apply mutation operator
– evaluate each individual's fitness
Exercise
• Q: Is there a danger of Local maximum in GA? How does the algorithm tries to avoid it?
• A: The mutation operator, which inserts randomization to the algorithm.
• Q: If start temperature very close to 0 in SA
Ram Meshulam 200479
• Q: If start temperature very close to 0 in SA– how will the algorithm behave?
– What problem will it cause?
– How partially can we solve it?
• A: – Greedy Search with no Closed list.
– It will stuck on the first local max.
– Random-restart.
Exercise – Cont.
• Q: Solve the Traveling Salesman Problem using:
– Simulated annealing (SA)
– Genetic Algorithm (GA).
Ram Meshulam 200480
• A:
– For both algorithms a state is a vector which represents
the order in which the salesman travels.
– State value/fitness is the distance the agent traveled.
– State expand/mutation is to swap order of two cities in
path.
Exercise – Cont.
• GA:
– crossover: “greedy crossover”[greffenstette,1985]:
– GreedyCrossover(vector v1, vector v2)
Ram Meshulam 200481
– GreedyCrossover(vector v1, vector v2)1. Set vector res=v1[0] //v1 and v2 are chosen
randomly
2. Repeat until |res|=number of cities1. Select the closest city to res[i] from v1[i+1],v2[i+1]
which is not already in res.
2. If not possible select randomly a city which is not in res.
Artificial Intelligence
Ram Meshulam 200482
Lesson 5
Search Algorithms Hierarchy
Global
Informed Uninformed
Ram Meshulam 200483
Local
GSAT
Hill Climbing
Random
Walk
Mixed Walk
Mixed GSAT
Simulated Annealing
DFS
IDS
BFS
Uniform Cost
A*
IDA*Greedy
Exercise
• What are the different
data structures used to
implement the open
list in BFS,DFS,Best-
QueueBFS
StackDFS
Ram Meshulam 200484
list in BFS,DFS,Best-
FS:StackDFS
Priority
Queue
Best-FS (Greedy,A*,Unifo
rm-Cost Alg).
Exercise – Cont.
• If there is no solution A* will explore the
whole graph
• An admissible heuristic function h(n) will
[yes]
Ram Meshulam 200485
• An admissible heuristic function h(n) will
always return smaller values than the real
distance to the goal
• h,h’ admissible � A* will expand the same
number of nodes with both func.
[no. h(n)<=h*(n) ]
[no]
Artificial Intelligence
Ram Meshulam 200486
Lesson 6(From Russell & Norvig)
Games- Outline
• Optimal decisions
• α-β pruning
• Imperfect, real-time decisions
Ram Meshulam 200487
• Imperfect, real-time decisions
Games vs. search problems
• "Unpredictable" opponent � specifying a
move for every possible opponent reply
Ram Meshulam 200488
• Time limits � unlikely to find goal, must
approximate
Game tree (2-player,
deterministic, turns)
Ram Meshulam 200489
Minimax
• Perfect play for deterministic games
• Idea: choose move to position with highest minimax value
= best achievable payoff against best play
• E.g., 2-ply game:
Ram Meshulam 200490
• E.g., 2-ply game:
Minimax algorithm
Ram Meshulam 200491
Properties of minimax
• Complete? (=will not run forever) Yes (if tree is finite)
• Optimal? (=will find the optimal response) Yes (against an
optimal opponent)
Ram Meshulam 200492
• Time complexity? O(bm)
• Space complexity? O(bm) (depth-first exploration), O(bm)
for saving the optimal response
• For chess, b ≈ 35, m ≈100 for "reasonable" games
� exact solution completely infeasible
•
α-β pruning example
Ram Meshulam 200493
α-β pruning example
Ram Meshulam 200494
α-β pruning example
Ram Meshulam 200495
α-β pruning example
Ram Meshulam 200496
α-β pruning example
Ram Meshulam 200497
Properties of α-β
• Pruning does not affect final result
• Good move ordering improves effectiveness of pruning
• With "perfect ordering“ on binary tree, time complexity =
Ram Meshulam 200498
• With "perfect ordering“ on binary tree, time complexity = O(bm/2)� doubles depth of search
• A simple example of the value of reasoning about which computations are relevant (a form of metareasoning)
Why is it called α-β?
• α is the value of the best (i.e., highest-value) choice found so far at any choice point along the path for max
Ram Meshulam 200499
the path for max
• If v is worse than α, maxwill avoid it�prune that branch
• Define β similarly for min
The α-β algorithm
Ram Meshulam 2004100
The α-β algorithm
Ram Meshulam 2004101
Resource limitsSuppose we have 100 secs, explore 104 nodes/sec
� 106 nodes per move
Standard approach:
• cutoff test: e.g., depth limit
Ram Meshulam 2004102
• cutoff test: e.g., depth limit
(perhaps add quiescence search: Additional “grade” for each
node).מצב בו שני צדדים במשחק בעיצומו של החלפת כלים– בהקשר של משחקים . מצב חוסר שקט
יש סיכוי גבוה כי יחזיר ערך שגוי מאחר ותיתכן , אם אלגוריתם החיפוש יסיים לחפש בשלב כזה
. המשכת החלפת כלים נוספת
.פתרון הבעיה הוא להמשיך להעמיק בענף העץ עד שמגיעים למצב שקט שבו אין החלפת כלים
• evaluation function: estimated desirability of position
•
Evaluation functions
• For chess, typically linear weighted sum of features
Eval(s) = w1 f1(s) + w2 f2(s) + … + wn fn(s)
• e.g., w = 9 with
Ram Meshulam 2004103
• e.g., w1 = 9 with
f1(s) = (number of white queens) – (number of black
queens), etc.
Cutting off searchMinimaxCutoff is identical to MinimaxValue except
1. "Terminal ?“ is replaced by “Cutoff?”
2. Utility is replaced by Eval
Does it work in practice?
bm = 106, b=35 � m=4
Ram Meshulam 2004104
bm = 106, b=35 � m=4
4-ply lookahead is a hopeless chess player!
– 4-ply ≈ human novice
– 8-ply ≈ typical PC, human master
– 12-ply ≈ Deep Blue, Kasparov
Deterministic games in practice• Checkers: Chinook ended 40-year-reign of human world champion
Marion Tinsley in 1994. Used a precomputed endgame database defining perfect play for all positions involving 8 or fewer pieces on the board, a total of 444 billion positions.
• Chess: Deep Blue defeated human world champion Garry Kasparov in a six-game match in 1997. Deep Blue searches 200 million positions
Ram Meshulam 2004105
a six-game match in 1997. Deep Blue searches 200 million positions per second, uses very sophisticated evaluation, and undisclosed methods for extending some lines of search up to 40 ply.
• Othello: human champions refuse to compete against computers, who are too good.
• Go: human champions refuse to compete against computers, who are too bad. In go, b > 300, so most programs use pattern knowledge bases to suggest plausible moves.
Summary
• Games are fun to work on!
• They illustrate several important points about AI
Ram Meshulam 2004106
• perfection is unattainable � must approximate
• good idea to think about what to think about
Artificial Intelligence
Ram Meshulam 2004107
Lesson 7
Planning
• Traditional search methods does not fit to a
large, real world problem: it’s needed to
define specific states, and not in general.
Ram Meshulam 2004108
define specific states, and not in general.
• We want to use general knowledge
• We need general heuristic
• Problem decomposition
STRIPS Algorithm
• Strips – Stands for STanford Research
Institute Problem Solver (1971).
• Strips idea: start from the goal to the start
Ram Meshulam 2004109
• Strips idea: start from the goal to the start
state
• See example (pdf).
STRIPS – Representation
• States and goal – sentences in FOL.
• Operators – are combined of 3 parts:– Operator name
– Preconditions – a sentence describing the conditions
Ram Meshulam 2004110
– Preconditions – a sentence describing the conditions that must occur so that the operator can be executed.
– Effect – a sentence describing how the world has change as a result of executing the operator. Has 2 parts:
• Add-list
• Delete-list
– Optionally, a set of (simple) variable constraints
Example – Blocks world
Basic operations– stack(X,Y): put block X on block Y
– unstack(X,Y): remove block X from block Y
– pickup(X): pickup block X
Ram Meshulam 2004111
– pickup(X): pickup block X
– putdown(X): put block X on the table
A
B
C
TABLE
Example – Blocks world (Cont.)
operator(stack(X,Y),
Precond [holding(X),clear(Y)],
Add [handempty,on(X,Y),clear(X)],
Delete [holding(X),clear(Y)],
Constr [X\==Y,Y\==table,X\==table]).
operator(unstack(X,Y),
[on(X,Y), clear(X), handempty],
[holding(X),clear(Y)],
[handempty,clear(X),on(X,Y)],
[X\==Y,Y\==table,X\==table]).
Ram Meshulam 2004112
operator(pickup(X),
[ontable(X), clear(X), handempty],
[holding(X)],
[ontable(X),clear(X),handempty],
[X\==table]).
[X\==Y,Y\==table,X\==table]).
operator(putdown(X),
[holding(X)],
[ontable(X),handempty,clear(X)],
[holding(X)],
[X\==table]).
STRIPS Pseudo code
STRIPS(stateList start, stateList goals)
1. Set state = start
2. Set plan = []
Ram Meshulam 2004113
2. Set plan = []
3. Set stack = goals
4. while stack is not empty do
1. STRIPS-Step()
5. Return plan
STRIPS Pseudo code – Cont.
STRIPS-Step()
switch top of stack t :
1. case t is a goal that matches state:
Ram Meshulam 2004114
1. case t is a goal that matches state:
1. pop stack
2. case t is an unsatisfied conjunctive-goal:
1. select an ordering for the sub-goals
2. push the sub-goals into stack
STRIPS Pseudo code – Cont.
3. case t is a simple unsatisfied goal1. choose an operator op whose add-list matches t
2. replace the t with op
Ram Meshulam 2004115
3. push preconditions of op to stack
4. case t is an operator1. pop stack
2. state = state + t.add-list - t.delete-list
3. plan = [plan | t]
Versions and Decision points• 3 decision points
– How to order sub-goals?
– Which operator to choose?
– Which object to place in a variable?
• Different versions– Backtracking? (at each decision point)
Ram Meshulam 2004116
– Backtracking? (at each decision point)
– Lifted: remain a variable in the stack with no value Vs.
– Grounded: for each variable, a value is assigned
• The original STRIPS– Backtrack only on the order of sub-goals
– Lifted
Artificial Intelligence
Ram Meshulam 2004117
Lesson 8
Outline
• Inductive learning
• Decision tree learning
Ram Meshulam 2004118
Learning
• Learning is essential for unknown environments,
– i.e., when designer lacks omniscience
• Learning is useful as a system construction
Ram Meshulam 2004119
• Learning is useful as a system construction
method,
– i.e., expose the agent to reality rather than trying to
write it down
• Learning modifies the agent's decision
mechanisms to improve performance
Learning Paradigms
• Supervised Learning: with a “supervisor”.
Inputs and their supplied outputs by the
“supervisor”
• Reinforced Learning: with “reward” for a • Reinforced Learning: with “reward” for a
good action, or “penalty” for a bad action.
Self learning.
• Unsupervised Learning: Try to learn, but
it’s unknown if the learning is correct or
not.
Ram Meshulam 2004120
Inductive learning• Simplest form: learn a function from examples
• f is the target function,
An example is a pair (x, f(x))
Ram Meshulam 2004121
• Problem: find a hypothesis hsuch that h ≈ f
given a training set of examples
• This is a highly simplified model of real learning:– Ignores prior knowledge
– Assumes examples are given
Inductive learning method
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Ram Meshulam 2004122
Inductive learning method
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Ram Meshulam 2004123
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Inductive learning method
Ram Meshulam 2004124
• Ockham’s razor: prefer the simplest hypothesis consistent with data
• The tradeoff between the expressiveness of a hypothesis space and the complexity of finding simple and consistent hypothesis
Learning decision trees
Problem: decide whether to wait for a table at a restaurant, based on the following attributes:1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
Ram Meshulam 2004125
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute-based representations• Examples described by attribute values (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait for a table:
Ram Meshulam 2004126
• Classification of examples is positive (T) or negative (F)
Decision trees• One possible representation for hypotheses
• E.g., here is the “true” tree for deciding whether to wait:
Ram Meshulam 2004127
Expressiveness• Decision trees can express any function of the input attributes.
• E.g., for Boolean functions, truth table row → path to leaf:
Ram Meshulam 2004128
• Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples
• Prefer to find more compact decision trees
Decision tree learning• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of
(sub)tree
Ram Meshulam 2004129
Choosing an attribute
• Idea: a good attribute splits the examples into subsets that
are (ideally) "all positive" or "all negative"
Ram Meshulam 2004130
• Patrons? is a better choice
Using information theory
• To implement Choose-Attribute in the DTL
algorithm
• Information Content of an answer (Entropy):
Ram Meshulam 2004131
I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi)
• For a training set containing p positive examples
and n negative examples:
np
n
np
n
np
p
np
p
np
n
np
pI
++−
++−=
++ 22 loglog),(
Information gain
• A chosen attribute A divides the training set E into subsets E1, … , Ev according to their values for A, where A has v
distinct values.
∑+
=v
iiii npI
npAremainder ),()(
Ram Meshulam 2004132
• Information Gain (IG) or reduction in entropy from the
attribute test:
• Choose the attribute with the largest IG
∑= +++
+=
i ii
i
ii
iii
np
n
np
pI
np
npAremainder
1
),()(
)(),()( Aremaindernp
n
np
pIAIG −
++=
Information gain
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit
Consider the attributes Patrons and Type (and others too):
42642
Ram Meshulam 2004133
Patrons has the highest IG of all attributes and so is chosen by the DTL
algorithm as the root
bits 0)]4
2,
4
2(
12
4)
4
2,
4
2(
12
4)
2
1,
2
1(
12
2)
2
1,
2
1(
12
2[1)(
bits 0541.)]6
4,
6
2(
12
6)0,1(
12
4)1,0(
12
2[1)(
=+++−=
=++−=
IIIITypeIG
IIIPatronsIG
Example contd.
• Decision tree learned from the 12 examples:
Ram Meshulam 2004134
• Substantially simpler than “true” tree---a more complex
hypothesis isn’t justified by small amount of data
Performance measurement• How do we know that h ≈ f ?
1. Use theorems of computational/statistical learning theory
2. Try h on a new test set of examples
(use same distribution over example space as training set)
Learning curve = % correct on test set as a function of training set size.
Ram Meshulam 2004135
A learning curve for the
decision tree algorithm on
100 randomly generated
examples in the restaurant
domain. The graph
summarizes 20 trials.
Summary
• Learning needed for unknown environments, lazy designers
• Learning agent = performance element + learning element
Ram Meshulam 2004136
element
• For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples
• Decision tree learning using information gain
• Learning performance = prediction accuracy measured on test set
Fresh our memory withPROBABILITY
137
Lesson 9
• Unconditional or prior probability that a
proposition A is true: P(A)
– In the absence of any other information, the probability
to event A is P(A).
– Probability of application accepted:
Unconditional Probability
138
– Probability of application accepted:
P(application-accept) = 0.2
• Propositions include random variables X
– Each random variable X has domain of values:
{red, blue, …green}
– P(X=Red) means the probability of X to be Red
• If application-accept is binary random variable ->
values = {true,false}
– P(application-accept) same as P(app-accept = True)
– P(~app-accept) same as P(app-accept = False)
Unconditional Probability
139
• If Status-of-application domain:
{reject, accept, wait-list}
– We are allowed to make statements such as:
P(status-of-application = reject) = 0.2
P(status-of-application = accept) = 0.3
P(status-of-application = wait-list) = 0.5
Conditional Probability
• What if agent has some evidence?
– E.g. agent has a friend who has applied with a much weaker
qualification, and that application was accepted?
• Posterior or conditional probability
140
• Posterior or conditional probability
P(A|B) probability of A given all we know is B
– P(X=accept|Weaker application was accepted)
– If we know B and also know C, then P(A| B ∧ C)
– P(A ∧ B) = P(A|B)*P(B)
– P(A ∧ B) = P(B|A)*P(A)
Product rule
141
– P(A|B) = P(A ∧ B) / P(B)
– P(B|A) = P(A ∧ B) / P(A) BA
• Probability of all the possible values of X Denote by
P(X)
– Note that P is in bold
– In our example:
X = Status-of-application
Probability Distribution
142
X = Status-of-application
Xi ∈{reject, accept, wait-list}
P(X) = <0.2, 0.3, 0.5>
• Σ P(X=xi) = 1
Joint Probability Distribution
• Joint probability distribution is a table– Assigns probabilities to all possible assignment of values for
combinations of variables
• P
143
• P(X1,X2,..Xn) assigns probabilities to all possible assignment of values to variables X1, X2,..Xn
Joint Probability Distribution
• X1 = Status of your application
• X2 = Status of your friend’s application
• Then P(X1,X2)
X1
144
0.15 0.3 0.02
0.3 0.02 0.09
0.02 0.09 0.01
X1
X2
AcceptReject Wait-list
Accept
Reject
Wait-list
Bayes’ Rule
• Given that
– P(A ∧ B) = P(A|B)*P(B)
– P(A ∧ B) = P(B|A)*P(A)
���� P(B|A) = P(A|B)*P(B)
145
P(A)
• Determine P(B|A) given P(A|B), P(B) and P(A)
• Generalize to some background evidence e
P ( Y | X, e) = P(X | Y, e) * P(Y | e)
P(X | e)
Bayes’ Rule Example• S: Proposition that patient has stiff neck
• M: Proposition that patient has meningitis
• Meningitis causes stiff-neck, 50% of the time
• Given:
– P(S | M) = 0.5
146
– P(S | M) = 0.5
– P(M) = 1/50,000
– P(S) = 1/20
– P(M|S) = P(S| M) * P(M) / P(S) = 0.0002
• If a patient complains about stiff-neck,
P(meningitis) only 0.0002
Bayes’ Rule
• How can it help us?
– P(A|B) may be causal knowledge, P(B|A) diagnostic knowledge
– E.g., A is symptom, B is disease
• Diagnostic knowledge may vary:
147
• Diagnostic knowledge may vary:
– Robustness by allowing P(B | A) to be computed from others
Bayes’ Rule Use
• P(S | M) is causal knowledge, does not change
– It is “model based”
– It reflects the way meningitis works
• P(M | S) is diagnostic; tells us likelihood of M given
148
• P(M | S) is diagnostic; tells us likelihood of M given
symptom S
– Diagnostic knowledge may change with circumstance, thus helpful
to derive it
– If there is an epidemic, probability of Meningitis goes up; rather
than again observing P(M | S), we can compute it
Computing the denominator: P(S)
We wish to avoid computing the denominator in the
Bayes’ rule
– May be hard to obtain
– Introduce 2 different techniques to compute (or avoid
computing P(S))
149
computing P(S))
Computing the denominator:
#1 approach - compute relative likelihoods:
• If M (meningitis) and W(whiplash) are two possible
explanations:
– P(M|S) = P(S| M) * P(M) / P(S)
– P(W|S) = P(S| W) * P(W)/ P(S)
150
– P(W|S) = P(S| W) * P(W)/ P(S)
• P(M|S)/P(W|S) = P(S|M) * P(M) / P(S| W) * P(W)
• Disadvantages:
– Not always enough
– Possibility of many explanations
#2 approach - Using M & ~M:
• Checking the probability of M, ~M when S
– P(M|S) = P(S| M) * P(M) / P(S)
– P(~M|S) = P(S| ~M) * P(~M)/ P(S)
• P(M|S) + P(~M | S) = 1 (must sum to 1)
Computing the denominator:
151
• P(M|S) + P(~M | S) = 1 (must sum to 1)
– [P(S|M)*P(M)/ P(S) ] +
[P(S|~M) * P(~M)/P(S)] = 1
– P(S|M) * P(M) + P(S|~M) * P(~M) = P(S)
• Calculate P(S) in this way…
The #2 approach is actually - normalization:
• 1/P(S) is a normalization constant
– Must ensure that the computed probability values sum to 1
– For instance: P(M|S)+P(~M|S) must sum to 1
• Compute:
Computing the denominator
152
• Compute:
– (a) P(S|~M) * P(~M)
– (b) P(S | M) * P (M)
– (a) and (b) are numerators, and give us “un-normalized
values”
– We could compute those values and then scale them so that
they sum to 1
Simple Example• Suppose two identical boxes
• Box1:
– colored red from inside
– has 1/3 black balls, 2/3 red balls
• Box2:
153
– colored black from inside
– has 1/3 red balls, 2/3 black balls
• We select one Box at random; cant tell how it is colored
inside.
• What is the probability that Box is red inside?
Applying Bayes’ RuleWhat if we were to select a ball at random from Box, and it is red,
Does that change the probability?
P(Red-box | Red-ball) = P(Red-ball | Red-box) * P(Red-box)
P(Red-ball)
= 2/3 * 0.5 / P(Red-ball)
How to calculate P(Red-ball)?
154
P(Black-box|Red-ball) = P(Red-ball |Black-box)*P(Black-box)
P(Red-ball)
= 1/3 * 0.5 / P(Red-ball)
Thus, by our approach #2: 2/3 * 0.5 / P(Red-ball) +
1/3 * 0.5 / P(Red-ball)
= 1
Thus, P(Red-ball) = 0.5, and P(Red-box | Red-ball) = 2/3
Absolute and Conditional Independence
• Absolute: P(X|Y) = P(X) or P(X ∧ Y) = P(X)P(Y)
• Conditional: P(A ∧∧∧∧ B | C) = P(A | C) P(B | C)
• P(A| B ∧∧∧∧ C)
– If A and B are conditionally independent given C Then, probability of A is not dependent on B
155
probability of A is not dependent on B
– P(A| B ∧∧∧∧ C) = P(A| C)
• E.g. Two independent sensors S1 and S2 and a jammer J1– P(Si) = Probability Si can read without jamming
– P(S1 | J1 ∧ S2) = P(S1 | J1)
Combining Evidence• Example:
– S: Proposition that patient has stiff neck
– H: Proposition that patient has severe headache
– M: Proposition that patient has meningitis
– Meningitis causes stiff-neck, 50% of the time
– Meningitis causes head-ache, 70% of the time
156
– Meningitis causes head-ache, 70% of the time
• probability of Meningitis should go up, if both symptoms
reported
• How to combine such symptoms?
Combining Evidence
• P(C| A ∧ B) = P(C ∧ A ∧ B) / P ( A ∧ B)
• Numerator:– P(C ∧ A ∧ B) = P(B | A ∧ C) * P(A ∧ C)
= P(B | C) * P(A ∧ C)
= P(B | C) * P(A | C) * P (C)
157
= P(B | C) * P(A | C) * P (C)
• Going back to our example:
P(M | S ∧ H) = P(S| M) * P(H| M) * P(M)
P( S ∧ H)
Artificial Intelligence
Ram Meshulam 2004158
Lesson 10
(From Russell & Norvig)
Introduction
• Why ANN
Try to imitate the computational abilities of the human brain.
Some tasks can be done easily (effortlessly) by humans but are hard by
conventional paradigms on Von Neumann machine with algorithmic
approach
Ram Meshulam 2004159
• Pattern recognition (e.g, recognition of old friends or simply a hand-
written character)
• Content addressable recall (ASSOCIATIVE MEMORIES)
• Approximate, common sense reasoning (e.g., driving in busy streets,
deciding what to do when we miss the bus)
These tasks are often ill-defined, experience based, hard to apply logic
Introduction
Von Neumann machine--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
• One or a few high speed (ns)
processors with considerable
computing power
• One or a few shared high speed
buses for communication
Human Brain--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
• Large # (1011) of low speed
processors (ms) with limited
computing power
• Large # (1015) of low speed
connections
Ram Meshulam 2004160
buses for communication
• Sequential memory access by
address
• Problem-solving knowledge is
separated from the computing
component
• Hard to be adaptive
connections
• Content addressable recall
(CAM)
• Problem-solving knowledge
resides in the connectivity of
neurons
• Adaptation by changing the
connectivity
•Biological neural activity
Ram Meshulam 2004161
– Each neuron has a body, an axon, and many dendrites
• Can be in one of the two states: firing and rest.
• Neuron fires if the total incoming stimulus exceeds the threshold
– Synapse: thin gap between axon of one neuron and dendrite
of another.
• Signal exchange
• Synaptic strength/efficiency
Introduction
• What is an (artificial) neural network
– A set of nodes (units, neurons, processing elements)
• Each node has input and output
• Each node performs a simple computation by its node
function
Ram Meshulam 2004162
function
– Weighted connections between nodes
• Connectivity gives the structure/architecture of the net
• What can be computed by a NN is primarily determined
by the connections and their weights
– A very much simplified version of networks of
neurons in animal nerve systems
ANN Neuron Models
General neuron model
• Each node has one or more
inputs from other nodes, and
one output to other nodes
• Input/output values can be
– Binary {0, 1}
– Bipolar {-1, 1}
– Continuous
Ram Meshulam 2004163
General neuron model
Weighted input summation
– Continuous
• All inputs to one node come in
at the same time and remain
activated until the output is
produced
• Weights associated with links
•
popularmost is
function node theis )(
1∑ ==
n
i ii xwnet
netf
•
•
• Step (threshold) function
where c is called the threshold
Node Function
Step function
.)( :functionIdentity netnetf =.)( :functionConstant cnetf =
Ram Meshulam 2004164
where c is called the threshold
• Ramp function
Ramp function
Node Function
• Sigmoid function
– S-shaped
– Continuous and everywhere
differentiable
– Rotationally symmetric about
some point (net = c)
– Asymptotically approach Sigmoid function
Ram Meshulam 2004165
– Asymptotically approach
saturation points
– Examples:When y = 0 and z = 0:
a = 0, b = 1, c = 0.
Larger x gives steeper curve
Perceptron
• The purpose: examples classification:
• Perceptron with N inputs lines gets an
example (x1,…, xn) as input, where each xi is
an attribute value. an attribute value.
• Result=f (x1,…, xn)
• If the result>threshold, return 1, otherwise 0.
• Note: perceptron works only for functions that
are linear separated…
Ram Meshulam 2004166
Perceptrons
• A simple perceptron
– Structure:
• Single output node with threshold function
• n input nodes with weights wi, {i = 1 to n}
– To classify input patterns into one of the two classes
(depending on whether output = 0 or 1)
Ram Meshulam 2004167
– Example: input patterns: (x1, x2)
• Two groups of input patterns
(0, 0) (0, 1) (1, 0) (-1, -1);
(2.1, 0) (0, -2.5) (1.6, -1.6)
• Can be separated by a line on the (x1, x2) plane x1 - x2 = 2
• Classification by a perceptron with
w1 = 1, w2 = -1, threshold = 2
Perceptrons
• The step function is:
(1.6, -1.6)(-1, -1)
Ram Meshulam 2004168
• The step function is:• 1, if x>2
• F(x)= {• 0, if x<2
• Implement threshold by a node x0
– Constant output 1
– Weight w0 = - threshold
– A common practice in NN design
Perceptrons
• Linear separability
– A set of (2D) patterns (x1, x2) of two classes is linearly
separable if there exists a line on the (x1, x2) plane
• w0 + w1 x1 + w2 x2 = 0
• Separates all patterns of one class from the other class
– A perceptron can be built with
Ram Meshulam 2004169
• 3 input x0 = 1, x1, x2 with weights w0, w1, w2
– n dimensional patterns (x1,…, xn)
• Hyperplane w0 + w1 x1 + w2 x2 +…+ wn xn = 0 dividing the
space into two regions
– Can we get the weights from a set of sample patterns?
• If the problem is linearly separable, then YES (by
perceptron learning)
• Examples of linearly separable classes
- Logical AND function
patterns (bipolar) decision boundary
x1 x2 output w1 = 1-1 -1 -1 w2 = 1-1 1 -1 w0 = -11 -1 -1 1 1 1 -1 + x1 + x2 = 0
x
oo
o
x: class I (output = 1)o: class II (output = -1)
Ram Meshulam 2004170
1 1 1 -1 + x1 + x2 = 0
- Logical OR function
patterns (bipolar) decision boundary
x1 x2 output w1 = 1-1 -1 -1 w2 = 1-1 1 1 w0 = 11 -1 1 1 1 1 1 + x1 + x2 = 0
o: class II (output = -1)
x
xo
x
x: class I (output = 1)o: class II (output = -1)
Perceptron Learning Algorithm
1. Initialize weights and threshold:
Set wi(t), (0 <= i <= n), to be the weight i at time t, and ø to be the threshold
value in the output node.
Set w0 to be -ø, the bias, and x0 to be always 1.
Set wi(0) to small random values, thus initializing the weights and threshold.
2. Present input and desired output
Present input x0, x1, x2, ..., xn and desired output d(t). (x0 is always1).
Ram Meshulam 2004171
Present input x0, x1, x2, ..., xn and desired output d(t). (x0 is always1).
3. Calculate the actual output:
y(t) = fh[w0(t)x0(t) + w1(t)x1(t) + .... + wn(t)xn(t)]
4. Adapts weights
wi(t+1) = wi(t) + η[d(t) - y(t)]xi(t) , where 0 <= η <= 1 is a positive gain
function that controls the adaption rate.
• Steps 3 and 4. are repeated until the iteration error is less than a user-specified
error threshold or a predetermined number of iterations have been completed.
• Note:
– It is a supervised learning
– Learning occurs only when a sample input misclassified
(error driven)
• Termination criteria: learning stops when all samples are
Perceptron Learning
Ram Meshulam 2004172
• Termination criteria: learning stops when all samples are
correctly classified– Assuming the problem is linearly separable
– Assuming the learning rate (η) is sufficiently small
Choice of learning rate:– If η is too large:
– existing weights are overtaken by η[d(t) - y(t)]
– If η is too small (≈ 0): very slow to converge
– Common choice: η = 0.1
• Non-numeric input:– Different encoding schema
Perceptron Learning
Ram Meshulam 2004173
– Different encoding schema
ex. Color = (red, blue, green, yellow). (0, 0, 1, 0) encodes
“green”
• MLP: Feedforward Networks
– A connection is allowed from a node in layer i only to nodes in layer i + 1.
– Most widely used architecture.
Conceptually, nodes
at higher levels
Network Architecture
Ram Meshulam 2004174
at higher levels
successively
abstract features
from preceding
layers
– Generalization: can a trained perceptron correctly classify
patterns not included in the training samples?
• Common problem for many NN learning models
– Depends on the quality of training samples selected.
– Also to some extent depends on the learning rate and
initial weights
Perceptron Learning Quality
Ram Meshulam 2004175
initial weights
– How can we know the learning is ok?
• Reserve a few samples for testing
• Examples of linearly inseparable classes
- Logical XOR (exclusive OR) function
patterns (bipolar) decision boundary
x1 x2 output-1 -1 -1-1 1 11 -1 1
o
xo
x
x: class I (output = 1)
Linear Separability Again
Ram Meshulam 2004176
1 -1 11 1 -1
No line can separate these two classes, as can be seen from the fact that the following linear inequality system has no solution
because we have w0 < 0 from
(1) + (4), and w0 >= 0 from
(2) + (3), which is a
contradiction
x: class I (output = 1)o: class II (output = -1)
<++≥−+≥+−<−−
(4)
(3)
(2)
(1)
0 0 0 0
210
210
210
210
wwwwwwwwwwww
– XOR can be solved by a more
complex network with hidden
units
Threshold = 1
Y
z1x1 22
-2
-2
Threshold = 0
Ram Meshulam 2004177
z2x2 22
(-1, -1) (-1, -1) -1(-1, 1) (-1, 1) 1(1, -1) (1, -1) 1(1, 1) (-1, -1) -1
MultiLayer NN
– Perceptron extension:
1. Hidden layer in addition to input and output layers
2. In the output layer, it’s possible to have more than
one node, e.g., characters classificationone node, e.g., characters classification
3. Activation function: Sigmoid functions and not a
regular step function
4. The functions can be different in each node, but, in
general, use the same function for all the nodes
5. In the input layer, it’s possible to use step function
Ram Meshulam 2004178
MultiLayer NN-Purpose• Examples classification: possible to classify
more than 2 groups
• Function proximity: f: Rn�Rm.
Input layer with n nodes; output layer with m nodesInput layer with n nodes; output layer with m nodes
• MLP has much higher computational power than
a simple perceptron
• Possible to handle also function that are not
linear separable.
Ram Meshulam 2004179
Multilayer Network Learning Algorithm
Ram Meshulam 2004180
Backpropagation example
x5
x4
x3x1
x2
w35
w45
w13
w24
w14
w23
Ram Meshulam 2004181
Sigmoid as activation function with x=3:
• g(in) = 1/(1+℮-3·in)
• g’(in) = 3g(in)(1-g(in))
Adding the threshold
x5
x3x1
x0 x6
w35w13
w14
w23
1
w03w04
1
w65
Ram Meshulam 2004182
x5
x4x2 w45w24
w23
Training Set
• Logical XOR (exclusive OR) function
x1 x2 output0 0 00 1 11 0 11 1 0
Ram Meshulam 2004183
• Choose random weights
• <w03,w04,w13,w14,w23,w24,w65,w35,w45> =<0.03,0.04,0.13,0.14,-0.23,-0.24,0.65,0.35,0.45>
• Learning rate: 0.1 for the hidden layers, 0.3 for the output layer
First Example• Compute the outputs
• a0 = 1 , a1= 0 , a2 = 0
• a3 = g(1*0.03 + 0*0.13 + 0*-0.23) = 0.522
• a4 = g(1*0.04 + 0*0.14 + 0*-0.24) = 0.530
• a6 = 1, a5 = g(0.65*1 + 0.35*0.522 + 0.45*0.530) = 0.961
• Calculate ∆5 = 3*g(1.0712)*(1-g(1.0712))*(0-0.961) = -0.108
• Calculate ∆6, ∆3, ∆4
Ram Meshulam 2004184
• Calculate ∆6, ∆3, ∆4
• ∆6 = 3*g(1)*(1-g(1))*(0.65*-0.108) = -0.010
• ∆3 = 3*g(0.03)*(1-g(0.03))*(0.35*-0.108) = -0.028
• ∆4 = 3*g(0.04)*(1-g(0.04))*(0.45*-0.108) = -0.036
• Update weights for the output layer
• w65 = 0.65 + 0.3*1*-0.108 = 0.618
• w35 = 0.35 + 0.3*0.522*-0.108 = 0.333
• w45 = 0.45 + 0.3*0.530*-0.108 = 0.433
First Example (cont)• Calculate ∆0, ∆1, ∆2
• ∆0 = 3*g(1)*(1-g(1))*(0.03*-0.028 + 0.04*-0.036) = -0.001
• ∆1 = 3*g(0)*(1-g(0))*(0.13*-0.028 + 0.14*-0.036) = -0.006
• ∆2 = 3*g(0)*(1-g(0))*(-0.23*-0.028 + -0.24*-0.036) = 0.011
• Update weights for the hidden layer
• w03 = 0.03 + 0.1*1*-0.028 = 0.027
• w04 = 0.04 + 0.1*1*-0.036 = 0.036
Ram Meshulam 2004185
• w04 = 0.04 + 0.1*1*-0.036 = 0.036
• w13 = 0.13 + 0.1*0*-0.028 = 0.13
• w14 = 0.14 + 0.1*0*-0.036 = 0.14
• w23 = -0.23 + 0.1*0*-0.028 = -0.23
• w24 = -0.24 + 0.1*0*-0.036 = -0.24
Second Example• Compute the outputs
• a0 = 1, a1= 0 , a2 = 1
• a3 = g(1*0.027 + 0*0.13 + 1*-0.23) = 0.352
• a4 = g(1*0.036 + 0*0.14 + 1*-0.24) = 0.352
• a6 = 1, a5 = g(0.618*1 + 0.333*0.352 + 0.433*0.352) = 0.935
• Calculate ∆5 = 3*g(0.888)*(1-g(0.888))*(1-0.935) = 0.012
• Calculate ∆6, ∆3, ∆4
Ram Meshulam 2004186
• Calculate ∆6, ∆3, ∆4
• ∆6 = 3*g(1)*(1-g(1))*(0.618*0.012) = 0.001
• ∆3 = 3*g(-0.203)*(1-g(-0.203))*(0.333*0.012) = 0.003
• ∆4 = 3*g(-0.204)*(1-g(-0.204))*(0.433*0.012) = 0.004
• Update weights for the output layer
• w65 = 0.618 + 0.3*1*0.012 = 0.623
• w35 = 0.333 + 0.3*0.352*0.012 = 0.334
• w45 = 0.433 + 0.3*0.352*0.012 = 0.434
Second Example (cont)• Calculate ∆0, ∆1, ∆2
• Skipped, we do not use them
• Update weights for the hidden layer
• w03 = 0.027 + 0.1*1*0.003 = 0.027
• w04 = 0.036 + 0.1*1*0.004 = 0.036
• w13 = 0.13 + 0.1*0*0.003 = 0.13
• w14 = 0.14 + 0.1*0*0.004 = 0.14
Ram Meshulam 2004187
• w14 = 0.14 + 0.1*0*0.004 = 0.14
• w23 = -0.23 + 0.1*1*0.003 = -0.23
• w24 = -0.24 + 0.1*1*0.004 = -0.24
Summary
• Single layer nets have limited representation power
(linear separability problem)
• Error driven seems a good way to train a net
Ram Meshulam 2004188
• Multi-layer nets (or nets with non-linear hidden
units) may overcome linear inseparability problem
Artificial Intelligence
Ram Meshulam 2004189
Lesson 11
(From Russell & Norvig)
Conditional probability
• Conditional or posterior probabilitiese.g., P(cavity | toothache) = 0.8
i.e., given that toothache is all I know
• Notation for conditional distributions:P(Cavity | Toothache) = 2-element vector of 2-element vectors)
Ram Meshulam 2004190
P(Cavity | Toothache) = 2-element vector of 2-element vectors)
• If we know more, e.g., cavity is also given, then we haveP(cavity | toothache,cavity) = 1
• New evidence may be irrelevant, allowing simplification, e.g.,P(cavity | toothache, sunny) = P(cavity | toothache) = 0.8
• This kind of inference, sanctioned by domain knowledge, is crucial
Inference by enumeration
• Start with the joint probability distribution:
Ram Meshulam 2004191
• Can also compute conditional probabilities:
P(¬cavity | toothache) = P(¬cavity ∧ toothache)
P(toothache)
=4.0
0.0640.0160.0120.108
0.0640.016=
++++
Independence• A and B are independent iff
P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B)
Ram Meshulam 2004192
P(Toothache, Catch, Cavity, Weather)
= P(Toothache, Catch, Cavity) P(Weather)
• Absolute independence powerful but rare
• Dentistry is a large field with hundreds of variables, none of which are independent. What to do?
Conditional independence• P(Toothache, Cavity, Catch) has 23 independent entries
• If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache:(1) P(catch | toothache, cavity) = P(catch | cavity)
• The same independence holds if I haven't got a cavity:
Ram Meshulam 2004193
• The same independence holds if I haven't got a cavity:(2) P(catch | toothache,¬cavity) = P(catch | ¬cavity)
• Catch is conditionally independent of Toothache given Cavity:P(Catch | Toothache,Cavity) = P(Catch | Cavity)
• Equivalent statements:P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
Bayesian networks• A simple, graphical notation for conditional independence
assertions and hence for compact specification of full joint distributions
• It describes how variables interact locally
• Local interactions chain together to give global, indirect
interactions
Ram Meshulam 2004194
interactions
• Syntax:– a set of nodes, one per variable
– a directed, acyclic graph (link ≈ "directly influences")
– a conditional distribution for each node given its parents:P (Xi | Parents (Xi))- conditional probability table (CPT)
Example 1• Topology of network encodes conditional independence
assertions:
Cavity P(C=true |
Cavity)
T .9
F .05
P(Cavity=true) = 0.8
Cavity P(T=true | Cavity)
P(W=true) = 0.4
Ram Meshulam 2004195
• Weather is independent of the other variables
• Toothache and Catch are conditionally independent given Cavity
• It is usually easy for a domain expert to decide what direct influences exist
Cavity P(T=true | Cavity)
T .8
F .4
Example 2• N independent coin flips :
P(X1=tree) = 0.5 P(X2=tree) = 0.5 P(Xn=tree) = 0.5
Ram Meshulam 2004196
• No interactions between variables: absolute independence
• Does every Bayes Net can represent every full joint?
• No. For example, Only distributions whose variables are
absolutely independent can be represented by a Bayes’ net
with no arcs.
Calculation of Joint Probability
∧∧∧∧ ∧∧∧∧ ∧∧∧∧ Π
• How to build the Bayes net?
• Given its parents, each node is conditionally independent of everything except its descendants
• Thus,
Ram Meshulam 2004197
P(x1∧∧∧∧x2∧∧∧∧…∧∧∧∧xn) = Πi=1,…,nP(xi|parents(Xi))
� full joint distribution table
• Every BN over a domain implicitly represents some joint distribution over that domain
Example 3
• I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?
• Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls
Ram Meshulam 2004198
MaryCalls
• Network topology reflects "causal" knowledge:– A burglar can set the alarm off
– An earthquake can set the alarm off
– The alarm can cause Mary to call
– The alarm can cause John to call
Example contd.
Ram Meshulam 2004199
For example, what is the probability that there is a burglary, earthquake, alarm,
Jon call, Mary doesn’t?
P(b,e,a,j,~m)=P(b)*P(e)*P(a|b,e)*P(j|a)*P(⌐ m|a)
Answering queries
• I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?
– P(b|j,⌐m) = P(b,j,⌐m)/P(j,⌐m) (based on p(a|b)=p(a,b)/p(b))
– P(b,j ⌐m) = P(b,e,a,j,⌐m) + P(b,⌐e,a,j,⌐m) + P(b,e,⌐a,j,⌐m) + P(b,⌐e,⌐a,j,⌐m) =
Ram Meshulam 2004200
– P(b,j ⌐m) = P(b,e,a,j,⌐m) + P(b,⌐e,a,j,⌐m) + P(b,e,⌐a,j,⌐m) + P(b,⌐e,⌐a,j,⌐m) =
P(b)P(e)P(a|b,e)P(j|a)P(⌐m|a) +
P(b)P(e)P(⌐a|b,e)P(j|⌐a)P(⌐m|⌐a) +
P(b)P(⌐e)P(a|b, ⌐e)P(j|a)P(⌐m|a) +
P(b)P(⌐e)P(⌐a|b, ⌐e)P(j|⌐a)P(⌐m|⌐a)
– Do the same to calculate P(⌐b,j ⌐m) and normalize
P(b|j,⌐m)+ P(⌐ b|j,⌐m)=1
�P(b,j,⌐m)+P(⌐ b,j,⌐m)= P(j,⌐m)
Laziness and Ignorance
• The probabilities actually summarize a potentially infinite set of circumstances in which the alarm might fail to go off– high humidity
– power failure
– dead battery
– cut wires
Ram Meshulam 2004201
– cut wires
– a dead mouse stuck inside the bell
• John or Mary might fail to call and report it– out to lunch
– on vacation
– temporarily deaf
– passing helicopter
Compactness• A CPT for Boolean Xi with k Boolean parents has 2k rows for the
combinations of parent values
• Each row requires one number p for Xi = true(the number for Xi = false is just 1-p)
• If each variable has no more than k parents, the complete network requires O(n · 2k) numbers
Ram Meshulam 2004202
requires O(n · 2k) numbers
• I.e., grows linearly with n, vs. O(2n) for the full joint distribution
• For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25-1 = 31)
• We utilize the property of locally structured system:
local connections between the variables, so each variable won’t have too many parents.
The number of parents depends of how the net is built.
Worst case: n · 2k is equal to 2n
Causality?
• Rain (a) causes Traffic (b)
• Let’s build the joint: p(a,b)=p(a|b)*p(b)=p(b|a)*p(a)
Ram Meshulam 2004203
Reverse Causality?• Both nets are legal, but the previous one is preferred.
Rain cause traffic in general, tough there is a connectionbetween traffic and rain….
Ram Meshulam 2004204
Causality?
• What do the arrows really mean?
• Topology may happen to encode causal structure
• Topology really encodes conditional independencies
• When Bayes’ nets reflect the true causal patterns:
Ram Meshulam 2004205
• When Bayes’ nets reflect the true causal patterns:– Often simpler (nodes have fewer parents)
– Often easier to think about
– Often easier to elicit from experts
• BNs need not actually be causal– Sometimes no causal net exists over the domain
– E.g. consider the variables Traffic and RoofDrips
– End up with arrows that reflect correlation, not causation
Example 2, AgainWhat if the net is build not in a logical order� The net looks much more complicated.
Consider the following 2 orders for insertion:
• (a) MaryCalls, JohnCalls, Alarm, Burglary, Earthquake
– Since, P(Burglary|Alarm, JohnCalls, MaryCalls) = P(BurglarylAlarm)
• (b) Mary Calls, JohnCalls, Earthquake, Burglary, Alarm.
Ram Meshulam 2004206
Connection TypesX ind. Z, given Y?X ind. Z?DiagramName
YesNot necessarilyB A MCausal chain
Ram Meshulam 2004207
YesNoA
J M
Common Cause
NoYesB E
A
Common Effect
Test Question
P(H=true) = 0.1
GH
R
H P(G=true | H)
T .4
F .8
H G P(R =true | H, G)
Ram Meshulam 2004208
JH G P(R =true | H, G)
false false 0.2
false true 0.9
true false 0.3
true true 0.8R P(J=true | R)
false 0.2
true 0.7H - Hardworking
G - Good Grader
R - Excellent Recommendation
J - Landed a good Job
What can be inferred?
i:
ii
iii
( ) ( ) ( ),P H G P H P G= ⋅
( ) ( ),P J R H P J R=
( ) ( )P J P J H≠�
Ram Meshulam 2004209
Q: What is the value of P(H,G,¬R,¬J)?
A: P(H,G, ¬R, ¬J) = P(H)*P(G|H)*P(¬R|H,G)*P(¬J|H,G,
¬R) = P(H)*P(G|H)*P(¬R|H,G)*P(¬J| ¬R) = 0.1 * 0.4 * 0.2
* 0.8 = 0.0064
Q: What if we want to add another parameter, C= Has The
Right Connections?
Answer
P(H=true) = 0.1
GH
R
H P(G=true | H)
T .4
F .8
C H G P(R =true | H, G,C)
C
P(C=true) = ???
Ram Meshulam 2004210
JC H G P(R =true | H, G,C)
false false false ??
false false true ??
false true false ??
false true true ??
true false false ??
true false true ??
true true false ??
true true true ??
R P(J=true | R)
false 0.2
true 0.7
Reachability (the Bayes Ball)Given bayes net, source node and target node, are these two
nodes independent?
• Shade evidence (things that happened) nodes
• Start at source node
• Try to reach target by search
• States: node, along with previous arc
• Successor function:
Ram Meshulam 2004211
• Successor function:
– Unobserved nodes:
• To any child of X
• To any parent of X if S is coming from a child
– Observed nodes:
• From parent of X to parent of X
• If you can’t reach a node, it’s conditionally independent of
the start node. If there is a path, they are probably
dependent.
Example
• L ind. T’, given T?
Yes
• L ind. B?
Yes
• L ind. B, given T?
Ram Meshulam 2004212
• L ind. B, given T?
No
• L ind. B, given T’?
No
• L ind. B, given T and R?
Yes
Naïve Bayes
• Conditional Independence Assumption: features are independent of each other given the class:
)|()|()|()|,,( CXPCXPCXPCXXP •••= LK
C
X1 X2 XnX3…
Ram Meshulam 2004213
• What can we model with naïve bayes?
• Any process where,
• Each cause has lots of “independent” effects
• Easy to estimate the CPT for each effect
• We want to reason about the probability of different causes given observed effects
)|()|()|()|,,( 211 CXPCXPCXPCXXP nn •••= LK
Naive Bayes ClassifiersTask: Classify a new instance D based on a tuple of attribute values into
one of the classes cj ∈ C
nxxxD ,,, 21 K=
),,,|(argmax 21 nMAP xxxcPc K=
CIS 391- Intro to AI214
According to Rule Bayes
Since the denominator is fix
),,,|(argmax 21 nCc
MAP xxxcPc K∈
=
),,,(
)()|,,,(argmax
21
21
n
n
Cc xxxP
cPcxxxP
K
K
∈=
)()|,,,(argmax 21 cPcxxxP nCc
K∈
=
Summary
• Bayesian networks provide a natural
representation for (causally induced)
conditional independence
Ram Meshulam 2004215
conditional independence
• Topology + CPTs = compact representation
of joint distribution
• Generally easy for domain experts to
construct
Artificial Intelligence
216
Lesson 12
• Many applications:
– Floor cleaning, mowing, de-mining, ….
• Many approaches:
Robotics, a Case Study - Coverage
• Many approaches:
– Off-line (getting a map in advance) or On-line
– Heuristic or Complete (promise complete coverage)
• Multi-robot, motivated by robustness and efficiency
217
• Dynamic vs Static: influence.
• Accessible vs Inaccessible: feel.
• Non-Deterministic vs Deterministic:
expected result.
Robots Environment Parameters
expected result.
• Discrete vs Continues: possible values of
actions and feelings.
• Static - to be able to guarantee completeness
• Inaccessible - greater impact on the on-line version
• Non-deterministic (move 5M, but able to move 5.1M)
• Continuous: actions and feelings have continues values
Environment Assumptions
• Continuous: actions and feelings have continues values
– Exact cellular decomposition: exact shapes not necessarily in
the same size
– Approximate cellular decomposition: squares in the same size
219
• Complete - with approximate cellular decomposition
• Robust– Coverage completed as long as one robot is alive
– The robustness mechanism is simple
MSTC- Multi Robot Spanning Tree
Coverage
– The robustness mechanism is simple
• Off-line and On-line algorithms– Off-line:
o Analysis according to initial positions
o Efficiency improvements
– On-line:o Implemented on simulation of real-robots
220
Off-line Coverage, Basic Assumptions
• Area division – n cells
• k homogenous robots
• Equal associated tool size
• Robots movement• Robots movement
221
STC: Spanning Tree Coverage(Gabrieli and Rimon 2001)
• Area division
• Graph definition
• Building the spanning tree
222
Non-backtracking MSTC
• Initialization phase: Build STC, distribute to robots
• Distributed execution: Each robot follows its section
– Low risk of collisions
Robot B is done!C
Robot A is done!
Robot C is done!
A
B
C
223
• Coverage completed as long as one robot is alive
• Low communication, no need for re-allocation
Guaranteed Robustness
C
A
B
224
Analysis: Non-backtracking MSTC
• Running time = max i k step(i)
• Best case:
• Worst case: n – k
∈
−1k
n
A
D B
C
• Worst case: n – k
– Unfortunately, common case
• General non-backtracking worst case: n - 2(k-1) - 1
A
D
B
C
A
D
B
C
225
Backtracking MSTC
• Similar initialization phase
• Robots backtrack to assist others
• No point is covered more than twice
DD
A
C
B
A
C
B
226
Backtracking MSTC (cont.)
• Same robustness mechanism
• Same communication requirements
Robot C is done!
Robot B is done!
Robot C is done!
Robot A is done!
227
Backtracking MSTC Analysis
Best case: The same
Worst case: k=2
−1k
n
−13
2nB
Worst case: k=2
k>2
2
n
A
D
B
C
A
228
Efficiency in Off-line Coverage
• Optimal MSTC- improves the average case
• Heterogeneous robots- flexibility• Heterogeneous robots- flexibility
• Optimal spanning tree- improves the worst case
229
Optimal MSTC
• Similar initialization phase
• Robots backtrack to assist others:
– All the robots can backtrack
– Backtracking on any number of steps
• No point is covered more than twice• No point is covered more than twice
• Same robustness mechanism
• Same communication requirements
A
D
C
B
E230
Optimal MSTC (cont.)
• Choose a robot
• Search for the minimum valid solution
– Left search
– Right searchD
• Complexity:
– Check on all the robots: k
– Each search: O(n logn)
– Validity check: O(k)
– Total: O(k2n logn) A
D
C
B
E
231
Heterogeneous Robots
• Different speeds
– Non-backtracking MSTC ����
– Backtracking MSTC ����
– Optimal MSTC ��������– Optimal MSTC ��������
• Different fuel/battery time
– Non-backtracking MSTC ����
– Backtracking MSTC ����
– Optimal MSTC ��������
232
Optimal Spanning tree
• Improves the worst case with all 3 algorithms
• The construction is believed to be NP-HardR1 R1
(a) (b)
R3R2 R2 R3
233
Generating a Good Spanning Tree(Believed to be NP-Hard)
AA
BB
CC
A B = 12 cells
B C = 12 cells
C A = 12 cells
A B = 28 cells
B C = 4 cells
C A = 4 cells
234
A Heuristic Solution
• Build k subtrees on coarse grid
– Start building subtrees from initial locations
– Add cells to each subtree gradually
– Spread away from other robots (based on Manhattan dist)
• Connect subtrees• Connect subtrees
– Randomly pick connections between subtrees
– Calculate x in resulting tree
– Repeat k^a times (a is a parameter)
– Report tree yielding minimal x
235
Illustration – Stage 1
Min{3,4} = 3
Min{1,2} = 1
Min{2,3} = 2
236
Example
X = 1316171616
237
On-line MSTC
• Same basic assumptions:
– Area decomposition- n cells
– k homogenous robots
– Equal tool size and robot movements
• All the robots know their absolute initial position
• Initialization phase
1. Agreed-upon grid construction
2. Self-localization
3. Locations update
238
On-line MSTC (Cont.)
239
• Coverage completed as long as one robot is alive
• No need for re-allocation
Guaranteed Robustness
240
• Player/Stage with modeled RV-400 robots
• Localization solutions
– GPS
– Odometry with limited errors
• Agreed-upon grid options
From Theory to Practice
• Agreed-upon grid options
– Big enough work-area
– Dynamic work-area
• Collisions avoidance with bumps
– Random wait
– Communication based
• Limited sensors solution241
Off-line Algorithms Experiments (1)
• Work area: 30X20 cells, 2400 sub-cells
• Each point represents 100 trials
1220
1420
Co
ve
rag
e t
ime
20
220
420
620
820
1020
1 6 11 16 21 26 31
Number of robots
Co
ve
rag
e t
ime
non-backtracking-random backtracking-random optimal-random best case
242
Off-line Algorithms Experiments (2)
• Work area: 30X20 cells with 80 holes, 2080 sub-cells
• Each point represents 100 trials
1220
1420
Co
ve
rag
e t
ime
20
220
420
620
820
1020
1 6 11 16 21 26 31
Number of robots
Co
ve
rag
e t
ime
non-backtracking-random backtracking-random optimal-random best case
243
Experimental Results
5 2 0
6 2 0
7 2 0
8 2 0
9 2 0
Co
ve
rag
e t
ime
n o n - b a c k t r a c k in g -r a n d o mb a c k t r a c k in g - r a n d o m
o p t im a l - r a n d o m
2 0
1 2 0
2 2 0
3 2 0
4 2 0
5 2 0
3 1 3 2 3
N u m b e r o f r o b o t s
Co
ve
rag
e t
ime
n o n - b a c k t r a c k in g -B e s t S T Co p t im a l - B e s t S T C
b e s t c a s e
244
Experimental Results - 27% Obstacles
245
On-line Algorithm Run-time Example
246
On-line Algorithm Experiments
• Random places
• Each point represents 10 trials
03:21:36
03:50:24
04:19:12
00:00:00
00:28:48
00:57:36
01:26:24
01:55:12
02:24:00
02:52:48
2 4 6 8 10
Number of robots
Tim
e
Outdoor environment Indoor environment
247
Conclusion
• Complete and robust multi-robot algorithms
• Redundancy vs. efficiency with off-line algorithms
• Optimal MSTC which handle heterogeneous robots
• Implemented on-line MSTC with approximation techniques
248
Recommended