1、1,Artificial Intelligence: Bayesian Networks,2,Graphical Models,If no assumption of independence is made, then an exponential number of parameters must be estimated for sound probabilistic inference. No realistic amount of training data is sufficient to estimate so many parameters. If a blanket assu
2、mption of conditional independence is made, efficient training and inference is possible, but such a strong assumption is rarely warranted. Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable dependencies and allow for less restrictive ind
3、ependence assumptions while limiting the number of parameters that must be estimated. Bayesian Networks: Directed acyclic graphs that indicate causal structure. Markov Networks: Undirected graphs that capture general dependencies.,3,Bayesian Networks,Directed Acyclic Graph (DAG) Nodes are random var
4、iables Edges indicate causal influences,4,Conditional Probability Tables,Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case). Roots (sources) of the DAG that have no parent
5、s are given prior probabilities.,Burglary,Earthquake,Alarm,JohnCalls,MaryCalls,5,CPT Comments,Probability of false not given since rows must add to 1. Example requires 10 parameters rather than 251 = 31 for specifying the full joint distribution. Number of parameters in the CPT for a node is exponen
6、tial in the number of parents (fan-in).,6,Joint Distributions for Bayes Nets,A Bayesian Network implicitly defines a joint distribution.,Example,Therefore an inefficient approach to inference is: 1) Compute the joint distribution using this equation. 2) Compute any desired conditional probability us
7、ing the joint distribution.,7,Nave Bayes as a Bayes Net,Nave Bayes is a simple Bayes Net,Y,X1,X2,Xn,Priors P(Y) and conditionals P(Xi|Y) for Nave Bayes provide CPTs for the network.,8,Independencies in Bayes Nets,If removing a subset of nodes S from the network renders nodes Xi and Xj disconnected,
8、then Xi and Xj are independent given S, i.e. P(Xi | Xj, S) = P(Xi | S) However, this is too strict a criteria for conditional independence since two nodes will still be considered independent if their simply exists some variable that depends on both. For example, Burglary and Earthquake should be co
9、nsidered independent since they both cause Alarm.,9,Independencies in Bayes Nets,If removing a subset of nodes S from the network renders nodes Xi and Xj disconnected, then Xi and Xj are independent given S, i.e. P(Xi | Xj, S) = P(Xi | S) However, this is too strict a criteria for conditional indepe
10、ndence since two nodes will still be considered independent if their simply exists some variable that depends on both. For example, Burglary and Earthquake should be considered independent since they both cause Alarm.,P(Xi | Xj, S) = P(Xi | S) , is equivalent to P(Xi , Xj | S) = P(Xi | S) P(Xj | S)
11、How to prove?,10,Independencies in Bayes Nets,If removing a subset of nodes S from the network renders nodes Xi and Xj disconnected, then Xi and Xj are independent given S, i.e. P(Xi | Xj, S) = P(Xi | S) However, this is too strict a criteria for conditional independence since two nodes will still b
12、e considered independent if their simply exists some variable that depends on both. For example, Burglary and Earthquake should be considered independent since they both cause Alarm.,11,Independencies in Bayes Nets (cont.),Unless we know something about a common effect of two “independent causes” or
13、 a descendent of a common effect, then they can be considered independent. For example, if we know nothing else, Earthquake and Burglary are independent. However, if we have information about a common effect (or descendent thereof) then the two “independent” causes become probabilistically linked si
14、nce evidence for one cause can “explain away” the other. For example, if we know the alarm went off that someone called about the alarm, then it makes earthquake and burglary dependent since evidence for earthquake decreases belief in burglary. and vice versa.,12,Bayes Net Inference,Given known valu
15、es for some evidence variables, determine the posterior probability of some query variables. Example: Given that John calls, what is the probability that there is a Burglary?,Burglary,Earthquake,Alarm,JohnCalls,MaryCalls,?,John calls 90% of the time there is an Alarm and the Alarm detects 94% of Bur
16、glaries so people generally think it should be fairly high.However, this ignores the prior probability of John calling.,13,Bayes Net Inference,Example: Given that John calls, what is the probability that there is a Burglary?,Burglary,Earthquake,Alarm,JohnCalls,MaryCalls,?,John also calls 5% of the t
17、ime when there is no Alarm. So over 1,000 days we expect 1 Burglary and John will probably call. However, he will also call with a false report 50 times on average. So the call is about 50 times more likely a false report: P(Burglary | JohnCalls) 0.02,14,Bayes Net Inference,Example: Given that John
18、calls, what is the probability that there is a Burglary?,Burglary,Earthquake,Alarm,JohnCalls,MaryCalls,?,Actual probability of Burglary is 0.016 since the alarm is not perfect (an Earthquake could have set it off or it could have gone off on its own). On the other side, even if there was not an alar
19、m and John called incorrectly, there could have been an undetected Burglary anyway, but this is unlikely.,15,Types of Inference,16,Sample Inferences,Diagnostic (evidential, abductive): From effect to cause. P(Burglary | JohnCalls) = 0.016 P(Burglary | JohnCalls MaryCalls) = 0.29 P(Alarm | JohnCalls
20、MaryCalls) = 0.76 P(Earthquake | JohnCalls MaryCalls) = 0.18 Causal (predictive): From cause to effect P(JohnCalls | Burglary) = 0.86 P(MaryCalls | Burglary) = 0.67 Intercausal (explaining away): Between causes of a common effect. P(Burglary | Alarm) = 0.376 P(Burglary | Alarm Earthquake) = 0.003 Mi
21、xed: Two or more of the above combined (diagnostic and causal) P(Alarm | JohnCalls Earthquake) = 0.03 (diagnostic and intercausal) P(Burglary | JohnCalls Earthquake) = 0.017,17,Sample Inferences,Diagnostic (evidential, abductive): From effect to cause. P(Burglary | JohnCalls) = 0.016 P(Burglary | Jo
22、hnCalls MaryCalls) = 0.29 P(Alarm | JohnCalls MaryCalls) = 0.76 P(Earthquake | JohnCalls MaryCalls) = 0.18 Causal (predictive): From cause to effect P(JohnCalls | Burglary) = 0.86 P(MaryCalls | Burglary) = 0.67 Intercausal (explaining away): Between causes of a common effect. P(Burglary | Alarm) = 0
23、.376 P(Burglary | Alarm Earthquake) = 0.003 Mixed: Two or more of the above combined (diagnostic and causal) P(Alarm | JohnCalls Earthquake) = 0.03 (diagnostic and intercausal) P(Burglary | JohnCalls Earthquake) = 0.017,Assignment: Calculate these results!,18,Probabilistic Inference in Humans,People
24、 are notoriously bad at doing correct probabilistic reasoning in certain cases. One problem is they tend to ignore the influence of the prior probability of a situation.,19,Monty Hall Problem,1,2,3,One Line Demo: http:/math.ucsd.edu/crypto/Monty/monty.html,20,Multiply Connected Networks,Networks wit
25、h undirected loops, more than one directed path between some pair of nodes.,In general, inference in such networks is NP-hard. Some methods construct a polytree(s) from given network and perform inference on transformed graph.,21,Node Clustering,Eliminate all loops by merging nodes to create meganod
26、es that have the cross-product of values of the merged nodes.,Number of values for merged node is exponential in the number of nodes merged. Still reasonably tractable for many network topologies requiring relatively little merging to eliminate loops.,22,Bayes Nets Applications,Medical diagnosis Pat
27、hfinder system outperforms leading experts in diagnosis of lymph-node disease. Microsoft applications Problem diagnosis: printer problems Recognizing user intents for HCI Text categorization and spam filtering Student modeling for intelligent tutoring systems.,23,Statistical Revolution,Across AI the
28、re has been a movement from logic-based approaches to approaches based on probability and statistics. Statistical natural language processing Statistical computer vision Statistical robot navigation Statistical learning Most approaches are feature-based and “propositional” and do not handle complex
29、relational descriptions with multiple entities like those typically requiring predicate logic.,Structured (Multi-Relational) Data,In many domains, data consists of an unbounded number of entities with an arbitrary number of properties and relations between them. Social networks Biochemical compounds
30、 Web sites,25,Biochemical Data,Predicting mutagenicity Srinivasan et. al, 1995,Web-KB Dataset Slattery & Craven, 1998,Faculty,Grad Student,Research Project,Other,Collective Classification,Traditional learning methods assume that objects to be classified are independent (the first “i” in the i.i.d. a
31、ssumption) In structured data, the class of an entity can be influenced by the classes of related entities.Need to assign classes to all objects simultaneously to produce the most probable globally-consistent interpretation.,Logical AI Paradigm,Represents knowledge and data in a binary symbolic logi
32、c such as FOPC. + Rich representation that handles arbitrary sets of objects, with properties, relations, quantifiers, etc. Unable to handle uncertain knowledge and probabilistic reasoning.,Probabilistic AI Paradigm,Represents knowledge and data as a fixed set of random variables with a joint probab
33、ility distribution. + Handles uncertain knowledge and probabilistic reasoning. Unable to handle arbitrary sets of objects, with properties, relations, quantifiers, etc.,30,Statistical Relational Models,Integrate methods from predicate logic (or relational databases) and probabilistic graphical model
34、s to handle structured, multi-relational data. Probabilistic Relational Models (PRMs) Stochastic Logic Programs (SLPs) Bayesian Logic Programs (BLPs) Relational Markov Networks (RMNs) Markov Logic Networks (MLNs) Other TLAs,31,Conclusions,Bayesian learning methods are firmly based on probability the
35、ory and exploit advanced methods developed in statistics. Nave Bayes is a simple generative model that works fairly well in practice. A Bayesian network allows specifying a limited set of dependencies using a directed graph. Inference algorithms allow determining the probability of values for query variables given values for evidence variables.,