Uncertainty Modelling in AI

Uncertainty Modelling in AI

All materials are from Module CS5340 (NUS Computing)

PGM (Probabilistic Graphical Modeling)

  • gain global insight based on local observations

Key Ideas:

  • Represent the world as a collection of random variables X1,,XN with joint distribution p(X1,,XN)

  • Learn the distribution from data.

  • Perform “inference” (compute conditional distributions (p(Xi|X1=x1,,XN=xN)).

Introduction

Probability space

Three parts:

  • Outcome space Ω: space of possible outcomes
  • Event space E: E2Ω, subset of power set of Ω
    • contains empty and trivial event Ω
    • closed under union. if α,βE, so is αβ
    • closed under complement. if αE, So is Ωα
  • Probability distribution P: mapping from events in E into real values
    • Non negativity
    • sum to 1
    • Mutually disjoint events: if α,βE, and αβ=, then P(αβ)=P(α)P(β).

Probability distribution:

  • X: random variables. x: generic values [xVal(X)].

  • Indicator Random Variables: map every outcome to 0 or 1.: Just indication

  • Discrete vs.. continuous

    • Discrete:
      • PMF (Probability mass function): p(x)
    • Continuous:
      • PDF (Probability Density function) : p(x):RR0
      • P(X) is a cumulative function

Probability:

  • Joint probability:

Basic operations

Marginalization

Recover probability distribution of any variable in a joint distribution by integrating (or summing) over all other variables.

p(x)=p(x,y)dy;p(y)=(x,y)dx

p(x)=yp(x,y);p(y)=x(x,y)

Conditional probability

p(x|Y=y)

can be extracted from joint probability.

P(x|Y=y)=p(x,Y=y)xp(x,Y=y)dx=p(x,Y=y)p(Y=y)

i.e. p(x|y)=p(x,y)p(y)

So, p(x,y)=p(x|y)p(y)=p(y|x)p(x) : chain rule of product rule

Bayes rule

p(x|y)p(y)=p(y|x)p(x)p(y|x)=p(x|y)p(y)p(x)=p(x|y)p(y)p(x,y)dy=p(x|y)p(y)p(x|y)p(y)dy

Independence

independence of X and Y means that every conditional distribution is the same.

when variables are independent, p(x,y)=p(x|y)p(y)=p(x)p(y)

Expectation

E[f(x)]=xf(x)p(x) or f(x)p(x)dx

E[f(x)+g(x)]=E[f(x)]+E[g(x)]

E[f(x)g(x)]=E[f(x)]E[g(x)](if X, Y independent)

Probability Distribution

Data Type Domain Distributions PDF(PMF) Parameter
Univariate, discrete, binary x0,1 Bernoulli p(x)=λx(1λ)1x λ
Univariate, discrete, multi-valued x1,2,,K Categorical p(x)=k=1Kλkxk=λk λ=[λ1,,λK]T, where λ0, kλk=1
Univariate, continuous, unbounded xR Univariate normal (Gaussian) $p(x \mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}} \exp -\frac{(x-\mu)^2}{2\sigma^2}$
Univariate, continuous, bounded x[0,1] Beta p(x)=Γ[α+β]Γ[α]Γ[β]xα1(1x)β1 α, β
Multivariate, continuous, unbounded xRD Multivariate normal $\frac{1}{(2\pi)^{D/2} \boldsymbol{\Sigma}
Multivariate, continuous, bounded, sum to 1 x=[x1,x2,,xK]T, xk[0,1], k=1Kxk=1 Dirichlet p(x)=Γ[k=1Kαk]k=1KΓ[αk]k=1Kxkαk1 αk (k个)
Bivariate, continuous, x1 unbounded, x2 bounded below x=[x1,x2], x1R, x2R+ Normal-scaled inverse gamma
Multivariate vector x and matrix X. x Unbounded, X square, positive definite xRK, XRK×K, zTXz>0zRK Normal inverse Wishart
  • Conjugate Distributions

    • Parameters of conjugate distributions are known as hyperparameters because they control the parameter distribution

    • Aim: Learning the parameters θ of a probability distribution:

      p(θ|x)=p(x|θ)p(θ)p(x|θ)p(θ)dθ

      • p(θ): prior. p(x|θ): likelihood. p(θ|x): posterior
      • Use prior and likelihood to compute posterior. Posterior has the same form as prior conjugate
    • 使用共轭先验的原因是,可以使得先验分布和后验分布的形式相同,这样符合人的直观感受,另一方面,是可以形成一个先验链,即现在的后验分布可以作为下一次计算的先验分布,也就是带来了计算的简单性。

    • 使得Bayes inference更加方便, 比如在 sequential Bayesian inference中,得到一个observation之后,可以算出一个posterior。由于选取的是conjugate prior,因此posterior和prior的形式一样,可以把该posterior当作新的prior,用于下一次的observation,继续下一次迭代。

    Distribution Domain Prior Prior Distribution hyper parameters
    Bernouli x0,1 Beta p(λ)=Γ[α+β]Γ[α]Γ[β]λα1(1λ)β1, λ[0,1] α, β
    Binomial x0,1,,n Beta p(λ)=Γ[α+β]Γ[α]Γ[β]λα1(1λ)β1, λ[0,1] α, β
    Categorical x1,2,,K Dirichlet p(λ1,,λK)=Γ[k=1Kαk]k=1KΓ[αk]k=1Kλkαk1 k 个 αk>0
    Univariate normal (Given Variance, mean unknot) Univariate normal
    Univariate normal (Given Mean, Variance unknown) Gamma
    Univariate normal (Both Variance and Mean unknown) xR normal inverse gamma p(μ,σ2)=γσ2πβαΓ[α](1σ2)α+1exp[2β+γ(δμ)22σ2] α,β,γ>0,and δR
    Multivariate normal xRk normal inverse Wishart $p(\boldsymbol{\mu}, \boldsymbol{\Sigma})=\frac{\gamma^{D / 2} \mathbf{\Psi}

    Gamma Function:

    Γ(z)=0tz1etdt,zC;

    Γ(n)=(n1)!,nZ>0

    Binomial distribution: p(x)=(n x)λx(1λ)nx

    Categorical distribution: p(x)=k=1Kλkxk=λk

    https://zhuanlan.zhihu.com/p/26638720

Fitting Probability Models

we already know some common parametric probability distributions p(x|θ).

Now, we need to know how to learn unknown parameters θ from given data D=x[1],,x[N]. Then, use these parameters to make prediction.

Learning

Methods to learning the unknown parameters θ from data

Maximum Likelihood Estimation (MLE)

  • Given data D=x[1],,x[N]

  • Assume:

    • distribution pθ:θΘ where p(θ)=p(x|θ)
    • D is sampled from $X_{1}, X_{2}, \ldots, X_{N} \sim p_{\theta^{}}forsome\theta^ \in \Theta$
    • Random variables X1,X2,,XN are i.i.d
  • Goal: estimate θ

  • Method: estimate θMLE is a maximum likelihood estimate (MLE) for θ if

    θMLE=argmaxθΘ[p(D|θ)] =argmaxθΘ[p(x|θ)] =argmaxθ[i=1Np(x[i]|θ)]( i.i.d )decompose into products of likelihoods

  • Solution: maximize the logarithm θMLE=argmaxθ[log(i=1Np(x[i]|θ))]. Then, solve by taking derivative w.r.t. θ equal to 0.

  • Advantages:

    • Easy and fast to compute
    • Nice Asymptotic properties**:**
      • Consistent: if data generated from f(θ), MLE converges to its true value, θ^MLEθ as n
      • Efficient: there is no consistent estimator that has lower mean squared error than the MLE estimate
    • Functional Invariance: if θ^ is the MLE of θ, and g(θ)is a transformation of θ then the MLE for α=g(θ) is α^=g(θ^)
  • Disadvantages:

    • MLE is a point estimate i.e., does not represent uncertainty over the estimate
    • MLE may overfit.
    • MLE does not incorporate prior information.
    • Asymptotic results are for the limit and assumes model is correct.
    • MLE may not exist or may not be unique
    • not know uncertainty of the parameter estimation

Bayesian Approach

To model uncertainty of the parameter estimation

Fitting: Instead of a point estimate θ^, compute the posterior distribution over all possible parameter values using Bayes’ rule:

p(θ|D)=i=1Np(x[i]|θ)p(θ)p(D)

Principle: why pick one set of parameters? There are many values that could have explained the data. Try to capture all of the possibilities.

Model:

  • Goal: Model Uncertainty over θ using prior distribution

  • Method: find posterior

    p(θ|D)=i=1Np(x[i]|θ)p(θ)p(D)

    p(θ|x)=i=1Np(x[i]|θ)p(θ)p(x)=i=1Np(x[i]|θ)p(θ)i=1Np(x[i]|θ)p(θ)dθ

    ​ where

    ​ $\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)=\prod_{i=1}^{N} \operatorname{Norm}{x[i]}\left[\mu, \sigma^{2}\right] \underbrace{\operatorname{NormlnvGam}{\mu, \sigma^{2}}[\alpha, \beta, \gamma, \delta]}_{\text{conjugate prior}}ifweuseGuassianasourobservationmodel,i.e.p(x|\theta)$

Properties:

  • models uncertainty over parameters.
  • incorporating prior information
  • can derive quantities of interest, p(x<10|D)
  • can perform model selection.

Drawbacks:

  • Need Prior
  • Computationally intractable
  • If initial belief not conjugate to normal likelihood.. bad

Maximum a Posterior Estimation (MAP)

Given: data D=x[1],,x[N]

Assume:

  • joint distribution p(D,θ)
  • θ is a random variable

Goal: Choose “good” θ

Method:

  • estimate θMAP is a maximum a posterior (MAP) estimation if

    θMAP=argmaxθ[p(θ|D)] =argmaxθ[p(D|θ)p(θ)p(D)]Bayes’ rule&=argmaxθ[i=1Np(x[i]|θ)p(θ)p(D)]i.i.d =argmaxθ[i=1Np(x[i]|θ)p(θ)]p(D)  is removed since it is independent of θ 

  • $ \theta_{MAP} = \underset{\theta}{\operatorname{argmax}}\left[ \prod_{i=1}^N \underbrace{ p(x[i]|\theta)}{\text{likelihood}} \underbrace{p(\theta)}{\text{prior}} \right] $

  • maximize the logarithm. Then taking derivatieves and setting to zero parameters estimation

Results:

  • more data points MAP is closer to MLE
  • Less data points MAP is closer to Prior

key steps:

  1. Prior distribution

  2. Derive
    L=logp(D|θ)p(θ)=logp(D|θ)+logp(θ)

  3. Then set \partL\partθi=0 for each parameter θi

Properties:

  • Fast and easy
  • incorporate prior information
  • Avoid overfitting (“regularization”) ?
  • As n, MAP tends to look like MLE, but not have same nice asympototic properties.

Drawbacks:

  • Point estimation (like MLE)
    • not capture uncertainty over estimates
  • NEED to choose Prior.
  • Not functionally invariant:
    • if θ^ is the MLE of θ, and g(θ) is a transformation of θ, then the MAP for α=g(θ) is not necessarily α^=g(θ^)

Prediction

  • MLE / MAP: evaluate new data point under parameter learnt by MLE /MAP.

  • Bayesian: calculate weighted sum of predictions from all possible values of parameters:
    $$
    \begin{aligned} p\left(x^{} | \mathcal{D}\right) &=\frac{p\left(x^{}, D\right)}{p(D)} \quad \text{(conditional probability)} \ &=\frac{\int p\left(x^{}, D, \theta\right) d \theta}{p(D)} \quad \text{(marginal probability)} \ &=\frac{\int p\left(x^{}, \theta | D\right) p(D) d \theta}{p(D)} \quad \text{(conditional probability)} \ &=\int p\left(x^{} | D, \theta\right) p(\theta | D) d \theta \quad \text{(conditional probability)} \ &=\int p\left(x^{} | D\right) p(\theta | D) d \theta \quad \text{(conditional independence)} \end{aligned}
    $$

  • Predictive Density

    • $$
      p\left(x^{} | D\right)=\int \underbrace{p\left(x^{} | \theta\right)}{\text{weights}} \underbrace{p(\theta | D) d \theta}{\text{prediction for each possible}\ \theta}
      $$

      Make a prediction that is an infinite weighted sum (integral) of the predictions for each parameter value, where weights are probabilitys.

      consider MLE and MAP estimates as probability distributions with zero probability everywhere except at estimate

Training data decrease, Bayesian become less certain, but MAP is wrongly overconfident.

Exponential Family

exponential family:

  • a set of probability distributions pθ:θΘ with the form
    pθ(x)=h(x)exp[η(θ)s(x)]Z(θ)
    Where

    • θRk,xRd
    • natural parameters η(θ):ΘRm
    • sufficient statistics: s(x):RdRm
    • base measure (supporting and scaling) : h(x):Rd[0,)
    • partition function: Z(θ):Θ[0,)

an exponential family is in its natural (canonical 标准的) form if it is parameterized by its natural parameters η :
pη(x)=p(x|η)=h(x)exp[ηs(x)]Z(η)
Normaliser: Z(η)
Z(η)=h(x)exp[ηs(x)]dx
Properties:

  • has conjugate prior.
  • fixed number of sufficient statistics that summarize iid data. 指数函数的充分统计量的可以从大量的i.i.d.数据中归结为估计的几个值(即 s(x) )
  • Posterior predictive distribution always has closed form solution
  • 共轭先验性质给出了后验概率分布的闭式解,否则我们需要求解复杂的积分。而且,共轭先验使得我们能够清楚的看到似然函数对概率分布的影响。

For any ExpFam:
E[s(x)]=logZ(η)

  • 通过对Z(η)求导,容易得到充分统计量s(x)的均值,方差和其他性质。

Important Properties:

Many distributions are Exponential Family:

PMF PDF
• Bernoulli
• Binomial
• Categorical/Multinoulli
• Poisson
• Multinomial
• Negative Binomial
• Negative Binomial
• Normal
• Gamma & Inverse Gamma
• Wishart & Inverse Wishart
• Beta
• Dirichlet
• lognormal
• Exponential
•…

Bayesian Networks

Conditional Independence

Fitting probability models (learning) and predictive density (inference)

  • BUT just at the case of One random variables, i.e. p(x|θ)
  • How about joint probability with N random variables?

Problem:

  • Assume N discrete random variables x1,,xN, where xi1,,K. need O(KN) parameters to represent the joint distribution p(x1,,xN) hard to infer, need huge amount of data to learn all parameters

  • If all random variables are independent O(NK) with p(x1,,xN|θ)=i=1Np(xi|θi) can infer, smaller data

###Definition:

Two random variables XA and XC are conditionally independent given XB, XAXC|XB iff p(xA,xC|xB)=p(xA|xB)p(xC|xB) OR p(xA|xB,xC)=p(xA|xB),XB:p(xB)>0, which means learning XC does not change prediction of XA once we know the value of XB.

###Meanings:

  • parameter reduction
    • let mi denotes the number of parents of node Xi, and each node takes on K values.
    • conditional probability of Xi use Kmi+1 parameters
    • Sum for all nodes, parameter reduction from O(KN) to O(Km+1), and mN

Learning with conditional independence: MLE

  1. represent the joint probability distribution

  2. Taking the logarithm of p(x|θ) converts the product into sums

  3. find the maximum log-likelihood of each parameter θi . this can be executed seperatively.

    argmaxlogp(x|θ)θ1, argmaxlogp(x|θ)θ2, , [other random varibles which independent of θ1 can be removed.]

Learning with conditional independence: MAP

Similar as MLE:
Missing or unrecognized delimiter for \left
where θ=(θ1,,θM)

Bayesian Networks

  • Bayesian Network is a DAG G (no cycle)

    • node is variables Xi, set of parent nodes Xπi: all variables that are parents of Xi
    • Topological ordering: if XiXj, then i<j [not unique]
  • Path: {path follow the arrows cosidering direction}. V.S. Trail: {path follow the edges without considering the direction}

  • Ancestors: parents, grand-parents , etc. V.S. Descendants: children, grand-children, etc.

  • Local Markov assumption :

    • Each random variable Xi is independent of its non-descendants XnonDesc(Xi) given its parents Xπi

    • Missing or unrecognized delimiter for \left

    • Given parents, Xi 与 非后代 XnonDesc(Xi) 独立。

    • Then, conditional probability p(xi|xπi),i=1,,N [according to local parent-children relationship]

Joint probability: Product of all local conditional independence
p(x1,,xN)=i=1Np(xi|xπi)
​ because p(x1,,xN)=p(x1)i=2Np(xi|xx1,,xi1) (chain rule) 

a fully connected DAG can represent any distribution over its random variables.

Interprete arrows as “possible dependence”

Interpreting/Analyzing Bayesian Networks

  • already know the conditional independence based on local parent-children relationship.
  • hope to know other extra independece

Methods:

  1. prove p(xA,xC|xB)=p(xA|xB)p(xC|xB) OR p(xA|xB,xC)=p(xA|xB),XB:p(xB)>0 for XAXC|XB
  2. Graph seperation

Graph seperation

Precise definition of “blocking” has to be done through the “three canonical 3-node graphs”, and “d-separation”.

Three canonical 3-node graphs:

  • Head-Tail

    • if none variables are observed, A and B are not independent A\notperpB|
      p(a,b)=p(a)cp(c|a)p(b|c)=p(a)p(b|a)

    • if C observed, using Bayes’ rule, obtain AB|C
      p(a,b|c)=p(a,b,c)p(c) =p(a)p(c|a)p(b|c)p(c) =p(a|c)p(b|c) (Bayes rule) 

    • Intuition: past A is independent of future B given present C, simple Markov Chain

  • Tail-Tail

    • If none variables are observed, NOT independent, A\notperpB|

    • If C observed, obtain AB|C , because

    • p(a,b|c)=p(a,b,c)p(c) =p(a|c)p(b|c)

    • Then,

  • Head-Head

    • if none variables observed, we obtain AB| because
      p(a,b)=p(a)p(b)

    • if C observed, we obtain A\notperpB|C , because
      p(a,b|c)=p(a,b,c)p(c) =p(a)p(b)p(c|a,b)p(c)
      ​ can not obtain p(a)p(b)

    • “V-structure”.

    • if C unobserved, “blocks”. If C observed, “unblocks”.

    • Observation of any descendent node of C “unblocks” the path from A to B.

    AB|C if all trails from nodes in A are “Blocked” from nodes in B when all C are observed.

    A is d-separated form B by C

  • Key: Independence:

    1. The arrows on the trail meet either head-to-tail or tail-to-tail at the node, and the node is in the set C, or
    2. The arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in the set C.

Bayes Ball

  • it’s a “reachability” algorithm”:

    1. shade the nodes in set C (given nodes)
    2. Place a ball at each of the nodes in set A
    3. Let the balls “bounce around” the graph according to the d-separation rules:
      • if none of the balls reach B, then AB|C
      • Else A\norperpB|C
  • can be implement as a breadth-first search

  • procedure of algorithm:

    • given Graph G and sets of nodes X,Y, and Z
    • Output True if XY|Z, False otherwise.
    • 2 Phases:
      • Phase 1: “Find all the unblocked v-structures”
        • traverse the graph from the leaves to the roots, marking all nodes that are in Z or have descendants in Z.
      • Phase 2: “Traverse all the trails starting from X
        • apply breadth-first-search, stopping a specific traversal when we hit a blocked node.
        • If Y is reached during. this traversal, return False. Else, return True.

Bayes Network Examples:

Linear Regression

  • Model:

    • Y[i]=wx[i]+ϵ[i]

    • where input vector: $\mathbf{x}[i]=\left[\mathrm{x}[i]{1}, \mathrm{x}[i]{2}, \ldots, \mathrm{x}[i]{\mathrm{D}}\right]^{\top};coeffiecientvector:\mathbf{w}=\left[w{1}, w_{2}, \ldots, w_{D}\right]^{\top};noise:\epsilon[i] \sim N\left(0, \sigma_{n}^{2}\right)$

    • if ϵN(0,σn2) , then YN(wx,σn2)

    • Affine property of Gaussian:

      • f(x)=ax+b of Gaussian, , new PDF N(aμ+b,a2σ2)
    • Independence assertion:
      Y[i]Y[i+1]|x[i],w,σn2
      write the factorization:
      p(y[1],,y[N])=iNp(y[i]|wx[i],σn2)

    • Problem: Assume we know σn2, how to learn unknown parameter w?

      • Approach 1: MLE

      θMLE=argmaxθ[i=1Np(x[i]|θ)]iid

      ​ with p(x|μ,σ2)=Normx[μ,σ2]=12πσ2exp(xμ)22σ2

      ​ logarithm, and compute derivate for w
      logi=1Np(y[i]|wx[i],σn2)=i=1NlogN([i]|wx[i],σn2) i=1N(y[i]wx[i])22σn2
      ​ So,
      argminθi=1N(y[i]wx[i])22σn2=12σn2argminθ(YXw)(YXw)
      ​ Set L(w)=YXw)(YXw)
      L(w)=(w(XX)YX)=0 so,w=(XX)1XY
      对ppt有疑问,手写的那一页, p41 

  • **Model Uncertainty of w **

    • Model:

      • p(w|v)=N(0,vI)

      • Conditional independence assertation:
        p(y[1],,y[N],w)=p(w|v)iNp(y[i]|wx[i],σn2)

    • Problem: assume know σn2, MAP solution of w

      • Same as MLE, derivative, equal to 0.

      • w=(YX+2λ)1XY,where λ=σn2v

      • if prior is laplacian distribution: p(w|0,b)=12bexp(|w|b)

        • logp(w|0,b)|w|b , then, have λ=2σn2b , form l1-norm.
      • derivative of l0-norm: This is zero (as you pointed out), but only in places where it isn’t interesting. Where it is interesting it’s not differentiable (it has jump discontinuities).

      • derivative of l1-norm: This norm is not differentiable with respect to a coordinate where that coordinate is zero. Elsewhere, the partial derivatives are just constants, ±1 depending on the quadrant.

      • derivative of l2-norm: Usually people use the l2-norm squared so that it’s differentiable even at zero. The gradient of ||x||2 is 2x, but without the square it’s x||x|| (i.e. it just points away from zero). The problem is that it’s not differentiable at zero.

  • Predictive DGM Model

Naiive Bayes

  • Model:

    • Class c1,,K, and input x=[x1,x2,,xD]

    • naiive bayes assume that p(x|c)=jp(xj|c). Naive assumption: all features are independent

    • Filled dots are deterministic parameters. If we have priors, we can make parameters random variables (open circles)

    • Even the assumption for naiive bayes is so strong, it still works well?

      • Key idea: it works well when the local feature dependencies between the two classes “cancel out”.

##Properties of Bayesian network

  • Independence-Maps (I-maps)

    • Theorem:

      Let P be a joint distribution over X and G be a Bayesian Network structure over X. If P factories according to G, then the local independence assertation Jl(G)J(P).

      • The local Markov independence $\mathcal{J}l(G)isthesetofallbasicconditionalindependenceassertionsoftheform\left{X{i} \perp\left(X_{\text { nondesc }\left(x_{i}\right)} \backslash X_{\pi_{i}}\right) | X_{\pi_{i}}\right}$
    • Proof:

      • $\mathcal{J}l(G)\subseteq \mathcal{J(P)}meansp(X_i|X\setminus X_i)=p(X_i|X{\pi_i})Missing or unrecognized delimiter for \left$

      • So, Missing or unrecognized delimiter for \left

  • Global Markov independence:

    • Definition: the set of all independencies that correspond to d-seperation in graph G is the set of global Markov independencies:
      Missing or unrecognized delimiter for \left

    • Soundness:

      • Definition: if a distribution P factorizes according to G, then Jl(G)J(P).
      • i.e. two nodes are d-separated given Z, they are conditionally independent given Z.
    • Completeness:

      • Definition: P is faithful to G if for any conditional independence (XY|Z)J(P), then dsepG(X;Y|Z).
      • i.e. any independence can be reflect as a d-separation.
      • Definition: weak completeness
      • Definition: almost completeness
    • How Powerful:

      Can find a Bayes network to G represent all conditional independencies for a given P?

      • Minimal I-map: “no redundant edges”
      • perfect map:

      No

Undirected Graphical Models (Markov Random Fields)

##Undirected Graphical Models:

UGM (MRF) or (Markov Network) is graph G(V,E), set of nodes and set of undirected edges

  • Parameterization: via factors.

    • Factor φ(C) function map from set of variables to non-negative numbers.
  • Factorization:

    • p(x1,,xN)=1Zj=1Mφj(Cj)

      DGM is p(x1,,xN)=i=1Np(xi|xi)

    • just like DGM, UGM encode a set of conditional independence assertions.
  • Conditional Independence:

    • 3 Markov properties of UGMs:

      1. Global Markov Property

        • XAXB|XC iff C separates A from B.
        • NO trails connect A and B when remove all nodes in C.
      2. Local Markov Property

        • Markov Blanket mb(Xs): set of nodes that renders a node Xs conditionally independent of all the other nodes:
          Missing or unrecognized delimiter for \left

        • Markov blanket in UGM is the set of immediate neighbors

          • node 和除了本身以及neighbors 的其他node 都 conditionally independent. {due to global Markov property}

          Markov Blankets for a node Xs in a DGM is the set of node’s parents, children, and co-parents (other parents of children).

          对ppt有疑问, p35 

      3. Pairwise Markov Property

        • two nodes Xs and Xt are conditionally independent given the rest if there is no direct edge between them
          Missing or unrecognized delimiter for \left
          移除其他nodes后无直接连线
    • relationship among three properties.

      • Interrelated.

      • Global Markov local Markov Pairwise Markov Global Markov

        • for positive distribution p(x)>0, the three Markov properties are equavalent.

Parameterization of MRFs

Aim: obtain a local parameterization of UGMs.

For DGM:

  • Local conditional probabilities: p(xi|xπi)

  • Joint probability is product of local conditional probabilities: p(x1,,xN)=i=1Np(xi|xπi)

But for UGM, difficult to obtain conditional probabilities since no topological ordering. So, discard conditional probabilities.

Represent the joint as a product of local functions.

  • Normalize
  • Factors NOT represent marginal/conditional distributions

Gibbs distribution

  • Definition: A distribution P is a Gibbs distribution parameterized by a set of factors ${\varphi_1(\mathcal{C}1), \varphi_2(\mathcal{C}2),\cdots, \varphi_m(\mathcal{C}m)},where$
    p\left(x
    {1}, x
    {2}, \ldots, x
    {N}\right)=\frac{1}{Z} \prod_{j=1}^{M} \varphi_{j}\left(\mathcal{C}_{j}\right)
    $$

  • Factor:

    Two nodes Xi and Xj that are not directly linked in UGM are conditionally independent given all other nodes. [不直接相连的nodes是条件独立的]

    Factorization of joint probability that places Xi and Xj in different factors.

    All nodes XC in a maximal clique C in UGM form factor (local function) φ(XC)

    • Clique: a fully connected subset of nodes.
    • Maximal Clique C: cliques that cannot be extended to include additional nodes

    Hammersley-Clifford: a positive distribution p(y)>0 satisfies the CI properties of undirected graph H iff p can be represented as a product of factors, one per maimal clique:

    • $$
      p(\mathbf{y} | \boldsymbol{\theta})=\frac{1}{Z(\boldsymbol{\theta})} \prod_{c \in \mathcal{C}} \psi_{c}\left(\mathbf{y}{c} | \boldsymbol{\theta}{c}\right)
      $$

      where C is the set of all the maimal liques; φC() is the factor or potential function of clique c; θ is the parameter of factors φC() for cC; Z(θ) is the partition function $Z(\boldsymbol{\theta}) \triangleq \sum \prod\limits_{c \in \mathcal{C}} \psi_{c}\left(\mathbf{y}{c} | \boldsymbol{\theta}{c}\right)$.

      UGM NOT specify a unique factorization.

      Hammersley-Clifford is just one type of factorization.

    Pairwise MRF:

    • parameretization to the edges of the graph, rather than maximal cliques. [每条边作为factor,不考虑maximal clique]
    • SIMPLE but not general.
  • Properties:

    • Independent set; Independence map (I-map);

    • Soundness

      • if P is a Gibbs distribution over H, then H is an I-map for P.
        • Hammersley-Clifford states that if H is an I-map for P, then P is a Gibbs distribution over H (for positive distribution)
    • Weak completeness

      • if X and Y are not separated in H, then there is some distribution P that factories over H where X and Y are dependent.
    • How Powerful: ? represent all conditional independencies ?

      • Perfect map: J(G)=J(P)

      • UGM are NOT “richer” or “more powerful” than DGM.

  • representing Potential (local) function

    • Log potentials is a linear function of parameters:

    • $$
      \log \psi_{c}\left(\mathbf{y}{c}\right) \triangleq \phi{c}\left(\mathbf{y}{c}\right)^{T} \boldsymbol{\theta}{c}
      $$

      ϕC(yC) is a feature vector derived from the values of the variables yC.

      Then, log probability is
      $$
      \log p(\mathbf{y} | \boldsymbol{\theta})=\sum_{c} \phi_{c}\left(\mathbf{y}{c}\right)^{T} \boldsymbol{\theta}{c}-\log Z(\theta)
      $$
      also called “maximum entropy” or “log-linear” model.

  • every MRF is an [exponential family](#Exponential Family).

Example of MRF

Model:

  • Ising and Potts (Depth Map from Stereo Images (Multi) or Binary Image Denosing)
p(y,x|J,θ)=p(y|J)tp(xt|yt,θ)=[1Z(J)stψ(ys,yt;J)Pairwise Potential]tp(xt|yt,θ)Unary potential 其实不太懂why 

Parameter Learning

UGM (MRF) in log-linear form, where c indexes the cliques:
$$
p(\mathbf{y} | \boldsymbol{\theta})=\frac{1}{Z(\boldsymbol{\theta})} \exp \left(\sum_{c} \boldsymbol{\theta}{c}^{T} \boldsymbol{\phi}{c}(\mathbf{y})\right)
$$
The scaled log-likelihood is given by
$$
\ell(\boldsymbol{\theta}) \triangleq \frac{1}{N} \sum_{i}^{N} \log p\left(\mathbf{y}{i} | \boldsymbol{\theta}\right)=\frac{1}{N} \sum{i}^{N}\left[\sum_{c} \boldsymbol{\theta}{c}^{T} \boldsymbol{\phi}{c}\left(\mathbf{y}_{i}\right)-\log Z(\boldsymbol{\theta})\right]
$$

  • belongs to exponential family,

  • convex function in θ, so can find unique global maximum by gradient-based optimizers.

  • $$
    \frac{\partial \ell}{\partial \boldsymbol{\theta}{c}}=\frac{1}{N} \sum{i}^{N}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}{i}\right)-\frac{\partial}{\partial \boldsymbol{\theta}_{c}} \log Z(\boldsymbol{\theta})\right]
    $$

    对ppt有疑问, p96, 怎么求的导? 

  • for exponential family distribution:

    • E[s(x)]=logZ(η)

    • if s(x)=x (natural exponential family), we can find moments of x simply by differentiation.

    • the derivative of the log partition function w.r.t. θc is the expectation of the cth feature under the model:

    • logZ(θ)θc=E[ϕc(y)|θ]=yϕc(y)p(y|θ)

    • Proof:

    • $$
      \begin{align}
      \frac{\partial \log Z(\theta)}{\partial \theta_{c}} &= \frac{1}{Z(\theta)} \frac{\partial Z(\theta)}{\partial \theta_{c}}, \quad \text { where } \quad Z(\theta) = \sum_{y} \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right)\
      \Rightarrow \frac{\partial Z(\theta)}{\partial \theta_{c}} &= \sum_{y} \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right) \phi_{c}(y)\
      \Rightarrow \frac{\partial \log Z(\theta)}{\partial \theta_{c}} &= \frac{1}{Z(\theta)} \sum_{y} \phi_{c}(y) \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right) =\sum_{y} \phi_{c}(y) \underbrace{\frac{1}{Z(\theta)} \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right)}{p(y|\theta)} = \sum{y} \phi_{c}(y) p(y | \theta)
      \end{align}
      $$

    • So, the gradient of the log-likelihood is

    • $$
      \frac{\partial \ell}{\partial \boldsymbol{\theta}{c}}=\left[\underbrace{\frac{1}{N} \sum{i}^{N} \boldsymbol{\phi}{c}\left(\mathbf{y}{i}\right)}{\text{Clamped term}}\right] - \underbrace{\mathbb{E}\left[\boldsymbol{\phi}{c}(\mathbf{y})\right]}_{\text{Unclamped/contrastive term}}
      $$

      Clamped term: y is fixed to its observed values; Unclamped/contrastive term: y is a free variable.

      Unclamped term requires inference in the model, once per gradient step, and this makes UGM learning much slower than DGM.

    • Gradient of log-likelihood rewrite:

    • lθc=Epemp[ϕc(y)]Ep(y|θ)[ϕc(y)]

      • Epemp[ϕc(y)]=1Nn=1Nϕc(yi) : Expected feature vector according to empirical distribution.
      • Ep(y|θ)[ϕc(y)] : Expected feature vector according to model’s distribution.

      At optimum, gradient =0, i.e. Ep(y|θ)[ϕc(y)]=yϕc(y)p(y|θ). But this equation cannot solve for unknown θ.

    • Solution: Gradient-based optimizers.

    • lθc=Epemp[ϕc(y)]Ep(y|θ)[ϕc(y)]

      But, when computing Ep(y|θ)[ϕc(y)] requires sum over all states of y , which is intractable.

    • Solution: combine approximate inference with gradient-based learning. i.e. Stochastic Maximum Likelihood

  • Stochastic Maximum Likelihood:

    iteratively updates the parameter θk+1 at the k step using the parameter and gradient from previous step:
    θk+1θkηgk
    where η is step size or learning rate, gk\partl\partθC is the gradient that can be approximated with Markov Chain Monte Carlo (MCMC), sampling.

  • Maximum A posterior (MAP):

    • add prior:

    • Missing or unrecognized delimiter for \left

      Choose prior with hyper-parameters.

Conditional Random Fields (CRF)

CRF or discriminative random field

  • a version of MRF where all the clique potentials are conditioned on input feature X:

  • p(y|x,w)=1Z(x,w)cψc(yc|x,w)

    • Log-linear representation of potentials:

    • $$
      \psi_{c}\left(\mathbf{y}{c} | \mathbf{x}, \mathbf{w}\right)=\exp \left(\mathbf{w}{c}^{T} \boldsymbol{\phi}\left(\mathbf{x}, \mathbf{y}_{c}\right)\right)
      $$

      where ϕ(x,yC) is a feature vector derived from the global inputs X and the local set of labels YC.

  • Advantages: (compare to generative models)

    • No need to waste resources, i.e. modelling things that we can always observe.
      • Focus our attention on modeling what we care about, i.e. the distribution of labels given the data.
    • We can make the potentials (or factors) of the model be data-dependent.
      • e.g., in natural language processing problems, we can make the latent labels depend on global properties of the sentence, such as which language it is written in.
  • Disadvantages:

    • require labeled training data
    • learning is slower.
  • Parameter learning:

    Consider CRF in log-linear form:
    $$
    p(\mathbf{y} | \mathbf{x}, \mathbf{w})=\frac{1}{Z(\mathbf{x}, \mathbf{w})} \prod_{c} \exp \left(\mathbf{w}{c}^{T} \phi{c}\left(\mathbf{x}, \mathbf{y}{c}\right)\right)
    $$
    where $\phi
    {c}\left(\mathrm{x}, y_{c}\right)isafeaturevectorderivedfromtheglobalinputs\mathbb{x}andthelocalsetoflabelsy_c$.

    Scaled log-likelihood:
    $$
    \begin{aligned} \ell(\mathbf{w}) & \triangleq \frac{1}{N} \sum_{i}^{N} \log p\left(\mathbf{y}{i} | \mathbf{x}{i}, \mathbf{w}\right) =\frac{1}{N} \sum_{i}^{N}\left[\sum_{c} \mathbf{w}{c}^{T} \boldsymbol{\phi}{c}\left(\mathbf{y}{i}, \mathbf{x}{i}\right)-\log Z\left(\mathbf{w}, \mathbf{x}{i}\right)\right] \end{aligned}
    Gradient:
    \frac{\partial \ell}{\partial \mathbf{w}
    {c}} =\frac{1}{N} \sum_{i}^{N}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}{i}, \mathbf{x}{i}\right)-\frac{\partial}{\partial \mathbf{w}{c}} \log Z\left(\mathbf{w}, \mathbf{x}{i}\right)\right] =\frac{1}{N} \sum{i}^{N}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}{i}, \mathbf{x}{i}\right)-\mathbb{E}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}, \mathbf{x}{i}\right)\right]\right]
    $$
    ​ Need labeled pairs of data $(y_i,x_i)
    {i=1}^Nforlearning.And\mathbb{E}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}, \mathbf{x}{i}\right)\right] =\sum_{\mathbb{y}, \mathbb{x}{i}} p\left(\mathbb{y} | \mathbb{x}{i}, \mathbb{w}\right) \boldsymbol{\phi}{c}\left(\mathbb{y}, \mathbb{x}{i}\right)$

    • depend on inputs xi
    • Need inference for every single training case inside each gradient step, which is O(N) times slower than MRF.
    • Also use MAP to estimate unknown parameter Missing or unrecognized delimiter for \left by choosing a prior.

Probabilistic inference

calculate p(XF|XE) for arbitrary subset E and F


黑体