Uncertainty Modelling in AI

Uncertainty Modelling in AI

All materials are from Module CS5340 (NUS Computing)

PGM (Probabilistic Graphical Modeling)

  • gain global insight based on local observations

Key Ideas:

  • Represent the world as a collection of random variables $X_1, \cdots, X_N$ with joint distribution $p(X_1, \cdots, X_N)$

  • Learn the distribution from data.

  • Perform “inference” (compute conditional distributions ($p(X_i|X_1=x_1,\cdots,X_N=x_N)$).

Introduction

Probability space

Three parts:

  • Outcome space $\Omega$: space of possible outcomes
  • Event space $E$: $E\subseteq 2^{\Omega}$, subset of power set of $\Omega$
    • contains empty $\emptyset$ and trivial event $\Omega$
    • closed under union. if $\alpha, \beta \in E$, so is $\alpha \cup\beta$
    • closed under complement. if $\alpha \in E$, So is $\Omega -\alpha$
  • Probability distribution $P$: mapping from events in $E$ into real values
    • Non negativity
    • sum to 1
    • Mutually disjoint events: if $\alpha, \beta \in E$, and $\alpha \cap \beta=\emptyset$, then $P(\alpha\cup\beta)=P(\alpha)\cup P(\beta)$.

Probability distribution:

  • $X$: random variables. $x$: generic values [$x\in Val(X)$].

  • Indicator Random Variables: map every outcome to 0 or 1.: Just indication

  • Discrete vs.. continuous

    • Discrete:
      • PMF (Probability mass function): $p(x)$
    • Continuous:
      • PDF (Probability Density function) : $p(x):R\rightarrow R_{\geq 0}$
      • $P(X)$ is a cumulative function

Probability:

  • Joint probability:

Basic operations

Marginalization

Recover probability distribution of any variable in a joint distribution by integrating (or summing) over all other variables.

$${} p(x)=\int p(x,y) dy ;\quad p(y)=\int(x,y) dx$$

$p(x)=\sum\limits_y p(x,y) ;\quad p(y)=\sum\limits_x(x,y) $

Conditional probability

$p(x|Y=y^*)$

can be extracted from joint probability.

$P(x|Y=y^*)=\frac{p(x,Y=y^*)}{\int\limits_x p(x,Y=y^*) dx} = \frac{p(x,Y=y^*)}{p(Y=y^*)}$

i.e. $p(x|y)=\frac{p(x,y)}{p(y)}$

So, $p(x,y)=p(x|y)p(y)=p(y|x)p(x)$ : chain rule of product rule

Bayes rule

$$p(x|y)p(y)=p(y|x)p(x) \Longrightarrow p(y|x)=\frac{p(x|y)p(y)}{p(x)}=\frac{p(x|y)p(y)}{\int p(x,y) dy} = \frac{p(x|y)p(y)}{\int p(x|y)p(y)dy}$$

Independence

independence of X and Y means that every conditional distribution is the same.

when variables are independent, $p(x,y)=p(x|y)p(y)=p(x)p(y)$

Expectation

$E[f(x)]=\sum\limits_x f(x)p(x)\ \text{or} \ \int f(x)p(x)dx$

$E[f(x)+g(x)]=E[f(x)]+E[g(x)]$

$E[f(x)g(x)]=E[f(x)]E[g(x)]$(if X, Y independent)

Probability Distribution

Data Type Domain Distributions PDF(PMF) Parameter
Univariate, discrete, binary $x\in{0,1}$ Bernoulli $p(x)=\lambda^x(1-\lambda)^{1-x}$ $\lambda$
Univariate, discrete, multi-valued $x\in{1,2,\cdots,K}$ Categorical $p(\mathbb{x})=\prod\limits_{k=1}^K \lambda_k^{x_k}=\lambda_k$ $\lambda=[\lambda_1,\cdots,\lambda_K]^T$, where $\lambda\geq 0$, $\sum_k\lambda_k=1$
Univariate, continuous, unbounded $x\in R$ Univariate normal (Gaussian) $p(x \mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}} \exp -\frac{(x-\mu)^2}{2\sigma^2}$
Univariate, continuous, bounded $x\in [0,1]$ Beta $p(x)=\frac{\Gamma[\alpha+\beta]}{\Gamma[\alpha]\Gamma[\beta]} x^{\alpha-1}(1-x)^{\beta-1}$ $\alpha$, $\beta$
Multivariate, continuous, unbounded $\mathbf{x}\in \mathbb{R}^D$ Multivariate normal $\frac{1}{(2\pi)^{D/2} \boldsymbol{\Sigma}
Multivariate, continuous, bounded, sum to 1 $\mathbf{x}=[x_1,x_2,\cdots,x_K]^T$, $x_k\in[0,1]$, $\sum\limits_{k=1}^K x_k=1$ Dirichlet $p\left(\mathbb{x}\right)=\frac{\Gamma\left[\sum_{k=1}^{K} \alpha_{k}\right]}{\prod_{k=1}^{K} \Gamma\left[\alpha_{k}\right]} \prod_{k=1}^{K} x_{k}^{\alpha_{k}-1}$ $\alpha_k$ (k个)
Bivariate, continuous, $x_1$ unbounded, $x_2$ bounded below $\mathbf{x}=[x_1, x_2]$, $x_1\in\mathbb{R}$, $x_2\in\mathbb{R}^+$ Normal-scaled inverse gamma
Multivariate vector $\mathbf{x}$ and matrix $\mathbf{X}$. $\mathbf{x}$ Unbounded, $\mathbf{X}$ square, positive definite $\mathbf{x}\in\mathbb{R}^K$, $\mathbf{X}\in\mathbb{R}^{K\times K}$, $\mathbf{z}^T\mathbf{X}\mathbf{z}>0\quad \forall\mathbf{z}\in\mathbb{R}^K$ Normal inverse Wishart
  • Conjugate Distributions

    • Parameters of conjugate distributions are known as hyperparameters because they control the parameter distribution

    • Aim: Learning the parameters $\theta$ of a probability distribution:

      ​ $$p(\theta|x)=\frac{p(x|\theta)p(\theta)}{\int p(x|\theta)p(\theta)d\theta}$$

      • $p(\theta)$: prior. $p(x|\theta)$: likelihood. $p(\theta|x)$: posterior
      • Use prior and likelihood to compute posterior. Posterior has the same form as prior $\longrightarrow$ conjugate
    • 使用共轭先验的原因是,可以使得先验分布和后验分布的形式相同,这样符合人的直观感受,另一方面,是可以形成一个先验链,即现在的后验分布可以作为下一次计算的先验分布,也就是带来了计算的简单性。

    • 使得Bayes inference更加方便, 比如在 sequential Bayesian inference中,得到一个observation之后,可以算出一个posterior。由于选取的是conjugate prior,因此posterior和prior的形式一样,可以把该posterior当作新的prior,用于下一次的observation,继续下一次迭代。

    Distribution Domain Prior Prior Distribution hyper parameters
    Bernouli $x\in{0,1}$ Beta $p(\lambda)=\frac{\Gamma[\alpha+\beta]}{\Gamma[\alpha]\Gamma[\beta]}\lambda^{\alpha-1}(1-\lambda)^{\beta-1}$, $\lambda\in[0,1]​$ $\alpha$, $\beta​$
    Binomial $x\in{0,1,\cdots,n}$ Beta $p(\lambda)=\frac{\Gamma[\alpha+\beta]}{\Gamma[\alpha]\Gamma[\beta]}\lambda^{\alpha-1}(1-\lambda)^{\beta-1}$, $\lambda\in[0,1]$ $\alpha$, $\beta$
    Categorical $x\in {1, 2, \cdots,K}$ Dirichlet $p\left(\lambda_{1}, \ldots, \lambda_{K}\right)=\frac{\Gamma\left[\sum_{k=1}^{K} \alpha_{k}\right]}{\prod_{k=1}^{K} \Gamma\left[\alpha_{k}\right]} \prod_{k=1}^{K} \lambda_{k}^{\alpha_{k}-1}$ k 个 $\alpha_k>0$
    Univariate normal (Given Variance, mean unknot) Univariate normal
    Univariate normal (Given Mean, Variance unknown) Gamma
    Univariate normal (Both Variance and Mean unknown) $x\in\mathbb{R}$ normal inverse gamma $p\left(\mu, \sigma^{2}\right)=\frac{\sqrt{\gamma}}{\sigma \sqrt{2 \pi}} \frac{\beta^{\alpha}}{\Gamma[\alpha]}\left(\frac{1}{\sigma^{2}}\right)^{\alpha+1} \exp \left[-\frac{2 \beta+\gamma(\delta-\mu)^{2}}{2 \sigma^{2}}\right]$ $\alpha, \beta, \gamma>0, \text{and}\ \delta\in\mathbb{R}$
    Multivariate normal $\mathbf{x}\in\mathbb{R}^k$ normal inverse Wishart $p(\boldsymbol{\mu}, \boldsymbol{\Sigma})=\frac{\gamma^{D / 2} \mathbf{\Psi}

    Gamma Function:

    $\Gamma(z)=\int_0^{\infty} t^{z-1}e^{-t} dt, \quad z\in \mathbb{C}$;

    $\Gamma(n)=(n-1)!,\quad n\in\mathbb{Z}_{>0}$

    Binomial distribution: $p(x)=\left( \begin{array}{l}{n} \ {x}\end{array}\right) \lambda^{x}(1-\lambda)^{n-x}$

    Categorical distribution: $p(\mathbb{x})=\prod\limits_{k=1}^K \lambda_k^{x_k}=\lambda_k$

    https://zhuanlan.zhihu.com/p/26638720

Fitting Probability Models

we already know some common parametric probability distributions $p(x|\theta)$.

Now, we need to know how to learn unknown parameters $\theta$ from given data $\mathcal{D}={x[1],\cdots,x_[N]}$. Then, use these parameters to make prediction.

Learning

Methods to learning the unknown parameters $\theta$ from data

Maximum Likelihood Estimation (MLE)

  • Given data $\mathcal{D}={x[1],\cdots,x_[N]}$

  • Assume:

    • distribution ${p_\theta : \theta \in \Theta}$ where $p(\theta)=p(x|\theta)$
    • $\mathcal{D}$ is sampled from $X_{1}, X_{2}, \ldots, X_{N} \sim p_{\theta^{}}$ for some $\theta^ \in \Theta$
    • Random variables $X_{1}, X_{2}, \ldots, X_{N}$ are i.i.d
  • Goal: estimate $\theta^*$

  • Method: estimate $\theta_{MLE}$ is a maximum likelihood estimate (MLE) for $\theta^*$ if

    ​ $$\begin{align} \theta_{M L E}&=\underset{\theta \in \Theta}{\operatorname{argmax}}[p(\mathcal{D} | \theta)]\ &=\underset{\theta \in \Theta}{\operatorname{argmax}}[p(x | \theta)]\ &=\underset{\theta}{\operatorname{argmax}}\left[\prod_{i=1}^{N} p(x[i] | \theta)\right] \quad(\text { i.i.d }) \quad \longrightarrow \quad \text{decompose into products of likelihoods} \end{align} $$

  • Solution: maximize the logarithm $\theta_{MLE}=\underset{\theta}{\operatorname{argmax}}\left[\log\left(\prod_{i=1}^{N} p(x[i] | \theta)\right)\right]$. Then, solve by taking derivative w.r.t. $\theta$ equal to 0.

  • Advantages:

    • Easy and fast to compute
    • Nice Asymptotic properties**:**
      • Consistent: if data generated from $f(\theta^*)$, MLE converges to its true value, $\hat{\theta}_{MLE} \rightarrow \theta^*$ as $n \rightarrow \infty$
      • Efficient: there is no consistent estimator that has lower mean squared error than the MLE estimate
    • Functional Invariance: if $\hat{\theta}$ is the MLE of $\theta^*$, and $g(\theta^*)$is a transformation of $\theta^*$ then the MLE for $\alpha=g(\theta^*)$ is $\hat{\alpha}=g(\hat{\theta})$
  • Disadvantages:

    • MLE is a point estimate i.e., does not represent uncertainty over the estimate
    • MLE may overfit.
    • MLE does not incorporate prior information.
    • Asymptotic results are for the limit and assumes model is correct.
    • MLE may not exist or may not be unique
    • not know uncertainty of the parameter estimation

Bayesian Approach

To model uncertainty of the parameter estimation

Fitting: Instead of a point estimate $\hat{\theta}$, compute the posterior distribution over all possible parameter values using Bayes’ rule:

​ $$p(\theta | D)=\frac{\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)}{p(D)}$$

Principle: why pick one set of parameters? There are many values that could have explained the data. Try to capture all of the possibilities.

Model:

  • Goal: Model Uncertainty over $\theta$ using prior distribution

  • Method: find posterior

    ​ $p(\theta | D)=\frac{\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)}{p(D)}$

    ​ $p(\theta | x)=\frac{\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)}{p(x)}=\frac{\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)}{\int \prod_{i=1}^{N} p(x[i] | \theta) p(\theta) d \theta}$

    ​ where

    ​ $\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)=\prod_{i=1}^{N} \operatorname{Norm}{x[i]}\left[\mu, \sigma^{2}\right] \underbrace{\operatorname{NormlnvGam}{\mu, \sigma^{2}}[\alpha, \beta, \gamma, \delta]}_{\text{conjugate prior}}$ if we use Guassian as our observation model, i.e. $p(x|\theta)$

Properties:

  • models uncertainty over parameters.
  • incorporating prior information
  • can derive quantities of interest, $p(x<10|D)$
  • can perform model selection.

Drawbacks:

  • Need Prior
  • Computationally intractable
  • If initial belief not conjugate to normal likelihood.. bad

Maximum a Posterior Estimation (MAP)

Given: data $D={x[1], \cdots, x[N]}$

Assume:

  • joint distribution $p(D,\theta)$
  • $\theta$ is a random variable

Goal: Choose “good” $\theta$

Method:

  • estimate $\theta_{MAP}$ is a maximum a posterior (MAP) estimation if

    $$\begin{align} \theta_{MAP}&=\underset{\theta}{\operatorname{argmax}} [p(\theta|D)]\ &=\underset{\theta}{\operatorname{argmax}}\left[\frac{p(D|\theta)p(\theta)}{p(D)} \right] \quad \text{Bayes’ rule}\&=\underset{\theta}{\operatorname{argmax}}\left[\frac{\prod_{i=1}^N p(x[i]|\theta)p(\theta)}{p(D)} \right]\quad \text{i.i.d}\ &=\underset{\theta}{\operatorname{argmax}}\left[ \prod_{i=1}^N p(x[i]|\theta)p(\theta) \right] \quad p(D)\ \text{ is removed since it is independent of}\ \theta \ \end{align}$$

  • $ \theta_{MAP} = \underset{\theta}{\operatorname{argmax}}\left[ \prod_{i=1}^N \underbrace{ p(x[i]|\theta)}{\text{likelihood}} \underbrace{p(\theta)}{\text{prior}} \right] $

  • maximize the logarithm. Then taking derivatieves and setting to zero $\longrightarrow$ parameters estimation

Results:

  • more data points $\rightarrow$ MAP is closer to MLE
  • Less data points $\rightarrow$ MAP is closer to Prior

key steps:

  1. Prior distribution

  2. Derive
    $$
    \mathcal{L}=\log p(D | \theta) p(\theta)=\log p(D | \theta)+\log p(\theta)
    $$

  3. Then set $\frac{\part L}{\part \theta_i}=0$ for each parameter $\theta_i$

Properties:

  • Fast and easy
  • incorporate prior information
  • Avoid overfitting (“regularization”) ?
  • As $n \rightarrow \infty$, MAP tends to look like MLE, but not have same nice asympototic properties.

Drawbacks:

  • Point estimation (like MLE)
    • not capture uncertainty over estimates
  • NEED to choose Prior.
  • Not functionally invariant:
    • if $\hat{\theta}$ is the MLE of $\theta^*$, and $g(\theta^*)$ is a transformation of $\theta^*$, then the MAP for $\alpha = g(\theta^*)$ is not necessarily $\hat{\alpha} = g(\hat{\theta})$

Prediction

  • MLE / MAP: evaluate new data point under parameter learnt by MLE /MAP.

  • Bayesian: calculate weighted sum of predictions from all possible values of parameters:
    $$
    \begin{aligned} p\left(x^{} | \mathcal{D}\right) &=\frac{p\left(x^{}, D\right)}{p(D)} \quad \text{(conditional probability)} \ &=\frac{\int p\left(x^{}, D, \theta\right) d \theta}{p(D)} \quad \text{(marginal probability)} \ &=\frac{\int p\left(x^{}, \theta | D\right) p(D) d \theta}{p(D)} \quad \text{(conditional probability)} \ &=\int p\left(x^{} | D, \theta\right) p(\theta | D) d \theta \quad \text{(conditional probability)} \ &=\int p\left(x^{} | D\right) p(\theta | D) d \theta \quad \text{(conditional independence)} \end{aligned}
    $$

  • Predictive Density

    • $$
      p\left(x^{} | D\right)=\int \underbrace{p\left(x^{} | \theta\right)}{\text{weights}} \underbrace{p(\theta | D) d \theta}{\text{prediction for each possible}\ \theta}
      $$

      Make a prediction that is an infinite weighted sum (integral) of the predictions for each parameter value, where weights are probabilitys.

      consider MLE and MAP estimates as probability distributions with zero probability everywhere except at estimate

Training data decrease, Bayesian become less certain, but MAP is wrongly overconfident.

Exponential Family

exponential family:

  • a set of probability distributions ${p_\theta: \theta \in \Theta}$ with the form
    $$
    p_{\theta}(x)=\frac{h(x) \exp \left[-\eta(\theta)^{\top} s(x)\right]}{Z(\theta)}
    $$
    Where

    • $\theta \in \mathbb{R}^{k}, x \in \mathbb{R}^{d}$
    • natural parameters $\eta(\theta) : \Theta \rightarrow \mathbb{R}^m$
    • sufficient statistics: $s(x): \mathbb{R}^d \rightarrow \mathbb{R}^m$
    • base measure (supporting and scaling) : $h(x):\mathbb{R}^d \rightarrow [0,\infty)$
    • partition function: $Z(\theta):\Theta \rightarrow [0,\infty)$

an exponential family is in its natural (canonical 标准的) form if it is parameterized by its natural parameters $\eta$ :
$$
p_{\eta}(x)=p(x | \eta)=\frac{h(x) \exp \left[-\eta^{\top} s(x)\right]}{Z(\eta)}
$$
Normaliser: $Z(\eta)$
$$
Z(\eta)=\int h(x) \exp \left[-\eta^{\top} s(x)\right] d x
$$
Properties:

  • has conjugate prior.
  • fixed number of sufficient statistics that summarize iid data. 指数函数的充分统计量的可以从大量的i.i.d.数据中归结为估计的几个值(即 $s (x)$ )
  • Posterior predictive distribution always has closed form solution
  • 共轭先验性质给出了后验概率分布的闭式解,否则我们需要求解复杂的积分。而且,共轭先验使得我们能够清楚的看到似然函数对概率分布的影响。

For any ExpFam:
$$
\mathbb{E}[s(x)]=\nabla \log \mathrm{Z}(\eta)
$$

  • 通过对$Z(\eta)$求导,容易得到充分统计量$s(x)$的均值,方差和其他性质。

Important Properties:

Many distributions are Exponential Family:

PMF PDF
• Bernoulli
• Binomial
• Categorical/Multinoulli
• Poisson
• Multinomial
• Negative Binomial
• Negative Binomial
• Normal
• Gamma & Inverse Gamma
• Wishart & Inverse Wishart
• Beta
• Dirichlet
• lognormal
• Exponential
•…

Bayesian Networks

Conditional Independence

Fitting probability models (learning) and predictive density (inference)

  • BUT just at the case of One random variables, i.e. $p(x|\theta)$
  • How about joint probability with $N$ random variables?

Problem:

  • Assume $N$ discrete random variables $x_1, \cdots, x_N$, where $x_i\in{1,\cdots,K}$. $\longrightarrow$ need $O(K^N)$ parameters to represent the joint distribution $p(x_1, \cdots, x_N)$ $\longrightarrow$ hard to infer, need huge amount of data to learn all parameters

  • If all random variables are independent $\longrightarrow$ $O(NK)$ with $p(x_1,\cdots, x_N|\theta) = \prod_{i=1}^N p(x_i|\theta_i)$ $\longrightarrow$ can infer, smaller data

###Definition:

Two random variables $X_A$ and $X_C$ are conditionally independent given $X_B$, $X_{A} \perp X_{C} | X_{B}$ iff $p\left(x_{A}, x_{C} | x_{B}\right)=p\left(x_{A} | x_{B}\right) p\left(x_{C} | x_{B}\right)$ OR $p\left(x_{A} | x_{B}, x_{C}\right)=p\left(x_{A} | x_{B}\right), \quad \forall X_{B} : p\left(x_{B}\right)>0$, which means learning $X_C$ does not change prediction of $X_A$ once we know the value of $X_B$.

###Meanings:

  • parameter reduction
    • let $m_i$ denotes the number of parents of node $X_i$, and each node takes on $K$ values.
    • conditional probability of $X_i$ use $K^{m_i+1}$ parameters
    • Sum for all nodes, parameter reduction from $O(K^N)$ to $O(K^{m+1})$, and $m \ll N$

Learning with conditional independence: MLE

  1. represent the joint probability distribution

  2. Taking the logarithm of $p(x|\theta)$ converts the product into sums

  3. find the maximum log-likelihood of each parameter $\theta_i$ . this can be executed seperatively.

    $\underset{\theta_1}{\operatorname{argmax} \log p(x|\theta)}$, $\underset{\theta_2}{\operatorname{argmax} \log p(x|\theta)}$, $\cdots$, [other random varibles which independent of $\theta-1$ can be removed.]

Learning with conditional independence: MAP

Similar as MLE:
$$
\begin{aligned} \underset{\theta_{i}}{\operatorname{argmax}} \log p(\theta | x) &=\underset{\theta_{i}}{\operatorname{argmax}} \log p(x | \theta) p(\theta) \ &=\underset{\theta_{i}}{\operatorname{argmax}}\left{\log p\left(x_{i} | x_{\pi_{i}}, \theta_{i}\right)+\log p\left(\theta_{i}\right)\right} \end{aligned}
$$
where $\theta=\left(\theta_{1}, \ldots, \theta_{M}\right)$

Bayesian Networks

  • Bayesian Network is a DAG $\mathcal{G}$ (no cycle)

    • node is variables $X_i$, set of parent nodes $X_{\pi_i}$: all variables that are parents of $X_i$
    • Topological ordering: if $X_i \rightarrow X_j$, then $i<j$ [not unique]
  • Path: {path follow the arrows cosidering direction}. V.S. Trail: {path follow the edges without considering the direction}

  • Ancestors: parents, grand-parents , etc. V.S. Descendants: children, grand-children, etc.

  • Local Markov assumption :

    • Each random variable $X_i$ is independent of its non-descendants $X_{nonDesc(X_i)}$ given its parents $X_{\pi_i}$

    • $$
      \left{X_{i} \perp\left(X_{\text { nonDesc }\left(X_{i}\right)} \backslash X_{\pi_{i}}\right) | X_{\pi_{i}}\right}
      $$

    • Given parents, $X_i$ 与 非后代 $X_{nonDesc(X_i)}$ 独立。

    • Then, conditional probability $p\left(x_{i} | x_{\pi_{i}}\right), \quad i=1, \cdots,N$ [according to local parent-children relationship]

Joint probability: Product of all local conditional independence
$$
p\left(x_{1}, \ldots, x_{N}\right)=\prod_{i=1}^{N} p\left(x_{i} | x_{\pi_{i}}\right)
$$
​ because $p\left(x_{1}, \ldots, x_{N}\right)=p\left(x_{1}\right) \prod_{i=2}^{N} p\left(x_{i} | x_{x_{1}, \ldots, x_{i-1}}\right) \quad \text { (chain rule) }$

a fully connected DAG can represent any distribution over its random variables.

Interprete arrows as “possible dependence”

Interpreting/Analyzing Bayesian Networks

  • already know the conditional independence based on local parent-children relationship.
  • hope to know other extra independece

Methods:

  1. prove $p\left(x_{A}, x_{C} | x_{B}\right)=p\left(x_{A} | x_{B}\right) p\left(x_{C} | x_{B}\right)$ OR $p\left(x_{A} | x_{B}, x_{C}\right)=p\left(x_{A} | x_{B}\right), \quad \forall X_{B} : p\left(x_{B}\right)>0$ for $X_{A} \perp X_{C} | X_{B}$
  2. Graph seperation

Graph seperation

Precise definition of “blocking” has to be done through the “three canonical 3-node graphs”, and “d-separation”.

Three canonical 3-node graphs:

  • Head-Tail

    • if none variables are observed, A and B are not independent $A \notperp B | \emptyset$
      $$
      p(a, b)=p(a) \sum_{c} p(c | a) p(b | c)=p(a) p(b | a)
      $$

    • if C observed, using Bayes’ rule, obtain $A \perp B | C$
      $$
      \begin{aligned} p(a, b | c) &=\frac{p(a, b, c)}{p(c)} \ &=\frac{p(a) p(c | a) p(b | c)}{p(c)} \ &=p(a | c) p(b | c) \end{aligned} \quad \text { (Bayes rule) }
      $$

    • Intuition: past A is independent of future B given present C, $\rightarrow$ simple Markov Chain

  • Tail-Tail

    • If none variables are observed, NOT independent, $A \notperp B | \emptyset$

    • If C observed, obtain $A \perp B | C$ , because

    • $$
      \begin{aligned} p(a, b | c) &=\frac{p(a, b, c)}{p(c)} \ &=p(a | c) p(b | c) \end{aligned}
      $$

    • Then,

  • Head-Head

    • if none variables observed, we obtain $A \perp B | \emptyset$ because
      $$
      p(a, b)=p(a) p(b)
      $$

    • if C observed, we obtain $A \notperp B | C$ , because
      $$
      \begin{aligned} p(a, b | c) &=\frac{p(a, b, c)}{p(c)} \ &=\frac{p(a) p(b) p(c | a, b)}{p(c)} \end{aligned}
      $$
      ​ can not obtain $p(a)p(b)$

    • “V-structure”.

    • if C unobserved, “blocks”. If C observed, “unblocks”.

    • Observation of any descendent node of C “unblocks” the path from A to B.

    $A \perp B |C$ if all trails from nodes in A are “Blocked” from nodes in B when all C are observed.

    A is d-separated form B by C

  • Key: Independence:

    1. The arrows on the trail meet either head-to-tail or tail-to-tail at the node, and the node is in the set C, or
    2. The arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in the set C.

Bayes Ball

  • it’s a “reachability” algorithm”:

    1. shade the nodes in set $C$ (given nodes)
    2. Place a ball at each of the nodes in set $A$
    3. Let the balls “bounce around” the graph according to the d-separation rules:
      • if none of the balls reach B, then $A \perp B|C$
      • Else $A \norperp B|C$
  • can be implement as a breadth-first search

  • procedure of algorithm:

    • given Graph $G$ and sets of nodes $X, Y$, and $Z$
    • Output True if $X\perp Y | Z$, False otherwise.
    • 2 Phases:
      • Phase 1: “Find all the unblocked v-structures”
        • traverse the graph from the leaves to the roots, marking all nodes that are in $Z$ or have descendants in $Z$.
      • Phase 2: “Traverse all the trails starting from $X$”
        • apply breadth-first-search, stopping a specific traversal when we hit a blocked node.
        • If $Y$ is reached during. this traversal, return False. Else, return True.

Bayes Network Examples:

Linear Regression

  • Model:

    • $$
      Y[i]=\mathbf{w}^{\top} \mathbf{x}[i]+\epsilon[i]
      $$

    • where input vector: $\mathbf{x}[i]=\left[\mathrm{x}[i]{1}, \mathrm{x}[i]{2}, \ldots, \mathrm{x}[i]{\mathrm{D}}\right]^{\top}$; coeffiecient vector: $\mathbf{w}=\left[w{1}, w_{2}, \ldots, w_{D}\right]^{\top}$; noise: $\epsilon[i] \sim N\left(0, \sigma_{n}^{2}\right)$

    • $\Longrightarrow$

    • if $\epsilon \sim N\left(0, \sigma_{n}^{2}\right)$ , then $Y \sim N\left(\mathbf{w}^{\top} \mathbf{x}, \sigma_{n}^{2}\right)$

    • Affine property of Gaussian:

      • $f(x)=ax+b$ of Gaussian, $\rightarrow$ , new PDF $\mathcal{N}(a\mu+b,a^2\sigma^2)$
    • Independence assertion:
      $$
      Y[i] \perp Y[i+1] |\mathbb{x}[i], \mathbb{w}, \sigma_n^2
      $$
      write the factorization:
      $$
      p(y[1],\cdots,y[N])=\prod\limits_i^N p(y[i]| \mathbb{w}^\top\mathbb{x}[i], \sigma_n^2)
      $$

    • Problem: Assume we know $\sigma_n^2$, how to learn unknown parameter $\mathbb{w}$?

      • Approach 1: MLE

      $$
      \theta_{MLE}=\underset{\theta}{\operatorname{argmax}}\left[\prod_{i=1}^{N} p(x[i] | \theta)\right] \quad \text{iid}
      $$

      ​ with $p\left(x | \mu, \sigma^{2}\right)=\operatorname{Norm}_{x}\left[\mu, \sigma^{2}\right]=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp -\frac{(x-\mu)^{2}}{2 \sigma^{2}}$

      ​ logarithm, and compute derivate for $\mathbb{w}$
      $$
      \log \prod_{i=1}^{N} p(y[i]| \mathbb{w}^\top\mathbb{x}[i], \sigma_n^2) = \sum_{i=1}^N \log \mathcal{N}([i]| \mathbb{w}^\top\mathbb{x}[i], \sigma_n^2)\
      \propto-\sum_{i=1}^N \frac{(y[i]-\mathbb{w}^\top x[i])^2}{2\sigma_n^2}
      $$
      ​ So,
      $$
      \underset{\theta}{\operatorname{argmin}} \sum\limits_{i=1}^N \frac{(y[i]-\mathbb{w}^\top x[i])^2}{2\sigma_n^2} = \frac{1}{2\sigma_n^2} \underset{\theta}{\operatorname{argmin}}(\mathbf{Y}-\mathbf{X}^\top \mathbb{w})^\top (\mathbf{Y}-\mathbf{X}^\top \mathbb{w})
      $$
      ​ Set $\mathcal{L}(\mathbb{w})=\mathbf{Y}-\mathbf{X}^\top \mathbb{w})^\top (\mathbf{Y}-\mathbf{X}^\top \mathbb{w})$
      $$
      \nabla \mathcal{L}(\mathbb{w})=(\mathbb{w}^\top(\mathbf{X}^\top\mathbf{X})-\mathbf{Y}^\top\mathbf{X}) = 0\
      \text{so,} \quad \mathbb{w}=(\mathbf{X}\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y}
      $$
      对ppt有疑问,手写的那一页, p41 

  • **Model Uncertainty of $\mathbb{w}$ **

    • Model:

      • $p(\mathbf{w} | v)=N(\mathbf{0}, v \mathbf{I})$

      • Conditional independence assertation:
        $$
        p(y[1], \ldots, y[N], \mathbf{w}) =p(\mathbf{w} | v) \prod\limits_{i}^N p(y[i]| \mathbf{w}^{\top} \mathbf{x}[i], \sigma_{n}^{2} )
        $$

    • Problem: assume know $\sigma_n^2$, MAP solution of $\mathbb{w}$

      • Same as MLE, derivative, equal to 0.

      • $$
        \mathbb{w}=(\mathbf{Y}^\top\mathbf{X}+2\lambda)^{-1}\mathbf{X}^\top\mathbf{Y},\quad\text{where}\ \lambda=\frac{\sigma_n^2}{v}
        $$

      • if prior is laplacian distribution: $p(\mathbb{w}|0,b)=\frac{1}{2b}\exp (-\frac{|w|}{b})$

        • $\log p(\mathbb{w}|0,b)\propto -\frac{|w|}{b}$ , then, have $\lambda = -\frac{2\sigma_n^2}{b}$ , form $l_1$-norm.
      • derivative of $l_0$-norm: This is zero (as you pointed out), but only in places where it isn’t interesting. Where it is interesting it’s not differentiable (it has jump discontinuities).

      • derivative of $l_1$-norm: This norm is not differentiable with respect to a coordinate where that coordinate is zero. Elsewhere, the partial derivatives are just constants, $\pm 1$ depending on the quadrant.

      • derivative of $l_2$-norm: Usually people use the $l_2$-norm squared so that it’s differentiable even at zero. The gradient of $||x||^2$ is $2x$, but without the square it’s $\frac{x}{||x||}$ (i.e. it just points away from zero). The problem is that it’s not differentiable at zero.

  • Predictive DGM Model

Naiive Bayes

  • Model:

    • Class $c \in{1, \ldots, K}$, and input $\mathbf{x}=\left[x_{1}, x_{2}, \ldots, x_{D}\right]^{\top}$

    • naiive bayes assume that $p(\mathbf{x} | c)=\prod_{j} p\left(x_{j} | c\right)$. Naive assumption: all features are independent

    • Filled dots are deterministic parameters. If we have priors, we can make parameters random variables (open circles)

      $\Longrightarrow$

    • Even the assumption for naiive bayes is so strong, it still works well?

      • Key idea: it works well when the local feature dependencies between the two classes “cancel out”.

##Properties of Bayesian network

  • Independence-Maps (I-maps)

    • Theorem:

      Let $P$ be a joint distribution over $\mathcal{X}$ and $G$ be a Bayesian Network structure over $\mathcal{X}$. If $P$ factories according to $G$, then the local independence assertation $\mathcal{J}_l(G)\subseteq \mathcal{J(P)}$.

      • The local Markov independence $\mathcal{J}l(G)$ is the set of all basic conditional independence assertions of the form $\left{X{i} \perp\left(X_{\text { nondesc }\left(x_{i}\right)} \backslash X_{\pi_{i}}\right) | X_{\pi_{i}}\right}$
    • Proof:

      • $\mathcal{J}l(G)\subseteq \mathcal{J(P)}$ means $p(X_i|X\setminus X_i)=p(X_i|X{\pi_i})$
        $$
        \begin{align}
        p(X_i|X\setminus X_i)=&\frac{p(X_1, \cdots,X_N)}{P({X_1,\cdots, X_N}\setminus X_i)}\
        &=\frac{\prod\limits_{i=1}^N p(X_i|X_{\pi_i})}{\sum\limits_{X_i} \prod\limits_j p(X_j|X_{\pi_j})}\
        &=\frac{p(X_i|X_{\pi_i}) \prod\limits_j p(X_j|X_{\pi_j})}{\sum\limits_{X_i}p(X_i|X_{\pi_i}) \prod\limits_j p(X_j|X_{\pi_j})}\
        &=\frac{p(X_i|X_{\pi_i})}{\sum\limits_{X_i}p(X_i|X_{\pi_i})} \quad \text{because}\left{\sum\limits_{X_i}p(X_i|X_{\pi_i})=1\right}\
        &=p(X_i|X_{\pi_i})
        \end{align}
        $$

      • So, $\left{X_{i} \perp\left(X_{\text { nondesc }\left(x_{i}\right)} \backslash X_{\pi_{i}}\right) | X_{\pi_{i}}\right}$

  • Global Markov independence:

    • Definition: the set of all independencies that correspond to d-seperation in graph $G$ is the set of global Markov independencies:
      $$
      \mathcal{J}(G)=\left{(X \perp Y | Z) : \operatorname{dsep}_{\mathrm{G}}(X ; Y | Z)\right}
      $$

    • Soundness:

      • Definition: if a distribution $P$ factorizes according to $G$, then $\mathcal{J}_l(G)\subseteq \mathcal{J(P)}$.
      • i.e. two nodes are d-separated given $Z$, they are conditionally independent given $Z$.
    • Completeness:

      • Definition: $P$ is faithful to $G$ if for any conditional independence $(X \perp Y | Z) \in \mathcal{J}(P)$, then $\operatorname{dsep}_{G}(X ; Y | Z)$.
      • i.e. any independence can be reflect as a d-separation.
      • Definition: weak completeness
      • Definition: almost completeness
    • How Powerful:

      Can find a Bayes network to $G$ represent all conditional independencies for a given $P$?

      • Minimal I-map: “no redundant edges”
      • perfect map:

      No

Undirected Graphical Models (Markov Random Fields)

##Undirected Graphical Models:

UGM (MRF) or (Markov Network) is graph $\mathcal{G}(\mathcal{V},\mathcal{E})$, set of nodes and set of undirected edges

  • Parameterization: via factors.

    • Factor $\varphi(\mathcal{C})$ function map from set of variables to non-negative numbers.
  • Factorization:

    • $$
      p\left(x_{1}, \ldots, x_{N}\right)=\frac{1}{Z} \prod_{j=1}^{M} \varphi_{j}\left(\mathcal{C}_{j}\right)
      $$

      DGM is $p\left(x_{1}, \ldots, x_{N}\right)=\prod\limits_{i=1}^{N} p\left(x_{i} | x_{i}\right)$

    • just like DGM, UGM encode a set of conditional independence assertions.
  • Conditional Independence:

    • 3 Markov properties of UGMs:

      1. Global Markov Property

        • $X_{A} \perp X_{B} | X_{C}$ iff $C$ separates $A$ from $B$.
        • NO trails connect $A$ and $B$ when remove all nodes in $C$.
      2. Local Markov Property

        • Markov Blanket $\operatorname{mb}(X_s)$: set of nodes that renders a node $X_s$ conditionally independent of all the other nodes:
          $$
          X_{S} \perp \overbrace{\mathcal{V} \backslash\left{\mathrm{mb}\left(X_{S}\right), X_{S}\right}}^{\text{all other nodes in}\ G} | \mathrm{mb}\left(X_{S}\right)
          $$

        • Markov blanket in UGM is the set of immediate neighbors

          • node 和除了本身以及neighbors 的其他node 都 conditionally independent. {due to global Markov property}

          Markov Blankets for a node $X_s$ in a DGM is the set of node’s parents, children, and co-parents (other parents of children).

          对ppt有疑问, p35 

      3. Pairwise Markov Property

        • two nodes $X_s$ and $X_t$ are conditionally independent given the rest if there is no direct edge between them
          $$
          X_{s} \perp X_{t} | \mathcal{V} \backslash\left{X_{S}, X_{t}\right}, \text { where } \mathcal{E}_{s t}=\emptyset
          $$
          移除其他nodes后无直接连线
    • relationship among three properties.

      • Interrelated.

      • Global Markov $\Longrightarrow$ local Markov $\Longrightarrow$ Pairwise Markov $\Longrightarrow$ Global Markov

        • for positive distribution $p(\mathbb{x})>0$, the three Markov properties are equavalent.

Parameterization of MRFs

Aim: obtain a local parameterization of UGMs.

For DGM:

  • Local conditional probabilities: $p(x_i|x_{\pi_i})$

  • Joint probability is product of local conditional probabilities: $p\left(x_{1}, \ldots, x_{N}\right)=\prod_{i=1}^{N} p\left(x_{i} | x_{\pi_{i}}\right)$

But for UGM, difficult to obtain conditional probabilities since no topological ordering. So, discard conditional probabilities.

Represent the joint as a product of local functions.

  • Normalize
  • Factors NOT represent marginal/conditional distributions

Gibbs distribution

  • Definition: A distribution $P$ is a Gibbs distribution parameterized by a set of factors ${\varphi_1(\mathcal{C}1), \varphi_2(\mathcal{C}2),\cdots, \varphi_m(\mathcal{C}m)}$, where
    $$
    p\left(x
    {1}, x
    {2}, \ldots, x
    {N}\right)=\frac{1}{Z} \prod_{j=1}^{M} \varphi_{j}\left(\mathcal{C}_{j}\right)
    $$

  • Factor:

    Two nodes $X_i$ and $X_j$ that are not directly linked in UGM are conditionally independent given all other nodes. [不直接相连的nodes是条件独立的]

    $\rightarrow$ Factorization of joint probability that places $X_i$ and $X_j$ in different factors.

    All nodes $X_C$ in a maximal clique $C$ in UGM form factor (local function) $\varphi(X_C)$

    • Clique: a fully connected subset of nodes.
    • Maximal Clique $C$: cliques that cannot be extended to include additional nodes

    Hammersley-Clifford: a positive distribution $p(y)>0$ satisfies the CI properties of undirected graph $\mathcal{H}$ iff $p$ can be represented as a product of factors, one per maimal clique:

    • $$
      p(\mathbf{y} | \boldsymbol{\theta})=\frac{1}{Z(\boldsymbol{\theta})} \prod_{c \in \mathcal{C}} \psi_{c}\left(\mathbf{y}{c} | \boldsymbol{\theta}{c}\right)
      $$

      where $\mathcal{C}$ is the set of all the maimal liques; $\varphi_C(\cdot)$ is the factor or potential function of clique $c$; $\theta$ is the parameter of factors $\varphi_C(\cdot)$ for $c\in\mathcal{C}$; $Z(\boldsymbol{\theta})$ is the partition function $Z(\boldsymbol{\theta}) \triangleq \sum \prod\limits_{c \in \mathcal{C}} \psi_{c}\left(\mathbf{y}{c} | \boldsymbol{\theta}{c}\right)$.

      UGM NOT specify a unique factorization.

      Hammersley-Clifford is just one type of factorization.

    Pairwise MRF:

    • parameretization to the edges of the graph, rather than maximal cliques. [每条边作为factor,不考虑maximal clique]
    • SIMPLE but not general.
  • Properties:

    • Independent set; Independence map (I-map);

    • Soundness

      • if $P$ is a Gibbs distribution over $H$, then $H$ is an I-map for $P$.
        • Hammersley-Clifford states that if $H$ is an I-map for $P$, then $P$ is a Gibbs distribution over $H$ (for positive distribution)
    • Weak completeness

      • if $X$ and $Y$ are not separated in $H$, then there is some distribution $P$ that factories over $H$ where $X$ and $Y$ are dependent.
    • How Powerful: ? represent all conditional independencies ?

      • Perfect map: $\mathcal{J}(G)=\mathcal{J}(P)$

      • UGM are NOT “richer” or “more powerful” than DGM.

  • representing Potential (local) function

    • Log potentials is a linear function of parameters:

    • $$
      \log \psi_{c}\left(\mathbf{y}{c}\right) \triangleq \phi{c}\left(\mathbf{y}{c}\right)^{T} \boldsymbol{\theta}{c}
      $$

      $\phi_C(y_C)$ is a feature vector derived from the values of the variables $y_C.$

      Then, log probability is
      $$
      \log p(\mathbf{y} | \boldsymbol{\theta})=\sum_{c} \phi_{c}\left(\mathbf{y}{c}\right)^{T} \boldsymbol{\theta}{c}-\log Z(\theta)
      $$
      also called “maximum entropy” or “log-linear” model.

  • every MRF is an [exponential family](#Exponential Family).

Example of MRF

Model:

  • Ising and Potts (Depth Map from Stereo Images (Multi) or Binary Image Denosing)
$$ \begin{aligned} p(y, x | J, \theta) =p(\mathbf{y} | J) \prod_{t} p\left(x_{t} | y_{t}, \boldsymbol{\theta}\right) =\left[\frac{1}{Z(J)} \underbrace{\prod_{s \sim t} \psi\left(y_{s}, y_{t} ; J\right)}_{\text{Pairwise Potential}}\right] \underbrace{\prod_{t} p\left(x_{t} | y_{t}, \theta\right)}_{\text{Unary potential}} \end{aligned} $$ 其实不太懂why 

Parameter Learning

UGM (MRF) in log-linear form, where $c$ indexes the cliques:
$$
p(\mathbf{y} | \boldsymbol{\theta})=\frac{1}{Z(\boldsymbol{\theta})} \exp \left(\sum_{c} \boldsymbol{\theta}{c}^{T} \boldsymbol{\phi}{c}(\mathbf{y})\right)
$$
The scaled log-likelihood is given by
$$
\ell(\boldsymbol{\theta}) \triangleq \frac{1}{N} \sum_{i}^{N} \log p\left(\mathbf{y}{i} | \boldsymbol{\theta}\right)=\frac{1}{N} \sum{i}^{N}\left[\sum_{c} \boldsymbol{\theta}{c}^{T} \boldsymbol{\phi}{c}\left(\mathbf{y}_{i}\right)-\log Z(\boldsymbol{\theta})\right]
$$

  • belongs to exponential family,

  • convex function in $\theta$, so can find unique global maximum by gradient-based optimizers.

  • $$
    \frac{\partial \ell}{\partial \boldsymbol{\theta}{c}}=\frac{1}{N} \sum{i}^{N}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}{i}\right)-\frac{\partial}{\partial \boldsymbol{\theta}_{c}} \log Z(\boldsymbol{\theta})\right]
    $$

    对ppt有疑问, p96, 怎么求的导? 

  • for exponential family distribution:

    • $$
      \mathbb{E}[s(x)]=\nabla \log Z(\eta)
      $$

    • if $s(x)=x$ (natural exponential family), we can find moments of $x$ simply by differentiation.

    • the derivative of the log partition function w.r.t. $\theta_c$ is the expectation of the $c^{th}$ feature under the model:

    • $$
      \frac{\partial \log Z(\boldsymbol{\theta})}{\partial \theta_{c}}=\mathbb{E}\left[\phi_{c}(\mathbf{y}) | \boldsymbol{\theta}\right]=\sum_{\mathbf{y}} \phi_{c}(\mathbf{y}) p(\mathbf{y} | \boldsymbol{\theta})
      $$

    • Proof:

    • $$
      \begin{align}
      \frac{\partial \log Z(\theta)}{\partial \theta_{c}} &= \frac{1}{Z(\theta)} \frac{\partial Z(\theta)}{\partial \theta_{c}}, \quad \text { where } \quad Z(\theta) = \sum_{y} \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right)\
      \Rightarrow \frac{\partial Z(\theta)}{\partial \theta_{c}} &= \sum_{y} \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right) \phi_{c}(y)\
      \Rightarrow \frac{\partial \log Z(\theta)}{\partial \theta_{c}} &= \frac{1}{Z(\theta)} \sum_{y} \phi_{c}(y) \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right) =\sum_{y} \phi_{c}(y) \underbrace{\frac{1}{Z(\theta)} \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right)}{p(y|\theta)} = \sum{y} \phi_{c}(y) p(y | \theta)
      \end{align}
      $$

    • So, the gradient of the log-likelihood is

    • $$
      \frac{\partial \ell}{\partial \boldsymbol{\theta}{c}}=\left[\underbrace{\frac{1}{N} \sum{i}^{N} \boldsymbol{\phi}{c}\left(\mathbf{y}{i}\right)}{\text{Clamped term}}\right] - \underbrace{\mathbb{E}\left[\boldsymbol{\phi}{c}(\mathbf{y})\right]}_{\text{Unclamped/contrastive term}}
      $$

      Clamped term: $y$ is fixed to its observed values; Unclamped/contrastive term: $y$ is a free variable.

      Unclamped term requires inference in the model, once per gradient step, and this makes UGM learning much slower than DGM.

    • Gradient of log-likelihood rewrite:

    • $$
      \frac{\partial l}{\partial \theta_{c}}=E_{p_{e m p}}\left[\phi_{c}(y)\right]-E_{p(y | \theta)}\left[\phi_{c}(y)\right]
      $$

      • $E_{p_{e m p}}\left[\phi_{c}(y)\right]=\frac{1}{N} \sum_{n=1}^{N} \phi_{c}\left(y_{i}\right)$ : Expected feature vector according to empirical distribution.
      • $E_{p(y | \theta)}\left[\phi_{c}(y)\right]$ : Expected feature vector according to model’s distribution.

      At optimum, gradient =0, i.e. $E_{p(y | \theta)}\left[\phi_{c}(y)\right]=\sum_{y} \phi_{c}(y) p(y | \theta)$. But this equation cannot solve for unknown $\theta$.

    • Solution: Gradient-based optimizers.

    • $$
      \frac{\partial l}{\partial \theta_{c}}=E_{p_{e m p}}\left[\phi_{c}(y)\right]-E_{p(y | \theta)}\left[\phi_{c}(y)\right]
      $$

      But, when computing $E_{p(y | \theta)}\left[\phi_{c}(y)\right]$ requires sum over all states of $y$ , which is intractable.

    • Solution: combine approximate inference with gradient-based learning. i.e. Stochastic Maximum Likelihood

  • Stochastic Maximum Likelihood:

    iteratively updates the parameter $\theta_{k+1}$ at the $k$ step using the parameter and gradient from previous step:
    $$
    \theta_{k+1} \leftarrow \theta_{k}-\eta g_{k}
    $$
    where $\eta$ is step size or learning rate, $g_k\approx \frac{\part l}{\part \theta_C}$ is the gradient that can be approximated with Markov Chain Monte Carlo (MCMC), sampling.

  • Maximum A posterior (MAP):

    • add prior:

    • $$
      \underset{\theta}{\operatorname{argmax}}\left{\sum_{i} \log p\left(y_{i} | \theta\right)+\log \underbrace{p(\theta)}_{\text{prior}}\right}
      $$

      Choose prior with hyper-parameters.

Conditional Random Fields (CRF)

CRF or discriminative random field

  • a version of MRF where all the clique potentials are conditioned on input feature $X$:

  • $$
    p(\mathbf{y} | \mathbf{x}, \mathbf{w})=\frac{1}{Z(\mathbf{x}, \mathbf{w})} \prod_{c} \psi_{c}\left(\mathbf{y}_{c} | \mathbf{x}, \mathbf{w}\right)
    $$

    • Log-linear representation of potentials:

    • $$
      \psi_{c}\left(\mathbf{y}{c} | \mathbf{x}, \mathbf{w}\right)=\exp \left(\mathbf{w}{c}^{T} \boldsymbol{\phi}\left(\mathbf{x}, \mathbf{y}_{c}\right)\right)
      $$

      where $\phi(\mathbb{x}, y_C)$ is a feature vector derived from the global inputs $X$ and the local set of labels $Y_C$.

  • Advantages: (compare to generative models)

    • No need to waste resources, i.e. modelling things that we can always observe.
      • Focus our attention on modeling what we care about, i.e. the distribution of labels given the data.
    • We can make the potentials (or factors) of the model be data-dependent.
      • e.g., in natural language processing problems, we can make the latent labels depend on global properties of the sentence, such as which language it is written in.
  • Disadvantages:

    • require labeled training data
    • learning is slower.
  • Parameter learning:

    Consider CRF in log-linear form:
    $$
    p(\mathbf{y} | \mathbf{x}, \mathbf{w})=\frac{1}{Z(\mathbf{x}, \mathbf{w})} \prod_{c} \exp \left(\mathbf{w}{c}^{T} \phi{c}\left(\mathbf{x}, \mathbf{y}{c}\right)\right)
    $$
    where $\phi
    {c}\left(\mathrm{x}, y_{c}\right)$ is a feature vector derived from the global inputs $\mathbb{x}$ and the local set of labels $y_c$.

    Scaled log-likelihood:
    $$
    \begin{aligned} \ell(\mathbf{w}) & \triangleq \frac{1}{N} \sum_{i}^{N} \log p\left(\mathbf{y}{i} | \mathbf{x}{i}, \mathbf{w}\right) =\frac{1}{N} \sum_{i}^{N}\left[\sum_{c} \mathbf{w}{c}^{T} \boldsymbol{\phi}{c}\left(\mathbf{y}{i}, \mathbf{x}{i}\right)-\log Z\left(\mathbf{w}, \mathbf{x}{i}\right)\right] \end{aligned}
    $$
    Gradient:
    $$
    \frac{\partial \ell}{\partial \mathbf{w}
    {c}} =\frac{1}{N} \sum_{i}^{N}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}{i}, \mathbf{x}{i}\right)-\frac{\partial}{\partial \mathbf{w}{c}} \log Z\left(\mathbf{w}, \mathbf{x}{i}\right)\right] =\frac{1}{N} \sum{i}^{N}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}{i}, \mathbf{x}{i}\right)-\mathbb{E}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}, \mathbf{x}{i}\right)\right]\right]
    $$
    ​ Need labeled pairs of data $(y_i,x_i)
    {i=1}^N$ for learning. And $\mathbb{E}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}, \mathbf{x}{i}\right)\right] =\sum_{\mathbb{y}, \mathbb{x}{i}} p\left(\mathbb{y} | \mathbb{x}{i}, \mathbb{w}\right) \boldsymbol{\phi}{c}\left(\mathbb{y}, \mathbb{x}{i}\right)$

    • depend on inputs $\mathbb{x}_i$
    • Need inference for every single training case inside each gradient step, which is $O(N)$ times slower than MRF.
    • Also use MAP to estimate unknown parameter $\underset{w}{\operatorname{argmax}}\left{\sum \log p\left(y_{i} | x_{i}, w\right)+\log p(w)\right}$ by choosing a prior.

Probabilistic inference

calculate $p(X_F|X_E)$ for arbitrary subset $E$ and $F$


黑体