Artificial Intelligence

Artificial Intelligence

most supervised learning when ppl talk about

ethics

Mill’s utilitarianism: maximize benefit for everyone, mostly equally

bane:
- conflict of interest
- personal benefit
Kant’s formalism/ duty ethics: unconditional command for every individual

bane: universal principle may harm specific people
Locke’s Right Ethics: individual has right simply by existence

bane: what to do when two people’s rights conflict
Aristotle’s virtue ethics: objective goodness from human qualities

bane: how to find the “golden mean”

the golden rule (all above agrees with)

Do unto others as you would have others do unto you

codes of ethics

statements of general principles, followed by instructions for specific conduct

defiens duties the professional owes to society/employers/clients/colleagues/subordinates/profession/self

engineering design process

recognize problem/need, gather information
define problem/goal
generate/propose solution/method
evaluate benefit&cost of alternatives

handling ethical issues

correct the problem
whistle blowing
resign in protest

definition of artificial intelligence

humanly → acting
ration → math/theory

Turing test

NLP
knowledge representation
automated reasoning
ML

total Turing test

perceptional abilities

computer vision
robotics

thinking humanly

get inside the working of human mind
general problem solver
cognitive science

thinking rationally

correctness
logic
fact-check

knowledge-based system

general-purpose search
domain-specific knowledge
knowledge bottleneck

intelligent agent

rational agent

prior knowledge of environment
performable action
performance measurement
perception

task environment

PEAS: performance measurement, environment, actuators, sensors

properties

fully/partially observable
single/multiple agent
deterministic/stochastic
episodic/sequential
static/dynamic
known/unknown

agent structure

function: perception → action
architecture: sensory → actuator

table-driven structure

$n_{e n t ry} = t = 1 \sum T ∣ P ∣^{T}$

simplest
e.g. industrial robot
#cases explode

simple reflex agent

match input against rule, return action

model-based: only work for fully observable
goal-based
utility-based: maximize gain expectation

search

breadth/depth first

example application

collaborative perception

share raw data → huge overhead
extract feature
share object position

anomaly detection

bidirectional optimization

uninformed search

backtracking search

breadth-first search

complete: will find shallowest goal if graph has finite depth
optimal if path cost $g (n)$ is non-decreasing of depth $d$
$O (b^{d})$ where $b$ is branching factor

uniform-cost search

expand node with least path cost $g (n)$

optimal
$O (b^{1 + ⌊ C^{*} / ϵ ⌋})$ where every action cost $\geq ϵ$

informed search (heuristic search)

A* search (A-star search)

cost $f (n) = g (n) + h (n)$
- cost to reach node $g (n)$
- estimated cost to goal, heuristic, $h (n)$
optimal if $h (n)$ is admissible in tree search
- admissibility: never overestimate
optimal if $h (n)$ is consistent in graph search
- consistency: triangle inequality $h (n) \leq c (n, a, n^{'}) + h (n^{'})$
optimally efficient if $h (n)$ is consistent: expand fewest node among optimal algorithm

reinforcement learning

control system
model-based vs model-free

dynamic programming

key: sub-problem

example

Dijkstra’s algorithm: per node
Bellman-Ford algorithm: per hop

discrete Markov decision process (discrete MDP)

finite tuple ${S, A, {P_{s a}}, γ, R}$

space $S$
action $A$
state transition probabilities $P_{s a}$
discount factor $γ \in [0, 1)$
reward function $R$ : evaluation metric
total payoff. maximize this

$V = i \sum γ^{i} R (s_{i})$
policy $π : s \mapsto a$ . find this

find optimal policy

optimal policy $π^{*}$

$π^{*} = a arg max s^{'} \sum P_{s a} (s^{'}) V^{*} (s^{'})$

mapping state to expected total payoff $V^{π} : s \mapsto R$

$V^{π} (s) = E [i \sum γ^{i} R (s_{i}) s_{0} = s] = E [[R (s)] + γ V^{π} (s^{'})] \Rightarrow V^{π} (s) = R (s) + γ s^{'} \sum P_{s π (s)} (s^{'}) V^{π} (s^{'})$

Bellman equation

value iteration

$V (s) := 0$
Bellman update: $\forall s,$

$V (s) := R (s) + a max γ s^{'} \sum P_{s a} (s^{'}) V (s^{'})$

linear system
Bellman back operator $V^{'} := B (V)$
sync/async update
$γ$ force $V$ to converge $V^{*}$ exponentially

policy iteration

random $π$
repeat:

$V := V^{π} by Bellman equation π (s) := a arg max s^{'} \sum P_{s a} (s^{'}) V (s^{'})$

when converge, guarantee optimal policy
high complexity: linear system very step

exploration and exploitation

$ε$ -greedy

$a = {arg max V r an d o m a \in A p ro babi l i t y 1 - ε p ro babi l i t y ε$
- $ε$ is small and decrease
softmax

continuous Markov decision process (continuous MDP)

$V (s) = R (s) + γ a max E_{s^{'} \sim P_{s a}} [V (s^{'})]$

inverted pendulum

kinematic model

discretization

curse of dimensionality: $∣ S ∣ = R^{n}$
bad for smooth function

4 ~ 8 dimension

value function approximation

approximate $V^{*}$

$V^{*} (s) = θ^{T} ϕ (s)$

trial

model/simulator: $s^{'} \mapsto P_{s a}$
- assume $A$ discrete
$s^{'} = s + Δ t \overset{s}{˙}$
learn from data
- $n$ trial, each with $T$ time step
- supervised learning $(s_{t}, a_{t}) \mapsto s_{t + 1}$
  - linear regression $s_{t + 1} = A s_{t} + B a_{t}, A \in R^{m \times n}, B \in R^{n \times d}$
  - deterministic/stochastic model
    - noise term $ε_{t} \in N (0, ε)$
  - model-based reinforcement learning
  - fitted value iteration

fitted value iteration

approximate $V^{*} (s)$ from $s^{(i)}, i \in {1, \dots, n}$

trial: ramdomly sample $s^{(i)}, i \in {1, \dots, n}$
initialization: $θ := 0$
repeat for $i \in {1, \dots, n}$ :
1. repeat for $1 \in A$ :
  
  sample $s_{i}^{'} \sim P_{s a}^{(i)}, i \in {1, \dots, k}$
  
  $q (a) := \frac{1}{k} j = 1 \sum k [R (s^{(i)}) + γV (s_{j}^{'})]$
  
  is estimation of
  
  $R (s) + γ E_{s^{'} \sim P_{s a}} [V (s^{'})]$
$y^{(i)} := a max q (a)$

any regression model, e.g.

$θ := θ arg min i \sum (θ^{T} ϕ (s^{(i)}) - y^{(i)})^{2}$

for deterministic model, can set $k = 1$

Mealy machine MDP

$R : S \times A \to R$

Bellman equation:

$V^{*} (s) = a max [R (s, a) + γ s^{'} \sum P_{s a} (s^{'}) V^{*} (s^{'})] \Rightarrow π^{*} = a arg max [R (s, a) + γ s^{'} \sum P_{s a} (s^{'}) V^{*} (s^{'})]$

finite horizon MDP

finite tuple ${S, A, {P_{s a}}, T, R}$
time horizon $T \in (0, + \infty)$
maximize $\sum_{t = 0}^{T} R (s_{t}, a_{t})$
action based on time $π_{t}^{*}$
time-dependent dynamic

$V_{t} (s) = E [t^{'} = t \sum T R (s_{t^{'}}, a_{t^{'}})] V_{t}^{*} (s) = ⎩ ⎨ ⎧ a max [R^{(t)} (s, a) + E_{s^{'} \sim P_{s a}^{(t)}} (V_{t + 1}^{*} (s^{'}))] a max [R^{(t)} (s, a)] t \in {0, \dots, T - 1} t = T$
- solution by dynamic programming: work way back from $V_{T}^{*} (s)$

linear quadratic regulation

$S : R^{n}, A : R^{d}, n > d$
linear transition with noise $P_{s a}$

$s_{t + 1} = A s_{t} + B a_{t} + ω_{t}$
negative quadratic reward to push system back
$u_{t} : R^{u \times n}, v_{t} : R^{n \times n}, u_{t}, v_{t} \geq 0$

$R (s_{t}, a_{t}) = - s_{t}^{T} u_{t} s_{t} - a_{t}^{T} v_{t} a_{t}$

policy searching method

stochastic policy $π_{θ} : S \times A \to R$
$π_{θ} (s, a)$ : probability of taking action $a$ at state $s$
direct policy search: find reasonable $θ$

$θ max E [t = 0 \sum T R (s_{t}, a_{t}) π_{θ}]$
- fixed initial state
- greedy stochastic gradient ascent
- learning rate $α \in (0, 1]$

repeat:

sample $s_{t}, a_{t}$
reinforce

$θ := θ + α t \sum [\frac{\nabla _{θ} π _{θ} ( s _{t} , a _{t} )}{π _{θ} ( s _{t} , a _{t} )}] t \sum R (s_{t}, a_{t})$

reason it converge: product rule

partial observable MDP

reinforce with baseline

arbitrary baseline $b (s)$
- independent of $a$

$θ := θ + α t \sum [\nabla_{θ} ln π_{θ} (s_{t}, a_{t})] (f (τ) - b (s))$

Monte Carlo method

trial: $(s_{t}, a_{t})$
wait until end of episode return $G_{t}$
$V (s_{t}) := V (s_{t}) + α (G_{t} - V (s_{t}))$
drawback: slow if episode long

time-difference learning (TD-learning)

at time $t + 1$ , update $V (s_{t})$

$V (s_{t}) := V (s_{t}) + α (δ_{t})$

TD error $δ_{t}$

$δ_{t} = R (s_{t + 1}) + γV (s_{t + 1}) - V (s_{t})$

on-policy TD (SARSA)

$ε$ -greedy

initialize arbitrary reward $Q (s, a)$
repeat for episode
1. initialize $S$
2. behavior policy: select $a_{t}$ based on $Q$
3. repeat for each step
  1. select potential $a^{'}$ for $s^{'}$ based on $Q$
  2. update $Q (s, a)$
    
    $Q (s_{t}, a_{t}) := Q (s_{t}, a_{t}) + α [R (s_{t}) + γ Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]$
  3. $s := s^{'}$
  until $s$ is terminal

off-policy TD: Q-learning

same as SARSA except

$Q (s_{t}, a_{t}) := Q (s_{t}, a_{t}) + α [R (s_{t}) + γ a max Q (s_{t + 1}, a) - Q (s_{t}, a_{t})]$

greedy

Steven Hé (Sīchàng)