3D Vision (1)

Published: September 17, 2024

Notes of 3D Vision from Shenlong Wang’s lecture slides. Image fundation, camera basics and correspondence.

3D Transform

Rotation Matrix $R = \begin{bmatrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{bmatrix}$
Translation Matrix
$T = \begin{bmatrix} t_{1} \\ t_{2} \\ t_{3} \end{bmatrix}$
Homogeneous Transformation Matrix $H = \begin{bmatrix} R & T \\ 0 & 1 \end{bmatrix}$

Roll, Pitch, Yaw $R = R_z(\psi)R_y(\theta)R_x(\phi)$ $R_x(\phi) = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos(\phi) & -\sin(\phi) \\ 0 & \sin(\phi) & \cos(\phi) \end{bmatrix}$ $R_y(\theta) = \begin{bmatrix} \cos(\theta) & 0 & \sin(\theta) \\ 0 & 1 & 0 \\ -\sin(\theta) & 0 & \cos(\theta) \end{bmatrix}$ $R_z(\psi) = \begin{bmatrix} \cos(\psi) & -\sin(\psi) & 0 \\ \sin(\psi) & \cos(\psi) & 0 \\ 0 & 0 & 1 \end{bmatrix}$
Order Matters!!
Gimbal Lock: When the second rotation axis is aligned with the first, the third rotation axis is the same as the first.

axis_angle

Rodrigues’ Rotation Formula $R = I + \sin(\Psi)[u]_{\times} + (1-\cos(\Psi))[u]_{\times}^2$ $[u]_{\times} = \begin{bmatrix} 0 & -u_z & u_y \\ u_z & 0 & -u_x \\ -u_y & u_x & 0 \end{bmatrix}$
Suffering from “edges”:

axis_angle_edges

\[q = (w, x, y, z) = w + xi + yj + zk\]
Hamilton Product $q_1 = (w_1, x_1, y_1, z_1)$ $q_2 = (w_2, x_2, y_2, z_2)$ $q_1 \otimes q_2 = (w_1w_2 - x_1x_2 - y_1y_2 - z_1z_2, w_1x_2 + x_1w_2 + y_1z_2 - z_1y_2, w_1y_2 - x_1z_2 + y_1w_2 + z_1x_2, w_1z_2 + x_1y_2 - y_1x_2 + z_1w_2)$
Unit Quaternion as Rotation: $q \cdot q^* = 1$ where $q = (sin(\Psi /2) \cdot u, cos(\Psi /2))$

cheat_sheet

dark blurry lens

Depth of Focus and Depth of Field source: Bilibili
Camera Model
Take the pinhole point as the camera center and the virtual plance as the image plane, as illustrated below:
We have the following relations: $x = P \cdot X$ where $ P $ is the camera projection matrix, $ X $ is the 3D point in the world coordinate system, and $ x $ is the 2D point in the image coordinate system.
The projection matrix $ P $ can be decomposed as: $P = K[R|t]$ where $ K $ is the camera intrinsic matrix, $ R $ is the rotation matrix, and $ t $ is the translation vector. More information can be found in my another post Camera Calibration.

According to Takeo Kanade, the most three important problem in computer vision is the “Correspondence, correspondence and correspondence.”

Brightness Constancy $I(x, y, t-1) = I(x + u(x,y), y + v(x,y), t)$ where $ u(x,y) $ and $ v(x,y) $ are the optical flow in the x and y directions at position $ (x, y) $. Through taylor expansion, we have: $I(x, y, t-1) \approx I(x, y, t-1) + \frac{\partial I}{\partial x}u + \frac{\partial I}{\partial y}v + \frac{\partial I}{\partial t}$ Shorthand: $I_xu + I_yv + I_t = 0$
Lucas-Kanade Method
We want to solve u and v in the equation above. L-K method assumes that the optical flow is constant in a local patch: $\left\{ \begin{array}{l} I_x(q_1)V_x + I_y(q_1)V_y = -I_t(q_1) \\ I_x(q_2)V_x + I_y(q_2)V_y = -I_t(q_2) \\ \vdots \\ I_x(q_n)V_x + I_y(q_n)V_y = -I_t(q_n) \end{array} \right.$
Horn-Schunck Method
H-S method takes this as an optimization problem. Assumes that the optical flow is smooth in the whole image (i.e., the regularizerization term): $\min_{u, v} \int \int (I_xu + I_yv + I_t)^2 + \alpha(||\nabla u||^2 + ||\nabla v||^2) dxdy$ which can be solved by Euler-Lagrange equation.
Deep Learning Methods
- FlowNet
- PWC-Net
- RAFT

Sparse Correspondence

SIFT (scale-invariant feature transform)
- Step 1: Detect distinctive keypoints
- Step 2: Compute oriented histogram gradient features (SIFT feature)
- Step 3: Measure distances between each pair