3D Vision (1)
Published:
Notes of 3D Vision from Shenlong Wang’s lecture slides. Image fundation, camera basics and correspondence.
3D Transform
Homogeneous Transformation Matrix
- Rotation Matrix \(R = \begin{bmatrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{bmatrix}\)
- Translation Matrix
\(T = \begin{bmatrix} t_{1} \\ t_{2} \\ t_{3} \end{bmatrix}\) - Homogeneous Transformation Matrix \(H = \begin{bmatrix} R & T \\ 0 & 1 \end{bmatrix}\)
Euler Angles
- Roll, Pitch, Yaw \(R = R_z(\psi)R_y(\theta)R_x(\phi)\) \(R_x(\phi) = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos(\phi) & -\sin(\phi) \\ 0 & \sin(\phi) & \cos(\phi) \end{bmatrix}\) \(R_y(\theta) = \begin{bmatrix} \cos(\theta) & 0 & \sin(\theta) \\ 0 & 1 & 0 \\ -\sin(\theta) & 0 & \cos(\theta) \end{bmatrix}\) \(R_z(\psi) = \begin{bmatrix} \cos(\psi) & -\sin(\psi) & 0 \\ \sin(\psi) & \cos(\psi) & 0 \\ 0 & 0 & 1 \end{bmatrix}\)
- Order Matters!!
- Gimbal Lock: When the second rotation axis is aligned with the first, the third rotation axis is the same as the first.
Axis Angle
- Rodrigues’ Rotation Formula \(R = I + \sin(\Psi)[u]_{\times} + (1-\cos(\Psi))[u]_{\times}^2\) \([u]_{\times} = \begin{bmatrix} 0 & -u_z & u_y \\ u_z & 0 & -u_x \\ -u_y & u_x & 0 \end{bmatrix}\)
- Suffering from “edges”:
Quaternions
- \[q = (w, x, y, z) = w + xi + yj + zk\]
- Hamilton Product \(q_1 = (w_1, x_1, y_1, z_1)\) \(q_2 = (w_2, x_2, y_2, z_2)\) \(q_1 \otimes q_2 = (w_1w_2 - x_1x_2 - y_1y_2 - z_1z_2, w_1x_2 + x_1w_2 + y_1z_2 - z_1y_2, w_1y_2 - x_1z_2 + y_1w_2 + z_1x_2, w_1z_2 + x_1y_2 - y_1x_2 + z_1w_2)\)
- Unit Quaternion as Rotation: \(q \cdot q^* = 1\) where \(q = (sin(\Psi /2) \cdot u, cos(\Psi /2))\)
Cheat Sheet
Camera Basics
Pinhole Camera Model
- Trade-off between dark and blurry => Lens
Depth of Focus and Depth of Field source: Bilibili
Camera Model
Take the pinhole point as the camera center and the virtual plance as the image plane, as illustrated below:
We have the following relations: \(x = P \cdot X\) where $ P $ is the camera projection matrix, $ X $ is the 3D point in the world coordinate system, and $ x $ is the 2D point in the image coordinate system.
The projection matrix $ P $ can be decomposed as: \(P = K[R|t]\) where $ K $ is the camera intrinsic matrix, $ R $ is the rotation matrix, and $ t $ is the translation vector. More information can be found in my another post Camera Calibration.
Correspondence
According to Takeo Kanade, the most three important problem in computer vision is the “Correspondence, correspondence and correspondence.”
Optical Flow
Brightness Constancy \(I(x, y, t-1) = I(x + u(x,y), y + v(x,y), t)\) where $ u(x,y) $ and $ v(x,y) $ are the optical flow in the x and y directions at position $ (x, y) $. Through taylor expansion, we have: \(I(x, y, t-1) \approx I(x, y, t-1) + \frac{\partial I}{\partial x}u + \frac{\partial I}{\partial y}v + \frac{\partial I}{\partial t}\) Shorthand: \(I_xu + I_yv + I_t = 0\)
Lucas-Kanade Method
We want to solve u and v in the equation above. L-K method assumes that the optical flow is constant in a local patch: \(\left\{ \begin{array}{l} I_x(q_1)V_x + I_y(q_1)V_y = -I_t(q_1) \\ I_x(q_2)V_x + I_y(q_2)V_y = -I_t(q_2) \\ \vdots \\ I_x(q_n)V_x + I_y(q_n)V_y = -I_t(q_n) \end{array} \right.\)
Horn-Schunck Method
H-S method takes this as an optimization problem. Assumes that the optical flow is smooth in the whole image (i.e., the regularizerization term): \(\min_{u, v} \int \int (I_xu + I_yv + I_t)^2 + \alpha(||\nabla u||^2 + ||\nabla v||^2) dxdy\) which can be solved by Euler-Lagrange equation.
Deep Learning Methods
- FlowNet
- PWC-Net
- RAFT
Dense Point Tracking
- TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
Keypoint Tracking/Sparse Correspondence
- Virtual Correspondence Humans as a Cue for Extreme-View Geometry
- SIFT (scale-invariant feature transform)
- Step 1: Detect distinctive keypoints
- Step 2: Compute oriented histogram gradient features (SIFT feature)
- Step 3: Measure distances between each pair