3D Vision (1)

Published:

Notes of 3D Vision from Shenlong Wang’s lecture slides. Image fundation, camera basics and correspondence.

3D Transform

Homogeneous Transformation Matrix

  • Rotation Matrix \(R = \begin{bmatrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{bmatrix}\)
  • Translation Matrix
    \(T = \begin{bmatrix} t_{1} \\ t_{2} \\ t_{3} \end{bmatrix}\)
  • Homogeneous Transformation Matrix \(H = \begin{bmatrix} R & T \\ 0 & 1 \end{bmatrix}\)

Euler Angles

  • Roll, Pitch, Yaw \(R = R_z(\psi)R_y(\theta)R_x(\phi)\) \(R_x(\phi) = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos(\phi) & -\sin(\phi) \\ 0 & \sin(\phi) & \cos(\phi) \end{bmatrix}\) \(R_y(\theta) = \begin{bmatrix} \cos(\theta) & 0 & \sin(\theta) \\ 0 & 1 & 0 \\ -\sin(\theta) & 0 & \cos(\theta) \end{bmatrix}\) \(R_z(\psi) = \begin{bmatrix} \cos(\psi) & -\sin(\psi) & 0 \\ \sin(\psi) & \cos(\psi) & 0 \\ 0 & 0 & 1 \end{bmatrix}\)
  • Order Matters!!
  • Gimbal Lock: When the second rotation axis is aligned with the first, the third rotation axis is the same as the first.

Axis Angle

axis_angle

  • Rodrigues’ Rotation Formula \(R = I + \sin(\Psi)[u]_{\times} + (1-\cos(\Psi))[u]_{\times}^2\) \([u]_{\times} = \begin{bmatrix} 0 & -u_z & u_y \\ u_z & 0 & -u_x \\ -u_y & u_x & 0 \end{bmatrix}\)
  • Suffering from “edges”:

axis_angle_edges

Quaternions

  • \[q = (w, x, y, z) = w + xi + yj + zk\]
  • Hamilton Product \(q_1 = (w_1, x_1, y_1, z_1)\) \(q_2 = (w_2, x_2, y_2, z_2)\) \(q_1 \otimes q_2 = (w_1w_2 - x_1x_2 - y_1y_2 - z_1z_2, w_1x_2 + x_1w_2 + y_1z_2 - z_1y_2, w_1y_2 - x_1z_2 + y_1w_2 + z_1x_2, w_1z_2 + x_1y_2 - y_1x_2 + z_1w_2)\)
  • Unit Quaternion as Rotation: \(q \cdot q^* = 1\) where \(q = (sin(\Psi /2) \cdot u, cos(\Psi /2))\)

Cheat Sheet

cheat_sheet

Camera Basics

Pinhole Camera Model

  • Trade-off between dark and blurry => Lens

dark blurry lens

  • Depth of Focus and Depth of Field source: Bilibili

  • Camera Model

    Take the pinhole point as the camera center and the virtual plance as the image plane, as illustrated below: camera model

    We have the following relations: \(x = P \cdot X\) where $ P $ is the camera projection matrix, $ X $ is the 3D point in the world coordinate system, and $ x $ is the 2D point in the image coordinate system.

    The projection matrix $ P $ can be decomposed as: \(P = K[R|t]\) where $ K $ is the camera intrinsic matrix, $ R $ is the rotation matrix, and $ t $ is the translation vector. More information can be found in my another post Camera Calibration.

Correspondence

According to Takeo Kanade, the most three important problem in computer vision is the “Correspondence, correspondence and correspondence.”

Optical Flow

  • Brightness Constancy \(I(x, y, t-1) = I(x + u(x,y), y + v(x,y), t)\) where $ u(x,y) $ and $ v(x,y) $ are the optical flow in the x and y directions at position $ (x, y) $. Through taylor expansion, we have: \(I(x, y, t-1) \approx I(x, y, t-1) + \frac{\partial I}{\partial x}u + \frac{\partial I}{\partial y}v + \frac{\partial I}{\partial t}\) Shorthand: \(I_xu + I_yv + I_t = 0\)

  • Lucas-Kanade Method

    We want to solve u and v in the equation above. L-K method assumes that the optical flow is constant in a local patch: \(\left\{ \begin{array}{l} I_x(q_1)V_x + I_y(q_1)V_y = -I_t(q_1) \\ I_x(q_2)V_x + I_y(q_2)V_y = -I_t(q_2) \\ \vdots \\ I_x(q_n)V_x + I_y(q_n)V_y = -I_t(q_n) \end{array} \right.\)

  • Horn-Schunck Method

    H-S method takes this as an optimization problem. Assumes that the optical flow is smooth in the whole image (i.e., the regularizerization term): \(\min_{u, v} \int \int (I_xu + I_yv + I_t)^2 + \alpha(||\nabla u||^2 + ||\nabla v||^2) dxdy\) which can be solved by Euler-Lagrange equation.

  • Deep Learning Methods

    • FlowNet
    • PWC-Net
    • RAFT

Dense Point Tracking

  • TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

Keypoint Tracking/Sparse Correspondence

  • Virtual Correspondence Humans as a Cue for Extreme-View Geometry

Sparse Correspondence

  • SIFT (scale-invariant feature transform)
    • Step 1: Detect distinctive keypoints
    • Step 2: Compute oriented histogram gradient features (SIFT feature)
    • Step 3: Measure distances between each pair