2D Boxes to 3D Frustums: Simplifying Point Cloud Labeling for Object Detection

From 2D detections to 3D box annotations

Labeling point clouds for object detection is a daunting task. The data is massive and its 3D nature makes it harder for the annotator to find the objects of interest as compared to labeling images. In order to facilitate the process of labeling point clouds, I have explored and developed a simple algorithm that leverages the maturity of image object detection to simplify the process of 3D object labeling. The idea is that for a dataset consisting of image-point cloud pairs, like the one obtained with my robot, we can apply object detection to the images and then create 3D frustums from the 2D bounding boxes.

From 2D bounding boxes to 3D frustums

Taking a 2D bounding box along with the intrinsic and extrinsic camera parameters, we can back-project the corners of the bounding box to obtain 3D points. These points define the vertices of the 3D frustum in real-world space. In essence, each frustum can be conceptualized as a 3D field of view of a camera. The bounding box in the 2D image acts as the aperture of this camera, and the frustum encapsulates all the 3D points that the camera can potentially observe. In other words, the algorithm metaphorically places multiple virtual cameras in the 3D space, each focusing on a specific object identified by the 2D object detector.

Let’s break it down:

2D Object Detection: I used the YOLO (You Only Look Once) model (V8), a real-time object detection system, to detect objects in the image. The detected objects are represented by 2D bounding boxes, which are then used to generate 3D frustums in the next step.

Object detection using YOLO on images

3D Frustum Generation: A frustum is a portion of a solid that lies between one or two parallel planes cutting it. Here, we use the 2D bounding boxes to generate 3D frustums in the point cloud. Each frustum corresponds to an object detected in the 2D image and contains the relevant 3D points from the point cloud.

The frustums, the cyan ones are the cars and the other ones are the people.

Point Cloud Segmentation: The algorithm then filters the points inside each frustum to create smaller pieces of the point cloud. Each piece contains the 3D points corresponding to an object detected in the 2D image.

A person within a frustum

Visualization and Saving: The final step of the algorithm is to visualize these smaller pieces and optionally save them for further processing or labeling.

Cars within a frustum

After saving the filtered point clouds, they can be uploaded to an annotation tool for easy labeling.

Labeling the point clouds with CVAT

The algorithm involves several techniques that might require some basic understanding of image processing and 3D geometry. Let’s go through them:

  1. 2D to 3D Back Projection: Back projection is a technique of converting 2D image points to 3D points.
  2. 3D Frustum Generation: I used the corners of the 2D bounding box and the minimum and maximum depths to generate the eight vertices of the 3D frustum. Then, I back-projected these vertices onto 3D space using the camera parameters and the rotation and translation between the camera and the world coordinates.
  3. Point Cloud Filtering: To segment the point cloud, we need to identify which points lie within each frustum. For this, I used the Delaunay triangulation of the frustum vertices and then check if each point in the point cloud lies within the triangulation. The points that do are part of the frustum and are used to create the smaller point cloud pieces.

For access to the full code and sample point clouds, check my repo:


For a full demo check the video at the top.





Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: