Recovering Non-Rigid 3-D Structures from Stereo and Structured Light


Faculty: A. Goshtasby
Student: Lyubomir Zagorchev

Sponsor: Ohio Board of Regents



We know that binocular stereo can produce a dense depth map of a scene, but the computed depth values may not be very accurate. Structured light on the other hand produces more accurate depth values, but it is very restrictive. For example, structured light cannot be used to continuously capture the depth map of a dynamic scene. By combining binocular stereo and structured light, the objective is to recover the geometry of a non-rigid scene continuously in time, such as recovering the geometry of a person's face during speaking. In this project, first, structured light is used to recover the geometry of a scene, while keeping the scene static. Then binocular stereo is used to track the object in 3-D while the scene is in motion. By knowing the geometry of the scene at time t, the correspondence between points in the images at time t+dt can be estimated. This estimation helps to predict the correspondences and carry out search in small neighborhoods for accurate correspondences. Approximate correspondences at time t+dt are, therefore, obtained from the depth map at time t. The approximate correspondences are refined using image colors through a local matching process. From the revised correspondences, a revised depth map is produced for the scene at time t+dt. This process is repeated until all images in a stereo video pair are processed. The steps of the process are demonstrated through an example below.

Step 1. Determining the geometry of a scene from structured light.

Two video frames showing a stereo pair obtained by a stereo camera setup while sweeping the laser over the object.

Detection of  the laser stripe in the two images. The horizontal displacement between laser points on the same scanline in the images determines the 3-D coordinates of the laser point.

Initial 3-D model of the object in different views. Since there are missing image points in the model, we first approximate the missing points and obtain a new 3-D model as shown below. This process will also reduce noise in data.

There are still some holes in the data because mapping points in an image to a 3-D model produces some gaps. We fit a NURBS surface to 3-D data estimated from a stereo image pair to represent a continuous model

3-D model constructed with a NURBS surface. This further reduces noise and produces a smooth model.

The NURBS surface with mapped texture.

The following images show the same process but using a human model.

Initial 3-D model.

Approximated 3-D model by initial filling in process.

NURBS surface representation of the model

The NURBS surface with mapped texture.

Step 2: Rigid tracking from stereo correspondence.

Having a complete 3D model of the face, obtained as a result of Step 1, we then proceed with tracking of the rigid motion of the head in the stereo video sequence. For every pair of frames, we track a number of MPEG-4 facial features . The tracking process is initiated by selecting the features interactively. Template matching in a small neighborhood is conducted in the tracking.

During tracking. variable template size and linear prediction is used to achieve as accurate as possible matches.

From the tracked points, we pick four among them that move the least with respect to other points because they represent the rigid (global) movement of the head. Having the 3D coordinates of these four points in consecutive frames, we determine the new rotation and the translation of the head by the least squares approach. The rotation and translation are then applied to the 3D model and the process is repeated for all pairs of frames in a video sequence.

At this point we have determined the rotation and translation of the head for all frames, and by applying them to the previously created model we track the head in 3D.

Between some frames, due to inaccurate template matches, we may not obtain a smooth transition in 3D. Therefore, we introduce a smoothing step. For every feature point we build a 3D Rational Gaussian curve (RaG) that represents the trajectory of the point in 3-D alog x, y, and z axes.

Smoothing the initial RaG curves, we obtain the new 3D trajectories for all feature points. By running the initial algorithm for determining rotation and translation over the smoothed trajectories, we obtain the smooth rigid motion of the head.

Left Camera Stereo Video Sequence

Initial Motion 

RaG Smoothed Motion (sigma=10)

Step3: Non-rigid tracking from stereo correspondence (in progress).

[Intelligent Systems Laboratory] [WSU Home Page] [CSE Department Home Page]

For more information contact A. Goshtasby (

Last modified: 7/28/03