One of my main labors is the recognition of objects, by comparison of target models and extracted data, using statistically sound methods that directly account for and manipulate uncertainty in the various stages of the system. I have developed and tested recognition methods that explicitly modeled the various sources of uncertainty. This method utilized the Expectation/Maximization algorithm to trade off solving for the correspondence between model and data features, and solving for the pose of the target in the data coordinate frame. A multiresolution version of the method was designed and implemented to demonstrate the potential efficiencies and robustness gained by using multiresolution data and models.
A drawback to this approach is a reliance on explicit uncertainty models. As an alternative, with co-workers, am investigating a new multiresolution approach for finding the pose of an object model in an image, based on a new formulation of the mutual information between model and image. As applied here, the technique is intensity-based, rather than feature-based. It works well in domains where edge or gradient-based methods have difficulty, yet it is more robust than traditional correlation. The general problem of alignment involves comparing a predicted image of an object with an actual image. Given an object model and a pose, a model for the imaging process can be used to predict the image that will result. In general this is a difficult problem. If we had a good imaging model then deciding whether an image contained a particular model at a given pose is straightforward: compute the predicted image and compare it to the actual image directly. Given a perfect imaging model the two images will be identical, or close to it. Of course finding the correct alignment is still a remaining challenge.
In general the relationship between an object model (no matter how
accurate) and the object's image is a complex one.
However, in the part of the scene containing an image of the object,
we can formulate a general imaging equation:
where
are coordinates of a surface patch of the object model,
describes the properties of the surface of the model
at position
, and
are parameters of the imaging process,
such as the illumination conditions.
is the image formation function that generates the
brightness of the surface patch in the image. Thus,
is the
brightness image of the object placed in the scene by coordinate
transformation
.
If
and
were known in detail it would be feasible to
make an accurate prediction of scene intensities, since the
physics of image formation are well understood.
But, because of the complexity of visible light
imaging, it may be difficult to determine the particular
and
for a given scene. Similar complexities
arise in the cases of laser radar and SAR imagery.
One reason that it is, in principle, possible to find
is that the
model does supply much information about the scene. Clearly if there
were no mutual information between
and
, there could be no
.
We finesse the problem of finding and computing
by dealing with
this mutual information directly. Such a technique finds the
alignment of the model in the scene by maximizing the information that
the model gives us about the scene. (Motivated by the work of Viola and
Wells.)
We have adopted the following approach to aligning the object model to the image: (1) The mutual information of the model and image is defined, and it expressed in terms of the entropies of several random variables. (2) The entropies and their derivatives are approximated by a method that involves random sampling from the model and image data. (3) A local maximum of the mutual information is sought by using a stochastic analog of gradient descent. Steps are repeatedly taken that are proportional to the approximation of the derivative of the mutual information with respect to the transformation.