Fig. 1: Inverse rendering for monocular multi-object tracking and inverse rendering optimization.
From: Towards generalizable and interpretable three-dimensional tracking with inverse neural rendering

a, We initialize the embedding codes of an object generator zS for shape and zT for texture for each detected object k. The generative object prior (for example, GET3D (ref. 2), pretrained on synthetic data) is frozen. Only embedding codes for an object’s geometry zS and texture zT, location tk, rotation ψk and size sk for each object instance k are optimized with inverse rendering to fit the image observation best. The inverse rendering loss (\(\mathcal{L}_\mathrm{IR}\)) quantifies the discrepancy between the observed and rendered images to guide the optimization. Solid connectors denote sequential processing steps, and dashed connectors indicate iterative feedback loops for optimization. The process terminates after a maximum fixed number of steps or when \(\mathcal{L}_\mathrm{IR}\) converges. Inverse rendered texture and shape embeddings and refined object locations are provided to the matching stage to match predicted states of tracked objects of past with their historical, exponentially moving averaged (EMA) texture and shape embeddings zT,EMA and zS,EMA, and new observations. Matched and new tracklets are updated, and unmatched tracklets are ultimately discarded before predicting states in the next step (data from ref. 33). b, An example of this test-time optimization method with zoomed-in views of rendered objects. From left to right: the observed image, the rendering predicted by the initial starting point latent embeddings, the predicted rendered objects after the texture code is optimized, the predicted rendered objects after the translation, scale and rotation are optimized, and the predicted rendered objects after the shape latent code is optimized. The ground-truth images are faded to show our rendered objects clearly. Our proposed method effectively refines the predicted texture, pose and shape over several optimization steps, even if initialized with poses or appearances far from the target, all found at test time with inverse rendering. Init., initial; IoU, intersection over union.