Dexterity Supercharges Foresight by 17x with NVIDIA
Foresight is Dexterity's world model, the Physical AI system that gives robots real-time understanding of the physical world. Before every action, Foresight must answer: what does the world look like right now? The answer must be precise enough to plan stable placements among 100+ objects, and fast enough that the robot never waits.
01 | Problem Statement
At every timestep, Foresight integrates multi-modal sensor data (3D cameras, depth sensors, robot body state, force feedback, and the outcomes of physical interactions) into a unified world state. Each update maintains explicit uncertainty bounds, physics constraints, and rollback capability.
Of these modalities, visual perception from multi-camera arrays is the most computationally demanding. The visual perception pipeline must complete before any reasoning about whole-scene stability, geometric and motion planning, and other downstream AI can begin.
Before NVIDIA optimization, visual perception alone consumed 1508ms per cycle, longer than the entire decision budget. Real-time operation was possible only by running perception and planning in parallel with predictive pipelining, tolerating occasionally stale world state. The pipeline needed to be fundamentally faster. This post focuses on that visual perception pipeline: the biggest bottleneck and the biggest win.
02 | Solution and Impact Summary
17x Faster: From 1508ms to 90ms
TensorRT-optimized inference. Massively parallel geometric reasoning. Purpose-built physics CUDA kernels. Three stages of NVIDIA acceleration compound to deliver a 17x speedup, bringing visual perception from 1508ms down to 90ms per cycle.
After optimization, the full sense-think-act loop fits comfortably within the cadence needed for real-time dual-arm production operation. The result: higher throughput and fewer placement errors, enabling denser, more reliable trailer loads in production. What follows is the deep dive into each pipeline stage and how these gains were achieved.
03 | Deep Dive
Three Pipeline Stages, Three NVIDIA Solutions
Foresight's visual perception pipeline has three stages. Each one was a bottleneck. Each one was solved with a different layer of NVIDIA technology. Benchmarks were run on production data on an NVIDIA L4 Tensor Core GPU.
03.1 | Foresight Perception
Object-Level Scene Decomposition: 2x Faster with TensorRT
Decomposing raw sensor data into individual objects, each with its own spatial embedding, depth features, and downstream attributes. TensorRT optimization halves median inference latency and dramatically tightens the tail.
Before Foresight can reason about the world, it must decompose raw multi-camera imagery into object-level representations. This isn't simple segmentation. For each object in the scene, Foresight Perception localizes not by image patches but by object instance, isolating each box's region across RGB and depth channels. From that localized attention area, the system extracts a rich set of per-object embeddings and attributes: spatial priors, surface features, and operational properties like pickability scores that predict whether removing a box would destabilize its neighbors. In a full trailer, this means reasoning about 100+ objects per frame.
The Foresight 0.1 pipeline was already fast, but not fast enough. TensorRT optimization, combined with architectural improvements to eliminate unnecessary overhead, delivers a 2x median speedup and, critically, compresses the P95 tail from 207ms down to 114ms. That tail matters: in a real-time pipeline, one slow frame can stall everything downstream.
Heavy tail: 5% of frames exceed 207ms, creating downstream stalls. Median 57ms, but worst case spikes to 760ms.
Tight distribution: P95 compressed to 114ms. Median 28ms. TensorRT optimization + architectural improvements.
Full Benchmark Results
On NVIDIA GPU • 9,977 production images
| Configuration | Median | P95 | P99 | Worst Case | Speedup |
|---|---|---|---|---|---|
| Foresight 0.1 (train) | 56.7ms | 207.7ms | 308.3ms | 759.8ms | baseline |
| Foresight 1.0 TensorRT (train) | 27.7ms | 114.0ms | 126.6ms | 146.8ms | 2.0x |
| Foresight 0.1 (val) | 62.7ms | 236.8ms | 303.0ms | 504.4ms | baseline |
| Foresight 1.0 TensorRT (val) | 27.6ms | 112.7ms | 124.4ms | 245.1ms | 2.3x |
03.2 | Foresight Reconstruction
Geometric Scene Understanding: 4.6x Faster, 32x More Data
Forming geometric hypotheses about every object in the world. Massive GPU parallelism doesn't just make it faster; it makes it possible to use the full, uncompromised sensor resolution.
Once objects are identified, Foresight must build a geometric understanding of each one, inferring shape, pose, and spatial extent from raw point cloud data. This is a dense optimization problem: the system generates and refines geometric hypotheses for every object in the scene, converging on a representation that captures how each object exists in 3D space.
The Foresight 0.1 CPU-based implementation was fast enough for subsampled point clouds, but subsampling means throwing away data, which means less precise reconstructions. On the GPU, Foresight explores a massive space of geometric hypotheses in parallel, evaluating candidates simultaneously across thousands of threads. This processes the full, unsubsampled point cloud from every camera: 32x more data points per object, and still runs 4.6x faster than the CPU version did on the reduced data.
The CUDA implementation processes 32x more point cloud data, the full, uncompromised bandwidth of every camera, while completing in 49ms median vs 225ms. This isn't just speed; it's higher fidelity geometric understanding that was previously impossible in real time.
Full Benchmark Results
On NVIDIA GPU • 265 five-camera production scenes
| Configuration | Median | P95 | P99 | Worst Case | Data Volume |
|---|---|---|---|---|---|
| Foresight 0.1 (subsampled point clouds) | 225.2ms | 237.4ms | 410.0ms | 491.2ms | 1x |
| Foresight 1.0 (full resolution) | 49.1ms | 58.7ms | 61.2ms | 73.3ms | 32x |
03.3 | Foresight Synthesis
Physical Consistency at GPU Speed
Reconciling multi-sensor observations into a single, physically consistent world state, at the speed that real-time robotics demands. Purpose-built CUDA kernels enforce the laws of physics across every object simultaneously.
Multiple cameras observe the same scene from different viewpoints. Their observations disagree. Objects occlude each other, deform under contact, compress under load, and sag under gravity. Foresight Synthesis must reconcile all of this into one coherent world state: a representation where every physical constraint is satisfied and every observation is explained.
This is a continuous optimization over a suite of physical consistency constraints. Each pass evaluates how well the current world state explains reality, then refines it. The constraints span geometric agreement, volumetric accountability, contact mechanics, and physical plausibility, each demanding its own compute-intensive evaluation across every object in the scene.
This is the most compute-intensive stage of the pipeline, and the one where purpose-built CUDA kernels had the most dramatic impact.
Each constraint evaluates a different dimension of physical consistency, computed across every object in the scene:
Constraint Evaluation Performance
Averaged across 122 production scenes • NVIDIA GPU
| Constraint | What It Enforces | Foresight 0.1 | Foresight 1.0 | Speedup |
|---|---|---|---|---|
| Geometric Consistency | The world state must agree with observed sensor data | 141.9ms | 1.9ms | 73x |
| Plausibility Enforcement | Configurations must remain physically plausible | 16.6ms | 2.0ms | 8.3x |
| Contact Mechanics | Objects deform, compress, and flex under contact | 44.1ms | 5.5ms | 8.0x |
| Volumetric Reasoning | All occupied space must be accounted for | 1,023.3ms | 3.7ms | 280x |
Every constraint benefits from GPU acceleration, but the gains are not uniform. Volumetric Reasoning and Geometric Consistency see the most dramatic speedups (277x and 73x) because they scale as O(N×M) where N is the number of objects and M is the number of sensor observations, exactly the kind of embarrassingly parallel workload where purpose-built CUDA kernels dominate. Contact Mechanics and Plausibility Enforcement, while lighter, still achieve 8x speedups on GPU.
Latency scales linearly with scene complexity. A near-empty trailer (1–10 objects) completes in 5–7ms. A full trailer (140+ objects) takes 21ms. The linear scaling is a property of the Foresight 1.0 implementation. The Foresight 0.1 implementation scaled super-linearly due to memory allocation patterns.
Volumetric Reasoning dominated the Foresight 0.1 pipeline at 1023ms median. A purpose-built CUDA kernel brought it to 3.7ms, a 280x speedup that turned the bottleneck into the fastest constraint evaluation.
Evaluating geometric agreement across every object and every observation is O(N×M). CUDA parallelizes this across thousands of GPU threads, bringing 142ms down to 1.9ms.
03.4 | Cumulative Impact
How Every Optimization Compounds
Perception drops from 57ms to 28ms, Reconstruction from 225ms to 49ms, and Synthesis from 1226ms to 13ms. The three stages sum to 90ms, down from 1508ms: a 17x end-to-end speedup.
Stage-by-Stage Breakdown: Foresight 0.1 vs. Foresight 1.0
Median latencies across production data • NVIDIA GPU • Same scale
BEFORE: Foresight 0.1 | Total 1508ms per visual perception cycle
AFTER: Foresight 1.0 (NVIDIA-Accelerated) | Total 90ms per visual perception cycle
Per-Stage Contribution
Perception
2x
57ms → 28ms
TensorRT
Reconstruction
4.6x
225ms → 49ms
GPU Parallelism
Synthesis
94x
1226ms → 13ms
CUDA Physics Kernels
04 | Dexterity's Broader Use of NVIDIA Ecosystem
Dexterity × NVIDIA
TensorRT for inference. Massive GPU parallelism for geometric understanding. Purpose-built physics CUDA kernels for physical consistency. This is what it takes to give a robot real-time understanding of the physical world. This acceleration work is just the foundation. Below is a snapshot of the broader NVIDIA ecosystem we are actively building on.
Edge Deployment & Compute
High-bandwidth visual compute from edge to cloud, intelligent video, streaming sensor processing, next-gen robotic compute.
Jetson • Metropolis • Holoscan • RTX Ada • L4
Vision & Depth Models
Next-gen perception powered by NVIDIA foundation models for visual understanding and metric depth estimation at scale.
Physical AI Models
Leveraging NVIDIA's world foundation model platform (e.g. real-time object detection).
Mapping & Autonomy
GPU-accelerated 3D reconstruction and mapping for the Mech's autonomous navigation.
Isaac Nvblox
Dexterity • Powered by NVIDIA • March 2026