← All Posts
EngineeringMarch 16, 2026·Arpit Agarwal, Robert Sun

Dexterity Supercharges Foresight by 17x with NVIDIA

Foresight is Dexterity's world model, the Physical AI system that gives robots real-time understanding of the physical world. Before every action, Foresight must answer: what does the world look like right now? The answer must be precise enough to plan stable placements among 100+ objects, and fast enough that the robot never waits.

01 | Problem Statement

At every timestep, Foresight integrates multi-modal sensor data (3D cameras, depth sensors, robot body state, force feedback, and the outcomes of physical interactions) into a unified world state. Each update maintains explicit uncertainty bounds, physics constraints, and rollback capability.

Of these modalities, visual perception from multi-camera arrays is the most computationally demanding. The visual perception pipeline must complete before any reasoning about whole-scene stability, geometric and motion planning, and other downstream AI can begin.

Before NVIDIA optimization, visual perception alone consumed 1508ms per cycle, longer than the entire decision budget. Real-time operation was possible only by running perception and planning in parallel with predictive pipelining, tolerating occasionally stale world state. The pipeline needed to be fundamentally faster. This post focuses on that visual perception pipeline: the biggest bottleneck and the biggest win.

02 | Solution and Impact Summary

17x Faster: From 1508ms to 90ms

TensorRT-optimized inference. Massively parallel geometric reasoning. Purpose-built physics CUDA kernels. Three stages of NVIDIA acceleration compound to deliver a 17x speedup, bringing visual perception from 1508ms down to 90ms per cycle.

After optimization, the full sense-think-act loop fits comfortably within the cadence needed for real-time dual-arm production operation. The result: higher throughput and fewer placement errors, enabling denser, more reliable trailer loads in production. What follows is the deep dive into each pipeline stage and how these gains were achieved.

Vision Pipeline Speedup
17x
Perception Latency
90ms
Down from 1508ms
Production Speed
Real-Time
Dual-arm parallel operation

03 | Deep Dive

Three Pipeline Stages, Three NVIDIA Solutions

Foresight's visual perception pipeline has three stages. Each one was a bottleneck. Each one was solved with a different layer of NVIDIA technology. Benchmarks were run on production data on an NVIDIA L4 Tensor Core GPU.

Stage 1
Foresight Perception
Object-level scene decomposition and attribute extraction
NVIDIA Tech: TensorRT
Stage 2
Foresight Reconstruction
Geometric scene understanding from full-resolution point clouds
NVIDIA Tech: Massive GPU Compute Parallelism
Stage 3
Foresight Synthesis
Physical consistency enforcement across the full scene
NVIDIA Tech: Purpose-Built Physics CUDA Kernels

03.1 | Foresight Perception

Object-Level Scene Decomposition: 2x Faster with TensorRT

Decomposing raw sensor data into individual objects, each with its own spatial embedding, depth features, and downstream attributes. TensorRT optimization halves median inference latency and dramatically tightens the tail.

Before Foresight can reason about the world, it must decompose raw multi-camera imagery into object-level representations. This isn't simple segmentation. For each object in the scene, Foresight Perception localizes not by image patches but by object instance, isolating each box's region across RGB and depth channels. From that localized attention area, the system extracts a rich set of per-object embeddings and attributes: spatial priors, surface features, and operational properties like pickability scores that predict whether removing a box would destabilize its neighbors. In a full trailer, this means reasoning about 100+ objects per frame.

The Foresight 0.1 pipeline was already fast, but not fast enough. TensorRT optimization, combined with architectural improvements to eliminate unnecessary overhead, delivers a 2x median speedup and, critically, compresses the P95 tail from 207ms down to 114ms. That tail matters: in a real-time pipeline, one slow frame can stall everything downstream.

Median Speedup
2x
57ms → 28ms
P95 Speedup
1.8x
207ms → 114ms
Images Benchmarked
9,977
Production images
Latency Distribution: Before
Foresight 0.1 inference (train split, 9,977 images)

Heavy tail: 5% of frames exceed 207ms, creating downstream stalls. Median 57ms, but worst case spikes to 760ms.

Latency Distribution: After
Foresight 1.0 TensorRT inference (train split, 9,977 images)

Tight distribution: P95 compressed to 114ms. Median 28ms. TensorRT optimization + architectural improvements.

Full Benchmark Results

On NVIDIA GPU • 9,977 production images

ConfigurationMedianP95P99Worst CaseSpeedup
Foresight 0.1 (train)56.7ms207.7ms308.3ms759.8msbaseline
Foresight 1.0 TensorRT (train)27.7ms114.0ms126.6ms146.8ms2.0x
Foresight 0.1 (val)62.7ms236.8ms303.0ms504.4msbaseline
Foresight 1.0 TensorRT (val)27.6ms112.7ms124.4ms245.1ms2.3x
Two optimizations, one number. The speedup comes from two changes applied together: TensorRT model optimization (graph fusion, kernel auto-tuning, precision optimization) and architectural improvements that eliminate unnecessary serialization and overhead. Both rely on the NVIDIA software stack.

03.2 | Foresight Reconstruction

Geometric Scene Understanding: 4.6x Faster, 32x More Data

Forming geometric hypotheses about every object in the world. Massive GPU parallelism doesn't just make it faster; it makes it possible to use the full, uncompromised sensor resolution.

Once objects are identified, Foresight must build a geometric understanding of each one, inferring shape, pose, and spatial extent from raw point cloud data. This is a dense optimization problem: the system generates and refines geometric hypotheses for every object in the scene, converging on a representation that captures how each object exists in 3D space.

The Foresight 0.1 CPU-based implementation was fast enough for subsampled point clouds, but subsampling means throwing away data, which means less precise reconstructions. On the GPU, Foresight explores a massive space of geometric hypotheses in parallel, evaluating candidates simultaneously across thousands of threads. This processes the full, unsubsampled point cloud from every camera: 32x more data points per object, and still runs 4.6x faster than the CPU version did on the reduced data.

Median Speedup
4.6x
225ms → 49ms
More Data Processed
32x
Full sensor resolution
Scenes Benchmarked
265
Multi-camera production scenes
Latency Comparison: Foresight 0.1 (Subsampled) vs. Foresight 1.0 (Full Resolution)
265 production scenes from 5-camera logs • NVIDIA GPU

The CUDA implementation processes 32x more point cloud data, the full, uncompromised bandwidth of every camera, while completing in 49ms median vs 225ms. This isn't just speed; it's higher fidelity geometric understanding that was previously impossible in real time.

Full Benchmark Results

On NVIDIA GPU • 265 five-camera production scenes

ConfigurationMedianP95P99Worst CaseData Volume
Foresight 0.1 (subsampled point clouds)225.2ms237.4ms410.0ms491.2ms1x
Foresight 1.0 (full resolution)49.1ms58.7ms61.2ms73.3ms32x
More data, better decisions. The previous CPU pipeline had to subsample point clouds to meet latency budgets, discarding 97% of available sensor data. With CUDA, Foresight processes every point from every camera. No subsampling. No compromise. The result is richer geometric understanding, which cascades into better placement decisions.

03.3 | Foresight Synthesis

Physical Consistency at GPU Speed

Reconciling multi-sensor observations into a single, physically consistent world state, at the speed that real-time robotics demands. Purpose-built CUDA kernels enforce the laws of physics across every object simultaneously.

Multiple cameras observe the same scene from different viewpoints. Their observations disagree. Objects occlude each other, deform under contact, compress under load, and sag under gravity. Foresight Synthesis must reconcile all of this into one coherent world state: a representation where every physical constraint is satisfied and every observation is explained.

This is a continuous optimization over a suite of physical consistency constraints. Each pass evaluates how well the current world state explains reality, then refines it. The constraints span geometric agreement, volumetric accountability, contact mechanics, and physical plausibility, each demanding its own compute-intensive evaluation across every object in the scene.

This is the most compute-intensive stage of the pipeline, and the one where purpose-built CUDA kernels had the most dramatic impact.

Volumetric Reasoning
277x
1023ms → 3.7ms
Geometric Consistency
73x
142ms → 1.9ms
Scenes Benchmarked
122
Production scenes, 1–138 objects each

Each constraint evaluates a different dimension of physical consistency, computed across every object in the scene:

Constraint Evaluation Performance

Averaged across 122 production scenes • NVIDIA GPU

ConstraintWhat It EnforcesForesight 0.1Foresight 1.0Speedup
Geometric ConsistencyThe world state must agree with observed sensor data141.9ms1.9ms73x
Plausibility EnforcementConfigurations must remain physically plausible16.6ms2.0ms8.3x
Contact MechanicsObjects deform, compress, and flex under contact44.1ms5.5ms8.0x
Volumetric ReasoningAll occupied space must be accounted for1,023.3ms3.7ms280x

Every constraint benefits from GPU acceleration, but the gains are not uniform. Volumetric Reasoning and Geometric Consistency see the most dramatic speedups (277x and 73x) because they scale as O(N×M) where N is the number of objects and M is the number of sensor observations, exactly the kind of embarrassingly parallel workload where purpose-built CUDA kernels dominate. Contact Mechanics and Plausibility Enforcement, while lighter, still achieve 8x speedups on GPU.

Scaling: How Latency Grows With Scene Complexity
CUDA total cost vs. number of objects in scene • 122 production scenes

Latency scales linearly with scene complexity. A near-empty trailer (1–10 objects) completes in 5–7ms. A full trailer (140+ objects) takes 21ms. The linear scaling is a property of the Foresight 1.0 implementation. The Foresight 0.1 implementation scaled super-linearly due to memory allocation patterns.

Volumetric Reasoning: The 280x Win
The most compute-intensive constraint

Volumetric Reasoning dominated the Foresight 0.1 pipeline at 1023ms median. A purpose-built CUDA kernel brought it to 3.7ms, a 280x speedup that turned the bottleneck into the fastest constraint evaluation.

Geometric Consistency: 73x Speedup
Sensor-to-model agreement evaluation

Evaluating geometric agreement across every object and every observation is O(N×M). CUDA parallelizes this across thousands of GPU threads, bringing 142ms down to 1.9ms.

Purpose-built kernels, not off-the-shelf. These aren't generic GPU-accelerated matrix operations. Each constraint required a purpose-built CUDA kernel tailored to the specific physics it enforces: contact resolution, spatial reasoning, volumetric analysis. NVIDIA's CUDA development ecosystem (profiling tools, debugging infrastructure, and the maturity of the platform) made this level of low-level optimization practical for a robotics team.

03.4 | Cumulative Impact

How Every Optimization Compounds

Perception drops from 57ms to 28ms, Reconstruction from 225ms to 49ms, and Synthesis from 1226ms to 13ms. The three stages sum to 90ms, down from 1508ms: a 17x end-to-end speedup.

Stage-by-Stage Breakdown: Foresight 0.1 vs. Foresight 1.0

Median latencies across production data • NVIDIA GPU • Same scale

BEFORE: Foresight 0.1 | Total 1508ms per visual perception cycle

Perception
57ms
Reconstruct.
225ms
Synthesis
1226ms

AFTER: Foresight 1.0 (NVIDIA-Accelerated) | Total 90ms per visual perception cycle

Perception
28ms
Reconstruct.
49ms
Synthesis
13ms

Per-Stage Contribution

Perception

2x

57ms → 28ms

TensorRT

Reconstruction

4.6x

225ms → 49ms

GPU Parallelism

Synthesis

94x

1226ms → 13ms

CUDA Physics Kernels

04 | Dexterity's Broader Use of NVIDIA Ecosystem

Dexterity × NVIDIA

TensorRT for inference. Massive GPU parallelism for geometric understanding. Purpose-built physics CUDA kernels for physical consistency. This is what it takes to give a robot real-time understanding of the physical world. This acceleration work is just the foundation. Below is a snapshot of the broader NVIDIA ecosystem we are actively building on.

Edge Deployment & Compute

High-bandwidth visual compute from edge to cloud, intelligent video, streaming sensor processing, next-gen robotic compute.

Jetson Metropolis Holoscan RTX Ada L4

Vision & Depth Models

Next-gen perception powered by NVIDIA foundation models for visual understanding and metric depth estimation at scale.

FoundationStereo Video Search and Summarization (VSS)

Physical AI Models

Leveraging NVIDIA's world foundation model platform (e.g. real-time object detection).

Cosmos

Mapping & Autonomy

GPU-accelerated 3D reconstruction and mapping for the Mech's autonomous navigation.

Isaac Nvblox

Dexterity • Powered by NVIDIA • March 2026