EngineeringMarch 16, 2026·Arpit Agarwal, Robert Sun

Dexterity Supercharges Foresight by 17x with NVIDIA

Foresight is Dexterity's world model, the Physical AI system that gives robots real-time understanding of the physical world. Before every action, Foresight must answer: what does the world look like right now? The answer must be precise enough to plan stable placements among 100+ objects, and fast enough that the robot never waits.

01 | Problem Statement

At every timestep, Foresight integrates multi-modal sensor data (3D cameras, depth sensors, robot body state, force feedback, and the outcomes of physical interactions) into a unified world state. Each update maintains explicit uncertainty bounds, physics constraints, and rollback capability.

Of these modalities, visual perception from multi-camera arrays is the most computationally demanding. The visual perception pipeline must complete before any reasoning about whole-scene stability, geometric and motion planning, and other downstream AI can begin.

Before NVIDIA optimization, visual perception alone consumed 1508ms per cycle, longer than the entire decision budget. Real-time operation was possible only by running perception and planning in parallel with predictive pipelining, tolerating occasionally stale world state. The pipeline needed to be fundamentally faster. This post focuses on that visual perception pipeline: the biggest bottleneck and the biggest win.

02 | Solution and Impact Summary

17x Faster: From 1508ms to 90ms

TensorRT-optimized inference. Massively parallel geometric reasoning. Purpose-built physics CUDA kernels. Three stages of NVIDIA acceleration compound to deliver a 17x speedup, bringing visual perception from 1508ms down to 90ms per cycle.

After optimization, the full sense-think-act loop fits comfortably within the cadence needed for real-time dual-arm production operation. The result: higher throughput and fewer placement errors, enabling denser, more reliable trailer loads in production. What follows is the deep dive into each pipeline stage and how these gains were achieved.

Vision Pipeline Speedup

17x

Perception Latency

90ms

Down from 1508ms

Production Speed

Real-Time

Dual-arm parallel operation

03 | Deep Dive

Three Pipeline Stages, Three NVIDIA Solutions

Foresight's visual perception pipeline has three stages. Each one was a bottleneck. Each one was solved with a different layer of NVIDIA technology. Benchmarks were run on production data on an NVIDIA L4 Tensor Core GPU.

Stage 1

Foresight Perception

Object-level scene decomposition and attribute extraction

NVIDIA Tech: TensorRT

›

Stage 2

Foresight Reconstruction

Geometric scene understanding from full-resolution point clouds

NVIDIA Tech: Massive GPU Compute Parallelism

›

Stage 3

Foresight Synthesis

Physical consistency enforcement across the full scene

NVIDIA Tech: Purpose-Built Physics CUDA Kernels

03.1 | Foresight Perception

Object-Level Scene Decomposition: 2x Faster with TensorRT

Decomposing raw sensor data into individual objects, each with its own spatial embedding, depth features, and downstream attributes. TensorRT optimization halves median inference latency and dramatically tightens the tail.

Before Foresight can reason about the world, it must decompose raw multi-camera imagery into object-level representations. This isn't simple segmentation. For each object in the scene, Foresight Perception localizes not by image patches but by object instance, isolating each box's region across RGB and depth channels. From that localized attention area, the system extracts a rich set of per-object embeddings and attributes: spatial priors, surface features, and operational properties like pickability scores that predict whether removing a box would destabilize its neighbors. In a full trailer, this means reasoning about 100+ objects per frame.

The Foresight 0.1 pipeline was already fast, but not fast enough. TensorRT optimization, combined with architectural improvements to eliminate unnecessary overhead, delivers a 2x median speedup and, critically, compresses the P95 tail from 207ms down to 114ms. That tail matters: in a real-time pipeline, one slow frame can stall everything downstream.

Median Speedup

57ms → 28ms

P95 Speedup

1.8x

207ms → 114ms

Images Benchmarked

9,977

Production images

Latency Distribution: Before

Foresight 0.1 inference (train split, 9,977 images)

Heavy tail: 5% of frames exceed 207ms, creating downstream stalls. Median 57ms, but worst case spikes to 760ms.

Latency Distribution: After

Foresight 1.0 TensorRT inference (train split, 9,977 images)

Tight distribution: P95 compressed to 114ms. Median 28ms. TensorRT optimization + architectural improvements.

Full Benchmark Results

On NVIDIA GPU • 9,977 production images

Configuration	Median	P95	P99	Worst Case	Speedup
Foresight 0.1 (train)	56.7ms	207.7ms	308.3ms	759.8ms	baseline
Foresight 1.0 TensorRT (train)	27.7ms	114.0ms	126.6ms	146.8ms	2.0x
Foresight 0.1 (val)	62.7ms	236.8ms	303.0ms	504.4ms	baseline
Foresight 1.0 TensorRT (val)	27.6ms	112.7ms	124.4ms	245.1ms	2.3x

Two optimizations, one number. The speedup comes from two changes applied together: TensorRT model optimization (graph fusion, kernel auto-tuning, precision optimization) and architectural improvements that eliminate unnecessary serialization and overhead. Both rely on the NVIDIA software stack.

03.2 | Foresight Reconstruction

Geometric Scene Understanding: 4.6x Faster, 32x More Data

Forming geometric hypotheses about every object in the world. Massive GPU parallelism doesn't just make it faster; it makes it possible to use the full, uncompromised sensor resolution.

Once objects are identified, Foresight must build a geometric understanding of each one, inferring shape, pose, and spatial extent from raw point cloud data. This is a dense optimization problem: the system generates and refines geometric hypotheses for every object in the scene, converging on a representation that captures how each object exists in 3D space.

The Foresight 0.1 CPU-based implementation was fast enough for subsampled point clouds, but subsampling means throwing away data, which means less precise reconstructions. On the GPU, Foresight explores a massive space of geometric hypotheses in parallel, evaluating candidates simultaneously across thousands of threads. This processes the full, unsubsampled point cloud from every camera: 32x more data points per object, and still runs 4.6x faster than the CPU version did on the reduced data.

Median Speedup

4.6x

225ms → 49ms

More Data Processed

32x

Full sensor resolution

Scenes Benchmarked

265

Multi-camera production scenes

Latency Comparison: Foresight 0.1 (Subsampled) vs. Foresight 1.0 (Full Resolution)

265 production scenes from 5-camera logs • NVIDIA GPU

The CUDA implementation processes 32x more point cloud data, the full, uncompromised bandwidth of every camera, while completing in 49ms median vs 225ms. This isn't just speed; it's higher fidelity geometric understanding that was previously impossible in real time.

Full Benchmark Results

On NVIDIA GPU • 265 five-camera production scenes

Configuration	Median	P95	P99	Worst Case	Data Volume
Foresight 0.1 (subsampled point clouds)	225.2ms	237.4ms	410.0ms	491.2ms	1x
Foresight 1.0 (full resolution)	49.1ms	58.7ms	61.2ms	73.3ms	32x

More data, better decisions. The previous CPU pipeline had to subsample point clouds to meet latency budgets, discarding 97% of available sensor data. With CUDA, Foresight processes every point from every camera. No subsampling. No compromise. The result is richer geometric understanding, which cascades into better placement decisions.

03.3 | Foresight Synthesis

Physical Consistency at GPU Speed

Reconciling multi-sensor observations into a single, physically consistent world state, at the speed that real-time robotics demands. Purpose-built CUDA kernels enforce the laws of physics across every object simultaneously.

Multiple cameras observe the same scene from different viewpoints. Their observations disagree. Objects occlude each other, deform under contact, compress under load, and sag under gravity. Foresight Synthesis must reconcile all of this into one coherent world state: a representation where every physical constraint is satisfied and every observation is explained.

This is a continuous optimization over a suite of physical consistency constraints. Each pass evaluates how well the current world state explains reality, then refines it. The constraints span geometric agreement, volumetric accountability, contact mechanics, and physical plausibility, each demanding its own compute-intensive evaluation across every object in the scene.

This is the most compute-intensive stage of the pipeline, and the one where purpose-built CUDA kernels had the most dramatic impact.

Volumetric Reasoning

277x

1023ms → 3.7ms

Geometric Consistency

73x

142ms → 1.9ms

Scenes Benchmarked

122

Production scenes, 1–138 objects each

Each constraint evaluates a different dimension of physical consistency, computed across every object in the scene:

Constraint Evaluation Performance

Averaged across 122 production scenes • NVIDIA GPU

Constraint	What It Enforces	Foresight 0.1	Foresight 1.0	Speedup
Geometric Consistency	The world state must agree with observed sensor data	141.9ms	1.9ms	73x
Plausibility Enforcement	Configurations must remain physically plausible	16.6ms	2.0ms	8.3x
Contact Mechanics	Objects deform, compress, and flex under contact	44.1ms	5.5ms	8.0x
Volumetric Reasoning	All occupied space must be accounted for	1,023.3ms	3.7ms	280x

Every constraint benefits from GPU acceleration, but the gains are not uniform. Volumetric Reasoning and Geometric Consistency see the most dramatic speedups (277x and 73x) because they scale as O(N×M) where N is the number of objects and M is the number of sensor observations, exactly the kind of embarrassingly parallel workload where purpose-built CUDA kernels dominate. Contact Mechanics and Plausibility Enforcement, while lighter, still achieve 8x speedups on GPU.

Scaling: How Latency Grows With Scene Complexity

CUDA total cost vs. number of objects in scene • 122 production scenes

Latency scales linearly with scene complexity. A near-empty trailer (1–10 objects) completes in 5–7ms. A full trailer (140+ objects) takes 21ms. The linear scaling is a property of the Foresight 1.0 implementation. The Foresight 0.1 implementation scaled super-linearly due to memory allocation patterns.

Volumetric Reasoning: The 280x Win

The most compute-intensive constraint

Volumetric Reasoning dominated the Foresight 0.1 pipeline at 1023ms median. A purpose-built CUDA kernel brought it to 3.7ms, a 280x speedup that turned the bottleneck into the fastest constraint evaluation.

Geometric Consistency: 73x Speedup

Sensor-to-model agreement evaluation

Evaluating geometric agreement across every object and every observation is O(N×M). CUDA parallelizes this across thousands of GPU threads, bringing 142ms down to 1.9ms.

Purpose-built kernels, not off-the-shelf. These aren't generic GPU-accelerated matrix operations. Each constraint required a purpose-built CUDA kernel tailored to the specific physics it enforces: contact resolution, spatial reasoning, volumetric analysis. NVIDIA's CUDA development ecosystem (profiling tools, debugging infrastructure, and the maturity of the platform) made this level of low-level optimization practical for a robotics team.

03.4 | Cumulative Impact

How Every Optimization Compounds

Perception drops from 57ms to 28ms, Reconstruction from 225ms to 49ms, and Synthesis from 1226ms to 13ms. The three stages sum to 90ms, down from 1508ms: a 17x end-to-end speedup.

Stage-by-Stage Breakdown: Foresight 0.1 vs. Foresight 1.0

Median latencies across production data • NVIDIA GPU • Same scale

BEFORE: Foresight 0.1 | Total 1508ms per visual perception cycle

Perception

57ms

Reconstruct.

225ms

Synthesis

1226ms

AFTER: Foresight 1.0 (NVIDIA-Accelerated) | Total 90ms per visual perception cycle

Perception

28ms

Reconstruct.

49ms

Synthesis

13ms

Per-Stage Contribution

Perception

57ms → 28ms

TensorRT

Reconstruction

4.6x

225ms → 49ms

GPU Parallelism

Synthesis

94x

1226ms → 13ms

CUDA Physics Kernels

04 | Dexterity's Broader Use of NVIDIA Ecosystem

Dexterity × NVIDIA

TensorRT for inference. Massive GPU parallelism for geometric understanding. Purpose-built physics CUDA kernels for physical consistency. This is what it takes to give a robot real-time understanding of the physical world. This acceleration work is just the foundation. Below is a snapshot of the broader NVIDIA ecosystem we are actively building on.

Edge Deployment & Compute

High-bandwidth visual compute from edge to cloud, intelligent video, streaming sensor processing, next-gen robotic compute.

Jetson • Metropolis • Holoscan • RTX Ada • L4

Vision & Depth Models

Next-gen perception powered by NVIDIA foundation models for visual understanding and metric depth estimation at scale.

FoundationStereo • Video Search and Summarization (VSS)

Physical AI Models

Leveraging NVIDIA's world foundation model platform (e.g. real-time object detection).

Cosmos

Mapping & Autonomy

GPU-accelerated 3D reconstruction and mapping for the Mech's autonomous navigation.

Isaac Nvblox