The content on this page was provided by an independent third party and syndicated by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Top 5 Advantages of Partnering with a China flexible and resilient support Warp knitted interlining Manufacturer

Top 5 Advantages of Partnering with a China flexible and resilient support Warp knitted interlining Manufacturer

QIDONG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — As the global garment industry undergoes a structural

March 12, 2026

LEXIN Sets New Quality Benchmarks as a China professional Circular Knitted interlining Manufacturer

LEXIN Sets New Quality Benchmarks as a China professional Circular Knitted interlining Manufacturer

QIDONG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — Qidong LEXIN Textile Technology Co., Ltd., a recognized

March 12, 2026

Reliable Sourcing: LEXIN Provides Wholesale polyester interlining with OEKO-TEX certification

Reliable Sourcing: LEXIN Provides Wholesale polyester interlining with OEKO-TEX certification

QIDONG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — Qidong LEXIN Textile Technology Co., Ltd. has formally

March 12, 2026

VidAu Redefines Social E-Com in 2026: ‘VidRemake’ and ‘VidSnap’ to Transform Viral Hooks into High-Converting AI UGC

VidAu Redefines Social E-Com in 2026: ‘VidRemake’ and ‘VidSnap’ to Transform Viral Hooks into High-Converting AI UGC

Vidau.ai launches VidRemake and VidSnap, leveraging Sora 2 and Veo 3 to turn viral trends and single photos into

March 12, 2026

Inside SenCai: A Top 10 High Quality Bagasse Tableware Bulk in China for Eco-Conscious Brands

Inside SenCai: A Top 10 High Quality Bagasse Tableware Bulk in China for Eco-Conscious Brands

FUZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — In an era where environmental stewardship has transitioned

March 12, 2026

SenCai: A Top 10 China Stylish Kraft Gift Bag with Handles Manufacturer for Premium Retail Brands

SenCai: A Top 10 China Stylish Kraft Gift Bag with Handles Manufacturer for Premium Retail Brands

FUZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — The Evolution of Premium Retail Packaging: Why Quality

March 12, 2026

A Complete Guide to Choosing the China Best Sugarcane Plates Wholesale Supplier: SenCai’ s Quality Commitment

A Complete Guide to Choosing the China Best Sugarcane Plates Wholesale Supplier: SenCai’ s Quality Commitment

FUZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — In the evolving landscape of sustainable food service, the

March 12, 2026

Janus Assurance Re Se Compromete con el Hogar Escuela de Niñas Doña Chucha

Janus Assurance Re Se Compromete con el Hogar Escuela de Niñas Doña Chucha

Janus Assurance Re realiza donación al Hogar Escuela de Niñas Doña Chucha y formaliza compromiso de apoyo permanente

March 12, 2026

Technical Comparison: Solutions from a Top Lightweight and Stable Support Garment Interlining Supplier

Technical Comparison: Solutions from a Top Lightweight and Stable Support Garment Interlining Supplier

QIDONG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — As the international apparel industry navigates a critical

March 12, 2026

Market Analysis: The Shift Toward the China Environmental Friendly Adhesive Interlining Manufacturer Model

Market Analysis: The Shift Toward the China Environmental Friendly Adhesive Interlining Manufacturer Model

QIDONG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — The global textile and apparel industry is currently

March 12, 2026

roiquant increased its pricing for the first time since its monetization in 2021

roiquant increased its pricing for the first time since its monetization in 2021

Price adjustment for roiquant subscription plans When our customers trust us fully, I truly believe that our business

March 12, 2026

Visit SenCai at the Upcoming Rolling Paper Expo: The Leading Wholesale Eco Rolling Papers Supplier from China

Visit SenCai at the Upcoming Rolling Paper Expo: The Leading Wholesale Eco Rolling Papers Supplier from China

FUZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — The global smoking accessories market is undergoing a

March 12, 2026

How Does a Highly Cost-Effective Recycled Polyester Interlining Manufacturer Reduce Environmental Impact

How Does a Highly Cost-Effective Recycled Polyester Interlining Manufacturer Reduce Environmental Impact

QIDONG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — The global garment industry is currently undergoing a

March 12, 2026

From Fujian to the Global Stage: SenCai’s Strategic Growth as a China Top 10 Takeaway Packaging Design Company

From Fujian to the Global Stage: SenCai’s Strategic Growth as a China Top 10 Takeaway Packaging Design Company

FUZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — The New Face of Chinese Manufacturing In the contemporary

March 12, 2026

Jay the Wrap Specialist Expands to 200+ Premium Wrap Colors, One of Two Texas Shops with Exclusive PWF Access

Jay the Wrap Specialist Expands to 200+ Premium Wrap Colors, One of Two Texas Shops with Exclusive PWF Access

Houston-based wrap specialist now offers premium films rated for 7+ years in Texas heat, quadrupling color options for

March 12, 2026

LEXIN Expands Global Footprint as a Preferred Hot Melt Interlining Partner for International Garment Brands

LEXIN Expands Global Footprint as a Preferred Hot Melt Interlining Partner for International Garment Brands

QIDONG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — The global garment manufacturing sector is currently

March 12, 2026

Building Brand Loyalty with SenCai, a China Custom Kraft Paper Shopping Bags Enterprise Supporting EUDR

Building Brand Loyalty with SenCai, a China Custom Kraft Paper Shopping Bags Enterprise Supporting EUDR

FUZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — The global retail landscape is currently undergoing a

March 12, 2026

SenCai: A Leading China Wholesale Compostable Packaging Boxes Supplier with SNI Certification

SenCai: A Leading China Wholesale Compostable Packaging Boxes Supplier with SNI Certification

FUZHOU, FUJIAN, CHINA, March 12, 2026 /EINPresswire.com/ — Beyond Plastic: Embracing Sustainability in the Modern

March 12, 2026

True North Social Shares Expert Insights on How to Use Google Ads Effectively in Today’s Competitive Digital Landscape

True North Social Shares Expert Insights on How to Use Google Ads Effectively in Today’s Competitive Digital Landscape

CULVER CITY, CA – March 12, 2026 – PRESSADVANTAGE – True North Social, a full-service digital marketing agency based in

March 12, 2026

National CACFP Association to Host Provider Day Celebration for Child Care Providers

National CACFP Association to Host Provider Day Celebration for Child Care Providers

ROUND ROCK, TX, UNITED STATES, March 12, 2026 /EINPresswire.com/ — The National CACFP Association will host Provider

March 12, 2026

ClairFi Technologies Appoints Akbar Jaffer as Fractional CMO to Lead Next Phase of its Strategic Growth

ClairFi Technologies Appoints Akbar Jaffer as Fractional CMO to Lead Next Phase of its Strategic Growth

Experienced FinTech Marketing Executive to Strengthen Brand Positioning and Accelerate Market Expansion for its B2B and

March 12, 2026

NEQSOL Holding Named 2026 ATD BEST Award Winner for Excellence in Talent Development

NEQSOL Holding Named 2026 ATD BEST Award Winner for Excellence in Talent Development

BAKU, AZERBAIJAN, March 12, 2026 /EINPresswire.com/ — NEQSOL Holding, an international group of companies operating

March 12, 2026

A Sourcing Guide: Evaluating a superior dimensional stability Hot Rolling Nonwoven Interlining Supplier

A Sourcing Guide: Evaluating a superior dimensional stability Hot Rolling Nonwoven Interlining Supplier

QIDONG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — The international apparel manufacturing sector is

March 12, 2026

SkinLab Phoenix Invites Downtown Phoenix to ‘Glow in the City’ Grand Re-Opening Celebration

SkinLab Phoenix Invites Downtown Phoenix to ‘Glow in the City’ Grand Re-Opening Celebration

Celebrate the grand re-opening with exclusive skincare treatments, beauty specials, and expert aesthetic services in

March 12, 2026

Germicidal Maids Launches Free Monthly Home Cleaning Program for Cancer Patients in Orange County

Germicidal Maids Launches Free Monthly Home Cleaning Program for Cancer Patients in Orange County

Local Orange County Cleaning Company Announces Community Initiative Providing Two Free Home Cleanings Per Month to

March 12, 2026

Tony’s Fencing & Iron Works Explains Differences Between Pre-Made Fence Panels and Built-On-Site Fences

Tony’s Fencing & Iron Works Explains Differences Between Pre-Made Fence Panels and Built-On-Site Fences

Tony’s Fencing & Iron Works explains the key differences between pre-made fence panels and custom-built fences for

March 12, 2026

aReady.COM conga-SMX95 optimizes time-to-market and expands usage potential

aReady.COM conga-SMX95 optimizes time-to-market and expands usage potential

congatec extends aReady.COM to Arm based modules – application-ready hardware and software building blocks SAN DIEGO,

March 12, 2026

Erika Sinner Named to Inc.’s 2026 Female Founders 500 List

Erika Sinner Named to Inc.’s 2026 Female Founders 500 List

Being named to the Inc. Female Founders 500 is an incredible honor”— Erika Sinner SAINT LOUIS , MO, UNITED STATES,

March 12, 2026

AI, Satellite Archaeology and Climate Heritage Research Highlight India’s Emerging Leadership at Wakankar Seminar

AI, Satellite Archaeology and Climate Heritage Research Highlight India’s Emerging Leadership at Wakankar Seminar

BHOPAL, MADHYA PRADESH, INDIA, March 12, 2026 /EINPresswire.com/ — Emerging technologies such as artificial

March 12, 2026

AI Video Workflows Are Becoming More Structured, More Selective, and Less Experimental

AI Video Workflows Are Becoming More Structured, More Selective, and Less Experimental

AI video is moving from novelty to practical use, with creators valuing stable workflows, refinement, and more control

March 12, 2026

What Makes LEXIN a Trusted Provider of High quality Fusible interlining From China

What Makes LEXIN a Trusted Provider of High quality Fusible interlining From China

QIDONG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — Qidong Lexin Textile Technology Co., Ltd., a specialized

March 12, 2026

Decoding the Pattern and Mechanisms that Provide Hardness to Tooth Enamel

Decoding the Pattern and Mechanisms that Provide Hardness to Tooth Enamel

Using mouse incisor models, researchers reveal how coordinated cell movements form crack-resistant enamel in teeth

March 12, 2026

Adventure Cruises San Diego Redefines Private Bachelor Party in San Diego Bay

Adventure Cruises San Diego Redefines Private Bachelor Party in San Diego Bay

Adventure Cruises San Diego offers a private bachelor party cruise in San Diego Bay with amazing views, music, and a

March 12, 2026

Chinese Neurosurgical Journal Reports New Therapy for Hard-To-Treat Brain Aneurysms

Chinese Neurosurgical Journal Reports New Therapy for Hard-To-Treat Brain Aneurysms

Prospective multicenter study finds Woven EndoBridge therapy safe and effective for unruptured wide-necked bifurcation

March 12, 2026

Jan’s Boutique Hosts Exclusive Fouy Chov Couture Trunk Show Featuring Mother of the Occasion & Evening Wear

Jan’s Boutique Hosts Exclusive Fouy Chov Couture Trunk Show Featuring Mother of the Occasion & Evening Wear

NJ, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Jan’s Boutique is proud to announce an exclusive Fouy Chov

March 12, 2026

Intrinseque Health Malaysia Sdn. Bhd. Conducts Community Outreach Initiative at SK Methodist (ACS) School, Melaka

Intrinseque Health Malaysia Sdn. Bhd. Conducts Community Outreach Initiative at SK Methodist (ACS) School, Melaka

Intrinseque Health visited SK Methodist School, distributing household items and spending time with school children,

March 12, 2026

Guangzhou South China Printing Exhibition Preview: Inviting New and Old Customers to Visit and Appreciate UV Technology

Guangzhou South China Printing Exhibition Preview: Inviting New and Old Customers to Visit and Appreciate UV Technology

DONGGUAN, GUANGDONG, CHINA, March 12, 2026 /EINPresswire.com/ — From March 4th to 6th, 2026, the 32nd South China

March 12, 2026

LEXIN: Advancing Garment Quality as a Global Leading Interlining Supplier

LEXIN: Advancing Garment Quality as a Global Leading Interlining Supplier

QIDONG, JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — Qidong Lexin Textile Technology Co., Ltd., a prominent

March 12, 2026

Bettersize Instruments Launches Bettersizer 2600 Plus: All-in-One Particle Size and Shape Analyzer

Bettersize Instruments Launches Bettersizer 2600 Plus: All-in-One Particle Size and Shape Analyzer

Introducing the Bettersizer 2600 Plus, a combition of laser diffraction and dynamic imaging to deliver advanced

March 12, 2026

Comrade Digital Marketing Agency Announces Strategic Shift to the Home Improvement Industry

Comrade Digital Marketing Agency Announces Strategic Shift to the Home Improvement Industry

CHICAGO, IL, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Comrade, a digital marketing agency with more than 17

March 12, 2026