#fault-tolerance

[ follow ]
Artificial intelligence
fromInfoQ
2 weeks ago

Temporal and OpenAI Launch AI Agent Durability with Public Preview Integration

Temporal integration with the OpenAI Agents SDK provides durable execution for AI agents, adding persistence, retries, and crash recovery.
fromTechzine Global
3 weeks ago

Clockwork launches FleetIQ to dramatically improve AI training

In large-scale AI training, billions in computing power are currently lost due to communication problems between processors. Clockwork tackles this problem with FleetIQ, a Software-Driven Fabric that provides real-time insight into GPU clusters. The system detects bottlenecks within microseconds and automatically redirects traffic via alternative routes. In addition, stateful fault tolerance prevents entire AI jobs from having to be restarted after a failure.
Artificial intelligence
Science
fromTechCrunch
1 month ago

With Starship Flight 10, SpaceX prioritized resilience over perfection | TechCrunch

Starship's 10th test flight intentionally stressed fault tolerance—heat shield damage, propulsion redundancy, and engine relighting—to advance large-scale rocket reusability and operational safety.
Science
fromIT Pro
1 month ago

From SETI to CERN: What you need to know about cluster computing and how it can help your enterprise

Cluster computing uses many networked ordinary computers coordinated by middleware to divide work, providing scalable, fault-tolerant high-performance computing for tasks like simulations and data analysis.
fromWIRED
1 month ago

How Supercomputing Will Evolve, According to Jack Dongarra

Quantum computers produce a probability distribution of potential answers rather than definitive results, indicating a significant shift from traditional computation models.
Artificial intelligence
[ Load more ]