We’d like to share information about a rare phenomenon that we’re calling “Corrupted Cache” that appears to affect hardware used to process calls to LLMs (the type of AI models used by AI Dungeon and countless other platforms). What happens is that under extreme high loads, the GPU may fail to clear the memory during a crash, resulting in cross-contamination of outputs when the GPU recovers and begins its next task. In AI Dungeon terms, this means that a partially constructed output for one story could be incorrectly carried over into another. When we discovered this might be happening, we immediately took down affected models to investigate the cause and identify a solution. Because this seems to be a hardware level issue, we believe the best mechanism to avoid these conditions is better GPU load management, and we’re working with our providers to implement safer failure patterns and early detection of high load conditions.
Although we suspect the corrupted cache is an industry-wide issue, it’s extremely rare and when it occurs it’s likely diagnosed as common AI hallucination, making it a tricky issue to identify and confirm. We’ve been unable to find concrete examples of others who’ve observed this phenomenon, and we may be one of the first companies to observe and publish about the issue. Much of what we share here today may change as more people observe this issue and more information becomes available.
Now, in true Latitude fashion, let us give you the full story of how we came to learn about the “Corrupted Cache” and talk in greater detail about how we’re working to prevent the conditions that seem to trigger it.
Managed Services
AI Dungeon relies on “managed services” for many parts of our tech stack. This means that for technologies like our database, servers, and even AI compute, our technology partners are the ones who are managing the day-to-day operations like setting up physical storage devices, configuring network connections, thinking through data recovery options, etc. This allows us to spend most of our time thinking about making AI Dungeon great, instead of worrying about hardware scaling and configurations. Using managed services is a standard practice for most smaller companies, since managing your own cloud computing and AI resources is an expensive and specialized field of work. We are no exception. Generally, it’s massive organizations like Amazon, Google, Meta, or Microsoft that are at a large enough scale that it makes sense to run their own hardware.
Because of that, it’s pretty unusual for hardware level issues across any of these managed services to come to our team’s attention. When there’s an issue, our vendors are usually the ones identifying, troubleshooting, and servicing any disruptions to service or bugs in the system.
AI Dungeon’s unique traffic load
When it comes to working with AI vendors, we’re a bit of an outlier. We consume a lot of AI compute, which has made us an attractive customer to many AI providers. As a new space, it’s unsurprising that many of the AI providers are still relatively new companies. We’ve worked with many of them, and have often found ourselves pushing the limits of what their services can offer. It’s been the case on multiple occasions that the scale of our production traffic on even one AI model can bring a service to its knees.
As an outlier and high-use customer, we are sometimes helping our vendors discover places to shore up their services and identify improvements they need to make to their architecture.
In short…y’all love playing AI Dungeon, and it takes a lot of work to handle all the playing you do 🙂 And that playing has led to the discovery of the corrupted cache phenomenon.
The Corrupted Cache Phenomenon
When you take an action on AI Dungeon, it is sent to one of our AI providers. They have specialized hardware that is configured to receive, process, and return responses from Large Language Models. With each request, the GPU on this specialized hardware is running complex calculations, and storing the outputs in memory.
In rare instances, when the hardware is pushed beyond its limits, instead of outright failing it can exhibit strange behaviors. For instance, we’ve seen models start operating strangely at large context lengths. Or, a model might return complete gibberish. We’re also seeing that one of the most rare and unusual behaviors is the GPU crashes and fails to clear the memory. In other words, the GPU may be working on an AI response, store parts to the memory, and then crash. When it recovers, it picks up a new task, but assumes the non-wiped data in memory is part of the next response it’s working on. This can cause parts of the output from one AI call (or player story) to be used and sent as part of the output for another player’s story.
As we’ve worked with our vendors to understand this phenomenon, it appears that the memory clearing function is handled on the BIOS level of the AI hardware. BIOS is the essential firmware that is physically embedded into the motherboard of the machine. In other words, it’s not an issue that is easily addressed. The best way to address the issue, is to avoid letting the hardware ever get into this state.
As we’ve explored the space, it seems like this issue isn’t widely understood or even discussed. It’s possible that in the event a corrupted cache occurs on other services, it could be dismissed as run-of-the-mill AI hallucination. We anticipate that, over time, this behavior might be observed by other companies and, perhaps, even resolved in future generations of AI hardware.
Fortunately, the set of conditions required to put AI hardware into this state appears to be extremely unusual and rare. In full transparency, neither we nor our partners are able to fully explain what specific conditions cause the cache to be corrupted, nor are we confident that our explanation of how the corrupted cache happens is correct. Hopefully, more information about this will be more widely available over time. That said, we do know how to prevent it.
What we’ve observed
We’ve only had one confirmed case of a corrupted cache occurring, and it happened a few weeks ago with one of our test models on a test environment. We sent testing traffic to an AI server that we didn’t realize was only configured for extremely low traffic, essentially for developer use only. Over time, that server choked on the traffic, and after several days it ended up going into a strange state that our provider has been unable to recreate since (for testing and diagnosing purposes).
In the most unusual of coincidences, the phenomenon was discovered by some of our testers in a private channel shared with our development team. A player shared an unexpected output that seemed like it was related to another player’s story. Our team quickly jumped on, confirmed the issue, and shut down the server. In less than 24hrs, we worked with that vendor to not only get us the correctly scaled AI server, but also put in protections so that model calls fail completely before hitting the threshold where a corrupted cache could occur.
Because the circumstances of this occurrence seemed highly unique and atypical (heavy traffic on a test server), and seemed specific to the configuration of that test server, it felt like a one off issue. Now, we’re beginning to suspect that, although extremely rare, the issue may not be a one-off occurrence like we thought at the time, which is why we’re bringing this to your attention.
On Tuesday Jan 7th, 2025, players started reporting slowness and outages with Hermes 3 70b and Hermes 3 405b, which is hosted on a different provider than the previous occurrence. During that time, we were seeing players share outputs that we suspect (but haven’t been able to confirm) could have been caused by a similar issue. Due to the uptick in reports around the same time as these models experiencing issues, we shut down the models out of an abundance of caution.
To be clear, we haven’t been able to confirm whether these are simply AI hallucinations, or a manifestation of a corrupted cache. Even if hallucinations is the most likely explanation, we didn’t want to take any chances. We took the models out of circulation until we could ask our vendor to put additional protections in place, or find an alternative hosting partner for Hermes 3 70B and Hermes 3 405b.
What we’re doing
If our theory behind the cause is correct, addressing the root source of the problem appears to be something at the BIOS level of AI hardware. This means that even AI providers (ours or any provider) may not be able to directly address the source of the issue. We may need to wait for this corrupted cache issue to become more widely understood, and for hardware manufacturers to build protections into their firmware.
As we did with the first vendor we saw this with, we’re working with our other vendors to put protections in place. Given what we know now, this will be a requirement for all vendors we work with going forward.
Also, while we may not have visibility into the hardware load of the servers we’re using, we have metrics and alerting for model latency, which can give us an early indication of hardware that might be starting to struggle under load. We’re considering more aggressive interventions as well on our end to direct traffic to different models (alerting players, of course) to completely avoid letting servers get even close to the extremely overloaded state where a corrupted cache has a higher chance of occurring.
We suspect that between protections we can implement on AI Dungeon, and protections our vendors can provide, we believe we can reduce the chances of this happening from “rare” to “darn near impossible”.
Naturally, we welcome and appreciate players who share their odd model responses. We’ve looked into these reports many times over the years, and most of the time, odd responses are simply AI model hallucination which is a frequent occurrence with LLMs, especially for those of you who set your temperature high. Occasionally these reports reveal bugs we need to address in our models or system. In this instance, these reports helped uncover the truly rare.
Thank you for your help.
Hopefully it goes without saying that we take our responsibility to protect any data that passes through our platform very seriously. We apologize to any of you who were disappointed when we took down the Hermes models. We simply couldn’t tolerate even the slightest and rarest of chances of this phenomenon happening on our platform.