FAILS: An automated analysis tool from a Distributed Systems perspective
Lab Assignment for "Distributed Systems", Vrije Universiteit 2024/2025
The Framework for Analysis of Incidents and Outages of LLM Services (FAILS) is in the llm_analysis folder, with instruction on how to run it.
Large Language Model (LLM) services have rapidly become essential tools for applications ranging from customer support to content generation, yet their distributed nature makes them prone to failures that impact reliability and uptime. Existing tools for analysing service incidents are either closed-source, lack comparative capabilities, or fail to provide comprehensive insights into failure trends and recovery patterns. To address these gaps, we present FAILS (Framework for Analysis of Incidents and Outages of LLM Services), an open-source system designed to collect, analyse and visualize incident data from leading LLM providers. FAILS enables users to explore temporal trends, assess reliability metrics associated with failure models such as Mean Time to Recovery (MTTR) and Mean Time Between Failures (MTBF), and gain insights into service co-dependencies using a modern LLM-assisted analysis. With a web-based interface and advanced plotting tools, FAILS enables researchers, engineers, and decision-makers to understand and mitigate disruptions due to LLM services.
By Nishanthi Srinivasan, Bálint László Szarvas and Sándor Battaglini-Fischer.
Many thanks to Xiaoyu Chu and Prof. Dr. Ir. Alexandru Iosup for the support!



