From MLOps to LLMOps

Machine learning operations, or MLOps, emerged to bring structure to deploying and maintaining models. But the rise of large language models has stretched these practices to the breaking point. The challenges of building, fine-tuning, and serving LLMs have given rise to a new discipline often called LLMOps.
At training time, the sheer scale of LLMs requires distributed orchestration across hundreds or thousands of GPUs. Sharding, pipeline parallelism, and tensor parallelism are not optimizations but necessities. At inference, the size of these models makes serving costly, pushing engineers to use quantization, low-rank adaptation, and other compression strategies.
Monitoring also looks different. Instead of tracking accuracy and loss alone, LLMOps requires monitoring hallucination, bias, and user satisfaction. Feedback loops often involve reinforcement learning from human feedback, or continual fine-tuning to adapt to shifting data distributions.
The ecosystem of tools is evolving quickly. Libraries like DeepSpeed and vLLM handle efficient training and serving, while frameworks like LangChain and Haystack focus on orchestration for retrieval-augmented generation. Cloud providers now offer managed services specifically for large models, but open-source communities are pushing alternatives to keep access broad.
Just as DevOps became inseparable from software engineering, and MLOps became inseparable from machine learning, LLMOps is becoming inseparable from large language models. It represents a new operational philosophy for the era of foundation models.
References
https://arxiv.org/abs/2307.10169
https://www.deepspeed.ai/
https://www.vllm.ai/