MLOps at Scale: Bridging Cloud Infrastructure and AI Lifecycle Management

Abstract: To successfully manage the development and deployment of machine learning models (ML models), organizations require a platform that provides the necessary tools, standardized settings, and workflows for ML teams to easily build, monitor, and leverage ML models’ intelligence and insights at scale. This introduction section highlights the challenges organizations face in building robust, automated, reliable, and production-ready machine learning solutions. These challenges include managing changes across multiple tools, monitoring components throughout the AI lifecycle, and enabling collaboration among all teams involved in this lifecycle. Organizations need a hybrid cloud infrastructure that allows them to connect MLOps tools hosted on different clouds and vendors while enabling low-maintenance integration and a simplified, extensible developer experience. Traditionally, data scientists and AI developers performed data and model storage, training, and predictions on the organizations' infrastructure, whether on-premise or cloud-hosted. These systems allowed the development of ML solutions at scale through clustering hardware or distributed training techniques. These solutions were typically written as self-contained batch programs scheduled and monitored with existing data processing schedulers. The majority of ML workloads were still too simple for the engineering and deployment work needed to develop a robust and efficient solution, which allowed for other approaches to be followed. Once models were defined and trained, they were typically saved to disk, logged into a version control system, or manually documented in the source code or ticketing system. Existing systems, which in general were made for more traditional software, did not efficiently address the needs of an ML development team, which had very different tooling and priorities. This introduction focuses on vendor-neutral and cloud-agnostic approaches to the MLOps platform that empowers organizations to choose or easily integrate multiple open-source or proprietary tools into their workflows and pipelines while abstracting them with a streamlined API. The proposed platform addresses the aforementioned challenges faced by organizations by offering a set of deployment-ready components, giving them more freedom for customizing their MLOps and AI infrastructure management. Finally, the achievements of the MLOps works mentioned above and expected contributions to the literature are discussed.

Keywords: Data Science Platform; Data Lifecycle Management; Deployment & Monitoring; Machine Learning Platforms; MLOps. MLOps, MLOps system; ML; Ai; AIops; AIOps; Development process.

| DOI: 10.17148/IJIREEICE.2022.101216

International Journal of Innovative Research in
Electrical, Electronics, Instrumentation and Control Engineering

MLOps at Scale: Bridging Cloud Infrastructure and AI Lifecycle Management

Call for Papers

Author Center

IJIREEICE Management

Archives