Artificial Intelligence for IT Operations: an Overview

Artificial intelligence for IT operations (AIOps) combines sophisticated methods from deep learning, data streaming processing, and domain knowledge to analyse infrastructure data from internal and external sources to automate operations and detect anomalies (unusual system behavior) before they impact the quality of service. Odej Kao, professor at the University of Technology Berlin, gave a keynote presentation about artificial intelligence for IT operations at DevOpsCon Berlin 2021.

Log data is the most powerful source of information, widely available, and can be well-processed by AI-based prediction models, as Kao explained:

In data stream processing we frequently struggle to find sufficient amounts of data. On the other hand, in AIOps we have many different sources (e.g., metric, logs, tracing, events, alerts) with several Terabytes of data produced in a typical IT infrastructure per day. We utilize the power of these hidden gems to assist DevOps administrators and jointly with the AI-models improve the availability, security, and the performance of the overall system.

According to Kao, AI-driven log analytics will be a mandatory component in future Industry 4.0, IoT, smart cities and homes, autonomous driving, data centers, and IT organizations

Most companies already have set the scene for operation of AIOPs platforms: monitoring, ELK-stacks are in place and need to be extended with AI-based analytics tools to ensure availability, performance, and security, Kao said.

Kao presented how an AIOps workflow can look:

The workflow starts with collecting data from many different sources, e.g. metric data from hardware CPU/mem/net utilization, system logs from logstash, and distributed traces from the resource manager. The hard part here is to get a holistic picture of the current infrastructure: due to virtualization, SDNs, VNFs, etc. the system ..

