Design Overview

The goal of adding Kafka Monitor framework is to make it as easy as possible to

develop and execute long-running kafka-specific system tests in real clusters, and 2) monitor existing Kafka deployment from user’s perspective. Developers should be able to easily create new tests by composing reusable modules to take actions and collect metrics. And users should be able to run Kafka Monitor tests to perform actions at a user-defined schedule on the test cluster, e.g. broker hard kill and cluster bounce, and validate that Kafka still works well in accordance to its design.

A typical test may start some producers/consumers, take predefined sequence of actions periodically, report metrics, and validate metrics against some assertions. For example, Kafka Monitor can start one producer, one consumer, and bounce a random broker (say if it is monitoring a test cluster) every five minutes; the availability and message loss rate can be exposed via JMX metrics that can be collected and displayed on a health dashboard in real-time; and an alert is triggered if message loss rate is larger than 0.

To allow tests to be composed from reusable modules, we implement the logic of periodic/long-running actions in services. A service will execute the action in its own thread and export metrics. We have the following services to start with:

Produce service, which produces message to kafka and export produce rate and availability.
Consume service, which consumes message from kafka and export message loss rate, message duplicate rate and end-to-end latency. This service depends on the produce service to provide messages that encode certain information.
Broker bounce service, which bounce a given broker at the given interval.

A test will be composed of services and validate certain assertions either continuously or periodically. For example, we can create a test that includes one produce service, one consume service, and one broker bounce service. The produce service and consume service will be configured to use the same topic. And the test can validate that the message loss rate is constantly 0.

Finally, a given Kafka Monitor instance runs on a single physical machine and multiple tests can run in one Kafka Monitor instance. The diagram below demonstrates the relations between services, test and Kafka Monitor instance, as well as how Kafka Monitor interacts with Kafka and user.

While all services in the same Kafka Monitor instance must run on the same physical machine, we can start multiple Kafka Monitor instances in different clusters that coordinate together to orchestrate a single end-to-end test. In the test described by the diagram below, we start two Kafka Monitor instances in two clusters. The first Kafka Monitor instance contains one produce service that produces to Kafka cluster 1. The message is then mirrored from cluster 1 to cluster 2. Finally the consume service in the second Kafka Monitor instance consumes messages from the same topic and export end-to-end latency of this cross-cluster pipeline.

Xinfra Monitor (KMF): https://github.com/linkedin/kafka-monitor/

For inquiries or issues: https://github.com/linkedin/kafka-monitor/issues/new

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally