Concepts of Apache Kafka

Saurav Kumar
11 min readSep 25, 2023

--

Apache Kafka is a popular open-source stream processing platform that is widely used for building real-time data pipelines and event-driven applications. It was originally developed by LinkedIn and later open-sourced as an Apache project. Kafka is designed to handle high-throughput, fault-tolerant, scalable data streaming and it relies on a set of internal components and mechanisms to achieve these goals. Here are some key concepts associated with Apache Kafka and an overview of how Kafka works internally:

  1. Topics:
  • In Kafka, data is organized into topics. A topic is a logical channel or category to which messages are published by producers and from which messages are consumed by consumers.
  • Topics allow you to categorize and organize the data streams based on different data sources, events, or use cases.

2. Producer:

  • Producers are applications or systems that push data into Kafka topics.
  • They are responsible for creating and publishing messages to Kafka topics.
  • Producers can be configured to send messages to one or more Kafka topics.

3. Consumer:

  • Consumers are applications or systems that subscribe to Kafka topics and process messages.
  • Consumers can read messages from one or more partitions of a topic in parallel.
  • Kafka supports both single and consumer groups for parallel processing of messages.

4. Broker:

  • Kafka brokers are the servers or nodes that make up a Kafka cluster.
  • Brokers are responsible for storing and serving messages to consumers.
  • Kafka clusters consist of multiple brokers for redundancy and scalability.

5. Partition:

  • Topics are divided into partitions, which are the basic units of parallelism and scalability in Kafka.
  • Each partition is a linearly ordered, immutable sequence of messages.
  • Partitions allow Kafka to distribute and parallelize the processing of data across multiple brokers and consumers.

6. Replication:

  • Kafka provides data redundancy and fault tolerance through replication.
  • Each partition can have multiple replicas distributed across different brokers.
  • Replicas ensure that data is not lost in case of broker failures.

7. Offset:

  • An offset is a unique identifier for each message within a partition.
  • Consumers use offsets to keep track of the messages they have consumed.
  • Kafka retains messages for a configurable retention period, allowing consumers to replay past messages if needed.

8. Consumer Groups:

  • Consumer groups are a way to parallelize message processing in Kafka.
  • Consumers within a group coordinate consumer messages from a topic.
  • Each message in a topic partition is consumed by only one consumer within a group.

9. Zookeeper (deprecated in recent versions):

  • Kafka originally depended on Apache Zookeeper for distributed coordination and management.
  • In the newer Kafka version, Zookeeper is being phased out in favor of the Kafka Controller for better scalability and simplicity.

10. Kafka Connect:

  • Kafka Connect is a framework for building and running connectors to integrate Kafka with various data sources and sinks (databases, file systems, cloud services, etc.).
  • It simplifies the process of getting data in and out of Kafka.

11. Stream Processing:

  • Kafka Streams is a library for building real-time stream processing applications that can process data from Kafka topics and produce results back to Kafka or external systems.
  • It enables the development of applications that can consume data from Kafka topics, perform computations, and produce results back to Kafka or external systems.

The flow of messages:

  1. Messages are produced by producers and written to Kafka topics.
  2. Kafka’s partitioning mechanism determines which partition within a topic a message should go to.
  3. The message is written to the leader replica of that partition.
  4. Consumers within a consumer group subscribe to topics and are assigned partitions to read from.
  5. Each consumer reads messages from the partition(s) it is assigned to, starting from the last committed offset.
  6. As consumers process messages, they advance their offsets to keep track of the last successfully consumed message.
  7. If a consumer fails or new consumers join the group, Kafka automatically reassigns partitions to maintain load balancing.
  8. Consumers can commit their offsets periodically or based on some processing checkpoint.

Offset management is crucial for consumers to keep track of their position in the partition and avoid reprocessing messages.

What if none of the replication is available?

In Apache Kafka, replication is a fundamental mechanism for ensuring data availability and fault tolerance. Each partition can have multiple replicas, typically including a leader and one or more followers. If none of the replicas for a partition is available, it can result in data unavailability and potential data loss. Here’s what can happen if none of the replication is available:

  1. Data Unavailability:
  • Kafka relies on replication to ensure that data is available for consumption even in the face of broker failures.
  • If none of the replicas for a partition is available, there will be no source from which consumers can read messages for that partition.
  • This can lead to data unavailability, and consumers won’t be able to access the messages stored in that partition.

2. Message loss:

  • Kafka is designed to maintain data durability through replication. When a producer sends a message and receives acknowledgment, it assumes that the message is safely stored on at least one replica.
  • If none of the replicas is available, it means that the message may not have been successfully stored anywhere in the Kafka cluster.
  • This can result in potential data loss, as there would be no guaranteed copy of the message.

3. Service Disruption:

  • Kafka’s replication mechanism is essential for maintaining high availability and fault tolerance. It allows Kafka to continue functioning even if a broker or replica fails.
  • If none of the replicas are available, the Kafka topic or partition essentially becomes unavailable, which can disrupt services and applications relying on the data.

4. Data Recovery Challenges:

  • In situations where none of the replicas are available, recovering the lost data can be challenging.
  • Recovery efforts may involve examining data on the failed broker’s disks (if possible) or trying to restore data from backups.
  • These recovery processes can be time-consuming and may result in data loss if not handled carefully.

To mitigate the risks associated with a scenario where none of the replicas are available, it’s essential to ensure that Kafka clusters are properly configured with an adequate number of replicas and that data durability and replication settings are appropriately tuned. This includes setting replication factors to ensure that multiple copies of data exist and maintaining adequate monitoring and backup procedures to respond to potential issues promptly.

What if consumers went down?

If Kafka consumers go down, it generally doesn’t result in data loss or data unavailability, as Kafka is designed to be resilient and provide fault tolerance. However, the impact on your system’s real-time processing and message consumption depends on the following factors:

  1. Consumer Group:
  • If a single consumer within a consumer group goes down, the Kafka cluster automatically rebalances the partitions among the remaining consumers in the group.
  • The partitions that were previously assigned to the failed consumer are reassigned to active consumers.
  • This rebalancing ensures that message consumption continues even if individual consumers fail.

2. Offset Commit:

  • Kafka consumers typically commit their offsets periodically or based on some processing checkpoint.
  • If a consumer crashes before committing its offset, upon recovery, it will resume consuming from the last committed offset.
  • This means it may reprocess some messages, but no data is lost.

3. Consumer Group Liveliness:

  • If all consumers within a consumer group were to go down simultaneously (for example. due to a cluster-wide failure or all consumers crashing), there would be no active consumers to read new messages.
  • In this case, Kafka would continue to produce messages, and those messages would accumulate in the topics.
  • Once at least one consumer in the group is back online, it will start consuming messages from where it left off, including any messages that were produced during the downtime.

4. Message Retention:

  • Kafka retains messages in topics for a configurable retention period.
  • As long as the retention period has not expired, consumers can catch up on missed messages when they come back online.
  • If the retention period has expired, older messages may be deleted from Kafka, and they cannot be recovered.

5. Consumer Redundancy:

  • To ensure high availability and fault tolerance, it’s common practice to have multiple instances of a consumer within a consumer group.
  • If one instance goes down, others can continue processing messages without interruption.

In summary, Kafka is designed to handle consumer failures gracefully. If individual consumers go down, Kafka’s consumer group rebalancing and offset tracking mechanisms ensure that message consumption can continue without data loss. However, if all consumers within a group go down simultaneously, new messages will accumulate in topics until at least one consumer is back online to resume consumption. Properly configuring and monitoring your Kafka consumers and consumer groups is essential to ensure that your system can recover from consumer failures effectively.

ZooKeeper and its requirements:

Apache ZooKeeper is a centralized, open-source coordination service that is often used in distributed systems to manage configuration, maintain synchronization, provide distributed locking, and handle leader elections. In the context of Apache Kafka ZooKeeper was historically used for critical cluster coordination and management tasks. However, it’s important to note that recent versions of Kafka (2.8.0 and later) have been moving away from ZooKeeper and towards the Kafka Controller, a built-in component for managing cluster metadata. The following explanation outlines how ZooKeeper was used in Kafka and how it helped:

Why ZooKeeper used in Kafka:

  1. Distributed Coordination: Kafka is designed as a distributed system that typically runs on a cluster of multiple brokers. Coordination between these brokers is essential for tasks such as leader election, maintaining broker health, and managing partition assignments to consumers.
  2. Leader Election: Each partition in Kafka has a leader broker that is responsible for handling reads and writes. In case of a broker failure, ZooKeeper was used to elect a new leader for the affected partitions to ensure data availability.
  3. Cluster Metadata Management: ZooKeeper was used to store and manage metadata about the Kafka cluster, including broker information, topic configuration, and partition assignments.
  4. Failover and Recovery: In case of a broker failure or network partition, ZooKeeper helped Kafka maintain consistency and manage recovery procedures.
  5. Locking and Synchronization: ZooKeeper provided mechanisms for distributed locking and synchronization, which Kafka used to coordinate activities like partition reassignment, controller election, and configuration updates.
  6. Health Monitoring: ZooKeeper could be used to monitor the health and availability of Kafka brokers and other components. This helped Kafka detect and respond to issues promptly.

However, it’s worth noting that managing ZooKeeper alongside Kafka added complexity to Kafka clusters, and maintaining ZooKeeper required additional operational overhead.

As of Kafka 2.8.0 and later, Kafka has been working to remove its dependency on Zookeeper by introducing the Kafka Controller, which handles many of the coordination and metadata management tasks internally. This simplifies Kafka cluster management and reduces the operational overhead associated with maintaining ZooKeeper. While ZooKeeper was crucial in Kafka’s earlier version, its role is diminishing in favor of a more integrated Kafka Controller approach. Users are encouraged to adopt Kafka versions that support this transition.

Kafka Controller in Kafka

The Kafka Controller is a critical component introduced in Apache Kafka version 2.8.0 and later to handle various coordination and management tasks within a Kafka cluster. It plays a central role in managing the Kafka broker cluster and is designed to reduce Kafka's dependency on Apache ZooKeeper for certain operational tasks. Here’s an overview of the Kafka Controller and how it differs from ZooKeeper:

Kafka Controller:

  1. Cluster Management: The Kafka Controller is responsible for managing the overall health and state of the Kafka cluster. It monitors the availability of Kafka brokers and partitions, making decisions to maintain the clusters’s stability.
  2. Partition Leadership: One of the primary tasks of the Controller is managing partition leadership. Each partition in Kafka has a leader broker responsible for handling reads and writes. The controller ensures that when a leader broker becomes unavailable (e.g., due to a failure), a new leader is elected, maintaining data availability.
  3. Reassignment Handling: The Controller is in charge of handling partition reassignments. When partitions need to be moved between brokers (e.g., for load balancing or broker replacement), the Controller coordinates these reassignments.
  4. Broker Registration and Deregistration: It manages broker registration and deregistration. When a new broker joins the cluster or and existing one leaves, the Controller updates the cluster’s metadata accordingly.
  5. Topic and Partition Management: The Controller is responsible for managing topic and partition metadata, including creation, deletion, and configuration changes.
  6. Preferred Replica Election: In situations where the preferred replica for a partition needs to be changed (e.g., to optimize fault tolerance), the Controller facilitates this process.

Difference between Kafka Controller and ZooKeeper:

  1. Dependency:
  • Kafka Controller is a built-in component of Apache Kafka, whereas ZooKeeper is an external, separate service that Kafka traditionally depended on for coordination and management tasks.

2. Simplicity:

  • Kafka Controller simplifies Kafka cluster management by reducing the number of external dependencies. It eliminates the need for running and maintaining ZooKeeper alongside Kafka.

3. Integrated Design:

  • The Controller is designed specifically for Kafka’s needs, whereas ZooKeeper is a generic coordination service used in various distributed systems.
  • Kafka Controller’s design is tailored to Kafka’s requirements, making it more efficient and straightforward for Kafka-specific tasks.

4. Transition Away from ZooKeeper:

  • Kafka Controller is a part of Kafka’s effort to reduce its dependency on ZooKeeper. While Kafka still supports ZooKeeper for backward compatibility, the goal is to eventually eliminate this dependency in future Kafka versions.

In summary, the Kafka Controllers is a component introduced in the recent Kafka version to handle critical coordination and management tasks within a Kafka cluster. It is designed to simplify Kafka cluster management, reduce external dependencies (like ZooKeeper), and provide better control and efficiency for Kafka-specific operations. Kafka users are encouraged to adopt Kafka versions that support the transition to Kafka Controller for improved cluster management and reduced operational complexity.

Kafka Streams

Kafka Streams is a powerful and lightweight stream processing library and framework that is part of the Apache Kafka ecosystem. It allows developers to build real-time data processing applications that can consumer, process and produce data streams from Kafka topics. Kafka Streams is designed to make stream processing simple, scalable, and fault-tolerant, and it provides a native and integrated approach to processing data within Kafka.

Key characteristics and features of Kafka Streams include:

  1. Stream Processing: Kafka Streams enables the processing of continuous, real-time data streams. It is ideal for scenarios where data is generated continuously and needs to be processed as it arrives.
  2. Integration with Kafka: Kafka Streams is tightly integrated with Kafka, making it easy to consume data from Kafka topics, perform processing, and produce results back to Kafka topics. It leverages Kafka’s built-in fault tolerance, scalability, and durability.
  3. Stateful Processing: Kafka Streams allows developers to build stateful processing applications. You can maintain and update local state stores, which can be queried and joined with incoming data streams.
  4. Event Time Processing: It supports event time processing, making it suitable for handling out-of-order data and time-based aggregations.
  5. Exactly Once Semantics: Kafka Streams provides support for exactly-once processing semantics, ensuring that each message is processed and produced only once, even in the presence of failures.
  6. Windowed Aggregations: You can perform windows aggregations, such as tumbling, hopping, and sliding windows, which are useful for time-based analytics.
  7. Join Operations: Kafka Streams supports stream-table joins and stream-stream joins, allowing you to combine data from different sources of complex processing.
  8. Interactive Queries: It enables interactive queries to access the state stores, making it possible to build interactive applications and dashboards on top of real-time data.
  9. Distributed and Scalable: Kafka Streams applications can be deployed across multiple instances, allowing for horizontal scaling and distributing the processing load.
  10. Integration with External Systems: While Kafka Streams is designed to work seamlessly with Kafka, it can also integrate with external systems and services when needed.
  11. Built-in Error Handling: It provides built-in error handling and recovery mechanisms, ensuring that processing continues even in the presence of transient failures.

Kafka Streams is commonly used for a wide range of stream processing applications, including real-time analytics, monitoring, fraud detection, recommendation systems, and more. Its integration with Kafka and the ability to leverage Kafka’s reliability and scalability make it a popular choice for building event-driven, real-time applications within the Kafka ecosystem.

--

--

Saurav Kumar

Experienced Software Engineer adept in Java, Spring Boot, Microservices, Kafka & Azure.