Apache Kafka

Apache Kafka is a distributed event streaming platform designed to handle trillions of events daily, making it ideal for real-time data processing.

Notes

Apache

Kafka Team Kafka is designed to manage high-throughput, low-latency systems.

Apache Kafka was initially developed by LinkedIn profile, and later open-sourced and became part of the Apache Software Foundation in 2011. It is built to handle massive volumes of data efficiently with high throughput and fault tolerance, making it perfect for real-time data pipelines and streaming applications.

TakeAways

  • 📌 Kafka’s distributed architecture allows it to manage large volumes of data across multiple servers, providing scalability, durability, and fault tolerance.
  • 💡 It retains all messages for a configurable period (default: 7 days), enabling consumers to read at their own pace.
  • 🔍 Capable of handling trillions of events per day.

Understanding Apache Kafka

🐦 Kafka is designed to manage high-throughput, low-latency systems, making it ideal for real-time data processing pipelines.

Key Kafka Concepts

  • Producers: Entities that publish (or send) records/data to Kafka topics.
  • Consumers: Entities that subscribe to one or more topics and process the feed of published messages.
  • Topics: Categories or feeds name-spaced by a string. Producers write data to topics, and consumers read from topics.
  • Partitions: A way to split a topic into multiple streams, allowing parallelism and scalability. Each partition is an ordered, immutable sequence of records.
  • Brokers: Kafka servers that store data for a set period and serve data to clients. A broker can host multiple partitions.
  • Kafka Raft: The new built-in consensus mechanism replacing ZooKeeper for cluster management, maintaining metadata, and managing controller elections. Available since Kafka 2.8 and the default as of Kafka 3.3+.
    • ZooKeeper (Legacy): Previously used for cluster management in older Kafka versions. It managed configuration, naming, synchronization, and leader election but has been deprecated in favor of KRaft.

How It Works

  1. Producer writes data to a topic.
  2. Data is partitioned among brokers, ensuring redundancy and fault tolerance.
  3. Consumers read the data from these partitions, processing it as needed.

Core Features

  1. Scalability : The distributed architecture allows it to handle a vast amount of data across many servers. Scaling is primarily achieved through partitions and brokers.
  2. Durability : Retains all messages for a configurable period (default: 7 days). This enables consumers to read messages at their own pace, regardless of when they were produced.
  3. Fault Tolerance : Partition replication ensures that data is not lost if one broker fails. Producers and consumers can seamlessly handle failures without significant downtime.
  4. High Throughput : Optimized for high-throughput, enabling it to ingest massive volumes of data from numerous producers simultaneously.

Use Cases

  1. Log Aggregation : Collecting logs from various services and making them available for analysis.
  2. Metrics Collection : Gathering system metrics like CPU usage, disk I/O, etc.
  3. Activity Tracking : Recording user activities across web applications or mobile apps.
  4. Stream Processing : Real-time data processing using Kafka Streams API.
  5. Event Sourcing : Capturing the full history of changes to a system’s state.
  6. Microservices Communication : Enabling intercommunication between microservices in a scalable manner.

Thoughts

  • 🛠️ Producer & Consumer API : Interfaces for publishing and consuming records in Kafka.
  • 🌍 Distributed Architecture : Ensures high availability and fault tolerance.
  • Scalability : Easily handles increasing data loads by adding more servers to the cluster.