Apache Kafka - Part 1

Apache Kafka - Part 1

In this post, I share my notes on Apache Kafka, inspired by the teachings of Stephane Mareek's course. Enjoy!

Topics, partitions and offsets

  • Topic is a particular stream of data.
  • A kafka cluster may have many topics.
  • The sequence of messages is called a data stream.
  • Topics are split into partitions. Messages within each partition are ordered and indexed with id's. The id's will keep on increasing as new messages come.
Producers, topics and partitions
Producers, topics and partitions
  • Kafka topics are immutable, once the data is written it cannot be changed.
  • Data is kept only for a limited time. (default is one week but its configurable)
  • Order is guaranteed only within a partition
  • Data is assigned randomly to a partition unless a key is provided.

Producers

  • Producers write data to topics.
  • Producers know which partition to write to (and which Kafka broker has it)
  • Producers can choose to send a key with the message (string, number, binary etc.)
    • If key is null, partition is chosen by round robin
    • If key is not null, then all the messages for the same key go to the same partition. (hashing)
  • How does a kafka message look like ?
Kafka message structure
Kafka message structure
  • The key and the value are serialized before they are sent to kafka, consumer then deserializes them after consuming the message (meaning the message is consumed as bytes).
Kafka image serializer
Kafka image serializer

Consumers

  • Consumers read data from a topic - pull model.
  • Consumers know which broker and partition to read from.
  • In case of broker failures consumers know how to recover.

Consumer Deserializer

  • Deserializes binary data to its original form.
Kafka consumer deserializer
Kafka consumer deserializer
  • Consumer should know what kind of data is being sent through the topic, this means that we can only send a single type of data through a topic and can’t change it through its lifetime.

Consumer Groups

  • All the consumers in an application read data as consumer groups.
  • Each consumer within a group reads from exclusive partitions.
Kafka consumer groups - 1
Kafka consumer groups - 1
  • If you have more consumers than partitions, some consumers will be inactive.
Kafka consumer groups - 2
Kafka consumer groups - 2
  • Multiple consumer groups can be on the same topic.
Kafka consumer groups - 3
Kafka consumer groups - 3
  • Consumer groups can be thought of as services.
  • Consumers know which group they belong via their group.id property.

Consumer Offsets

  • Kafka stores the offsets at which a consumer group has been reading.
  • When a consumer in a group is processing data received from kafka, it should be periodically committing the offsets.
  • If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets.
  • By default Java consumers automatically commit offsets (at least once)
  • There are 3 delivery semantics if you choose to commit manually:
    • At least once (usually preferred):
      • Offsets are committed after the message is processed.
      • If the processing goes wrong, the message will be read again.
      • This can result in duplicate processing of messages so user should make sure the system won’t be impacted by messages that are processed again.
    • At most once:
      • Offsets are committed as soon as messages are received.
      • If the processing goes wrong, some messages will be lost (they won’t be read again)
    • Exactly once:
      • Messages are processed only once.