Apache Kafka - Part 1

August 05, 2024

In this post, I share my notes on Apache Kafka, inspired by the teachings of Stephane Mareek's course. Enjoy!

Topics, partitions and offsets

Topic is a particular stream of data.
A kafka cluster may have many topics.
The sequence of messages is called a data stream.
Topics are split into partitions. Messages within each partition are ordered and indexed with id's. The id's will keep on increasing as new messages come.

Kafka topics are immutable, once the data is written it cannot be changed.
Data is kept only for a limited time. (default is one week but its configurable)
Order is guaranteed only within a partition
Data is assigned randomly to a partition unless a key is provided.

Producers write data to topics.
Producers know which partition to write to (and which Kafka broker has it)
Producers can choose to send a key with the message (string, number, binary etc.)
- If key is null, partition is chosen by round robin
- If key is not null, then all the messages for the same key go to the same partition. (hashing)
How does a kafka message look like ?

The key and the value are serialized before they are sent to kafka, consumer then deserializes them after consuming the message (meaning the message is consumed as bytes).

Consumer should know what kind of data is being sent through the topic, this means that we can only send a single type of data through a topic and can’t change it through its lifetime.

Kafka stores the offsets at which a consumer group has been reading.
When a consumer in a group is processing data received from kafka, it should be periodically committing the offsets.
If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets.
By default Java consumers automatically commit offsets (at least once)
There are 3 delivery semantics if you choose to commit manually:
- At least once (usually preferred):
  - Offsets are committed after the message is processed.
  - If the processing goes wrong, the message will be read again.
  - This can result in duplicate processing of messages so user should make sure the system won’t be impacted by messages that are processed again.
- At most once:
  - Offsets are committed as soon as messages are received.
  - If the processing goes wrong, some messages will be lost (they won’t be read again)
- Exactly once:
  - Messages are processed only once.

August 05, 2024

In this post, I share my notes on Apache Kafka, inspired by the teachings of Stephane Mareek's course. Enjoy!

Topic is a particular stream of data.
A kafka cluster may have many topics.
The sequence of messages is called a data stream.
Topics are split into partitions. Messages within each partition are ordered and indexed with id's. The id's will keep on increasing as new messages come.

Kafka topics are immutable, once the data is written it cannot be changed.
Data is kept only for a limited time. (default is one week but its configurable)
Order is guaranteed only within a partition
Data is assigned randomly to a partition unless a key is provided.

Producers write data to topics.
Producers know which partition to write to (and which Kafka broker has it)
Producers can choose to send a key with the message (string, number, binary etc.)
- If key is null, partition is chosen by round robin
- If key is not null, then all the messages for the same key go to the same partition. (hashing)
How does a kafka message look like ?

The key and the value are serialized before they are sent to kafka, consumer then deserializes them after consuming the message (meaning the message is consumed as bytes).

Consumer should know what kind of data is being sent through the topic, this means that we can only send a single type of data through a topic and can’t change it through its lifetime.

Kafka stores the offsets at which a consumer group has been reading.
When a consumer in a group is processing data received from kafka, it should be periodically committing the offsets.
If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets.
By default Java consumers automatically commit offsets (at least once)
There are 3 delivery semantics if you choose to commit manually:
- At least once (usually preferred):
  - Offsets are committed after the message is processed.
  - If the processing goes wrong, the message will be read again.
  - This can result in duplicate processing of messages so user should make sure the system won’t be impacted by messages that are processed again.
- At most once:
  - Offsets are committed as soon as messages are received.
  - If the processing goes wrong, some messages will be lost (they won’t be read again)
- Exactly once:
  - Messages are processed only once.