Live data streaming project using Kafka — part 1

Subham Kumar Sahoo
8 min readOct 8, 2022

--

Kafka project to stream bidding data from Python Flask website. Data is fetched using multiple consumers in parallel and stored in MySQL database. Dashboard created using Power BI tool.

GitHub repository : https://github.com/sksgit7/Big-data-projects/tree/main/kafka%20project1

Apache Kafka is an open-source distributed streaming system used for stream processing, real-time data pipelines, and data integration at scale.

Real time data examples :

  • Bank transactions
  • Application for real-time sport scores or real-time gaming
  • Any real time dashboard like stock market dashboards etc.

Not interested in theory :( Click here to jump to hands-on part : https://medium.com/@subham-sahoo/live-data-streaming-project-using-kafka-part-2-2df7ed529413

Problem statement

We are going to organize an auction where people will bid for certain products. We need a website for that as well as a database to store the bid entries. Over that we need to build a real-time dashboard that can show latest insights over the data.

We want to try parallel processing of the data too like multiple applications process the data and store it in DB. This will lead to less time consumption. And simultaneously we need to store the data into a DB table or file for future use or audit purpose.

Means there will be one producer i.e. web application and multiple consumers, one group will insert records to DB table while another will back up the data into a file or DB table.

Solution overview

We will build an end-to-end pipeline that will fetch the bid submissions from a website and insert into a DB table and a backup file in parallel. And there will be a Power BI dashboard over it.

There will be 5 components like

  • Flask web application (producer).
  • Kafka topic (on Confluent-Kafka cloud).
  • Consumer group 1 (consumer 1 and 2) will insert records into DB table and Consumer group 2 (consumer 3) will backup records into file.
  • MySQL DB table to store the bidding data.
  • Power BI dashboard to visualize our data.

Not interested in theory :( Click here to jump to hands-on part : https://medium.com/@subham-sahoo/live-data-streaming-project-using-kafka-part-2-2df7ed529413

Why even we need Kafka here?

You might think this could have been done with just a Python application where we could have run Flask code and after fetching the data from web-page we could have directly inserted in DB table. But I am going ahead with Kafka because :

  • In case of a standalone Python application for web page, we cannot use multiple consumers (with different functions) of that web data. We can create multiple Python applications in parallel but we will have 3 different servers running, which we do not need. Rather than that we can have one web server now with auto-scaling on it (cloud services are good for this).
  • We will lose the parallel processing power. Kafka enables the multiple consumers in same group to share the messages among themselves so that they can perform operations on their set of data in parallel. This leads to better performance and less time consumption.
  • We need not provide a maximum size of a topic. Kafka is massively scalable.
  • Kafka provides benefits like partitions, replicas, schema registry etc. Which we will discuss below.

Kafka overview

Kafka runs on a cluster. One cluster contains a leader node and multiple brokers which are individual servers. The leader node stores all the metadata about the topics, messages and offsets. While the brokers will store the actual data. One broker can contain multiple topics.

Kafka cluster

Topic : It is a small part (memory) within a Kafka broker. Then there are multiple partitions within a topic. Think of a partition like a list or an array that can store the messages.

Offset : It is like an index (unique id) for each message in a partition. It starts from 0. When there is no message in a partition, the offset is -1.

Benefits of having partitions : So, the messages of a topic are divided among multiple partitions and each partition can reside on distinct cluster nodes (servers). This makes the operations on a topic (publishing, consuming messages etc.) fast. Also these partitions are replicated across several brokers which adds on to the data durability.

Replica : The partitions of a topic is replicated across multiple broker nodes (configurable) and each copy is called a replica. So, if one broker node is down the message can be published to or consumed from a replica. Also if multiple consumers accessing messages in same partition, then each one can be pointed to different partitions (replicas of the same partition) to reduce load on the original partition and make operation faster. The leader node has all the information or metadata about these and it is responsible for orchestrating these operations.

Producer : The application that continuously read the data from Source (API, website, file, DB etc.) and publish message to Kafka topic.

It first serializes the data based on the schema and produces to the Kafka topic. Serialization means converting values into binary format. This leads to compact message size and fast transfer over the network which leads to lower cost over the network.

Consumer : The application that continuously read data/messages from Kafka topic. Then it can do transformations, add to the storage layer or just send the data to processing engine like spark code.

It fetch the message and de-serialize it based on the schema. We can hard code the schema in the code but it is better to fetch the latest schema for that topic from schema registry.

A consumer application can have offset configured as earliest or latest. Earliest means read from the oldest record/message in each partition and Latest means read from the latest message in each partition.

Schema registry : Central repository of schema for each topic. This runs on a particular server.

We will be using Confluent Kafka which provides a cloud version of Kafka. Means our data will be stored on cloud. We will get $400 credits for 30 days and a single node cluster (1 broker) under free tier.

Flask web application

Flask is a micro web framework written in Python. It is classified as a micro-framework because it does not require particular tools or libraries. It has no database abstraction layer, form validation, or any other components where pre-existing third-party libraries provide common functions.

So, we will use this framework to create an web application where the bidders can place their bids. Something like this one :

I will be running the Flask application on local machine and will access it via localhost. You can host your website somewhere else if you want.

Here the bidder can just put his/her name and the price for that bid. Also on every refresh the Highest bid till now will be shown to the bidder too, based on which he/she can decide the price. Here the Flask back-end will just connect to the DB and fetch the maximum price which will be sent back to HTML front-end.

Improvements : Few other attributes like Product name or some more description about the bidder can be added here. I have just kept stuffs simple.

After the bid is placed, the back-end Python code will connect to the Kafka cluster (on cloud) and fetch the schema associated with the Kafka topic. Based on that it will serialize the record (convert into binary form to reduce size and easy transfer over the network — low transfer cost). The it will publish the messages or records to the Kafka topic. After that it will return you the highest bid and a message “bid added!”.

Consumer groups

We will be having two consumer groups 1 and 2. Group 1 will have consumer application 1 and 2 which will fetch the message from Kafka topic and write to the Database in parallel. As they will run in parallel, they can divide the messages among them.

For example, we produced 10 bids and consumer 1 fetched 6 out of 10 and consumer 2 got the rest 4 messages. Then they will insert these records into DB table. If each message took 5 seconds from getting the record to inserting into DB, then consumer 1 will take 5*6=30 seconds while consumer 2 will need 5*4=20 seconds. But as they are working in parallel the whole process will be done in 30 seconds only. But if we would have used a single consumer application it will take 5*10=50 seconds.

Group 2 will contain single consumer application i.e. consumer 3 which will fetch all records from Kafka topic and append those to a file.

Also each consumer application will first fetch the schema from Kafka topic and based on that it will de-serialize the message it get from Kafka topic. Then it will connect to the DB to insert the record into the table “bid”.

MySQL DB table

MySQL is an open-source relational database management system.

To store the bidding details we will use MySQL database. We will first create an DB instance through which we will connect to MySQL engine. There we will create a database “test” and a table “bid” inside it. The attributes/columns of the table will be :

  • id (integer, primary key) : this will be auto-incremented on each record insert. As one person can put multiple bids so we need to have an unique column to identify each record. For this we will have and id column on auto-increment.
  • name (string)
  • price (integer)
  • bid_ts (timestamp) : date-time when bid was placed.

I will be using MySQL Workbench tool on local system. You can use any other DB like PostgreSQL, Oracle etc. and these can be hosted on cloud too.

Power BI tool

Power BI is an interactive data visualization software product developed by Microsoft with a primary focus on business intelligence.

We will create connection from Power BI to the MySQL database and load the data. Over that we can make transformations and create visualizations. And if there is any change in the table data we can refresh the report.

Too much theory, I guess. But it was important though. Maybe I will add another blog where we can explore more about Kafka.

Now let’s jump to some hands-on..

Next part : https://medium.com/@subham-sahoo/live-data-streaming-project-using-kafka-part-2-2df7ed529413

If you liked this project, kindly do clap. Feel free to share it with others.

👉Follow me on medium for more such interesting contents and projects.

Check out my other projects here : https://medium.com/@subham-sahoo/

Connect with me at LinkedIn. ✨

Thank you!!

References

  • Wikipedia
  • YouTube

--

--

Subham Kumar Sahoo
Subham Kumar Sahoo

Written by Subham Kumar Sahoo

Data Engineer by profession, curious by nature😀 | AWS Certified | ▶️Linked-in: https://www.linkedin.com/in/subham-kumar-sahoo-55563a136/

No responses yet