Data Compare Tool using Pandas, Flask with MongoDB and Docker + AWS ECR — Part 1

Subham Kumar Sahoo
7 min readJan 30, 2023

Data comparison between MySQL tables using Python Pandas and Flask. MongoDB for logging the statistics and CI/CD using Docker with AWS ECR.

Architecture Diagram

GitHub repository : https://github.com/sksgit7/Data-compare-docker

In this project, along with our problem statement we will be focus on distribution of our application so that other users of our application can easily start using it without much of configuration. Therefore we will be using Docker and AWS ECR as a private docker repository.

Problem statement

Let’s say we have a process that extracts records from a RDBMS table, then apply some transformations on those and at last load to another DB table. Now we want to validate that those records have been loaded properly i.e. transformations worked as expected. Example :

SRC_EMP table:

TGT_EMP table:

So, let’s say there is a process that adds 10% bonus with salary from SRC_EMP table and loads it to TGT_EMP table. And accordingly we have mentioned the SQL queries in the web form.

Note — The name of the columns getting fetched from the source and target query should be same. Use alias if required.

But here we can see only Id 1 record matches i.e. 2000 in source and 2200 in target (+10%). But Id 2 record does not have correct value in target. And Id 3 present in Source only and Id 4 in target only. So, it is expected that:

  • src_diff_tgt.csv will contain 2 records Id 2 and Id 3.
  • tgt_diff_src.csv will contain 2 records Id 2 and 4.
  • Matches is only 1 record.

There can be various ways to do comparison like :

  • Doing that manually by looking a few records from source and target table.
  • Using SQL queries to get transformed results from Source table and compare the data with Target table.

The 1st method will not be full proof as we will just be checking few records. The 2nd method might not be feasible if there are too many records (millions) because in that case we cannot compare the results by copying into an excel sheets as well as any compare tool (integrated with SQL editor) might not perform well.

Solution

We will be following the 2nd method only and design the comparison process. At first we will fetch the data from Source and Target based on the SQL queries we provide and then compare those using Pandas dataframes. This will generate two Pandas dataframe having data in “source not in target” and “target not in source”. Then we will just put the mismatch records into two csv files for analysis.

Not interested in theory :( Click here to jump to hands-on part :

Components

Python Flask

Flask is a web framework that provides libraries to build lightweight web applications in python. We will have a Flask web application where we can submit our DB configuration details like host, user, pass, DB name, SQL queries etc. From there the inputs go to backend, do the processing and we can see the compare results at the end.

Web page
Results

MySQL database

For this project we will be using MySQL database which will run on our system (local). MySQL is a widely used relational database management system (RDBMS). MySQL is free and open-source. MySQL is ideal for both small and large applications.

To connect from Python to MySQL DB we will be using a Python driver called “python-mysql-connector”. If you plan to use any other DB engines like SQL server, Postgres etc. feel free to use respective Python drivers/libraries for the same.

Python Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. It provides data structures called data-frame which has a tabular form and is best suited for analysis of structured data. These are also used in data analysis, data science and machine learning processes.

MongoDB

MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schema. We will be using this database to store our compare results for each run so that we can come back later and check the results for a certain run we did.

  • A table in MongoDB is called “collection”.
  • A record in a collection is called “document”.

Mongo-Express

Mongo Express is an interactive lightweight Web-Based Administrative Tool to effectively manage MongoDB Databases written in Node.js, Express and Bootstrap 3. This provides an interactive web dashboard to efficiently interact with MongoDB data.

Note — We will be directly using MongoDB image from DockerHub but for Mongo-Express we will use the base image from DockerHub but will install curl library in it using Dockerfile which we will use to ping MongoDB container for health check.

Docker

Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers.

Docker workflow

Container

A way to package application with required dependencies and configs.

  • Portable artifact (shared/moved easily)
  • These live in Container repository
  • Private repo
  • Public repo for docker (DockerHub)
  • The package is called Image.
  • When we pull and run the image, the env is called container env. So, it is a running env for image.
  • Contains image, own virtual file system, port etc.

APP Dev

Before containers

  • The developers had to install packages (binaries) on their own in their OS

After containers

  • Developers just need to find the container with the specific package and with one command fetch and run it
  • docker run postgres
  • We can also run multiple versions of same application on our system.

App deploy

Before

  • Dev team create and give artifacts (packaged code) to operations team with instructions
  • Ops team will configure the server and deploy the application

After

  • The whole app (code+dependencies) is packaged
  • No environmental config required on server (only configure docker runtime that creates and run the containers)
  • Just pull the container from repo and run.

Images

Container is a layers of images.

  • Base image — Linux (small size)
  • Application image — top layer

Advantage of having layers is that if any particular layer is updated in a diff version then only that will be pulled.

Dockerfile

It is a text document that contains all the commands a user could call on the command line to assemble an image.

Docker-Compose

It is a tool to build and run multiple containers with specific configurations. We specify the configurations in a .yaml or .yml file. All the containers in the same file runs on one docker network which enable these to connect to each other by container name.

AWS ECR

Amazon ECR (Elastic Container Repository) is a fully managed container registry offering high-performance hosting, so you can reliably deploy application images and artifacts anywhere. Here we can push and store our docker images an anyone having the access can pull the images and run the applic/ation. We will create our private docker repository on ECR.

On ECR for every docker image we have to create a repository unlike DockerHub where one repository can contain different images like Mongo, Python etc. Also the name of the repositories on ECR should match the image name that we will be pushing to it. Each repository will contain different versions of the same image with unique tag for each.

We can run these ECR images on AWS ECS (Elastic Container Services), AWS EKS (Elastic Kubernetes Services) as well as pull the images and run on on-prem servers (which we will do at the end).

Too much theory, I guess. But it was important though. Maybe I will add another blog where we can explore more about CI/CD pipelines.

Now let’s jump to some hands-on..

Next part : https://medium.com/@subham-sahoo/data-compare-tool-using-pandas-flask-with-mongodb-and-docker-aws-ecr-part-2-a673daf80cd0

If you liked this project, kindly do clap. Feel free to share it with others.

👉Follow me on medium for more such interesting contents and projects.

Check out my other projects here : https://medium.com/@subham-sahoo/

Connect with me at LinkedIn. ✨

Thank you!!

References

All references can be found here : https://github.com/sksgit7/Data-compare-docker/blob/main/references.txt

--

--