AWS Change Data Capture project part 2

5 min readJun 14, 2022

We will be implementing a Change Data Capture project using AWS RDS, DMS, S3, Lambda, Glue job with IAM.

Previous part: https://subham-sahoo.medium.com/aws-change-data-capture-project-part-1-ce0103ba28e7

Step 1 : Create RDS MySQL instance

Create RDS parameter group

As our source database is RDS MySQL, replicating on-going changes requires MySQL binary logs to be enabled and set to row. This message you will also get while creating DMS task. So, let’s create a parameter group for this.

Go to RDS console on AWS and on left navigation pane click on “parameter groups”. Then create.

Then create. Then click on that parameter group, search for “binlog_format” parameter and click on edit. Change “values” from MIXED to ROW.

Then save.

Create RDS instance

On RDS console, click on databases. Then create.

Keep configurations as shown below and other configs as default.

Choose Version — MySQL 8.0.28, Templates — Free tier.

Provide suitable username (like admin) and password.

DB instance class — choose free tier one (t2 or t3).

Enable public access to access DB from local MySQL workbench.

Database auth — choose password authentication

No need for initial DB. We will create our schema manually.

Mention created parameter group and enable auto-backup.

Encryption and enhanced monitoring not required.

Maintenance window — no preference.

Create.

Step 2 : Create S3 bucket

Here DMS will put the initial load .csv files as well as when any change is captured in the DB, it will create new files with the rows modified.

On S3 dashboard, create bucket.

Object ownership — ACL disabled.

Uncheck — Block all public access.

Versioning and encryption — disable.

Create results bucket

To load final processed files from Glue job. We can use the same bucket too rather than using a new bucket. Because here main goal is that lambda function should get triggered by files generated by DMS not these final files and DMS will be creating files in different directory structure inside that bucket i.e. <schema_name>/<table_name>.

But it is better and standard to maintain different buckets for raw and processed data.

Object ownership — ACL disabled.

Uncheck — Block all public access.

Versioning and encryption — disable.

Step 3 : Create DMS endpoints

An endpoint provides connection, data store type, and location information about your data store. AWS Database Migration Service uses this information to connect to a data store and migrate data from a source endpoint to a target endpoint.

Source endpoint (for RDS)

In DMS console, select endpoints in left navigation pane. Then create endpoint.

Choose the RDS instance.

Then provide a name to the endpoint and choose engine as MySQL. Then fill out server configs manually.

Then create.

IAM role for target endpoint

As our target endpoint will be S3, we need to grant access to DMS to connect and perform read/write on S3. To avoid any security issues, we will add S3FullAccess policy to a role and attach that role to target endpoint.

Go to IAM console > Roles > Create role.

Next.

Provide a name and create role and copy the role ARN (click on the role and you will get it).

Destination endpoint

Click on click endpoint.

Give a name and choose target engine as S3.

Paste the role ARN and provide name of S3 bucket.

Step 4 : Create DMS replication instance

AWS DMS uses a replication instance to connect to your source data store, read the source data, and format the data for consumption and loads the data into the target data store. AWS DMS creates it on an Amazon EC2 instance in a VPC.

Click on replication instances in DMS console. Then create.

Provide a name to it, choose t2 or t3.micro instance as it comes under free-tier. Then choose single-AZ deployement.

Allocated storage — not required much for now. Keep it just few GBs.

Create.

Further steps in the next part..

Part 3: https://subham-sahoo.medium.com/aws-change-data-capture-project-part-3-ccf745c2ca00
Follow me for more such interesting contents and projects.

Connect with me at LinkedIn.

References

AWS documentations.
PySpark and AWS: Master Big Data with PySpark and AWS by Muhammad Ahmad.