Building a Data Lake, ingesting and analysing data!

2024-05-21 3 min read Projects Barny Baron

Journey: 📊 Community Builder 📊

Subject matter: Building on AWS

Task: Building a Data Lake, ingesting and analysing data!

This project is about Scalability in data analysis.

Using the 6 Pillars of the AWS Well-Architected Framework, Performance Efficiency and Sustainability will be achieved in this build.

Today, I will be building a Data Lake. I will be using raw data held in a CSV file, hosted in an S3 bucket which will be ingested into the Data Lake using a Glue Crawler. This will create a table in Lake Formation that contains the schema information which can then be read. Finally, I will use the serverless query engine Amazon Athena to query the dataset.

Analytics

Resource credit: This serverless architecture was created using guidance from Haimo Zhang on Medium Here.

What did I use to build this environment?

Amazon S3
AWS Lake Formation
AWS Glue
Amazon Athena
AWS IAM

What is built?

A new S3 bucket with data capture and processing folders
A new Lake Formation database, configured to house ingested data in a database table.
A Glue Crawler, created in AWS Glue to discover and create a schema for Athena to be able to query the dataset.
Amazon Athena is configured to point to the ingest table and export data to a QueryResult folder in the S3 bucket.

In this task, I created a new S3 bucket and created analytics sub-folders to house the various stages of analysed data. I then registered the location of my S3 bucket in Lake Formation and assigned an IAM role to perform the tasks. After creating a new database to ingest the data, I created a Glue Crawler in AWS Glue. I pointed the Crawler at my raw sample CSV dataset and assigned the IAM role. After granting the Crawler access to the Lake Formation Database, I was then able to run the Crawler service which took roughly 1 minute to process.

Crawler processing

When I returned to AWS Lake Formation, I observed that AWS Glue had successfully created a batch table in the database which contained schema information and data.

This permitted me to use Amazon Athena to query the data. I first had to create a location to output the query data that Athena was analysing. I used a new folder in my S3 bucket for this. Once I pointed Athena to the ingest table and ran a general query to return all rows as a test, it successfully output the data in the query results pane and also printed the output to my S3 bucket!

I also used the following AWS documentation to troubleshoot database creation in Lake Formation:

https://docs.aws.amazon.com/healthlake/latest/devguide/troubleshooting-data-lake-admin.html

I have performed labs in AWS Cloud Quest previously under the Data Analytics role where I used Data Lakes, Glue and Athena, but I had never set it up from scratch before. I enjoyed seeing how it all came together. Being able to run a query against data that has just been ingested from raw data was quite satisfying as I could watch one process passing off to the other in real time.