Skip to main content
Uncategorized

Automating Data Ingestion: A Guide to Snowpipe in Snowflake

By Novembre 3, 2023Marzo 5th, 2024No Comments

Snowpipe is a serverless service offered by Snowflake that facilitates the seamless and real-time ingestion of data. This service is particularly useful when you have continuous data streams that need to be processed promptly. Through this blog post, we aim to delve into the practical aspects of implementing Snowpipe in Snowflake to automate data ingestion, reducing the time and effort required in managing this crucial task.

Understanding Snowpipe

Snowpipe operates on a serverless architecture, which means it abstracts and handles all the infrastructure, allowing you to focus solely on loading your data. Below are some key points about Snowpipe:

  • Automated Data Loading:
    • Snowpipe automatically loads data from your external stage (e.g., AWS S3, Azure Blob Storage) into your Snowflake data warehouse without any manual intervention.
    • It utilizes a Continuous Data Ingestion (CDI) mechanism to constantly check for new data and load it into Snowflake.
  • Real-Time Data Ingestion:
    • With Snowpipe, data can be loaded into Snowflake almost in real-time, which is invaluable for applications and analyses that require up-to-date information.
    • The latency is typically just a few minutes, ensuring your data is promptly available for querying and analysis.
  • Cost-Effectiveness:
    • You only pay for what you use with Snowpipe, making it a cost-effective solution for data ingestion.
    • The serverless nature means there’s no need for provisioned resources (Warehouses), which further reduces the cost and operational overhead.
  • Ease of Setup and Use:
    • Setting up Snowpipe is straightforward with just a few steps needed to configure your pipe, and the external stage.
    • Snowflake provides a rich set of documentation and tools to help you get started with Snowpipe quickly.
  • Scalability:
    • Snowpipe can handle data ingestion at any scale, whether you’re dealing with a few megabytes or several terabytes of data.
    • It scales automatically with your data volumes ensuring consistent performance regardless of the size of data being ingested.

By utilizing Snowpipe, you can significantly streamline the data ingestion process, ensuring your data is always ready for analysis when you need it. In the following sections, we will look at how to prepare your data, configure Snowpipe, and automate data ingestion to make the most out of this powerful feature.

Connecting your external data with snowflake.

In this section, we will walk through a practical course on how to configure your external Amazon S3 repository with Snowflake. This step-by-step guide will help you set up Snowpipe to continuously load data from AWS S3 into your Snowflake data warehouse.

Notifications Triggering Snowpipe

Utilizing notifications from your cloud storage infrastructure is a straightforward way to trigger Snowpipe for continuous data loading. Among the cloud storage platforms Snowpipe supports are Google Cloud Storage, Microsoft Azure Blob Storage, and AWS S3. For this tutorial, we will focus on integrating AWS S3 with Snowpipe.

With AWS S3 and Snowpipe integration, you have the choice between using S3 event notifications or Amazon’s Simple Notification Service (SNS) to stage data for loading. Choose between these two approaches based on your current setup and security considerations.

Setting Up AWS S3 Event Notifications for Snowpipe

To commence using AWS storage notifications for Snowpipe processing, follow these outlined steps within your AWS and Snowflake accounts to set up the necessary security conditions:

  1. IAM Policy Creation:
    • Create an IAM policy (snowflake_access) granting Snowflake permissions like GetObject, GetObjectVersion, and ListBucket on your S3 bucket.
  2. New IAM Role:
    • Create a new IAM role (snowflake_role) associated with the snowflake_access policy to allow Snowflake to access your S3 bucket.
  3. Snowflake Storage Integration:
    • In Snowflake, run a CREATE STORAGE INTEGRATION command to create a storage integration (<storage_integration_name>) linking to the IAM role.
  4. Run Integration Description Command:
    • Run a command to display the new integration’s description, recording the STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID.
  5. Update IAM User Permissions:
    • Update the trust relationship of the IAM role using the ARN and External ID obtained from the previous step.

After these steps, your AWS and Snowflake accounts are set up and ready for Snowpipe data ingestion. For a detailed walkthrough, you can follow this link.

Creating Snowpipe

With the appropriate security conditions set between your AWS and Snowflake accounts, it’s time to complete the Snowpipe setup to enable S3 event notifications for automated data ingestion. Follow the steps below to create the necessary database objects and configure Snowpipe:

  1. Create a table to store the data.
  2. Create an External Stage linked to the url of your S3 bucket and the storage_integration created.
  3. Create the pipe with the following code:
CREATE OR REPLACE PIPE <your_pipe_name> 
                      auto_ingest = true
                      as
                      copy into <table_name>
                      from @<stage_name>

The key of the process is the “auto_ingest” parameter, which will allow your pipe to automatically integrate the data stagged into your database.

Now, the next step is configuring permissions to ensure the User with executing Snowpipe actions had sufficient permissions. To do this, best practice is creating a new role and grant all this privileges.

  • USAGE on database, schema and stage.
  • INSERT, SELECT on the table.
  • OWNERSHIP on the pipe

The permissions ensure that the designated user or role has the necessary rights to execute actions for data ingestion while adhering to the principle of least privilege, enhancing the security of your setup. With the auto_ingest parameter enabled, new data landing in the S3 bucket triggers Snowpipe to automatically load it into the specified table in Snowflake, without any manual intervention.

Extra: run the command show pipes to record the arn listed in the ‘nofiticaion_channel’ column.

Conclusion

Setting up Snowpipe to move data from your external bucket to Snowflake makes data handling much easier and quicker. This setup removes a lot of manual work, making data ready for analysis as soon as it lands in Snowflake.

Now, with less time spent on moving data around, you can focus more on using the data to make good decisions for your business. This setup is a big step towards making better use of your data in Snowflake.

As you keep working with Snowflake, you might find more tools and features that can help you work with your data even better. So, keep exploring and improving your data setup to help your business grow.

Auteur

Leave a Reply