Skip to main content

This week I set-up automate external stage loading. I did this for our first mock datastack project with Snowflake and dbt. The main idea of the project is that we get daily data (acquired and manipulated from the YouTube API) through an Amazon s3 bucket. The s3 bucket environment is hosted by a third-party data provider, therefore using a pipe was not optimal. I have chosen to automate the process of loading the data from an external stage through tasks and stored procedure.

Tasks

To automate external stage loading, I set-up five different tasks. The root task checks if there are any files loaded on the external stage. This task runs on a schedule, namely every morning at 6 AM. It calls a stored procedure check_stage that refreshes the external stage and compares any files with the current date.

-- ROOT TASK
CREATE TASK check_for_new_data_on_stage
    WAREHOUSE = transforming
    SCHEDULE = 'USING CRON 0 6 * * * '
AS
    CALL check_stage( current_date );

The other four tasks are dependent on this root task. Each task will call the same stored procedure for loading this specific data, albeit with different parameters to identify the source file and target table. An example of one of these tasks is shown below.

--CHILD TASK
CREATE OR REPLACE TASK load_youtube_file
    WAREHOUSE = transforming
    AFTER check_for_new_data_on_stage
AS
    CALL load_youtube_data( CURRENT_DATE , 'source_file', 'target_table');

The good thing about setting up one task after another is that if there is an error in the root task, the child task will not run. As you can see, my root task checks the external stage by a stored procedure. In that stored procedure I have declared my own exception that is raised when the external stage does not have any new files.

Stored procedure

The code below shows how to declare and raise your own exception. In this case I have named the exception no_new_files. An important statement is the refresh of the external stage. When the stage is refreshed, the directory table is updated and you can access this through the DIRECTORY(@external_stage) command. Then you can use a simple IF statement to either return true, or raise the exception.

--PROCEDURE THAT WILL RAISE AN ERROR IF THERE ARE NO NEW FILES
CREATE OR REPLACE PROCEDURE check_stage(check_date VARCHAR)
RETURNS VARCHAR
LANGUAGE SQL
AS
DECLARE
    new_files_bool BOOLEAN;
    no_new_files EXCEPTION (-20001, 'No new files on external stage');
BEGIN
    ALTER STAGE youtube_ext_stage REFRESH;

    new_files_bool:= (SELECT COUNT(*) > 0
    FROM DIRECTORY(@youtube_ext_stage)
    WHERE relative_path ILIKE '%' || :check_date || '%');

    IF (new_files_bool) THEN      
      RETURN TRUE;
    ELSE
      RAISE no_new_files;
    END IF;
END;

Relevant documentation

Auteur