In the previous chapter we have talked about Snowflake stages. It is now clear that these objects are the linking entities to connect the “world” with our Snowflake account.
Through stages it is possible to load and unload data in (and out) the various tables that build up our databases. In the next lines we are going to try to explain how to do the staging, using the commands: PUT file in stage, COPY INTO table.
Let’s find out how to upload data from a source to a stage.
Data Load
We have learned that if the information we want to transfer on our Snowflake account is stored on a local directory/folder on a client machine (our laptop, the server of our firm…) , then we have to rely on internal stages.
We must provide a way for the client to get in contact with the Snowflake instance. This can be done using SnowSQL: a command line client that connects with a Snowflake account without recurring to the web interface.
PUT
Once established a connection with the account from the command prompt, we can load files into an internal stage using the command PUT.
This command takes as input the path of the file on the local machine and the stage where the files should be loaded. In return, it simply executes the query for us. Let’s go through some examples:
USE SCHEMA my_DB.my_schema;
#Load from Windows path into a table stage (called with %):
PUT file://C:\Users\Documents\my_file.csv @%my_table;
#Load from Windows encoded path into a user stage (called with ~):
PUT 'file://C:\Users\Documents\my file.csv' @~;
# Load multiple files from Windows path into a folder in a named stage:
PUT file://C:\Users\Documents\my_file_0*.csv @my_int_stage/my_folder;
We have first specified the level at which we are operating, our schema.
Then, we showed that is possible to provide the local path both in a normal or encoded (as a string) format. The second option allows to provide paths where some folder or file names contain special characters (such as spaces).
Finally, we suggested that is possible to load multiple files at the same time – if named properly – using the symbols ‘*’ (multiple characters) or ‘?’ (single character), and to archive them in specific folders to keep the stages cleaner.
Creating a logical folder structure that reflects the data hierarchy is an important best practice. For example, organizing data files by year, month, or day can help to improve data management and make it easier to locate specific data files. However, deep folder nesting makes it harder to find the right file.
Why using multiple files and not have all the data in a single one? Simple! Snowflake recommends to keep file sizes in the range of 50-500 MB to have best performance efficiency and avoid timeouts.
In the examples we have also tried to recall how to address all the types of internal stages. We warn the reader to check the official documentation to learn how to provide the path with other operative systems.
REMOVE OR OVERWRITE
It happens that we modify the content of a file and we would like to reload it on the stage. Snowflake would not allow us to upload a new file with the same name of an existing one. We have two options:
#Overwrite the file:
PUT file://C:\Users\Documents\my_file.csv @%my_table
OVERWRITE = TRUE;
#Remove the old version and load the new one:
REMOVE @%my_table/my_file.csv.gz;
PUT file://C:\Users\Documents\my_file.csv @%my_table;
We have used this example also to implicitely convey two other concepts:
- Monitoring the stages: it is smart to keep track of our stages and to mantain order. It is good practice to update file versions and to remove files when no longer needed (to reduce storage costs).
REMOVE can address the whole stage (REMOVE @[Stage]), a single folder (REMOVE @[Stage]/my_folder) or just one element (REMOVE @[Stage]/my_folder/my_file). - Compression: Snowflake applies a .gzip compression to the files as soon as they hit a stage (clustering efficiency that also saves you storage costs). To keep the file as it is, one can disable this feature by forcing AUTO_COMPRESS = FALSE in the PUT statement.
WAIT!
Just wait. Why haven’t we talk about external stages yet?
Well, as the name suggests, external stage do not belong to the Snowflake account. Thus, it is physically impossible to create a connection between our account and these cloud services using SnowSQL.
We cannot populate these stages using the PUT command. The only way to do it is by storing the files “manually” in the cloud storage (or have the owner of the storage to do it for us) and recall them in the stage through the URL.
COPY INTO
Now that we have explored how to load the files in the stages, it is time to transfer the data whitin them inside our tables. We will use the command COPY INTO.
What is its syntax? Let’s analyze an example:
#Load data from the file my_file.csv.gz in the table:
COPY INTO my_table
FROM @%my_table
FILE_FORMAT = (TYPE = CSV RECORD_DELIMITER = '\n' FIELD_DELIMITER = ',' SKIP_HEADER = 1)
PATTERN = '.*my_file.csv.gz'
ON_ERROR = 'continue';
Firstly, we have loaded in the table of pertinence the data from the file contained in the table stage.
Simultaneously, we have declared the file format and its specifics. Snowflake, by default, allows the following file formats: CSV | JSON | AVRO | ORC | PARQUET | XML. All the other options or modifications on the interpretation of these formats can be achieved generating your own FILE FORMAT.
Something not mandatory is to define a pattern: it’s an option that allows to address only some of the files that belong to a stage. Conversely, by default, whitout specification Snowflake tries to copy the whole content of the called stage into the table.
The ON_ERROR clause clarifies to Snowflake how to behave in case of issues. Set on ‘continue’, the decompression will skip the lines that are causing problems and it will not load them in the table.
Many more clauses can be listed in this statement. To read more about the COPY INTO table function, check the documentation.
NOTE WELL!
Whenever you execute the COPY INTO command, Snowflake reads the data from the file and inserts it into the specified table. After loading the data, Snowflake creates an internal marker to indicate that the file has been processed. This marker links to the file metadata and is fundamental to avoid record duplication.
If you try to execute the COPY INTO command on the same file again without making any changes to the file, Snowflake will detect the internal marker and will not load the data again. Instead, it will return a message indicating that the file has already been processed.
However, if you modify the file after the first load (such as by adding new data or changing the existing ones, renaming the file), then Snowflake will detect that the file has changed and will load the modified data into the table.
Conclusion
We hope we have managed to convey all the fundamentals to load data on your Snowflake account.
A distinction among the internal stages (and their PUT function) and external stage, where staging happens “manually” on the cloud services.
Finally the COPY INTO function to transfer the data from a file in the stage to a table with specific directives on data interpretation.
Read the next chapters on FILE FORMAT creation and data unloading.
Have questions about data? Be sure to check our blog.