Skip to main content

In the previous chapters of this series, we have seen how to load through staging data from foreign sources to your Snowflake account and how to do the reverse.
Talking about these topics, we have mentioned a useful tool that we are now going to explore.

Let’s talk about file formats!

What are file formats?

Snowflake FILE FORMAT is a named database object that streamlines the process of interpreting the data contained in the files involved with staging and makes it easier to load and unload data from database tables.
Snowflake file formats indicates file information, from file type to format options used for preparing the data.

Types

The file formats that are supported by default on Snowflake are the following:

  • CSV
  • JSON
  • AVRO
  • ORC
  • PARQUET
  • XML

Except CSV, all the other formats are for semi-structured data.
All these formats are directly callable in the definition of a COPY INTO statement involving the loading from stage to table and/or viceversa.
Indeed, Snowflake allows CSV, JSON and PARQUET file formats both for loading data into a table and for creating a file during the unloading. Instead, AVRO, ORC and XML are supported for loading only.

Syntax

In the examples in the previous sections, we focused on showing how to call one of the default file formats. We will now explain how to create a user-defined file format with all the specifics that might be of interest. Let’s look at this example:

#Create a specific file format to read .dat files:
CREATE OR REPLACE FILE FORMAT my_file_format
  TYPE = CSV
  FIELD_DELIMITER = ','
  RECORD_DELIMITER = '*'
  SKIP_HEADER = 1
  NULL_IF = ('NULL', 'null')
  EMPTY_FIELD_AS_NULL = true
  COMPRESSION = gzip
  REPLACE_INVALID_CHARACTERS = TRUE;

Unquestionably, the syntax to generate a file format is as easy as for creating any other object on Snowflake. Indeed, using SQL statements it is possible to address (and note that none of these fields is mandatory) the type, the field (column) and record (row) delimiters and whether there’s a header or not.
Moreover, it is possible to tell Snowflake how to behave in case any NULL value is encountered in the reading or what to do if empty field values (or invalid charactecters) appear to be in the dataset.
Finally, it is also possible to state which type of compression was applied to the file when staging it.

Note well that when you don’t specify the value for an optional file format parameter, Snowflake considers automatically the default value for that element as the one to apply. For more information on specifics, please check the official documentation.

How to recall them

Snowflake file formats can be simply called in any COPY INTO statement from stage to table or table to stage.
The following example shows loading data into a table from a user stage using file format my_file_format.

#Load data from the file my_file.csv.gz in the table:
COPY INTO my_table
FROM @~
 FILE_FORMAT = my_file_format
 PATTERN = '.*my_file.csv.gz'
 ON_ERROR = 'continue';

Conclusion

We have talked about FILE FORMAT, showing both how to use a default option and how to define your own specific format.
To wrap it up: you call it the COPY INTO function to specify how to decompress/compress the data, focusing on specific directives on decoding and interpreting.

Read the next chapter on Snowflake pipes to automatize staging (when possible) or, in general, make it smoother.

Have questions about data? Be sure to check our blog.

Auteur

  • Mario Calise

    He was born in Cosenza (Italy); he has already lived in 7 different cities throughout his life and he counts on adding more to the list! Educated as an engineer (BSc in Engineering Physics – Polytechnic University of Milan, Italy; MSc in Space Research - DTU, Copenhagen, Denmark), Mario is an aspiring person who enjoys developing new solutions to solve tech-related problems. He has strong technical skills and an academic background that covers statistics, mathematical methods, data management and modelling. He also possesses good knowledge on programming with different languages and environments (VisualStudio, Snowflake, Matlab, Python, R, C#, SQL and HTML). He joined the Nimbus Intelligence Academy as his professional goal is to work with data and predictive models to support decision-making and process improvement in the industry.

Mario Calise

He was born in Cosenza (Italy); he has already lived in 7 different cities throughout his life and he counts on adding more to the list! Educated as an engineer (BSc in Engineering Physics – Polytechnic University of Milan, Italy; MSc in Space Research - DTU, Copenhagen, Denmark), Mario is an aspiring person who enjoys developing new solutions to solve tech-related problems. He has strong technical skills and an academic background that covers statistics, mathematical methods, data management and modelling. He also possesses good knowledge on programming with different languages and environments (VisualStudio, Snowflake, Matlab, Python, R, C#, SQL and HTML). He joined the Nimbus Intelligence Academy as his professional goal is to work with data and predictive models to support decision-making and process improvement in the industry.