Skip to main content

In this blog, we are going to discuss how Snowflake allows to easily interpret semi-structured data through its Schema Detection feature.

This feature helps gaining time when setting up the tables that should contain the data to import; thus, it also decrease the amount of effort at the beginning of data ingestion.

N.B. on Supported Formats: This feature is currently unavailable for all file formats. It is limited to Apache Parquet, Apache Avro, and ORC files.

What is Snowflake INFER SCHEMA?

Let’s picture one of these scenarios: we are dealing with a semi-structured file composed of a huge number of fields. Or we have to bulk-load a big amount of semi-structured files – each with its own set of fields – in a staging area.
In both cases we would have to prepare the table(s) ready to ingest the data from the files, paying attention to define the names and the data types correctly.

The idea, in principle, would be to automate this data pipeline; but, having to interpret the file structure every time, crashes with it.
Theoretically, Snowflake has a native support for semi-structured data, either through copying the whole data into a variant column or by querying the data directly from the staged files using external tables. However, at some point, a structure of the datasets that are included in the files must eventually be provided manually.

As a solution, Snowflake is able to automatically detect the structure of file and to infer the schema a table.
Particularly, the function INFER_SCHEMA helps detecting and returns the schema from a staged file. This SQL statement supports external stages and internal (user or named) stages. Simply as it sounds, it does not support table stages as we are indeed trying to form one starting from the file.
Moreover, it is fundamental to provide a file format to interpet the file correctly.

How to use INFER SCHEMA?

Here an example of how it works.
Let’s infer the schema of a table from a AVRO file. To work, it needs a stage and a file format:

CREATE FILE FORMAT my_avro_format
 TYPE = AVRO;

//Creation of a stage, remember to populate it with a AVRO file to use as a template!
CREATE STAGE AVRO_Stage
  FILE_FORMAT = my_avro_format;

Now, we can actively use the command:

SELECT *
FROM TABLE( 
           INFER_SCHEMA(
                         LOCATION=>'@AVRO_Stage'
                       , FILE_FORMAT=>'my_avro_format'
                       )
          );

The result of this query should be of this kind:

Credits to Snowflake

It works! Thus, we can create a table from this schema. It means we won’t have to specify each column name and variable type:

CREATE OR REPLACE TABLE AVRO_Upload
USING TEMPLATE ( SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
                 FROM TABLE(
                            INFER_SCHEMA(
                                          LOCATION=>'@AVRO_Stage'
                                        , FILE_FORMAT=>'my_avro_format'
                                        )
                           )
               );

It’s time to fill the table. Match_by_column clause forces to put everything in its own column, despite their name in the AVRO dictionary:

COPY INTO AVRO_Upload 
FROM @AVRO_Stage
 MATCH_BY_COLUMN_NAME = 'CASE_INSENSITIVE';

 //Check the result!
SELECT * FROM AVRO_Upload;

What would happen if we have multiple files with different structures and orders for the same table upload?

Snowflake will list down the union of all file column lists in random order.

Conclusion

We have talked about Schema Detection and the function INFER SCHEMA, explaining what they are and how to set them in place.

We hope you have found this useful!

Have questions about data? Be sure to check our blog.

Auteur

  • Mario Calise

    He was born in Cosenza (Italy); he has already lived in 7 different cities throughout his life and he counts on adding more to the list! Educated as an engineer (BSc in Engineering Physics – Polytechnic University of Milan, Italy; MSc in Space Research - DTU, Copenhagen, Denmark), Mario is an aspiring person who enjoys developing new solutions to solve tech-related problems. He has strong technical skills and an academic background that covers statistics, mathematical methods, data management and modelling. He also possesses good knowledge on programming with different languages and environments (VisualStudio, Snowflake, Matlab, Python, R, C#, SQL and HTML). He joined the Nimbus Intelligence Academy as his professional goal is to work with data and predictive models to support decision-making and process improvement in the industry.

Mario Calise

He was born in Cosenza (Italy); he has already lived in 7 different cities throughout his life and he counts on adding more to the list! Educated as an engineer (BSc in Engineering Physics – Polytechnic University of Milan, Italy; MSc in Space Research - DTU, Copenhagen, Denmark), Mario is an aspiring person who enjoys developing new solutions to solve tech-related problems. He has strong technical skills and an academic background that covers statistics, mathematical methods, data management and modelling. He also possesses good knowledge on programming with different languages and environments (VisualStudio, Snowflake, Matlab, Python, R, C#, SQL and HTML). He joined the Nimbus Intelligence Academy as his professional goal is to work with data and predictive models to support decision-making and process improvement in the industry.