Skip to main content
JSONSnowflakeUnstructured Data

Killing JSON: The Flatten Function

By noviembre 10, 2023marzo 5th, 2024No Comments

Welcome to Killing JSON, a series in which I channel my frustration with loading data into teaching you something. Today, we’re talking about the FLATTEN function in Snowflake.

Flattening JSON, Easy Enough

If you’ve ever loaded semi-structured data into Snowflake, I can bet you’ve used FLATTEN to put it into a table. You probably understand that this function plays an important role in unraveling the nested dictionaries in a JSON file to produce a structure compatible with a table.

Side-by-side, the same data in a JSON format and in a tabular format.

It sounds simple (maybe because it is) but let’s look at the syntax for the FLATTEN function together:

FLATTEN( INPUT => <expr> [ , PATH => <constant_expr> ]
                         [ , OUTER => TRUE | FALSE ]
                         [ , RECURSIVE => TRUE | FALSE ]
                         [ , MODE => 'OBJECT' | 'ARRAY' | 'BOTH' ] )

Huh? What should my INPUT be? What is recursion? Which mode should I select? And, most of all, why do all the arguments use => for assignment?

Well, at least the last one has a simple answer: the => operator is simply an assignment operator, needed because SQL uses the ordinary = as a logical operator. As for the rest, let’s tackle them one by one. Let’s try to figure out how FLATTEN works though simple examples.

Simple Outputs of the FLATTEN Function

In this first example, we’re passing very simple JSON data as the input to FLATTEN.

-- "parse_json" let's us create JSON inline.
SELECT * FROM TABLE(FLATTEN(input => parse_json('{"a":1, "b":[77,88]}')));
SEQ KEY PATH INDEX VALUE THIS
1 a a null 1 { «a»: 1, «b»: [ 77, 88 ] }
1 b b null [77, 88] { «a»: 1, «b»: [ 77, 88 ] }
The output of the FLATTEN function.

What does each column mean?

  • SEQ – A unique sequence number associated to the JSON data we passed.
  • KEY – For maps (such as "a" : 1), returns the key ("a")
  • PATH – The path to the element that needs to be flattened. In this case, it’s the same as the key.
  • INDEX – The index of the array element. If it is not an array, this is null.
  • VALUE – The value in the flattened value ("1")
  • THIS – The element being flattened.

So far, so good. Let’s keep playing with simple data, and pass FLATTEN an array.

-- This array is missing a value, but it'll be ok!
SELECT * FROM TABLE(FLATTEN(input => parse_json('[1, 55, ,77]'))) ;
SEQ KEY PATH INDEX VALUE THIS
1 null [0] 0 1 [ 1, 55, undefined, 77 ]
1 null [1] 1 55 [ 1, 55, undefined, 77 ]
1 null [3] 3 77 [ 1, 55, undefined, 77 ]
Another output of the FLATTEN function.

There’s some interesting differences between this example and the previous. For one, all the keys are null. This makes sense, our data consists of a simple array. Also, the path variable is now equivalent to the index of the value within the array. Null values in the array were completely skipped.

Using the PATH Parameter

Armed with our knowledge of the output of the FLATTEN function, let’s learn to use the PATH parameter. This parameter indicates the path to the element within the data structure which needs to be flattened. If we construct JSON data which contains nested elements that can be flattened, we can use this parameter to pinpoint those elements.

-- "b" is the key to an array, which can be flattened
SELECT * FROM TABLE(FLATTEN(input => parse_json('{"a":1, "b":[77,88]}'), path => 'b'));
SEQ KEY PATH INDEX VALUE THIS
1 null b[0] 0 77 [ 77, 88 ]
1 null b[1] 1 88 [ 77, 88 ]
The FLATTEN function applied to the array keyed by «b.»

Interesting! We can see that the path variable now reflects that we passed "b" as an argument, and this now refers to the array keyed by "b."

Using the OUTER Parameter

The OUTER parameter can either be TRUE or FALSE. If false, which is the default, any input rows that cannot be expanded are completely omitted from the output. If true, one row will be generated even for zero-row expansions. Zero-row expansions are those with NULL in the KEY, INDEX, and VALUE columns. Let’s see an example to understand this.

-- There's nothing here...
SELECT * FROM TABLE(FLATTEN(input => parse_json('[]')));
SEQ KEY PATH INDEX VALUE THIS
There’s nothing here!

Hm. There’s no values to unpack, so we have no output. Setting OUTER to TRUE changes this behavior.

-- There's nothing here... but empty tables make me sad!
SELECT * FROM TABLE(FLATTEN(input => parse_json('[]'), outer => true));
SEQ KEY PATH INDEX VALUE THIS
1 bull   null null []
There’s nothing here… but I don’t like seeing empty tables!

Using the RECURSIVE Parameter

The RECURSIVE (disabled, or FALSE, by default) parameter specifies whether the FLATTEN function will keep digging into the data to recursively expand sub-elements.

SELECT * FROM TABLE(FLATTEN(input => parse_json('{"a":1, "b":[77,88]}'), recursive => true));
SEQ KEY PATH INDEX VALUE THIS
1 a a null 1 { «a»: 1, «b»: [ 77, 88 ] }
1 b b null [77, 88] { «a»: 1, «b»: [ 77, 88 ] }
1 null b[0] 0 77 [ 77, 88 ]
1 null b[1] 1 88 [ 77, 88 ]
A recursive output of the FLATTEN function.

Using the MODE Parameter

The MODE parameter species which kind of data to unpack: objects, array, or both. This is set to BOTH by default.

-- Flatten recursively, but don't expand arrays.
SELECT * FROM TABLE(FLATTEN(input => parse_json('{"a":1, "b":[77,88], "c": {"d":"X"}}'),
                            recursive => true, mode => 'object'));
SEQ KEY PATH INDEX VALUE THIS
1 a a null 1 { «a»: 1, «b»: [ 77, 88 ], «c»: { «d»: «X» } }
1 b b null [ 77, 88 ] { «a»: 1, «b»: [ 77, 88 ], «c»: { «d»: «X» } }
1 c c null { «d»: «X» } { «a»: 1, «b»: [ 77, 88 ], «c»: { «d»: «X» } }
1 d c.d null «X» { «d»: «X» }
Recursive flatten excluding arrays.

That’s pretty cool! The array keyed by "b" was accessed, but it wasn’t recursively expanded. In contrast, the dictionary keyed by "c" was accessed and the element within that dictionary got its own row in the output table.

Conclusion

The real problem is… semi-structured data is awesome. The data within can be well organized but not rigidly constrained to specific fields. Loading this data into a relational database, however, can be tricky. All the more reason to learn the tools at our disposal for reliably extracting data from semi-structured file formats. I’ll catch you next week, where I’ll be taking a look at how the flatten function can help us unpack a real JSON file.

 

Auteur

Leave a Reply