Welcome to Killing JSON, a series in which I channel my frustration with loading data into teaching you something. Today, we’re talking about the FLATTEN function in Snowflake.
Flattening JSON, Easy Enough
If you’ve ever loaded semi-structured data into Snowflake, I can bet you’ve used FLATTEN to put it into a table. You probably understand that this function plays an important role in unraveling the nested dictionaries in a JSON file to produce a structure compatible with a table.
It sounds simple (maybe because it is) but let’s look at the syntax for the FLATTEN function together:
FLATTEN( INPUT => <expr> [ , PATH => <constant_expr> ]
[ , OUTER => TRUE | FALSE ]
[ , RECURSIVE => TRUE | FALSE ]
[ , MODE => 'OBJECT' | 'ARRAY' | 'BOTH' ] )
Huh? What should my INPUT be? What is recursion? Which mode should I select? And, most of all, why do all the arguments use =>
for assignment?
Well, at least the last one has a simple answer: the =>
operator is simply an assignment operator, needed because SQL uses the ordinary =
as a logical operator. As for the rest, let’s tackle them one by one. Let’s try to figure out how FLATTEN works though simple examples.
Simple Outputs of the FLATTEN Function
In this first example, we’re passing very simple JSON data as the input to FLATTEN.
-- "parse_json" let's us create JSON inline.
SELECT * FROM TABLE(FLATTEN(input => parse_json('{"a":1, "b":[77,88]}')));
SEQ | KEY | PATH | INDEX | VALUE | THIS |
1 | a | a | null | 1 | { “a”: 1, “b”: [ 77, 88 ] } |
1 | b | b | null | [77, 88] | { “a”: 1, “b”: [ 77, 88 ] } |
What does each column mean?
- SEQ – A unique sequence number associated to the JSON data we passed.
- KEY – For maps (such as
"a" : 1
), returns the key ("a"
) - PATH – The path to the element that needs to be flattened. In this case, it’s the same as the key.
- INDEX – The index of the array element. If it is not an array, this is null.
- VALUE – The value in the flattened value (
"1"
) - THIS – The element being flattened.
So far, so good. Let’s keep playing with simple data, and pass FLATTEN an array.
-- This array is missing a value, but it'll be ok!
SELECT * FROM TABLE(FLATTEN(input => parse_json('[1, 55, ,77]'))) ;
SEQ | KEY | PATH | INDEX | VALUE | THIS |
1 | null | [0] | 0 | 1 | [ 1, 55, undefined, 77 ] |
1 | null | [1] | 1 | 55 | [ 1, 55, undefined, 77 ] |
1 | null | [3] | 3 | 77 | [ 1, 55, undefined, 77 ] |
There’s some interesting differences between this example and the previous. For one, all the keys are null. This makes sense, our data consists of a simple array. Also, the path variable is now equivalent to the index of the value within the array. Null values in the array were completely skipped.
Using the PATH Parameter
Armed with our knowledge of the output of the FLATTEN function, let’s learn to use the PATH parameter. This parameter indicates the path to the element within the data structure which needs to be flattened. If we construct JSON data which contains nested elements that can be flattened, we can use this parameter to pinpoint those elements.
-- "b" is the key to an array, which can be flattened
SELECT * FROM TABLE(FLATTEN(input => parse_json('{"a":1, "b":[77,88]}'), path => 'b'));
SEQ | KEY | PATH | INDEX | VALUE | THIS |
1 | null | b[0] | 0 | 77 | [ 77, 88 ] |
1 | null | b[1] | 1 | 88 | [ 77, 88 ] |
Interesting! We can see that the path variable now reflects that we passed "b"
as an argument, and this
now refers to the array keyed by "b."
Using the OUTER Parameter
The OUTER parameter can either be TRUE
or FALSE
. If false, which is the default, any input rows that cannot be expanded are completely omitted from the output. If true, one row will be generated even for zero-row expansions. Zero-row expansions are those with NULL in the KEY, INDEX, and VALUE columns. Let’s see an example to understand this.
-- There's nothing here...
SELECT * FROM TABLE(FLATTEN(input => parse_json('[]')));
SEQ | KEY | PATH | INDEX | VALUE | THIS |
Hm. There’s no values to unpack, so we have no output. Setting OUTER to TRUE changes this behavior.
-- There's nothing here... but empty tables make me sad!
SELECT * FROM TABLE(FLATTEN(input => parse_json('[]'), outer => true));
SEQ | KEY | PATH | INDEX | VALUE | THIS |
1 | bull | null | null | [] |
Using the RECURSIVE Parameter
The RECURSIVE (disabled, or FALSE
, by default) parameter specifies whether the FLATTEN function will keep digging into the data to recursively expand sub-elements.
SELECT * FROM TABLE(FLATTEN(input => parse_json('{"a":1, "b":[77,88]}'), recursive => true));
SEQ | KEY | PATH | INDEX | VALUE | THIS |
1 | a | a | null | 1 | { “a”: 1, “b”: [ 77, 88 ] } |
1 | b | b | null | [77, 88] | { “a”: 1, “b”: [ 77, 88 ] } |
1 | null | b[0] | 0 | 77 | [ 77, 88 ] |
1 | null | b[1] | 1 | 88 | [ 77, 88 ] |
Using the MODE Parameter
The MODE parameter species which kind of data to unpack: objects, array, or both. This is set to BOTH
by default.
-- Flatten recursively, but don't expand arrays.
SELECT * FROM TABLE(FLATTEN(input => parse_json('{"a":1, "b":[77,88], "c": {"d":"X"}}'),
recursive => true, mode => 'object'));
SEQ | KEY | PATH | INDEX | VALUE | THIS |
1 | a | a | null | 1 | { “a”: 1, “b”: [ 77, 88 ], “c”: { “d”: “X” } } |
1 | b | b | null | [ 77, 88 ] | { “a”: 1, “b”: [ 77, 88 ], “c”: { “d”: “X” } } |
1 | c | c | null | { “d”: “X” } | { “a”: 1, “b”: [ 77, 88 ], “c”: { “d”: “X” } } |
1 | d | c.d | null | “X” | { “d”: “X” } |
That’s pretty cool! The array keyed by "b"
was accessed, but it wasn’t recursively expanded. In contrast, the dictionary keyed by "c"
was accessed and the element within that dictionary got its own row in the output table.
Conclusion
The real problem is… semi-structured data is awesome. The data within can be well organized but not rigidly constrained to specific fields. Loading this data into a relational database, however, can be tricky. All the more reason to learn the tools at our disposal for reliably extracting data from semi-structured file formats. I’ll catch you next week, where I’ll be taking a look at how the flatten function can help us unpack a real JSON file.