It is possible to match strings with regular expression (also called REGEX) in Snowflake. I think many data scientist, including myself, love this feature. As part of a series of blogs around the topic of SQL Injection, this blog will touch upon the use of REGEX in Snowflake. To show the functionality I will continue with my example of inputting Little Bobby Tables into a students table.
Snowflake functions
REGEX is available through built-in Snowflake string functions. Check out the full documentation here. In this blog I will show you two of these functions. The REGEXP or RLIKE function can be used just as the SQL LIKE command. For example, if I want to select all students that have a name starting with the letter ‘B’, I could use either one of the codes below.
SELECT name
FROM students
WHERE name RLIKE 'B.*';
SELECT name
FROM students
WHERE name LIKE 'B%';
Thus we can see that the RLIKE function has many similarities with the SQL LIKE command. There are some differences.
- The LIKE function uses % as a wildcard, whereas RLIKE needs
.*
. This is because the LIKE function uses SQL wildcards in the pattern. The RLIKE function uses POSIX ERE (Extended Regular Expression) syntax. - The LIKE function has no default escape character, but does include an option to manually specify it. The RLIKE function has the backslash (
\
) as default escape character, but no option to reset it. If you do not want the backslash, look into dollar-quoted string constants. - The LIKE function supports string searches, but no pattern search.
Pattern search
Lets dive deeper into the pattern search. This is a useful tool when validating data because we can, for example, check that our VARCHAR column only contains “word” characters. The word characters are any combination of Latin upper or lower case letters, the underscore and decimal digits. One could assume that a name has only word characters, and use the pattern search to exclude any values that do not comply (i.e. Robert'); DROP TABLE Students;--
). Use the pattern in the code below to select only the students with a name that has word characters.
SELECT name
FROM students
WHERE name RLIKE '\\w*';
This looks pretty useful for our case! However, what happens when we enter two other new students into our students table that are named Mike O’Leary and X Æ A-12 (the son of Elon Musk) ? Both of these students will not be excluded from our students table as well. Imagine the tweets Elon would write about that! How can we fix this?
Regex substring
To be as inclusive as possible, without being vulnerable to any SQL injection, we can use another Snowflake regex function: REGEXP_SUBSTR. Just like the regular Snowflake substring function, it will return a part of your string. With this we can specify a set of characters that we are not expecting in our names. To not exclude our newly added students, but save us from SQL injection we could for example add the semicolon and the two dashes as a pattern to check for.
SELECT name
FROM students
WHERE regexp_substr(name, ';\\s*--') IS NULL;
The code above selects only the students which do not have a combination of the semicolon, one or more whitespace characters and two dashes. Note that we could have used a general string function for this as well, but then we would not have been able to filter the whitespace out. Furthermore, note that this code cleans up our specific example and does not protect you in general against SQL injection.