Obtaining the Snowflake Data Engineer Certification

Second blog of the Road to Snowflake Data Engineer series. If you’ve already read the part 1, Road to Snowflake Data Engineer: Data Movement, you can scroll to the Troubleshoot the queries that underperform heading.

Road to Snowflake Data Engineer is five-parts blog series, I dive into the five domains assessed in the Snowflake SnowPro® Advanced Data Engineer certification exam, as outlined in the official SnowPro Advanced: Data Engineer Certification Exam Study Guide. Writing about the concepts that are covered in the guide helps me memorize and prepare for the exam.

If you’re still in the process of preparing for the Snowflake SnowPro Core Certification, I recommend you read some of these blog posts first. They collect the experience of some of my fellow Nimbus Intelligence colleagues, who provide their own approaches to obtaining the certification. Perhaps, their study methods are more aligned to what works for you.

If you’re aiming for the Snowflake Data Engineer Certification, keep reading.

Troubleshooting the queries that underperform

One of the first steps in optimizing performance in Snowflake is identifying and troubleshooting underperforming queries.

This involves analyzing query execution plans to pinpoint bottlenecks. Telemetry data, which provides metrics on query operations, is invaluable here. By understanding the root cause, whether it’s due to resource constraints or inefficient SQL code, you can take targeted actions to increase efficiency. Techniques might include refining query structure, optimizing joins, or adjusting warehouse sizing.

From my cheat sheet:

A query remains in the history tab for 14 days
Snowflake recommends workload management by having dedicated Warehouses to Load and Query Operations to improve performance.
QUERY_HISTORY family functions:
- QUERY_HISTORY_BY_SESSION, to return queries within a specified session and time range
- QUERY_HISTORY_BY_USER, to return queries submitted by a specified user within a specified time range.
- QUERY_HISTORY_BY_WAREHOUSE, to return queries executed by a specified warehouse within a specified time range.

Configuring solutions for the best performance

Configuring Snowflake to meet specific performance needs requires an understanding of both the environment and the workload. Deciding whether to scale out (adding more compute nodes to a warehouse) or scale up (increasing the size of each node) is crucial.

Other considerations include setting the right virtual warehouse properties, understanding the role of micro-partitions, utilizing materialized views to reduce computation on large datasets, and employing services like Search Optimization and Query Acceleration to expedite specific types of queries.

From my cheat sheet:

It is a good practice for better pruning and less storage solution to flatten your OBJECT and key data into separate relational columns.
You can not set storage quota at any level. You can set compute quota at account or warehouse level using resource monitor
You might consider disabling the auto-suspend functionality in two cases:
- if you have a heavy, steady workload;
- if you require the warehouse to always be available with no delay time.
Disabling the auto-suspend functionality might incur more compute costs.

About caching features

Caching is a powerful feature in Snowflake that significantly improves query performance by storing results and data in fast-access storage.

This means subsequent queries can retrieve data from the cache rather than processing the entire dataset again. Effective use of caching involves understanding when and how data is cached and strategically writing queries to leverage this feature, especially for frequently accessed data.

For a more detailed explanation, I highly recommend the Snowflake Level Up: Query History & Caching course.

Monitoring continuous data pipelines

Continuous data pipelines in Snowflake, including Snowpipe, tasks, and streams, are essential for real-time data processing and analytics.

Monitoring these pipelines is crucial to ensure they operate at peak efficiency and data flows smoothly without interruptions. Regular monitoring helps identify and rectify performance degradation, ensuring that data latency remains minimal and that compute resources are adequately provisioned.

From my cheat sheet:

Snowpipe facilitates the immediate loading of data from files once they arrive in a stage. This allows for data to be loaded in micro-batches, making it accessible to users in just minutes, as opposed to the manual execution of scheduled COPY statements for loading larger batches.
A task in Snowflake can execute various types of SQL code, including a single SQL statement, a call to a stored procedure, or procedural logic using Snowflake Scripting. When paired with table streams, tasks support continuous ELT workflows by processing newly changed rows with exactly-once semantics. Tasks can also function independently to periodically generate reports by inserting or merging rows into a report table or performing other regular operations.
A stream does not hold any table data itself but stores an offset for the source object, using the versioning history to return CDC (Change Data Capture) records. When a stream is initially created for a table, several hidden columns are added to the source table to store change tracking metadata, which occupies a minimal amount of storage.

Resources to pass the Snowflake Data Engineer Certification Exam:

The official SnowPro Advanced: Data Engineer Certification Exam Study Guide
Frosty Fridays, a series of weekly challenges released every Friday to help you practice and develop your Snowflake skills.
Udemy, for courses specific to the Snowflake Data Engineer Certification.

Auteur

Darko Monzio Compagnoni

Before becoming an analytics engineer, I worked in marketing, communications, customer support, and hospitality. I noticed how each of these fields, in their own way, benefit from decisions backed by data. Which fields don’t, after all? After spotting this pattern, I decided to retrain as a self taught data analyst, to then complete the Nimbus Intelligence Academy program and graduating as an Analytics Engineer obtaining certifications in Snowflake, dbt, and Alteryx. I'm now equipped to bring my unique perspective to any data driven team.

View all posts