In this blog post, one of a five-part series, I dive into the five domains assessed in the Snowflake SnowPro® Advanced Data Engineer exam, as outlined in the official SnowPro Advanced: Data Engineer Certification Exam Study Guide. Writing about the concepts that are covered in the guide helps me memorize and prepare for the exam.
If you’re still in the process of preparing for the Snowflake SnowPro Core Certification, the Snowflake Data Engineer Certification is probably too advanced. I recommend you read some of these blog posts first. They collect the experience of some of my fellow Nimbus Intelligence colleagues, who provide their own approaches to obtaining the certification. Perhaps, their study methods are more aligned to what works for you.
If you’re aiming for the Snowflake Data Engineer Certification, keep reading.
Loading data into Snowflake
“No Martini, no party” was the punchline of a popular, old tv commercial starring George Clooney. In data engineering, we translate this into No data loading, no insights.
When loading data into Snowflake, it’s crucial to consider a few things: source, volume, frequency, and cleanliness of the data. These factors vary according to the business needs, and determine the optimal method of loading (bulk or streaming) and the tools required.
In Snowflake, data loading can be managed through various features, the main ones being:
- the COPY INTO commands, for bulk operations (also possible via the web interface)
- using Snowpipe for near-real-time ingestion.
The impact of these methods on performance and costs can be significant. Keep in mind that efficient data loading strategies minimize latency and optimize computing resources, which are key to managing expenses in the cloud.
From my cheat sheet:
- Snowflake creates metadata when you load data from a file in an internal stage into a Snowflake table.
- Pattern matching is generally the slowest option when using the COPY INTO command for data loading, specifying the list of files to load is the fastest option. When using a RegEx, keep the wildcard at the end to speed up scanning.
- Organize and name files logically in the source, when possible. United_States/Colorado/Denver/20230129/filename.csv is better than Colorado/20230129/Denver/United_States/filename.csv.
Ingesting data of different formats
Snowflake supports a wide array of data formats: JSON, XML, Parquet, ORC, and CSV, making it versatile for ingesting structured and semi-structured data. Semi-structured data formats can be parsed at a later stage, to facilitate its consumption and the development of apps and dashboards.
The ingestion process typically involves staging the data, which can be done in Snowflake’s managed storage (internal stages) or through external stages like Amazon S3 or Google Cloud Storage. This staged approach allows for preprocessing and batch processing of large datasets, facilitating more efficient data management and query performance.
Make sure to familiarize yourself with the characteristics of the different data formats, and the different types of sources and stages.
From my cheat sheet:
- The three Snowflake data types used for semi-structured data are VARIANT, OBJECT, and ARRAY. VARIANT is the universal data type.
- You can define the file format settings for your staged data during CREATE TABLE, CREATE STAGE, COPY INTO TABLE.
- The VALIDATION_MODE parameter only validates the data before copying it. If it returns some errors, you can analyze them by querying the RESULT_SCAN table function. Transformations are not supported when using this parameter.
- STRIP_NULL_VALUES: Boolean that instructs the JSON parser to remove object fields or array elements containing null values. It’s used to reduce the file size. The default value is FALSE.
Troubleshooting data ingestion
A famous proverb states “Invite trouble and you’re in for double“. This proverb applies to data ingestion as well. This is why knowing how to properly troubleshoot it is crucial.
In Snowflake, troubleshooting data ingestion issues often revolves around identifying the root causes, such as data corruption, format inconsistencies, or connectivity problems. The Error Logs Snowflake provides are good to find the
Once identified, resolutions may include adjusting file formats, fixing corrupt data, or optimizing connection settings to ensure reliable data flow. Effective troubleshooting ensures data accuracy and availability, which are critical for downstream analytics and business intelligence processes at a later stage.
From my cheat sheet:
- Docs: Troubleshooting Snowpipe.
- Docs: Troubleshooting bulk data loads.
- Docs: Troubleshooting processing of unstructured data.
Designing, building, and troubleshooting data pipelines.
Continuous data pipelines in Snowflake are built using stages for data collection, tasks for automated processing, and streams for capturing changes. Snowpipe plays a critical role here, allowing for automatic loading of data as soon as it arrives in the staging area.
Comparing methods like Auto ingest versus using the REST API can affect how quickly data is available for use. Efficient pipeline design ensures minimal latency and maximizes the real-time decision-making capabilities of an organization.
From my cheat sheet:
- It is beneficial to remove loaded files from stages, to improve load performance and ensure that data isn’t inadvertently loaded again.
- When staging regular data sets, Snowflake recommends partitioning the data into logical paths that include identifying details such as geographical location or other source identifiers, along with the date when the data was written.
- Of the three options for identifying/specifying data files to load from a stage, providing a discrete list of files is generally the fastest; however, the FILES parameter supports a maximum of 1,000 files, meaning a COPY command executed with the FILES parameter can only load up to 1,000 files.
Analyzing and differentiating types of data pipelines.
Differentiating types of data pipelines involves understanding when to use batch processing versus real-time streams.
In Snowflake, creating User-Defined Functions (UDFs) and utilizing stored procedures can customize data transformations and processing. The integration of Snowpark allows developers to build sophisticated data pipelines within Snowflake’s framework, while the Snowflake SQL API facilitates complex queries and integrates with external applications, enhancing the pipeline’s functionality and flexibility.
Using connectors to connect to Snowflake
Connectors play a pivotal role in integrating Snowflake with various data sources and applications. Proper installation and configuration of connectors such as JDBC, ODBC, Python, or Kafka ensure seamless data flow between systems. Each connector has specific settings that need to be tuned according to the operational environment and data requirements, which is essential for maintaining data integrity and performance across platforms.
On the Snowflake Marketplace you can find the Nimbus Intelligence Connectors, offering an effortless data integration solution for your software. These apps allow you to seamlessly connect to a Software API and easily retrieve data. Here’s a quick intro:
Follow this link to access the Nimbus Connectors Documentation.
Data sharing on Snowflake
Effective data sharing solutions in Snowflake involve implementing data shares, secure views, and row-level filtering to manage data access and security. Data shares allow organizations to provide access to selected data across different Snowflake accounts without duplicating data physically.
Secure views and row-level filtering ensure the protection of sensitive data, making it accessible only by authorized users, complying with data governance and privacy standards.
From my cheat sheet:
- When you share data with a consumer account, the consumers pays for consumption because they have their own Snowflake account.
- In the case of reader accounts, the providers pay for the computing costs.
External tables: when to use them? How do they work?
External tables in Snowflake are useful for managing large datasets that do not need to be loaded into Snowflake storage. These tables can be queried directly where they reside, such as in an S3 bucket, using SQL commands. Features like partitioning external tables improve query performance by minimizing the data scanned during queries. Additionally, the use of materialized views and partitioned data unloading can optimize the management and accessibility of large-scale data, making external tables an efficient tool for specific data handling needs.
From my cheat sheet:
- Benefits of micro partitioning:
- Micro partitions are derived automatically
- Micro partitions enables extremely efficient DML and fine grained pruning for faster queries
- Columns are stored independently within micro partitions
- Columns are compressed individually within micro partitions
Resources to pass the Snowflake Data Engineer Certification Exam:
- The official SnowPro Advanced: Data Engineer Certification Exam Study Guide
- Frosty Fridays, a series of weekly challenges released every Friday to help you practice and develop your Snowflake skills, created by Snowflake users, for Snowflake users.
- Udemy, for courses specific to the Snowflake Data Engineer Certification.