Best practices Snowflake

Author

labcitrico

Published on

19 April 2021

Reading time

16 min read

1- Introduction

Snowflake is one of the few enterprise-ready cloud data warehouses that brings simplicity without sacrificing features. It automatically scales, both up and down, to get the right balance of performance vs. cost. Snowflake’s claim to fame is that it separates compute from storage. This is significant because almost every other database, Redshift included, combines the two, meaning you must size for your largest workload and incur the cost that comes with it.

In this document, we will focus on some of the best practices based on Snowflake’s recommendation and previous experience.

2- Best Practice — Efficient Table Design

i. Choosing the appropriate data types helps in improving query performance.

Date and TimeStamp are stored more efficiently than VARCHAR on Snowflake.
Snowflake recommends choosing a date or timestamp data type for storing date and timestamp fields instead of a character data type.

ii. It is recommended to use Referential Integrity Constraints

They provide valuable metadata that users can use to understand the schema and the relationships defined between tables.
In Snowflake, referral integrity constraints are not enforced by default. When created, they are disabled. NOT NULL is an exception and is enforced by default.

iii. For big data sets, Clustering is a good practice and helps improve query performance.

One will have to manually run a DML statement to re-cluster a table.
The existing data will be re-written in a different order upon re-clustering. Snowflake saves the previous order for 7 days to provide Fail-safe protection, one of Snowflake’s cloud data warehouse’s most lauded features.
Re-clustering a table on Snowflake costs additional dollars. This is directly proportional to the size of the data set that is re-ordered.

3- Best Practice — Data Storage

Snowflake provides an array of features for data that is stored. Continuous Data Protection (CDP) which includes Fail-Safe and Time Travel is given to all Snowflake accounts for no additional cost. This does not mean that CDP will not have an impact on your storage costs. On the contrary,

Your account will be charged for all the data stored in schemas, tables, and databases created in your Snowflake architecture. Based on the data stored and the duration for which it is stored, CDP has an impact on the storage costs.
Until the data leaves the Fail-safe state, storage costs will be incurred. This means that you pay for the data storage irrespective of whether it is in an Active, Time-travel, or Fail-safe State. Hence, it is important to make the right storage considerations.

4- Best Practice — Data Staging

To assist in loading bulk data into tables, Snowflake has a feature called stages where files that have the data to be loaded are staged.

Snowflake allows both internal (within Snowflake) and external (S3, Azure) stages.
No additional cost is charged for Time Travel and Fail-safe features for data stored in internal stages within Snowflake. However, standard data storage costs apply.
While performing Snowflake ETL, it is a good practice to stage regular data sets by partitioning them into logical paths. This could include details such as source identifiers or geographical location, etc., along with the date when the data was written.
Staging will allow you to take advantage of Snowflake’s parallel operations by letting you execute concurrent COPY statements that match a subset of files.

5- Best Practice — Data Cloning

Snowflake has a zero-copy cloning feature that gives an easy way to take a “snapshot” of any schema, table, or database.

· Cloning creates a derived copy of that object which initially shares the underlying storage. This can come in handy when creating instant backups.

· Cloning does not incur any additional costs if you do not need to make any changes to the cloned object.

6- Best Practice — Data Loading Considerations

i. General File Sizing Recommendations

While performing Snowflake ETL, it is important to optimize the number of parallel loads into Snowflake. It is recommended to create compressed data files that are roughly 10 MB to 100 MB in size.
Aggregate the smaller files to reduce processing overhead. Split the large files into several smaller files for faster load. This allows you to distribute the load between servers in the active Snowflake warehouse.

ii. Data Size Limitations for Semi-Structured Data

The VARIANT data type has a 16 MB (compressed) size limit on the individual rows.
For efficiency enhancement, while executing the COPY INTO <table> command it is recommended to enable the STRIP_OUTER_ARRAY file format option. This will load the records into separate table rows by removing the outer array structure.

iii. Data Size Limitations of Parquet files

It is recommended to split parquet files that are greater than 3GB in size into smaller files of 1GB or lesser for smooth loading. This will ensure that the loading does not timeout.

iv. Preparing Delimited Text Files

The following points must be considered while preparing CSV/Delimited text files for loading:

Files must have data in ASCII format only. The default character set is UTF-8. However, additional encodings can be mentioned using the ENCODING file format option.
Within the files, records and fields should be delimited by different characters. Note, that both should be a single (necessarily not the same) character. Pipe (|), caret (^), comma (,), and tilde (~) are common field delimiters. Often the line feed (n) is used as a row delimiter.
Fields that have delimiter should be enclosed in single or double-quotes. If the data being loaded contains quotes, then those must be escaped.
Fields that have carriage returns (r n) should be enclosed in single or double quotes too. In the windows system, carriage returns are commonly introduced along with a line feed character to mark the end of a line.
Each row should have the same number of columns.

7- Best Practice — Improving Load Performance

Use bulk loading to get the data into tables in Snowflake.
Consider splitting large data files so the load can be efficiently distributed across servers in a cluster.
Delete from internal stages files that are no longer needed. You may notice improved performance in addition to saving on costs.
Isolate load and transform jobs from queries to prevent resource contention. Dedicate separate warehouse for loading and querying operations to optimize performance for each.
Leverage the scalable compute layer to do the bulk of the data processing.
Consider using SnowPipe in micro-batching scenarios.

8- Best Practice — Data Caching

While Caching is an automatic behaviour, there are two best practices you can implement to maximize cache usage and speed query performance.

Firstly, when segmenting query workload, you should place users querying the same data on the same virtual warehouse. This maximizes the chances that data retrieved to the cache by one user will also be used by others.
Finally, be aware that the result cache is completely independent of the virtual warehouse, and any query executed by any user on the account will be served from the result cache, provided the SQL text is the same.

9- Best Practice — Data Querying

i. Dedicated warehouse for Querying:

Snowflake automatically caches data in the Virtual Warehouse (local disk cache), so place users querying the same data on the same virtual warehouse. This maximizes the chances that data retrieved to the cache by one user will also be used by others. The suspending warehouse will erase this cache.
Result Cache is maintained by the Global Services layer, any query executed by any user on the account will be served from the result cache, provided the SQL text is the same. Results are retained for 24 hours.
Snowflake Query Profile feature helps us to analyze queries being run from BI tools as well. In case, you have multiple BI tools and common users, having a dedicated warehouse for each BI tool will help to identify queries generated from the BI tool.

10- Best Practice — Warehouse Size

Snowflake offers pay-per-second billing. They allow the different sizes of the warehouse (Large, X-Large, 2X-Large, etc.). Create separate warehouses for different environments such as development, testing, and production.

For queries in development or testing environments, smaller warehouse sizes (such as X-Small, Small, Medium) may be enough.
For queries in production environments, larger warehouse sizes (such as Large, X-Large, 2X-Large, etc.) may be enough and cost-effective.
From a performance point of view, it is always better to create a separate warehouse for your data loading and query execution. Start with a smaller size and based on the performance, you can manually resize the warehouse.
Increasing the size of a warehouse does not always improve data loading performance. Data loading performance is influenced more by the number of files being loaded (and the size of each file) than the size of the warehouse.
Credit charges are calculated based on:

The number of Clusters (if using a multi-cluster warehouse)
The number of servers per cluster (a derivative of warehouse-size)
The time duration that each cluster in each server runs for.

11- Best Practice — Warehouse Optimizations

i. Enable Auto-Suspend

Make sure all virtual warehouses are set to auto-suspend. This way, when they are done processing queries, auto-suspend will turn off your virtual warehouses, and thus stop credit consumption.

ii. Enable Auto-Resume

Make sure all virtual warehouses are set to auto-resume. If you are going to implement auto-suspend and set appropriate timeout limits, enabling auto-resume is a must; otherwise, users will not be able to query the system.

iii. Set Timeouts Appropriately for Workloads

All virtual warehouses should have an appropriate timeout for their particular workload:

For the task, data loading, and ETL/ELT warehouses set the timeout for suspension immediately upon completion.
For BI and SELECT query warehouses, set the suspension timeout to 10 minutes in most situations to keep data caches warm for frequent access by end-users.
For DevOps, DataOps, and data science warehouses, set the suspension timeout to 5 minutes because having a warm cache is not as important for ad hoc and highly unique queries.

iv. Set Account Statement Timeouts

Use the STATEMENT_QUEUED_TIMEOUT_IN_SECONDS and STATEMENT_TIMEOUT_IN_SECONDS parameters to automatically stop queries that are taking too long to execute, either due to a user error or a frozen cluster. Customize warehouse, account, session, and user timeout-level statements according to your data strategy for long-running queries.

Here’s an example:

ALTER WAREHOUSE LOAD_WH SET STATEMENT_TIMEOUT_IN_SECONDS= 3600;
SHOW PARAMETERS IN WAREHOUSE LOAD_WH

v. Drop Unused Tables

You might have unused tables that are candidates to be dropped. Just make sure no one is querying these tables. In fact, you might want to make it mandatory to check all tables before deletion. (If you have Time Travel set up, you can un-drop a table if you make an error.) This is specific to the database context, so be sure to look at tables across your databases. Also, be mindful of tables used only because of DDLs.

vi. Find Warehouses That Don’t Have Resource Monitors

Resource monitors are a great way to proactively control workload budgets and prevent unexpected resource spikes. Resource monitors can help monitor both user usage and service account usage in Snowflake. First, you should have dedicated virtual warehouses for each of your loading, ELT, BI, reporting, and data science workloads as well as for other workloads. Accounts and warehouses can have total, yearly, monthly, weekly, and daily credit quotas.

vii. Apply Resource Monitors

You can use the UI or SQL to apply your resource monitor policy. Based on account preference settings, resource monitors can notify you when consumption reaches a lower threshold, and then suspend the warehouse or account at a higher threshold.

Considerations for Resource Monitoring

We recommend setting monitors to notify you when a certain threshold of consumption is reached.
When consumption approaches the maximum budgeted level, set the resource monitor to auto-suspend the warehouse or the entire account, allowing queries to complete but preventing future requests.
Resource monitors can also be used to terminate all currently running queries and immediately suspend the resource or account. This setting is usually reserved for situations where a hard quota is exceeded.
For customers that do not want to set hard limits, it’s still always a good idea to have notification monitors set on all warehouses in case of usage unexpectedly spikes. That way, all admins within the account will get an email or on-screen notification when thresholds are reached.

The following figure shows the resource monitor configuration screen:

12- Best Practice — Scaling Data Warehouse

i. Scale- Up

Snowflake allows for a scale-up in the virtual data warehouse to better handle large workloads. When using scale-up to improve performance, make note of the following:

Snowflake supports fast and easy adjustments to the warehouse-size to handle the workload.
It can also automatically suspend or resume the scale-up, with complete transparency for the user.

Snowflake’s scale-up functionality supports the continually changing requirements for processing.

ii. Scale-Out

Snowflake supports the deployment of same-size clusters to support concurrency. Keep these points in mind for how scale-out can help performance optimization:

As users execute queries, the virtual data warehouse automatically adds clusters up to a fixed limit.
It can scale-up in a more controlled way instead of deploying one or more clusters of larger machines like legacy data platforms.

Snowflake automatically adjusts based on user queries, with automatic clustering during peak and off-hours as needed.

iii. Enable Auto-Scaling

If you are using an enterprise edition of Snowflake, multi-cluster warehouses should be configured to run in an Auto-scale mode, which enables Snowflake to automatically start and stop clusters as needed.

13- Best Practice — Leverage Materialized Views

Materialized views can improve the performance of queries that repeatedly use the same subquery results. Use materialized views when

The query contains a small number of rows and/or columns relative to the base table
The query contains results that require significant processing
The query is against file-based data
Results don’t change often
Results are used often
The query consumes a lot of resources

You can create a materialized view like this:

CREATE MATERIALIZED VIEW MV1 AS
  SELECT * FROM TAB1 WHERE COLUMN_1 = x OR COLUMN_1 = y;

And use it like this:

SELECT * FROM MV1 WHERE COLUMN_1 = y;

But be careful! There are some gotchas:

A background service automatically updates the materialized view after changes are made to the base table
The maintenance of materialized views consumes credits
Compare materialized views vs. scaling up compute to solve performance issues.

14- Best Practice — Segment Data

Snowflake caches data in the virtual data warehouse, but it’s still essential to segment data. Consider these best practices for data query performance:

Group users with common queries in the same virtual data warehouse to optimize data retrieval and use.
The Snowflake Query Profile supports query analysis to help identify and address performance concerns.

Snowflake draws from the same virtual data warehouse to support complex data science operations, business intelligence queries, and ELT data integration.

15- Best Practice — Using Snowflake Query Profile

The first step to Snowflake optimization is to discover the source of performance bottlenecks. Use Snowflake’s Query Profile tool to diagnose issues in slow-running queries. Like similar database management tools, Query Profile lets you explore how Snowflake executed your query and what steps in the process are causing slowdowns.

Once you have identified the nature of the problem, use these seven best practices to help optimize your query speed and overall Snowflake performance.

16- Benefit from a range of Integrations and Partnerships

Snowflake’s ecosystem is designed for flexibility, openness, and extensibility without exposure to security threats. That means you benefit from partnerships and seamless integrations with vendors of your choice, including vendors for BI tools, data science technologies, marketing and sales automation software, CRM solutions, and machine learning.

Another area you might use vendors for is security analytics. Snowflake enables you to connect to existing SIEM software, fraud products, and threat intelligence feeds. Also, Snowflake provides built-in security options from leading BI partners, including Tableau, Looker, Sigma Computing, and Periscope Data, so you can create a wide range of user interfaces, visualizations, and reports that align with your needs, processes, and workflows.

17- Additional Features

i. Secure Data Sharing

Let’s put this to the test. In your organization, if you need to pass data to a team in another company in a secure way, how would you do it? Secure FTP? Ugh! An expensive proprietary tool? Blegh! Or — worst of all — an email (hopefully you would at least password-protect it)? Don’t you dare! If any of these are the case, you really need to consider Snowflake Secure Data Sharing.

Snowflake not only supports view-based data sharing but has enhanced this by revolutionizing how organizations distribute and consume shared data. Unlike FTP and email, Snowflake Data Sharing is far easier to use as it provides instant access to live data and eliminates data copying. The Snowflake Built for the Cloud Architecture enables Data Sharing without any complicated ETL and setup and more importantly allows you to control security in one single place, at the data level.

With Secure Data Sharing, no actual data is copied or transferred between accounts. All sharing is accomplished through Snowflake’s unique services layer and metadata store. This is an important concept because it means that shared data does not take up any storage in a consumer account and, therefore, does not contribute to the consumer’s monthly data storage charges. The only charges to consumers are for the compute resources (i.e. virtual warehouses) used to query the shared data.

Real-time secure sharing of data is ready as soon as a data provider adds to or updates data for the end consumers. There are out of the box advantages to this approach such as:

Immediate access: no transformation, data movement, loading, or reconstruction is required, and data is available for immediate use.
Live data: changes made in real-time by a data provider are immediately available to data consumers without effort. Thus, data is always current.
Secure Managed access: a data provider can share all their data to N number of data consumers in a secure, governed manner with minimal effort. The flexibility of controlling all aspects of managed access increases the ease of use.
No ETL management: users do not need to manage any ETL for secure data sharing capabilities and operations.
Access Logs: Data providers can track and monitor access to Data Shares and quickly respond to the user’s actions.

ii. Data Marketplace

A data marketplace is an online transactional location or store that facilitates the buying and selling of data. As many businesses seek to augment or enrich internal data sets with external data, cloud-based data marketplaces are appearing at a growing rate to match data consumers with the right data sellers.

Snowflake Data Marketplace gives data scientists, business intelligence and analytics professionals, and everyone who desires data-driven decision-making, access to more than 375 live and ready-to-query data sets from more than 125 third-party data providers and data service providers (as of January 29, 2021).

USE CASES

Data Augmentation: Source quality, third-party data to augment your own data sets and make better business decisions.
Data Analysis: Get personalized data feeds from vendors you already work with or request customized data sets based on your business requirements from the many providers in the Data Marketplace.
Data Science: Discover new data sets your team can use to train AI/ML models and to improve data science applications.
Data Enrichment: Take advantage of bidirectional data exchange and enrich your own data by leveraging data enrichment services from different vendors.

Original source article