Snowflake Study Guide
I recently obtained my SnowPro certification and I want to share my preparation materials with you. I gathered information from the Snowflake community, documentation, and online training through Udemy and Snowflake. I hope this collection of resources will help you in your journey towards becoming a SnowPro.
After passing the test the other day, I strongly advise you to thoroughly read the guides and materials available at https://docs.snowflake.com/en/guides-overview. The test consists of 100 questions, and some of them are quite obscure. I found most of these obscure questions in the guide after taking the test. If you have any questions, feel free to DM me on LinkedIn.
General Information
- Analytic Data Warehouse
- Built by industry professionals from Oracle, it has the look at feel of a traditional DBMS
- SaaS Offering
- No hardware
- No maintenance (handled by Snowflake)
- Cloud only (no on premise)
- Cloud
- AWS, Azure, GCP or VPC (cloud only)
- Snowflake Docs: Regions
- Decoupled compute and storage
- Editions
- Standard - Entry level product
- Enterprise - Adds new functionality like time travel, row level security, object tagging, classification, materialized views.
- Business Critical - For organizations that need PHI compliance (HIPPA, PCI, etc) and high uptime.
- VPC - Private Snowflake
- Workloads:
- AI/ML
- Applications (Transactional capability adding in 2022)
- Data Engineering
- Data Lake
- Data Warehouse
- Unistore (Data Cloud)
Costs
- Function of Storage, Compute, and Edition
- Costs are in the form of credits
- Different regions/clouds may have different costs per credit.
- Generally $2 for standard, $3 for Enterprise, and $4 for business critical
- Credits are consumed by a running compute cluster
- Pre-purchase storage capacity is approximately $23/TB depending on the cloud
- Data transfer costs
- Same Cloud/Region - 0
- Same cloud different region - Varies starting at $20/TB US
- Diff cloud diff region - varies starting at $90/TB
- Snowflake Docs: Pricing
Architecture
- Snowflake is a self managed platform
- No server or software maintenance
- Hybrid of traditional shared-disk and shared-nothing database architecture
- Shared Disk - Central data repository accessible by all nodes
- Shared-nothing - Using MPP clusters where each node stores a portion of the entire data set locally
- 3 Layers
- Database Storage
- Query Processing
- Cloud Services
Layer 1: Database Storage
- Stored as internal, optimised, compressed columnar format(micro partitions)
- Snowflake manages how data is stored
- Objects stored by snowflake are not visable to customers
Layer 2: Query Processing
- Query processing layer
- Uses virtual warehouse (compute clusters)
- each cluster is independent and doesn’t share resources
- Each warehouse is an MPP cluster
- https://docs.snowflake.com/user-guide/warehouses
Layer 3: Cloud Services
- Collection of services to coordinates ativiites across Snowflake
- Includes:
- Authentication
- Infrastructure Management
- Metadata Management
- Query parsing and optimization
- Access Control
- Customers don’t have direct access to change
- Terms associated are Query Planning, Optimization, and Complication (*Note - Not Processing or Design)
Connecting to Snowflake
- WebUI
- Classic
- Snowsight
- Worksheet Organization and Sharing
- Coding Productivity with Autocomplete
- Visualization - Limited at the time, but basic dashboard functions
- Collaboration
- Administration
- Command Line (SnowSQL)
- Connect with
snowsql -a {ORGNAME}-{ACCOUNTNAME} -u USER
- Connect with
- ODBC/JCBC
- Libraries (Python, Spark, Kafka, .Net, Node.js)
- Third Party (Matillion, Fivetran,etc)
- https://docs.snowflake.com/en/user-guide/connecting
- Connections
- APP: https://app.snowflake.com
- URL with Account ID and Cloud: https://qm79258.east-us-2.azure.snowflakecomputing.com/console/login#/ (Example)
Cache
- Snowflake caches queries to improve performance
- Metadata Cache
- Cloud services layer
- Improves compile time for queries and commonly used tables
- Result Cache
- Cloud services layer
- Holds results of queries every 24 hours
- Available across warehouses and users
- Local Disk / Warehouse Cache
- Storage layer
- Cache data used by SQL queries in it’s local SSD or MEM
- Improves query performance
- Remote Disk
- Holds long term storage
- Data resilience and durability (for data center failure)
Virtual Warehouses (Compute Clusters)
- Cluster of compute resources
- 2 Types;
- Standard
- Snowpark-optimized - Recommended for workloads with large memory requirements
- X-Small and Small sizes not available
- Sized
- X-Small
- Small
- Medium
- Large
- X-Large to 6X-Large
- Any transaction requires a warehouse
- Auto-suspend: defult 10m
- Auto-resume: starts as needed
- Multi-cluster Warehouse
- 1 to 10 clusters
- Can Autoscale up and down (horizonal scaling)
- INcrease clusters as needed for performance and scale down as needed
- Helps with concurrency and not slow running queries
- Scaling Policy
- Standard - start clusters over conserving credits
- Economy - conserve credits over starting clusters
Data Structures
- Snowflake is a columnstore database that leverages micropartitions
- Data Types
- Numeric Data Types
- NUMBER - Default precision and scale are (38,0).
- DECIMAL, NUMERIC - Synonymous with NUMBER.
- INT, INTEGER, BIGINT, SMALLINT, TINYINT, BYTEINT - Synonymous with NUMBER except precision and scale cannot be specified.
- FLOAT, FLOAT4, FLOAT8
- DOUBLE, DOUBLE PRECISION, REAL - Synonymous with FLOAT. [1]
- String & Binary Data Types
- VARCHAR - Default (and maximum) is 16,777,216 bytes.
- CHAR, CHARACTER - Synonymous with VARCHAR except default length is VARCHAR(1).
- STRING - Synonymous with VARCHAR.
- TEXT - Synonymous with VARCHAR.
- BINARY
- VARBINARY - Synonymous with BINARY.
- Logical Data Types
- BOOLEAN - Currently only supported for accounts provisioned after January 25, 2016.
- Date & Time Data Types
- DATE
- DATETIME - Alias for TIMESTAMP_NTZ
- TIME
- TIMESTAMP - Alias for one of the TIMESTAMP variations (TIMESTAMP_NTZ by default).
- TIMESTAMP_LTZ - TIMESTAMP with local time zone; time zone, if provided, is not stored.
- TIMESTAMP_NTZ - TIMESTAMP with no time zone; time zone, if provided, is not stored.
- TIMESTAMP_TZ - TIMESTAMP with time zone.
- Semi-structured Data Types
- VARIANT
- OBJECT
- ARRAY
- Geospatial Data Types
- GEOGRAPHY
- GEOMETRY
- Numeric Data Types
Tables
- All tables are automatically divided in micro-partitions of contiguous units between 50 and 500MB of uncompressed data, approximately 16MB after compression
- Metadata stored: range of values, distinct values, additional properties
- Extrememly efficient DDL
- Compression occurs after micropartitioning by snowflake selecting the best model
- Clustering
- Tables can be clusted using a cluster key during the create or alter DDL
- Automatic Clustering is the Snowflake service that seamlessly and continually manages all reclustering, as needed, of clustered tables.
- You can
SUSPEND
/RESUME RECLUSTER
for a table if you want to disable the cluster - Clustering is non-blocking DML
- Will consume credits, but you don’t have to have a warehouse
- Search Optimization
- Table level setting
- Uses an aditional persistent data structure that is designed for search
- Only speeds up equality searches, does not speed up range
- Applied to all columns
- use
select system$estimate_search_optimization_costs('TABLENAME')
to estimate the costs of search optimization and maintenance - use
ALTER TABLE TABLENAME ADD SEARCH OPTIMIZATION
- Types of Tables
- Permanent - explicy drop, 7 Days Fail Safe, 90 Days Time Travel
- Temporary - Remains during session, 0 Fail Safe Days, 1 time travel day
- Transient - Explicity drop, 0 fail safe, 1 time travel
Views
- Types of Views
- Non-Materialized - Named definition of a query. Results are NOT stored anywhere. Performance is slower than materialized views.
- Materialized - Defined the same as non-materialized, but behaves like a table. Faster access, but will have additional storage costs.
- Secure Views
- Both Materialized and Non-materialized can be a secure view.
- Allow for additional data privacy
Data Management
- Stages
- Internal - Snowflake managed space, can be used as a lake.
- External - AWS, GCP, or Azure storage.
PUT
command moves data into a stageCOPY INTO
moves data from the stage into a table- Makes use fo the File Format. Each copy into needs to have a format to interpret that data.
- Structures that have inherent structure like parquet and json, can use INFER SCHEMA.
- The
ON ERROR
option allows you to manage error handling (fail on file, fail row, etc) - Can truncate columns, transform data, etc on load
- Makes use fo the File Format. Each copy into needs to have a format to interpret that data.
- Can remove data with the
REMOVE
command.
- SnowPipes
- Enable automatically loading data once it appears in a bucket
- Data is made available immediately
- Serverless feature instead of warehouses (Don’t need to worry about scaling)
- Uses event notification to trigger
- Streams
- A stream object records data manipulation language (DML) changes made to tables (INSERT, UPDATE, DELETE, etc) as well as metadata changes
- Stored Procs
- Similar to other DB platforms
- UDF’s - You can create to extend Snowflake. Supports Java, Javascript, Python, Scala, and SQL.
- External Functions
- An external function calls code that is executed outside Snowflake.
- A type of UDF
- External Functions
- Orchestration
- Tasks - task can execute any one of the following types of SQL code:
- Single SQL statement
- Call to a stored procedure
- Procedural logic using Snowflake Scripting
- Tasks - task can execute any one of the following types of SQL code:
- Classes
- A Class is similar to a class in object oriented programming and serves as a blueprint for creating instances. An Instance is an object created from a Class. Classes and instances are schema-level objects in Snowflake.
Security
Roles
- Roles are a logical grouping of privledges
- Security structure is Privledges -> Roles -> Users
- Roles Hierarchy:
ACCOUNTADMIN
- All SecurityAdmin and SysAdmin functionsSYSADMIN
- Can create databases, schemes, etc.Custom Roles
- User created roles should live here
SECURITYADMIN
- All user priveledges - can manage objects and grants globallyUSERADMIN
- User and role management onlyPUBLIC
- create and own objects indivudally
API Auth and Secrets
- Snowflake supports external API code and can store sensitive information in keys/passwords as secrets
- secret is a schema-level object that stores sensitive information, limits access to the sensitive information using RBAC, and is encrypted using the Snowflake key encryption hierarchy
- can create and alter secrets
Network Policy and Rules
- Allows access based on IP whitelist or restrict via blacklist
- ACCOUNTADMIN and SECURITYADMIN can CRUD
Federated Auth and SSO
- User auth is seperated from user access though and independent IAP
SCIM
- SCIM is an open specification to help facilitate the automated management of user identities and groups
- https://docs.snowflake.com/en/user-guide/scim-intro
Private Link (Azure, AWS, Google)
- Snowflake can work with all private link services
- requires setup
- not a snowflake service
Key Management
- Snowflake supports SSL keys via RSA
- https://docs.snowflake.com/en/user-guide/key-pair-auth
MFA
- Not enabled by default
- The Duo application service communicates through TCP port 443.
alter account set allow_client_mfa_caching = true;
Data Governance
Tagging
- Tags enable data stewards to monitor sensitive data
- Schema level object that can be assigned to another object
- Created with
CREATE TAG cost_center;
- On create using
WITH TAG (cost_center = 'IT');
- Alter using
WITH TAG (cost_center = 'IT');
Data Classification
- Snowflake has built in semantic and privacy tags avaiable
- SNOWFLAKE.CORE.SEMANTIC_CATEGORY
- SNOWFLAKE.CORE.PRIVACY_CATEGORY
- Helps to manage PII data tracking
Data Masking and External Tokenization
- Masking
- Snowflake supports masking policies as a schema-level object to protect sensitive data from unauthorized access while allowing authorized users to access sensitive data at query runtime.
- Use with data sharing
- Value of data is lost
- External Tokens
- External Tokenization enables accounts to tokenize data before loading it into Snowflake and detokenize the data at query runtime.
- Can’t be used with data sharing
- Loses analytical value
Data Sharing
- Only ACCOUNTADMIN can provision
- Sharer = Provider
- User = Consumer
- uses Snowflake service layer and metadata store
- 1 DB per share but unlimited number of shares on the account
- Consists of
- Privilege to grant DB access
- Privilege to grant object access
- Consumer account DB and objects are shared
- Consumer is READ ONLY
Monitoring
- Resource monitors can help to impose limits on the number of credits consumed
- Can adjust to alert, suspend, or immediately suspend
Example Code
Warehouse, Databases, Schemas, Tables, and Views
1 | -- Create a warehouse <br> |