Snowflake Study Guide

I recently obtained my SnowPro certification and I want to share my preparation materials with you. I gathered information from the Snowflake community, documentation, and online training through Udemy and Snowflake. I hope this collection of resources will help you in your journey towards becoming a SnowPro.

After passing the test the other day, I strongly advise you to thoroughly read the guides and materials available at https://docs.snowflake.com/en/guides-overview. The test consists of 100 questions, and some of them are quite obscure. I found most of these obscure questions in the guide after taking the test. If you have any questions, feel free to DM me on LinkedIn.

General Information

Analytic Data Warehouse
Built by industry professionals from Oracle, it has the look at feel of a traditional DBMS
SaaS Offering
- No hardware
- No maintenance (handled by Snowflake)
- Cloud only (no on premise)
Cloud
- AWS, Azure, GCP or VPC (cloud only)
- Snowflake Docs: Regions
Decoupled compute and storage
Editions
- Standard - Entry level product
- Enterprise - Adds new functionality like time travel, row level security, object tagging, classification, materialized views.
- Business Critical - For organizations that need PHI compliance (HIPPA, PCI, etc) and high uptime.
- VPC - Private Snowflake
Workloads:
- AI/ML
- Applications (Transactional capability adding in 2022)
- Data Engineering
- Data Lake
- Data Warehouse
- Unistore (Data Cloud)

Costs

Function of Storage, Compute, and Edition
Costs are in the form of credits
Different regions/clouds may have different costs per credit.
- Generally $2 for standard, $3 for Enterprise, and $4 for business critical
- Credits are consumed by a running compute cluster
Pre-purchase storage capacity is approximately $23/TB depending on the cloud
Data transfer costs
- Same Cloud/Region - 0
- Same cloud different region - Varies starting at $20/TB US
- Diff cloud diff region - varies starting at $90/TB
Snowflake Docs: Pricing

Architecture

Snowflake is a self managed platform
No server or software maintenance
Hybrid of traditional shared-disk and shared-nothing database architecture
- Shared Disk - Central data repository accessible by all nodes
- Shared-nothing - Using MPP clusters where each node stores a portion of the entire data set locally
3 Layers
- Database Storage
- Query Processing
- Cloud Services

Layer 1: Database Storage

Stored as internal, optimised, compressed columnar format(micro partitions)
Snowflake manages how data is stored
Objects stored by snowflake are not visable to customers

Layer 2: Query Processing

Query processing layer
Uses virtual warehouse (compute clusters)
each cluster is independent and doesn’t share resources
Each warehouse is an MPP cluster
https://docs.snowflake.com/user-guide/warehouses

Layer 3: Cloud Services

Collection of services to coordinates ativiites across Snowflake
Includes:
- Authentication
- Infrastructure Management
- Metadata Management
- Query parsing and optimization
- Access Control
Customers don’t have direct access to change
Terms associated are Query Planning, Optimization, and Complication (*Note - Not Processing or Design)

Connecting to Snowflake

WebUI
- Classic
- Snowsight
  - Worksheet Organization and Sharing
  - Coding Productivity with Autocomplete
  - Visualization - Limited at the time, but basic dashboard functions
  - Collaboration
  - Administration
Command Line (SnowSQL)
- Connect with snowsql -a {ORGNAME}-{ACCOUNTNAME} -u USER
ODBC/JCBC
Libraries (Python, Spark, Kafka, .Net, Node.js)
Third Party (Matillion, Fivetran,etc)
https://docs.snowflake.com/en/user-guide/connecting
Connections
- APP: https://app.snowflake.com
- URL with Account ID and Cloud: https://qm79258.east-us-2.azure.snowflakecomputing.com/console/login#/ (Example)

Cache

Snowflake caches queries to improve performance
Metadata Cache
- Cloud services layer
- Improves compile time for queries and commonly used tables
Result Cache
- Cloud services layer
- Holds results of queries every 24 hours
- Available across warehouses and users
Local Disk / Warehouse Cache
- Storage layer
- Cache data used by SQL queries in it’s local SSD or MEM
- Improves query performance
Remote Disk
- Holds long term storage
- Data resilience and durability (for data center failure)

Virtual Warehouses (Compute Clusters)

Cluster of compute resources
2 Types;
- Standard
- Snowpark-optimized - Recommended for workloads with large memory requirements
  - X-Small and Small sizes not available
Sized
- X-Small
- Small
- Medium
- Large
- X-Large to 6X-Large
Any transaction requires a warehouse
Auto-suspend: defult 10m
Auto-resume: starts as needed
Multi-cluster Warehouse
- 1 to 10 clusters
- Can Autoscale up and down (horizonal scaling)
- INcrease clusters as needed for performance and scale down as needed
- Helps with concurrency and not slow running queries
Scaling Policy
- Standard - start clusters over conserving credits
- Economy - conserve credits over starting clusters

Data Structures

Snowflake is a columnstore database that leverages micropartitions
Data Types
- Numeric Data Types
  - NUMBER - Default precision and scale are (38,0).
  - DECIMAL, NUMERIC - Synonymous with NUMBER.
  - INT, INTEGER, BIGINT, SMALLINT, TINYINT, BYTEINT - Synonymous with NUMBER except precision and scale cannot be specified.
  - FLOAT, FLOAT4, FLOAT8
  - DOUBLE, DOUBLE PRECISION, REAL - Synonymous with FLOAT. [1]
- String & Binary Data Types
  - VARCHAR - Default (and maximum) is 16,777,216 bytes.
  - CHAR, CHARACTER - Synonymous with VARCHAR except default length is VARCHAR(1).
  - STRING - Synonymous with VARCHAR.
  - TEXT - Synonymous with VARCHAR.
  - BINARY
  - VARBINARY - Synonymous with BINARY.
- Logical Data Types
  - BOOLEAN - Currently only supported for accounts provisioned after January 25, 2016.
- Date & Time Data Types
  - DATE
  - DATETIME - Alias for TIMESTAMP_NTZ
  - TIME
  - TIMESTAMP - Alias for one of the TIMESTAMP variations (TIMESTAMP_NTZ by default).
  - TIMESTAMP_LTZ - TIMESTAMP with local time zone; time zone, if provided, is not stored.
  - TIMESTAMP_NTZ - TIMESTAMP with no time zone; time zone, if provided, is not stored.
  - TIMESTAMP_TZ - TIMESTAMP with time zone.
- Semi-structured Data Types
  - VARIANT
  - OBJECT
  - ARRAY
- Geospatial Data Types
  - GEOGRAPHY
  - GEOMETRY

Tables

All tables are automatically divided in micro-partitions of contiguous units between 50 and 500MB of uncompressed data, approximately 16MB after compression
Metadata stored: range of values, distinct values, additional properties
Extrememly efficient DDL
Compression occurs after micropartitioning by snowflake selecting the best model
Clustering
- Tables can be clusted using a cluster key during the create or alter DDL
- Automatic Clustering is the Snowflake service that seamlessly and continually manages all reclustering, as needed, of clustered tables.
- You can SUSPEND / RESUME RECLUSTER for a table if you want to disable the cluster
- Clustering is non-blocking DML
- Will consume credits, but you don’t have to have a warehouse
Search Optimization
- Table level setting
- Uses an aditional persistent data structure that is designed for search
- Only speeds up equality searches, does not speed up range
- Applied to all columns
- use select system$estimate_search_optimization_costs('TABLENAME') to estimate the costs of search optimization and maintenance
- use ALTER TABLE TABLENAME ADD SEARCH OPTIMIZATION
Types of Tables
- Permanent - explicy drop, 7 Days Fail Safe, 90 Days Time Travel
- Temporary - Remains during session, 0 Fail Safe Days, 1 time travel day
- Transient - Explicity drop, 0 fail safe, 1 time travel

Views

Types of Views
- Non-Materialized - Named definition of a query. Results are NOT stored anywhere. Performance is slower than materialized views.
- Materialized - Defined the same as non-materialized, but behaves like a table. Faster access, but will have additional storage costs.
Secure Views
- Both Materialized and Non-materialized can be a secure view.
- Allow for additional data privacy

Data Management

Stages
- Internal - Snowflake managed space, can be used as a lake.
- External - AWS, GCP, or Azure storage.
- PUT command moves data into a stage
- COPY INTO moves data from the stage into a table
  - Makes use fo the File Format. Each copy into needs to have a format to interpret that data.
    - Structures that have inherent structure like parquet and json, can use INFER SCHEMA.
  - The ON ERROR option allows you to manage error handling (fail on file, fail row, etc)
  - Can truncate columns, transform data, etc on load
- Can remove data with the REMOVE command.
SnowPipes
- Enable automatically loading data once it appears in a bucket
- Data is made available immediately
- Serverless feature instead of warehouses (Don’t need to worry about scaling)
- Uses event notification to trigger
Streams
- A stream object records data manipulation language (DML) changes made to tables (INSERT, UPDATE, DELETE, etc) as well as metadata changes
Stored Procs
- Similar to other DB platforms
UDF’s - You can create to extend Snowflake. Supports Java, Javascript, Python, Scala, and SQL.
- External Functions
  - An external function calls code that is executed outside Snowflake.
  - A type of UDF
Orchestration
- Tasks - task can execute any one of the following types of SQL code:
  - Single SQL statement
  - Call to a stored procedure
  - Procedural logic using Snowflake Scripting
Classes
- A Class is similar to a class in object oriented programming and serves as a blueprint for creating instances. An Instance is an object created from a Class. Classes and instances are schema-level objects in Snowflake.

Security

Roles

Roles are a logical grouping of privledges
Security structure is Privledges -> Roles -> Users
Roles Hierarchy:
- ACCOUNTADMIN - All SecurityAdmin and SysAdmin functions
  - SYSADMIN - Can create databases, schemes, etc.
    - Custom Roles - User created roles should live here
  - SECURITYADMIN - All user priveledges - can manage objects and grants globally
    - USERADMIN - User and role management only
      - PUBLIC - create and own objects indivudally

API Auth and Secrets

Snowflake supports external API code and can store sensitive information in keys/passwords as secrets
secret is a schema-level object that stores sensitive information, limits access to the sensitive information using RBAC, and is encrypted using the Snowflake key encryption hierarchy
can create and alter secrets

Network Policy and Rules

Allows access based on IP whitelist or restrict via blacklist
ACCOUNTADMIN and SECURITYADMIN can CRUD

Federated Auth and SSO

User auth is seperated from user access though and independent IAP

SCIM

SCIM is an open specification to help facilitate the automated management of user identities and groups
https://docs.snowflake.com/en/user-guide/scim-intro

Private Link (Azure, AWS, Google)

Snowflake can work with all private link services
requires setup
not a snowflake service

Key Management

Snowflake supports SSL keys via RSA
https://docs.snowflake.com/en/user-guide/key-pair-auth

MFA

Not enabled by default
The Duo application service communicates through TCP port 443.
alter account set allow_client_mfa_caching = true;

Data Governance

Tagging

Tags enable data stewards to monitor sensitive data
Schema level object that can be assigned to another object
Created with CREATE TAG cost_center;
On create using WITH TAG (cost_center = 'IT');
Alter using WITH TAG (cost_center = 'IT');

Data Classification

Snowflake has built in semantic and privacy tags avaiable
- SNOWFLAKE.CORE.SEMANTIC_CATEGORY
- SNOWFLAKE.CORE.PRIVACY_CATEGORY
Helps to manage PII data tracking

Data Masking and External Tokenization

Masking
- Snowflake supports masking policies as a schema-level object to protect sensitive data from unauthorized access while allowing authorized users to access sensitive data at query runtime.
- Use with data sharing
- Value of data is lost
External Tokens
- External Tokenization enables accounts to tokenize data before loading it into Snowflake and detokenize the data at query runtime.
- Can’t be used with data sharing
- Loses analytical value

Only ACCOUNTADMIN can provision
Sharer = Provider
User = Consumer
uses Snowflake service layer and metadata store
1 DB per share but unlimited number of shares on the account
Consists of
- Privilege to grant DB access
- Privilege to grant object access
- Consumer account DB and objects are shared
Consumer is READ ONLY

Monitoring

Resource monitors can help to impose limits on the number of credits consumed
Can adjust to alert, suspend, or immediately suspend

Example Code

Warehouse, Databases, Schemas, Tables, and Views

-- Create a warehouse <br>
CREATE OR REPLACE WAREHOUSE WH_BLOG 
WITH 
WAREHOUSE_SIZE='XSMALL'
MAX_CLUSTER_COUNT = 1
MIN_CLUSTER_COUNT = 1
SCALING_POLICY = ECONOMY
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE;

USE WAREHOUSE WH_BLOG;

-- Create a Database <br>
CREATE OR REPLACE database SNOWFLAKE_BLOG_TEMP;
Use database SNOWFLAKE_BLOG_TEMP;

-- Create a Schema <br>
CREATE OR REPLACE schema myblog;
use schema myblog;

-- Create a Permenant Table <br>
CREATE OR REPLACE table tbl_perm (
id integer,
val1 string
);

-- Insert Values into Table <br>
INSERT INTO tbl_perm VALUES
(1, 'a'),
(2, 'b'),
(3, 'c');

-- Select data from table <br>
Select * from tbl_perm;

-- Create a temporary table <br>
CREATE OR REPLACE TEMPORARY TABLE tbl_perm_temp (
id integer,
val1 string
);

-- Insert Values into Table <br>
INSERT INTO tbl_perm_temp VALUES
(1, 'a'),
(2, 'b'),
(3, 'c');

-- Review Table Metadata <br>
Select * from 
INFORMATION_SCHEMA.TABLES
where TABLE_NAME like 'TBL_%';

-- Drop a table <br>
Drop Table tbl_perm;

-- Undrop Table <br>
UNDROP Table tbl_perm;

-- Set data retention (up to 90) for perm, 1 day for transient, and 0 for temp  <br>
ALTER TABLE tbl_perm set DATA_RETENTION_TIME_IN_DAYS = 7;

-- grab the last query id  <br>
select * from tbl_perm; 
select last_query_id();

-- Update a table  <br>
update tbl_perm 
set val1 = 'a' where id = 1;

-- Look at a table from a prior point in time from a previous statement  <br>
select * from tbl_perm before (statement => '01af7925-0000-370d-0000-0000941d4241');

-- show tables <br>
SHOW TABLES;

-- Use the following setup for the views demo
CREATE OR REPLACE TABLE tbl_player (
playerid bigint primary key,
playername string,
playercountry string
);

Insert into tbl_player Values
(1, 'Knight1213', 'US'),
(2, 'Squire123', 'US'),
(3, 'Horse33', 'US');

CREATE OR REPLACE TABLE tbl_games (
gameid bigint primary key,
gamename string,
gamerules string
);

Insert into tbl_games Values
(1, 'Game1', 'Fight'),
(2, 'Game2', 'Fight'),
(3, 'Game3', 'King of the Hill'),
(4, 'Game4', 'Fight'),
(5, 'Game5', 'King of the Hill');


CREATE OR REPLACE TABLE tbl_playergames (
gamid bigint,
playerid bigint,
InstantRank int,
GameDate date
);

Insert into tbl_playergames values
(5, 3, 73, '9/1/2023'),
(4, 1, 31, '9/2/2023'),
(4, 3, 15, '9/1/2024'),
(3, 3, 97, '9/2/2024'),
(5, 1, 22, '9/1/2025'),
(4, 3, 68, '9/2/2025'),
(1, 3, 3, '9/1/2026'),
(2, 1, 85, '9/2/2026'),
(4, 3, 3, '9/1/2027'),
(4, 2, 58, '9/2/2027'),
(2, 3, 8, '9/1/2028'),
(1, 2, 41, '9/2/2028'),
(1, 1, 70, '9/1/2029'),
(1, 2, 96, '9/2/2029'),
(1, 1, 19, '9/1/2030'),
(1, 3, 82, '9/2/2030'),
(1, 2, 3, '9/1/2031'),
(5, 3, 41, '9/2/2031'),
(4, 2, 72, '9/1/2032'),
(4, 2, 92, '9/2/2032'),
(3, 1, 86, '9/1/2033'),
(1, 2, 58, '9/2/2033'),
(4, 2, 8, '9/1/2034'),
(2, 1, 84, '9/2/2034'),
(1, 2, 64, '9/1/2035');

-- Create a view
CREATE or REPLACE VIEW myplayerstats as
select playername, count(gameid) as countgames, count(distinct gameid) as distinctgames, avg(instantrank) as AverageInstantRank
from tbl_player p inner join tbl_playergames pg on p.playerid = pg.playerid
inner join tbl_games g on g.gameid = pg.gamid
group by playername;

select * from myplayerstats;

--Create a materialized view - you can only do it on one table
CREATE or REPLACE MATERIALIZED VIEW mygamestats as
select * from tbl_playergames;

-- create a Secure materialized view
CREATE or REPLACE SECURE MATERIALIZED VIEW mygamestats as
select * from tbl_playergames where gamid > 3;

Show VIEWS;


-- Create an Internal Stage
CREATE STAGE my_blog_stage
  COPY_OPTIONS = (ON_ERROR='skip_file');

-- Create an External Stage

-- From SNOWSQL load data into internal stage
put file://cities.parquet @my_blog_stage;
list @my_blog_stage;

-- Create a parquet file format
CREATE FILE FORMAT my_parquet_format 
    TYPE = parquet;

-- Infer the schema of the file via parquet
SELECT *
  FROM TABLE(
    INFER_SCHEMA(
      LOCATION=>'@my_blog_stage/cities.parquet'
      , FILE_FORMAT=>'my_parquet_format'
      )
    );

-- Create a table from the schema
CREATE TABLE myinferredtable
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
      FROM TABLE(
        INFER_SCHEMA(
          LOCATION=>'@my_blog_stage/cities.parquet',
          FILE_FORMAT=>'my_parquet_format'
        )
      ));

-- Copy data from the stage to the newly created table
COPY into myinferredtable 
from @my_blog_stage/cities.parquet FILE_FORMAT = (FORMAT_NAME= 'my_parquet_format') MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE;

-- Select from the newly populated table
select * from myinferredtable;

-- Clone a table
CREATE or REPLACE Table myinferredtable_clone CLONE myinferredtable;

-- Cluster a table
Select * from information_schema.Tables
Where TABLE_NAME = 'TBL_PLAYERGAMES';

Select * from TBL_PLAYERGAMES;

ALTER TABLE TBL_PLAYERGAMES CLUSTER by (GAMID, PLAYERID);

-- Create a STAGE for SnowPipe (S3 Example)
CREATE OR REPLACE stage MYBLOG.EXTERNALSTAGES.MySnowPipeStage
    URL = 's3://example/example'
    STORAGE_INTEGRATION = s3_int
    FILE_FORMAT = MYBLOG.EXTERNALSTAGES.CSVformat;

// Define pipe
CREATE OR REPLACE pipe MYBLOG.MYSNOWPIPE
auto_ingest = TRUE
AS
COPY INTO SNOWFLAKE_BLOG_TEMP.MYBLOG.TBL_PLAYER
FROM @MYBLOG.EXTERNALSTAGES.MySnowPipeStage;

// Describe pipe
DESC pipe MYBLOG.MYSNOWPIPE;

-- PUll the arn and setup notifications in AWS (not included in demo)

-- Create a Stream to monitor inserts  
CREATE STREAM my_ext_table_stream ON EXTERNAL TABLE my_ext_table INSERT_ONLY = TRUE;

-- Create a task to insert a timestamp every hour
create table MYBLOG.MYTIME as
(
    ts date
);

CREATE TASK mysnowtask
  WAREHOUSE = mywh
  SCHEDULE = '60 MINUTE'
AS
INSERT INTO MYTIME(ts) VALUES(CURRENT_TIMESTAMP);