 
 
Snowflake Study Guide
I recently obtained my SnowPro certification and I want to share my preparation materials with you. I gathered information from the Snowflake community, documentation, and online training through Udemy and Snowflake. I hope this collection of resources will help you in your journey towards becoming a SnowPro.
After passing the test the other day, I strongly advise you to thoroughly read the guides and materials available at https://docs.snowflake.com/en/guides-overview. The test consists of 100 questions, and some of them are quite obscure. I found most of these obscure questions in the guide after taking the test. If you have any questions, feel free to DM me on LinkedIn.
General Information
- Analytic Data Warehouse
- Built by industry professionals from Oracle, it has the look at feel of a traditional DBMS
- SaaS Offering - No hardware
- No maintenance (handled by Snowflake)
- Cloud only (no on premise)
 
- Cloud- AWS, Azure, GCP or VPC (cloud only)
- Snowflake Docs: Regions
 
- Decoupled compute and storage
- Editions- Standard - Entry level product
- Enterprise - Adds new functionality like time travel, row level security, object tagging, classification, materialized views.
- Business Critical - For organizations that need PHI compliance (HIPPA, PCI, etc) and high uptime.
- VPC - Private Snowflake
 
- Workloads: - AI/ML
- Applications (Transactional capability adding in 2022)
- Data Engineering
- Data Lake
- Data Warehouse
- Unistore (Data Cloud)
 
Costs
- Function of Storage, Compute, and Edition
- Costs are in the form of credits
- Different regions/clouds may have different costs per credit.- Generally $2 for standard, $3 for Enterprise, and $4 for business critical
- Credits are consumed by a running compute cluster
 
- Pre-purchase storage capacity is approximately $23/TB depending on the cloud
- Data transfer costs- Same Cloud/Region - 0
- Same cloud different region - Varies starting at $20/TB US
- Diff cloud diff region - varies starting at $90/TB
 
- Snowflake Docs: Pricing
Architecture
- Snowflake is a self managed platform
- No server or software maintenance
- Hybrid of traditional shared-disk and shared-nothing database architecture- Shared Disk - Central data repository accessible by all nodes
- Shared-nothing - Using MPP clusters where each node stores a portion of the entire data set locally
 
- 3 Layers- Database Storage
- Query Processing
- Cloud Services
 
Layer 1: Database Storage
- Stored as internal, optimised, compressed columnar format(micro partitions)
- Snowflake manages how data is stored
- Objects stored by snowflake are not visable to customers
Layer 2: Query Processing
- Query processing layer
- Uses virtual warehouse (compute clusters)
- each cluster is independent and doesn’t share resources
- Each warehouse is an MPP cluster
- https://docs.snowflake.com/user-guide/warehouses
Layer 3: Cloud Services
- Collection of services to coordinates ativiites across Snowflake
- Includes:- Authentication
- Infrastructure Management
- Metadata Management
- Query parsing and optimization
- Access Control
 
- Customers don’t have direct access to change
- Terms associated are Query Planning, Optimization, and Complication (*Note - Not Processing or Design)
Connecting to Snowflake
- WebUI - Classic
- Snowsight- Worksheet Organization and Sharing
- Coding Productivity with Autocomplete
- Visualization - Limited at the time, but basic dashboard functions
- Collaboration
- Administration
 
 
- Command Line (SnowSQL)- Connect with snowsql -a {ORGNAME}-{ACCOUNTNAME} -u USER
 
- Connect with 
- ODBC/JCBC
- Libraries (Python, Spark, Kafka, .Net, Node.js)
- Third Party (Matillion, Fivetran,etc)
- https://docs.snowflake.com/en/user-guide/connecting
- Connections- APP: https://app.snowflake.com
- URL with Account ID and Cloud: https://qm79258.east-us-2.azure.snowflakecomputing.com/console/login#/ (Example)
 
Cache
- Snowflake caches queries to improve performance
- Metadata Cache- Cloud services layer
- Improves compile time for queries and commonly used tables
 
- Result Cache- Cloud services layer
- Holds results of queries every 24 hours
- Available across warehouses and users
 
- Local Disk / Warehouse Cache- Storage layer
- Cache data used by SQL queries in it’s local SSD or MEM
- Improves query performance
 
- Remote Disk- Holds long term storage
- Data resilience and durability (for data center failure)
 
Virtual Warehouses (Compute Clusters)
- Cluster of compute resources
- 2 Types; - Standard
- Snowpark-optimized - Recommended for workloads with large memory requirements- X-Small and Small sizes not available
 
 
- Sized- X-Small
- Small
- Medium
- Large
- X-Large to 6X-Large
 
- Any transaction requires a warehouse
- Auto-suspend: defult 10m
- Auto-resume: starts as needed
- Multi-cluster Warehouse- 1 to 10 clusters
- Can Autoscale up and down (horizonal scaling)
- INcrease clusters as needed for performance and scale down as needed
- Helps with concurrency and not slow running queries
 
- Scaling Policy- Standard - start clusters over conserving credits
- Economy - conserve credits over starting clusters
 
Data Structures
- Snowflake is a columnstore database that leverages micropartitions
- Data Types- Numeric Data Types- NUMBER - Default precision and scale are (38,0).
- DECIMAL, NUMERIC - Synonymous with NUMBER.
- INT, INTEGER, BIGINT, SMALLINT, TINYINT, BYTEINT - Synonymous with NUMBER except precision and scale cannot be specified.
- FLOAT, FLOAT4, FLOAT8
- DOUBLE, DOUBLE PRECISION, REAL - Synonymous with FLOAT. [1]
 
- String & Binary Data Types- VARCHAR - Default (and maximum) is 16,777,216 bytes.
- CHAR, CHARACTER - Synonymous with VARCHAR except default length is VARCHAR(1).
- STRING - Synonymous with VARCHAR.
- TEXT - Synonymous with VARCHAR.
- BINARY
- VARBINARY - Synonymous with BINARY.
 
- Logical Data Types- BOOLEAN - Currently only supported for accounts provisioned after January 25, 2016.
 
- Date & Time Data Types- DATE
- DATETIME - Alias for TIMESTAMP_NTZ
- TIME
- TIMESTAMP - Alias for one of the TIMESTAMP variations (TIMESTAMP_NTZ by default).
- TIMESTAMP_LTZ - TIMESTAMP with local time zone; time zone, if provided, is not stored.
- TIMESTAMP_NTZ - TIMESTAMP with no time zone; time zone, if provided, is not stored.
- TIMESTAMP_TZ - TIMESTAMP with time zone.
 
- Semi-structured Data Types- VARIANT
- OBJECT
- ARRAY
 
- Geospatial Data Types- GEOGRAPHY
- GEOMETRY
 
 
- Numeric Data Types
Tables
- All tables are automatically divided in micro-partitions of contiguous units between 50 and 500MB of uncompressed data, approximately 16MB after compression
- Metadata stored: range of values, distinct values, additional properties
- Extrememly efficient DDL
- Compression occurs after micropartitioning by snowflake selecting the best model
- Clustering- Tables can be clusted using a cluster key during the create or alter DDL
- Automatic Clustering is the Snowflake service that seamlessly and continually manages all reclustering, as needed, of clustered tables.
- You can SUSPEND/RESUME RECLUSTERfor a table if you want to disable the cluster
- Clustering is non-blocking DML
- Will consume credits, but you don’t have to have a warehouse
 
- Search Optimization- Table level setting
- Uses an aditional persistent data structure that is designed for search
- Only speeds up equality searches, does not speed up range
- Applied to all columns
- use select system$estimate_search_optimization_costs('TABLENAME')to estimate the costs of search optimization and maintenance
- use ALTER TABLE TABLENAME ADD SEARCH OPTIMIZATION
 
- Types of Tables- Permanent - explicy drop, 7 Days Fail Safe, 90 Days Time Travel
- Temporary - Remains during session, 0 Fail Safe Days, 1 time travel day
- Transient - Explicity drop, 0 fail safe, 1 time travel
 
Views
- Types of Views- Non-Materialized - Named definition of a query. Results are NOT stored anywhere. Performance is slower than materialized views.
- Materialized - Defined the same as non-materialized, but behaves like a table. Faster access, but will have additional storage costs.
 
- Secure Views- Both Materialized and Non-materialized can be a secure view.
- Allow for additional data privacy
 
Data Management
- Stages- Internal - Snowflake managed space, can be used as a lake.
- External - AWS, GCP, or Azure storage.
- PUTcommand moves data into a stage
- COPY INTOmoves data from the stage into a table- Makes use fo the File Format.  Each copy into needs to have a format to interpret that data.- Structures that have inherent structure like parquet and json, can use INFER SCHEMA.
 
- The ON ERRORoption allows you to manage error handling (fail on file, fail row, etc)
- Can truncate columns, transform data, etc on load
 
- Makes use fo the File Format.  Each copy into needs to have a format to interpret that data.
- Can remove data with the REMOVEcommand.
 
- SnowPipes- Enable automatically loading data once it appears in a bucket
- Data is made available immediately
- Serverless feature instead of warehouses (Don’t need to worry about scaling)
- Uses event notification to trigger
 
- Streams- A stream object records data manipulation language (DML) changes made to tables (INSERT, UPDATE, DELETE, etc) as well as metadata changes
 
- Stored Procs- Similar to other DB platforms
 
- UDF’s - You can create to extend Snowflake.  Supports Java, Javascript, Python, Scala, and SQL.- External Functions- An external function calls code that is executed outside Snowflake.
- A type of UDF
 
 
- External Functions
- Orchestration- Tasks -  task can execute any one of the following types of SQL code:- Single SQL statement
- Call to a stored procedure
- Procedural logic using Snowflake Scripting
 
 
- Tasks -  task can execute any one of the following types of SQL code:
- Classes- A Class is similar to a class in object oriented programming and serves as a blueprint for creating instances. An Instance is an object created from a Class. Classes and instances are schema-level objects in Snowflake.
 
Security
Roles
- Roles are a logical grouping of privledges
- Security structure is Privledges -> Roles -> Users
- Roles Hierarchy:- ACCOUNTADMIN- All SecurityAdmin and SysAdmin functions- SYSADMIN- Can create databases, schemes, etc.- Custom Roles- User created roles should live here
 
- SECURITYADMIN- All user priveledges - can manage objects and grants globally- USERADMIN- User and role management only- PUBLIC- create and own objects indivudally
 
 
 
 
API Auth and Secrets
- Snowflake supports external API code and can store sensitive information in keys/passwords as secrets
- secret is a schema-level object that stores sensitive information, limits access to the sensitive information using RBAC, and is encrypted using the Snowflake key encryption hierarchy
- can create and alter secrets
Network Policy and Rules
- Allows access based on IP whitelist or restrict via blacklist
- ACCOUNTADMIN and SECURITYADMIN can CRUD
Federated Auth and SSO
- User auth is seperated from user access though and independent IAP
SCIM
- SCIM is an open specification to help facilitate the automated management of user identities and groups
- https://docs.snowflake.com/en/user-guide/scim-intro
Private Link (Azure, AWS, Google)
- Snowflake can work with all private link services
- requires setup
- not a snowflake service
Key Management
- Snowflake supports SSL keys via RSA
- https://docs.snowflake.com/en/user-guide/key-pair-auth
MFA
- Not enabled by default
- The Duo application service communicates through TCP port 443.
- alter account set allow_client_mfa_caching = true;
Data Governance
Tagging
- Tags enable data stewards to monitor sensitive data
- Schema level object that can be assigned to another object
- Created with CREATE TAG cost_center;
- On create using WITH TAG (cost_center = 'IT');
- Alter using WITH TAG (cost_center = 'IT');
Data Classification
- Snowflake has built in semantic and privacy tags avaiable- SNOWFLAKE.CORE.SEMANTIC_CATEGORY
- SNOWFLAKE.CORE.PRIVACY_CATEGORY
 
- Helps to manage PII data tracking
Data Masking and External Tokenization
- Masking- Snowflake supports masking policies as a schema-level object to protect sensitive data from unauthorized access while allowing authorized users to access sensitive data at query runtime.
- Use with data sharing
- Value of data is lost
 
- External Tokens- External Tokenization enables accounts to tokenize data before loading it into Snowflake and detokenize the data at query runtime.
- Can’t be used with data sharing
- Loses analytical value
 
Data Sharing
- Only ACCOUNTADMIN can provision
- Sharer = Provider
- User = Consumer
- uses Snowflake service layer and metadata store
- 1 DB per share but unlimited number of shares on the account
- Consists of- Privilege to grant DB access
- Privilege to grant object access
- Consumer account DB and objects are shared
 
- Consumer is READ ONLY
Monitoring
- Resource monitors can help to impose limits on the number of credits consumed
- Can adjust to alert, suspend, or immediately suspend
Example Code
Warehouse, Databases, Schemas, Tables, and Views
| 1 | -- Create a warehouse <br> |