Thursday, August 9, 2018

Snowflake Shuffle Doc Deep-Dive


(Image is from page 6 of US Patent 10019454 B2 Data Management Systems and Methods)


I recently was encouraged to join a Slack channel about DATA and was reading though the chat history. There was an old discussion around Snowflake, which I had only heard of recently, and an open question that had not been answered:

"What I can't find is how they are handling things like shuffles efficiently between their compute nodes."
Being somewhat in a inquisitive state of mind today, I decided to do a quick study of Snowflake and see what I could learn. What did I find?

What I found suggests that Snowflake does go to quite a bit of trouble to efficiently manage cache on compute nodes as well as intelligently distribute table data across storage nodes. 

Summary of findings:

How does the resource manager determine what data is needed and which compute nodes to select to process a query? It depends on:
  • what data is already cached at particular compute nodes
  • caching resources available at particular compute nodes
  • computing resources available at particular compute nodes
  • I/O throughput available at particular nodes
  • current resource loads, number of concurrent users at particular nodes

Also, there are further optimizations:
  • Data are assigned to compute nodes using a consistent hashing model, increasing cache hit ratio for subsequent operations.
  • There is a "file stealing model" where datafiles that are unprocessed at an overallocated compute node subsequently get reassigned to a different available node.
  • Files are assigned "ownership levels" that further optimizes the file stealing model and determines file processing order.

The resource manager also has access to metadata for all data stored throughout the platform, including data in remote systems as well as local caches, including how data is organized.

Snowflake automatically uses data or timestamp columns in tables to distribute table data across storage nodes. Snowflake calls this clustering or micro-partitioning. You can manually change clustering keys or recluster. There is also an Automatic Clustering Service that can be enabled. As data distribution changes over time and becomes unbalanced, there is a benefit to reclustering.  Clustering keys and the clustering ratio will have an impact on efficient data access and query performance.

Question for the reader: How do Snowflake caching methods differ/improve upon established and familiar methods of distributed cache management?

No, I do not know this answer off the top of my head, but I have cultivated a critical attitude about such and wonder how much magic sauce there really is in Snowflakes' distributed cache management. Of course, the Patent Office thought it was unique, but I've seen them fooled before.

What were my sources? Nope - not too many docs about Snowflake around...


Snowflake Computing, according to Wikipedia,  is a cloud-based data-warehousing startup publicly launched in 2012. Hasn't been around too terribly long to write much docs. I found some info on the company website, but then resorted to the US Patent Office to look for the dirty details. This is a trick I've used some in the past when trying to understand the guts of new software from startups. If open source, the other trick is to LOOK AT THE SOURCE, but that was not an option here.

So, if you want to dig down to the bottom of the barrel, read US Patent 10019454 B2 Data Management Systems and Methods) from start to finish.

OTOH - Don't believe the docs until I see it with my own eyes....


Then again, perhaps every thing in the documentation and the patent is incomplete, misleading, or just plain wrong. I've learned that often the best way to understand how something works is to fashion a little experiment and see for yourself what happens. For someone who has access to a Snowflake instance and can play around with some ginormous tables, two fun experiments come to mind:

(1) Aggregate a ginourmous table on a cluster_key column and then a non-cluster-key column, compare and contrast query profiles.

For example:

imagine a 1 billion row table like:
table t1 (
event_id number,
customer_id number,
event_date timestamp,
payload string
):

aggregate on cluster-key column; this should require no "reshuffling"
select count(event_id)
from t1
group by event_date;

aggregate on non-cluster-key column; this should require "reshuffling"
select count(event_id)
from t1
group by customer_id;

examine the query profile for both queries, compare and contrast
useful link: https://docs.snowflake.net/manuals/user-guide/ui-query-profile.html

Question: From the query profiles, what clues (if any) can be gleaned regarding how data was "shuffled" between the compute nodes to aggregate by customer, although the table is clustered by event_date?

I would expect a certain unavoidable "reshuffle" for the the aggregation on a non-cluster-key columns. It would be interesting to measure and objectively characterize the impact.

(2) Join two ginormous tables on a cluster-key column, and then a non-cluster-key column. Compare and contrast the query profiles.


Example left as exercise for the reader :)

References:

US Patent 10019454 B2 Data Management Systems and Methods)
The Snowflake Elastic Data Warehouse (useful whitepaper)
Apache Spark 2.3.0 RDD Programming Guide, Shuffle Operations
Snowflake Online Documentation


No comments:

Post a Comment