Connect to data in a Clean Room

Fredrik Göransson
10 min readDec 5, 2024

--

Privacy preserving collaboration with data in protected environments. Collaborate securely with anyone, cross-cloud, cross-region — here are the options that makes the collaboration so much easier to manage, regardless or where you or the data is.

Publishers and Consumers collaborating in a Clean Room (Imagined by AI if Clean Rooms were a tangible thing and collaboration parties were sitting in the same physical room)

Data Clean Rooms have become an essential part of collaboration strategies that help organizations to drive joint insights and analysis, even in scenarios where it is essential that first party data is protected throughout the exchange. By combining strong data governance features natively on the platform, Snowflake Data Clean Rooms allows organizations to stand up secure environments for collaboration without any need to move data, without the need to create separate protected environments (a.k.a. The Switzerlands of data) or to force collaborators to move data into closed environments (the walled gardens). Each party in a collaboration (and there can be multiple parties involved, both as providers and as consumers of data) have explicit control over their data, how it is accessed, how it is secured and how it can be utilized in a specific clean room. This gives each party flexibility, as well as virtually unlimited ability to collaborate — simply by being on Snowflake they already have all the requirements for standing up and operating clean rooms. This lowered barrier to entry essentially means that clean rooms are no longer costly, complicated or time consuming processes for organizations. You can realize use cases, with less friction, with less effort essentially.

Cross-cloud and region collaboration

The ability to collaborate cross-cloud platforms and cross-regions allows organizations to connect to just about any other organization, regardless of their choices around geographic residency of the data or which cloud-provider they have chosen. This unlocks collaboration opportunities well beyond what most other data clean room solutions offer. Consider an organization that has chosen AWS as their cloud-provider with a majority of their data on their, there are certainly opportunities to collaborate with other organizations that are also on AWS, but what about those who are basing their operations on Azure? Or on Google Cloud? By only directly collaborating with other organizations on the same cloud-provider platform the reach is limited to 30%, 20%, or even 12% of the addressable market depending on each CSPs general market share.

Data connected between cloud regions and providers

Even for organizations that are not on Snowflake, there are options for collaborating in a clean room with an organization that stands up a Snowflake Data Clean Room. Providers in a data clean room can simply invite such collaborators to join their clean room, and securely connect their data in a managed environment, which is still governed, secure and protected from other parties in the clean room.

Let’s explore the options for collaboration between different parties, especially when it comes to Data Clean Rooms and how to connect to the data of each organization — preferably without any costly, inefficient or worst — unsecured, movement of data between environments.

Role of providers and consumers

Before diving into the different options it would be helpful to define the two major roles in a Data Clean Room, really in any data collaboration, but it becomes very clearly defined in a data clean room collaboration. The first is the Provider. The Provider is offering data that can be joined and analysed together with another party’s data. In the world of advertisement and marketing, it could be a publisher that is selling ads inventory and wants to allow advertisers to (securely) explore the potential overlap of the publisher’s subscribers with the advertiser’s customers. In a clean room there can be multiple providers of data, each having their own data securely protected by their defined set of policies and governance rules. Each provider’s data never moves outside of their control and is never copied to another party’s environment. This is an essential cornerstone of the data clean room that ensures the protection of the data.

The other role in a collaboration is the Consumer of data. Consumers of data also bring their data to the collaboration as well as running some analysis on top of that combined data. In the example above it would be an advertiser with a list of customers they would like to target and understand if they can be targeted on the publisher’s platform, without revealing to the publisher which customers those are, and at the same time the publisher is not revealing their list of subscribers. There is a vast list of different types of analyses that can be run in this type of environment. Simply put, the Consumer is the party in a Data Clean Room that can ask questions (and get them answered) based on the joined data from Provider(s) and Consumer.

Common setup

The most common setup for a Data Clean Room, and the easiest to get started with, is that each party, Provider and Consumer, have their own respective Snowflake Accounts. With that each organization has all they need to collaborate.

In the basic scenario the Provider creates a Data Clean Room in their Snowflake account. A Snowflake account is bound to a specific cloud-provider, and a region from that provider. It is important however to know that an organization can have multiple Snowflake accounts, this is the foundation of cross-platform and cross-region collaboration. I.e. a customer can have a Snowflake account in AWS Frankfurt and another in Azure Ireland. Here is a list of all supported regions.

Direct Publisher to Consumer setup

In this scenario a Consumer is invited to join the Data Clean Room and they can connect their Snowflake Account in the same region and join their data. This scenario ensures no data movement at all and data that stays in the respective environments at all times during and after the collaboration.

The data from the Provider is made available in the Data Clean Room through data sharing, but is never moved, copied or replicated.

Cross-cloud and region collaboration

There are also those scenarios where a Provider would like to allow Consumers to join the Data Clean Room regardless of where they are or what cloud-provider they have chosen. There are two ways to achieve this — either the Provider’s Data Clean Room is made available to consumers wherever they are — this can be done efficiently and seamlessly with Listing Auto-fulfillment, or the Consumers can make their data available in a region where the Data Clean Room is offered. Each scenario ensures that data never moves outside of each party’s accounts.

Provider cross-cloud collaboration

Publisher to Consumer setup with cross-region/-cloud Auto-Fulfillment

Here the Provider’s data and Clean Room protection is made available using Listing Auto-fulfillment to Snowflake accounts in each region where the Provider wants to meet Consumers. The Provider’s data is securely and efficiently replicated to an account that is controlled by the Provider, in the region where it needs to be. That account is created behind the scenes for the Provider and no additional work needs to be done. For the Consumers, they simply discover the Data Clean room available in the region where they join, and their data never moves outside their original account.

Pros: Simple to set up and simple to manage for Providers. Simple to discover and connect to for the Consumers.

Cons: If the Provider dataset is very large and/or has a very high churn, the replication of data involved may introduce latency (in terms of data updates) and additional cost. While the replication is highly performant and cost-efficient, it is a cost that is appended to the Provider’s cost for operating the Data Clean Room.

Consumer cross-cloud collaboration

The Consumer can inversely make their data available for the Data Clean Room collaboration in a region that the Provider has chosen. This could be beneficial in situations where the Consumer’s dataset is significantly smaller than the Provider’s dataset, or where the Consumers are joining from a wide range of regions and clouds that the Provider does not want to support directly.

Here the Consumer can simply ensure they have or create an account in a region matching the Provider’s offered Data Clean Room, and then through the built-in data replication make the Consumer data set available in the region and run the analysis there. Data replication in Snowflake is a simple and efficient process, the source and target database is pointed out and after that Snowflake ensures consistency and updates efficiently. It is cost effective and performance and in most cases only incremental changes are actually moved.

Publisher to Consumers setup with Consumer direct data replication

Pros: Provider saves on cost by not replicating data, process is simplified for the Provider as only a single region has to be considered.

Cons: No automated process for setting up data replication for Consumers, Consumers have to create an account in a matching Provider region where they can set up the replication.

Consumer external data connection

The final option is to allow Consumers to connect to data that sits in an external storage in the cloud. This could be AWS S3 storage, Azure Blob Storage or Google Cloud Storage. The simplest form is files in parquet format, but there are wider benefits to having files in an Iceberg table format as well as that can improve performance and reduce the cost of accessing files. Both are valid options for the Consumer. As a side note, these options are available to the Provider as well, if the data is available in cloud storage, it can be connected to the Data Clean Room without having to be moved from there.

In the scenario where a Consumer is connecting to data in external cloud storage, a Snowflake account is still used to securely connect to that data. This is an important aspect, as it ensures that the Provider account, or any other part of the Data Clean Room, is not connected to the storage, only the account dedicated to the Consumer. So while read and access rights to that cloud storage can be given to the clean room environment, it is always done in the scope of a Snowflake account dedicated to the Consumer.

Direct Publisher to Consumer setup with Consumer external data or Iceberg data access

In this setup, the data in the external storage can be located anywhere, as long as read and access permissions can be given to the Snowflake account for that Consumer. This allows for data to be connected regardless of where it is stored. This offers flexibility, but it should be noted that it may also include latency in accessing the data if the data is stored in a region remote from the Data Clean Room region, as well as potential egress costs on the cloud storage side when it is accessed. The data is however not moved when it is connected to the Clean Room.

Pros: Data can be connected from anywhere, as long as it is in cloud storage.

Cons: Can drive latency and lower performance in analysis when accessing the data. With Iceberg tables this is somewhat alleviated, but data is still accessed from a potentially remote region.

Consumers without Snowflake accounts

While having a dedicated Snowflake account is the option that allows the highest level of flexibility and in some situations the best performance, it is not necessary for all Consumers to have a Snowflake account to collaborate with a Provider’s Data Clean Room. A Provider can invite a Consumer to join using a Managed Account. This is a Snowflake account that is dedicated to the Consumer, it is not controlled or accessible by the Provider, so it offers the same security, isolation and control that a Consumer owned account does, but it is dedicated to the collaboration in a specific Data Clean Room.

The invitation to use a Managed Account means that Consumers can join a Data Clean Room by simply accepting basic Terms and Conditions for that account, they don’t need a full signed agreement or contract with Snowflake. This makes it very easy for Providers to collaborate with a wide range of Consumers, no matter if they are on Snowflake already or not.

The limitations of these accounts are that they are dedicated to the specific Data Clean Room and cannot be used for other purposes, such as general analytics or data engineering or other operations — if this is required by a Consumer, they could simply get a full account to work with.

Publisher to Managed Consumer accounts with external cloud storage data setup

Once a Managed Account has been created for a Consumer, that Consumer can securely connect their data that sits in an external storage location, like an AWS S3 bucket. That connection can only be used by the Managed Account, and for the purposes of the Data Clean Room.

Summary

Enabling two organizations for collaboration in a Clean Room is a fairly low effort operation, especially considering there is no additional license fee or product needed to be acquired. Choosing how to meet Consumers in a distributed environment with cross-cloud and cross-region data can be done in a number of different ways, and by looking at the requirements and each use case different approaches can be combined for a balance of ease of use, data movement and performance. Usually requirements around data movement and which party has a large or high churn data influences the decisions as well.

With these options, customers are free to set up and design their Data Clean Room strategies without limitations. All of the above strategies can additionally be mixed in any single engagement, it is up to the Provider to design the approach that meets their needs and the needs of their Consumers, wherever they are.

Comparison of Cost, Ease of Use, Performance, Data Movement and Cost options for different setups

Data Clean Rooms on Snowflake are free of charge, any customer with a contract with Snowflake can simply create one and get started in direct collaborations, just follow the Getting Started documentation.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Fredrik Göransson
Fredrik Göransson

Written by Fredrik Göransson

Have worked with innovation and architecture in IT for the last 20+ years. Really passionate getting cutting-edge technology, architecture and code

No responses yet

Write a response