What You Should Know About Snowflake’s Snowpipe Streaming

02

Data ingestion is one vital aspect of data management. Fast and smooth data ingestion drives faster decision-making that can significantly improve business data value. One common data platform used by businesses today is Snowflake. With Snowflake, you can do batch loading with COPY, continuous loading with Snowpipe, and stream loading with the new Snowpipe Streaming API.

Snowpipe automatically loads newly available data in your cloud data storage into Snowflake tables. In contrast, Snowpipe Streaming uses a Java SDK wrapper to enable a streaming API that directly loads data from sources into the appropriate Snowflake table.

Let’s explore Snowpipe Streaming, how it differs from the commonly used Snowpipe, its business applications, and some of its limitations.

Snowpipe vs. Snowpipe Streaming

Snowpipe is a micro-batch, continuous serverless ingestion tool that loads recently available data from staging areas, like a cloud storage solution, to Snowflake tables. Detecting the availability of new data from sources like WebSockets APIs, CRM systems, and web event data in the staging area occurs via cloud messaging or calling via a public REST endpoint. Hence, Snowpipe is a two-step loading process involving a staging area like Google Cloud Storage, Amazon S3, Microsoft Azure Bob storage, and others.

Snowpipe Streaming, on the other hand, offers real-time data ingestion directly into your Snowflake tables using the streaming ingest SDK via the streaming API. Removing the intermediary step of loading data into staging tables reduces the minutes’ end-to-end latency to mere seconds, significantly improving performance and easing scalability.

Combining the streaming, low-latency capabilities of Snowpipe Streaming with Snowpipe enables you to build a scalable, high-performing data architecture catering to a diverse range of business use cases.

However, these ingestion methods differ in the following ways.

Data ingestion methods Snowpipe Streaming Snowpipe
Latency Direct loading into Snowflake reduces latency to just seconds for real-time applications. The intermediary loading step from cloud storage adds network latency, thereby increasing latency to minutes which isn’t ideal for real-time apps.
Costs No additional storage costs. However, additional costs may arise from configuring separate Java clients. The staging area adds storage costs to your overall data architecture budget.
Configuration Complexity Requires a Java Client wrapper to use the streaming API, so use cases with data sources other than the Kafka connector require configuring a separate Java wrapper. Needs no third-party configuration.
Data sources Data sources include Kafka connectors, IoT devices, website events, and other managed cloud services. The data source is a cloud data store, like Amazon S3 cloud storage.
Loading data format Writes data into Snowflake tables as a row. Writes data into Snowflake tables as files.

Data Ingestion Method

Although Snowpipe and Snowpipe Streaming are serverless ingestion methods, Snowpipe loads new data from a staging area to a Snowflake table as files. In contrast, Snowpipe Streaming loads data directly into Snowflake tables with the streaming API as rows.

Latency

Snowpipe Streaming offers seconds latency because it reduces the latency involved with loading from the data storage stage, making it ideal for dealing with highly perishable data used by IoT applications, stock market trading data, and logs. Snowpipe offers minutes latency, which may not be suitable as data value may significantly reduce before its usage.

Costs

Snowpipe Streaming eliminates the extra storage costs incurred in cloud data storage. Snowpipe’s need for staged data storage increases your cost storage requirements.

Configuration Complexity

Snowpipe Streaming requires a custom Java wrapper for developing the SDK client to load the data into the appropriate table. This SDK means additional configuration, which adds complexity and development time to your ingestion architecture. Snowpipe requires no other third-party configurations, making it easier to deploy and maintain.

Data Sources

Snowpipe loads data from cloud storage, like cloud data storage, S3, and other storage options. Snowflake streaming loads data directly from sources via the streaming API and client SDK.

Snowpipe Streaming: When to Use it and When Not To

Snowpipe Streaming is highly valuable for highly-perishable streaming data like the stock market, IoT, and messaging data:

  • Financial applications: Effective and revenue-generating trades to boost portfolio performance relies on quick market decisions driven by market data. With the streaming ingest API, messages from sources like your WebSocket APIs are immediately mapped into columns in your table, saving storage costs and improving time to insights for your trades.
  • IoT applications: IoT devices rely on the timely and uninterrupted flow of data generated from sensors, phones, watches, equipment, and more to ensure faster insights for smooth business operations and improve efficiency. Loading these data into a staging area before loading them into Snowflake adds latency, affecting analytics efforts. The streaming API reduces latency, and your streaming data is available for analysis and other data use cases within seconds.
  • Monitoring systems and logs for preventative maintenance: Large-scale manufacturing employs logs that inform their tools and equipment’s health and performance status. Using the Snowpipe Streaming API grants engineering and maintenance teams quicker access to these log metrics like the effectiveness of processes, mean time to failure, and downtime that informs on potential health issues that need immediate attention.

The Downsides of Snowpipe Streaming

Although Snowpipe Streaming improves latency and reduces storage requirements for your data architecture, it faces the following challenges:

  • Increased technical complexity: Instead of needing a staging area that loads data into Snowflake tables, Snowpipe Streaming requires a streaming digest SDK Java Client that writes and loads the data into the appropriate table. This requirement means you may need to develop your own Java Client to support ‌streaming ingestion for use cases without a Kafka connector. Configuring your Java Client can be technical, which requires highly skilled hands and may translate to increased cost of development for your architecture.
  • Expensive: Ingestion costs are significantly reduced when your use cases use the already supported Client SDK- Kafka connector. This case becomes different for streaming data use cases with other data sources like Amazon Kinesis and Google Cloud Dataflow, where you need to develop your Java client to cater to every source, increasing configuration costs for your data architecture.
  • Risk of vendor lock-ins: A robust data architecture usually involves integrating multiple cloud services, which allows you to benefit from every service offering. However, as with Snowpipe, Snowpipe Streaming API stores data in the FDN formats, which means all data transformations and pipelines remain usable only within Snowflake, and data streaming from other sources may incur transformation costs to transform your data into the acceptable FDN formats. This risk of vendor lock-ins may limit flexibility and innovation otherwise offered by open-source data formats.
  • Limited functionality: Currently, the Snowpipe Streaming API only offers insert operations. Therefore, other data operations like update still require a staging table before loading, which may mean using both for your data architecture.

How StreamSets Complements Snowpipe and Snowpipe Streaming

Designing a robust data integration strategy for loading data into your Snowflake data warehouse can be challenging. StreamSets enable a seamless data integration strategy by allowing users to create smart, reusable pipelines for your Snowflake cloud data integration. Its CDC capabilities also ensure your Snowflake data is always up-to-date.

The StreamSets Snowflake Transformer for Snowflake engine uses Snowpark and its pre-built processors to design and execute complex processing tasks natively in Snowflake without the need to write complex SQL queries, functions, or orchestration code which may help reduce development, training, and collaboration costs.

With StreamSets, you can use all the power of Snowflake with an intuitive drag-and-drop interface. Democratize your data by empowering the people closest to it to build, maintain and orchestrate data pipelines, even if they aren’t technical. More technical data professionals don’t have to give up any functionality with StreamSets’ support for custom UDFs and advanced processors like Slowly Changing Dimensions. Give your data team the best of both worlds.