Blog | Flowtide

Release 0.13.0

February 10, 2025 · 2 min read

Major changes

New serializer to improve serialization speed

A new custom serializer has been implemented that follows the Apache Arrow serialization while minimizing extra allocations and memory copies.

Additionally, the default compression method was also changed from using ZLib to Zstd. This change was also made to improve serialization performance.

Support for pause & resume

A new feature has been added to allow pausing and resuming data streams, making it easier to conduct maintenance or temporarily halt processing without losing state.

For more information, visit https://koralium.github.io/flowtide/docs/deployment/pauseresume.

Integer column changed from 64 bits to dynamic size

The integer column was changed to now instead select the bit size based on the data inside of the column. This change reduces memory usage for columns with smaller integer values. Bit size is determined on a per-page basis, so pages with larger values will only use higher bit sizes when necessary.

Delta Lake Support

This version adds support to both read and write to the Delta Lake format. This allows easy integration to data lake storage. To learn more about delta lake support, please visit: https://koralium.github.io/flowtide/docs/connectors/deltalake

Custom data source & sink changed to use column based events

Both the custom data source and sink have now been changed to use column based events. This improves connector performance by eliminating the need to convert data between column-based and row-based formats during streaming.

Minor changes

Elasticsearch connector change from Nest to Elastic.Clients.Elasticsearch

The Elasticsearch connector has been updated from the deprecated Nest package to Elastic.Clients.Elasticsearch. This change requires stream configurations to be adjusted for the new connection settings.

Additionally, connection settings are now provided via a function, enabling dynamic credential management, such as rolling passwords.

Add support for custom stream listeners

Applications can now listen to stream events like checkpoints, state changes, and failures, allowing for custom exit strategies or monitoring logic.

Example:

.AddCustomOptions(s =>
{
    s.WithExitProcessOnFailure();
});

Cache lookup table for state clients

An internal optimization adds a small lookup table for state client page access, reducing contention on the global LRU cache. This change has shown a 10–12% performance improvement in benchmarks.

Release 0.12.0

January 24, 2025 · One min read

Major changes

All Processing Operators Updated to Column-Based Events

All processing operators now use the column-based event format, leading to better performance. However, some sources and sinks for connectors still use the row-based event format. Additionally, a few functions continue to rely on the row-based event format.

MongoDB Source Support

This release adds support to read data from MongoDB, this includes using MongoDBs change stream to directly react on data changes.

SQL Server Support for Stream State Persistence

You can now store the stream state in SQL Server. For setup instructions, refer to the documentation: https://koralium.github.io/flowtide/docs/statepersistence#sql-server-storage

Timestamp with Time Zone Data Type

A new data type for timestamps has been added. This ensures that connectors can correctly use the appropriate data type, especially when writing. For example, writing to MongoDB now uses the BSON Date type.

Minor Changes

Virtual Table Support

Static data selection is now supported. Example usage:

INSERT INTO output 
SELECT * FROM 
(
  VALUES 
  (1, 'a'),
  (2, 'b'),
  (3, 'c')
)

Release 0.11.0

November 14, 2024 · 3 min read

Major Changes

Column-Based Event Format

Most operators have transitioned from treating events as rows with flexbuffer to a column-based format following the Apache Arrow specification. This change has led to significant performance improvements, especially in the merge join and aggregation operators. Transitioning from row-based to column-based events involved a major rewrite of core components, and some operators still use the row-based format, which will be updated in the future.

Not all expressions have been converted to work with column data yet. However, the solution currently handles conversions between formats to maintain backward compatibility. Frequent conversions may result in performance decreases.

The shift to a column format also introduced the use of unmanaged memory for new data structures, for the following reasons:

64-byte aligned memory addresses for optimal SIMD/AVX operations.
Immediate memory return when a page is removed from the LRU cache, instead of waiting for the next garbage collection cycle.

With unmanaged memory, it is now possible to track memory allocation by different operators, providing better insight into memory usage in your stream.

B+ Tree Splitting by Byte Size

Previously, the B+ tree determined page sizes based on the number of elements, splitting pages into two equal parts when the max size (e.g., 128 elements) was reached. While this worked for streams with uniform element sizes, it led to size discrepancies in other cases, affecting serialization time and slowing down the stream.

This update introduces page splitting based on byte size, with a default page size of 32KB, ensuring more consistent and predictable page sizes.

Initial SQL Type Validation

This release contains the beginning of type validation when creating a Substrait plan using SQL. Currently, only SQL Server provides specific type metadata, while sources like Kafka continue to designate columns as 'any' due to varying payload types.

The new validation feature raises exceptions for type mismatches, such as when a boolean column is compared to an integer (e.g., boolColumn = 1). This helps inform users transitioning from SQL Server that bit columns are treated as boolean in Flowtide.

New UI

new UI

A new UI has been developed, featuring an integrated time series database that enables developers to monitor stream behavior over time. This database’s API aligns with Prometheus standards, allowing for custom queries to investigate potential issues.

The UI retrieves all data through the Prometheus API endpoint, enabling future deployment as a standalone tool connected to a Prometheus server.

Minor Changes

Congestion Control Based on Cache Misses

Flowtide processes data in small batches, typically 1-100 events. While this approach works well with in-memory data, cache misses that require persistent storage access can create bottlenecks. This is particularly problematic with multiple chained joins, where sequential data fetching can delay processing.

To address this, the join operator now monitors cache misses during a batch and, when a threshold is reached, splits the processed events and forwards them to the next operator. This change allows operators to access persistent storage in parallel, easing congestion.

Reduce the amount of pages written to persistent storage

Previously, all B+ tree metadata was written to persistent storage at every checkpoint, including root page IDs. In streams with numerous operators, this led to unnecessary writes.

Now, metadata is only written if changes have occurred, reducing the number of writes and improving storage efficiency.

Major changes​

New serializer to improve serialization speed​

Support for pause & resume​

Integer column changed from 64 bits to dynamic size​

Delta Lake Support​

Custom data source & sink changed to use column based events​

Minor changes​

Elasticsearch connector change from Nest to Elastic.Clients.Elasticsearch​

Add support for custom stream listeners​

Cache lookup table for state clients​

Major changes​

All Processing Operators Updated to Column-Based Events​

MongoDB Source Support​

SQL Server Support for Stream State Persistence​

Timestamp with Time Zone Data Type​

Minor Changes​

Virtual Table Support​

Major Changes​

Column-Based Event Format​

B+ Tree Splitting by Byte Size​

Initial SQL Type Validation​

New UI​

Minor Changes​

Congestion Control Based on Cache Misses​

Reduce the amount of pages written to persistent storage​

Major changes

New serializer to improve serialization speed

Support for pause & resume

Integer column changed from 64 bits to dynamic size

Delta Lake Support

Custom data source & sink changed to use column based events

Minor changes

Elasticsearch connector change from Nest to Elastic.Clients.Elasticsearch

Add support for custom stream listeners

Cache lookup table for state clients

Major changes

All Processing Operators Updated to Column-Based Events

MongoDB Source Support

SQL Server Support for Stream State Persistence

Timestamp with Time Zone Data Type

Minor Changes

Virtual Table Support

Major Changes

Column-Based Event Format

B+ Tree Splitting by Byte Size

Initial SQL Type Validation

New UI

Minor Changes

Congestion Control Based on Cache Misses

Reduce the amount of pages written to persistent storage