Data infrastructure

Data infrastructure software is the layer that stores, moves, processes and exposes data across cloud and enterprise environments - data warehouses and lakehouses, transactional and streaming databases, ELT/ETL pipelines, observability, catalogs, governance and the AI/ML infrastructure built on top. The category breaks into cloud data warehouses (Snowflake, BigQuery, Redshift and Databricks), databases (MongoDB, PostgreSQL ecosystem, Couchbase and Cassandra), streaming (Kafka/Confluent), ELT/ETL (dbt, Fivetran, Airbyte and Hevo), data observability, data catalogs and the new vector and AI databases. Snowflake and Databricks define the modern data cloud; the lakehouse versus warehouse architecture debate has narrowed since open table formats (Iceberg and Delta) became broadly supported.

It spans cloud data warehouses and lakehouses, transactional databases, streaming and real-time data, ELT/ETL and data integration, data observability, data catalogs and governance, reverse ETL and operational analytics, and vector databases for AI.

Revenue is dominated by consumption-based pricing on data warehouses and lakehouses (Snowflake credits, Databricks DBUs), per-row or per-event pricing on ELT/ETL pipelines, per-seat or per-event pricing on observability and data quality, capacity-based pricing on streaming, and increasingly token-based pricing on vector and AI databases.

Data infrastructure is part of Software.

$120B

Global market size

28

Public companies

Y Combinator
Sequoia Capital
Alumni Ventures
General Catalyst

Key VC investors

Snowflake
Databricks
IBM
MariaDB

Key strategic buyers

Business model

How data infrastructure companies monetize?

Data infrastructure software companies monetize through consumption-based pricing on warehouses and lakehouses, per-row and per-event pricing on ELT pipelines and open-source-core enterprise tiers.

Consumption-based pricing

Pay-per-compute pricing on data warehouses and lakehouses. Snowflake credits and Databricks DBUs are the reference models; the largest source of revenue volatility in the category.

Per-row / per-event ELT/ETL

Pricing on rows processed, events synced or active records. Fivetran, Hevo and Airbyte use this pricing; produces customer pushback at scale.

Per-seat data tools

Per-user subscriptions for data observability, catalogs, dbt and operational analytics. Monte Carlo, dbt Labs and Hightouch use this pricing.

Capacity-based streaming

Capacity, throughput or partition-based pricing on streaming and Kafka. Confluent, Aiven Kafka and Amazon MSK use variants.

Open-source-core enterprise

Free open-source-core with paid enterprise. dbt Labs, Confluent, Elastic, MongoDB, Grafana, Airbyte and Pinecone all use variants of this model.

Token / vector consumption

Token-based or query-based pricing on vector databases and AI infrastructure. Pinecone, Weaviate, Qdrant and Chroma use variants.

Data infrastructure valuations in May 2026

Public data infrastructure comps trade at 8.9x EV/Revenue. Median revenue multiple across data infrastructure M&A deals was 6.9x in the last 12 months. Median revenue multiple across data infrastructure VC rounds was 28x in the last 12 months.

8.9x

Median EV/Revenue as of May 2026 for public data infrastructure companies

12x

Oracle

Oracle is the highest valued public data infrastructure company based on EV/Revenue (excluding outliers)

6.9x

Median EV/Revenue across data infrastructure M&A deals in the last 12 months

28x

Median EV/Revenue across data infrastructure VC rounds in the last 12 months

Sector breakdown

Data infrastructure market segments

Data infrastructure software spans cloud data warehouses and lakehouses, transactional databases, streaming and real-time data, ELT/ETL, data observability and vector databases.

Cloud data warehouses & lakehouses

Cloud-native data warehouses and lakehouse platforms. Snowflake and Databricks lead independents; BigQuery (Google), Amazon Redshift and Microsoft Fabric anchor hyperscaler. Open table formats (Iceberg and Delta) increasingly cross-compatible.

Transactional databases

OLTP databases used by application teams. MongoDB is the largest standalone (NASDAQ: MDB); Couchbase and Cassandra serve specific use cases; PostgreSQL ecosystem (Neon, Supabase, Crunchy Data and Aiven) growing fast.

Streaming & real-time data

Apache Kafka-based streaming, change data capture and real-time analytics. Confluent (NASDAQ: CFLT) leads commercial Kafka; Redpanda, Materialize and Striim compete; Aiven and Amazon MSK serve managed services.

ELT/ETL & data integration

Software moving data between systems and into warehouses. Fivetran leads commercial ELT; Airbyte leads open-source; Hevo and Stitch (Talend) compete; dbt Labs anchors the transformation layer.

Data observability

Software monitoring data quality, freshness and lineage. Monte Carlo, Bigeye, Acceldata and Validio lead; Anomalo, Soda, Datafold and Sifflet serve adjacent segments.

Data catalogs & governance

Data discovery, governance and metadata management. Collibra and Alation lead enterprise; Atlan and Castor (Coalesce) lead modern; Microsoft Purview and Unity Catalog (Databricks) compete from cloud platforms.

Reverse ETL & operational analytics

Software moving data from warehouses back to operational tools. Hightouch and Census lead the standalone category; Polytomic and Grouparoo compete.

Vector databases & AI infrastructure

Vector databases for embeddings and AI applications. Pinecone, Weaviate, Qdrant and Chroma lead standalone; pgvector (PostgreSQL extension) competes from the database side; major warehouses now support native vector workloads.

Fractional CFO, financial modelling and deal advice for data infrastructure companies

See how Flow helps data infrastructure founders.

We speak founders' language and have great operational understanding of data infrastructure businesses.

Book an intro call - we'll look under the hood and recommend concrete next steps.

Explore pricing
Sector KPIs

Key data infrastructure KPIs to track

ARR, consumption credit growth, net revenue retention, gross margin, customer count and net new ARR are the metrics investors and operators track in data infrastructure software.

KPIDefinition
ARRRecurring revenue. Standard headline for SaaS data platforms; less informative for consumption-priced businesses where usage is the cleaner signal.
Consumption / credit growthSnowflake credits or Databricks DBUs consumed. The headline activity metric for consumption-based data platforms.
Net revenue retentionExpansion via consumption growth and new workload deployment. Snowflake and Databricks have historically run NRR at 130-160%.
Gross marginPure-software data platform SaaS at 70-80%; consumption-priced platforms net of cloud costs sit at 65-75%.
Customer countEnterprise logo count. Concentration in $1M+ ARR customers is the key revenue-quality lens.
Net new ARRPeriod-over-period change in ARR. Cleanest read on underlying business momentum after consumption volatility.
Key players

Main data infrastructure players globally

The most active data infrastructure software companies and category leaders globally.

CompanyHQOverview
Snowflake
snowflake.com
Bozeman
Cloud data warehouse and data cloud (NYSE: SNOW). One of the largest cloud-native SaaS franchises globally; consumption-based pricing model defines the modern data cloud category.
Databricks
databricks.com
San Francisco
Lakehouse and AI data platform. Private; raised at $62B valuation in 2024. Direct competitor to Snowflake; stronger ML and unstructured data positioning.
Confluent
confluent.io
Mountain View
Apache Kafka-based streaming platform (NASDAQ: CFLT). Commercial Kafka leader; expanding into broader real-time data infrastructure.
New York
Document database (NASDAQ: MDB). Largest standalone non-relational database business; Atlas managed service is the dominant revenue line.
Fivetran
fivetran.com
Oakland
Managed ELT platform. Private; raised at $5.6B valuation in 2021. Mature commercial ELT business.
dbt Labs
getdbt.com
Philadelphia
Transformation layer for the modern data stack. Private; raised at $4.2B valuation in 2022. Open-source core with dbt Cloud as enterprise tier.
San Francisco
Data observability leader. Private; raised at $1.6B valuation in 2022.
Collibra
collibra.com
Brussels
Data governance and catalog platform. Private; raised at $5.25B valuation in 2022.
Redwood City
Data catalog and intelligence platform. Private; backed by ICONIQ, Costanoa, Sapphire and Salesforce Ventures.
Pinecone
pinecone.io
New York
Vector database for AI applications. Private; raised at $750M valuation in 2023. Leading commercial vector database.

Thinking about M&A?

Fractional CFO services for sell-side, buy-side and strategic processes - from prep to close.

Market trends

Key data infrastructure market trends

Lakehouse architecture, AI workloads driving data infra spend and vector databases and RAG infrastructure are reshaping data infrastructure software right now.

Lakehouse architecture

Databricks-led lakehouse architecture competing with Snowflake's warehouse-first model. The architectural debate has narrowed since both vendors embraced open table formats (Iceberg and Delta).

Modern data stack consolidation

dbt, Fivetran, Hightouch and Census remain the modern data stack reference. Consolidation pressure from cloud platforms (Snowflake Snowpipe and Databricks Lakeflow) bundling overlapping capabilities.

AI workloads driving data infra spend

RAG infrastructure, vector databases, AI-data pipelines and inference-optimised warehouses driving structural new spend. Pinecone, Weaviate and warehouse-native vector capabilities competing for the AI data tier.

Open table formats and Iceberg adoption

Apache Iceberg has emerged as the de-facto open table format. Snowflake embraced Iceberg in 2024; Databricks Delta Lake continues to be the default for Databricks. Multi-format support is now standard.

Data observability becoming standard

Monte Carlo, Bigeye and Acceldata moving from emerging to mainstream as data teams scale. Embedded observability in dbt, Snowflake and Databricks competing with standalone vendors.

Vector databases and RAG infrastructure

Pinecone leads commercial vector databases; Weaviate, Qdrant and Chroma compete; pgvector and warehouse-native vector support reshape the category. The RAG infrastructure stack is still evolving rapidly.

Explore other sectors

We know tech inside & out.

We live and breath tech - true understanding of how startups operate is fundamental at what we do.

SoftwareAI & MLFintechConsumer internetDigital mediaE-commerce & marketplacesConsumer productsMobilityDigital healthIndustrial technologyDigital infrastructureIT services

Recent insights across data infrastructure and beyond

Talk to us

Schedule a call to get a health check on your business and see how we could help.

Fractional CFO

Stages

Countries

Cities

Sectors

© 2026 Flow Partners (London) Ltd. All rights reserved. Registered as a limited liability company in England and Wales (registered number 12969521).