Data infrastructure
Data infrastructure software is the layer that stores, moves, processes and exposes data across cloud and enterprise environments - data warehouses and lakehouses, transactional and streaming databases, ELT/ETL pipelines, observability, catalogs, governance and the AI/ML infrastructure built on top. The category breaks into cloud data warehouses (Snowflake, BigQuery, Redshift and Databricks), databases (MongoDB, PostgreSQL ecosystem, Couchbase and Cassandra), streaming (Kafka/Confluent), ELT/ETL (dbt, Fivetran, Airbyte and Hevo), data observability, data catalogs and the new vector and AI databases. Snowflake and Databricks define the modern data cloud; the lakehouse versus warehouse architecture debate has narrowed since open table formats (Iceberg and Delta) became broadly supported.
It spans cloud data warehouses and lakehouses, transactional databases, streaming and real-time data, ELT/ETL and data integration, data observability, data catalogs and governance, reverse ETL and operational analytics, and vector databases for AI.
Revenue is dominated by consumption-based pricing on data warehouses and lakehouses (Snowflake credits, Databricks DBUs), per-row or per-event pricing on ELT/ETL pipelines, per-seat or per-event pricing on observability and data quality, capacity-based pricing on streaming, and increasingly token-based pricing on vector and AI databases.
Data infrastructure is part of Software.
$120B
Global market size
28
Public companies
Key VC investors
Key strategic buyers
How data infrastructure companies monetize?
Data infrastructure software companies monetize through consumption-based pricing on warehouses and lakehouses, per-row and per-event pricing on ELT pipelines and open-source-core enterprise tiers.
Consumption-based pricing
Pay-per-compute pricing on data warehouses and lakehouses. Snowflake credits and Databricks DBUs are the reference models; the largest source of revenue volatility in the category.
Per-row / per-event ELT/ETL
Pricing on rows processed, events synced or active records. Fivetran, Hevo and Airbyte use this pricing; produces customer pushback at scale.
Per-seat data tools
Per-user subscriptions for data observability, catalogs, dbt and operational analytics. Monte Carlo, dbt Labs and Hightouch use this pricing.
Capacity-based streaming
Capacity, throughput or partition-based pricing on streaming and Kafka. Confluent, Aiven Kafka and Amazon MSK use variants.
Open-source-core enterprise
Free open-source-core with paid enterprise. dbt Labs, Confluent, Elastic, MongoDB, Grafana, Airbyte and Pinecone all use variants of this model.
Token / vector consumption
Token-based or query-based pricing on vector databases and AI infrastructure. Pinecone, Weaviate, Qdrant and Chroma use variants.
Data infrastructure valuations in May 2026
Public data infrastructure comps trade at 8.9x EV/Revenue. Median revenue multiple across data infrastructure M&A deals was 6.9x in the last 12 months. Median revenue multiple across data infrastructure VC rounds was 28x in the last 12 months.
8.9x
Median EV/Revenue as of May 2026 for public data infrastructure companies
12x
Oracle is the highest valued public data infrastructure company based on EV/Revenue (excluding outliers)
6.9x
Median EV/Revenue across data infrastructure M&A deals in the last 12 months
28x
Median EV/Revenue across data infrastructure VC rounds in the last 12 months
Data infrastructure market segments
Data infrastructure software spans cloud data warehouses and lakehouses, transactional databases, streaming and real-time data, ELT/ETL, data observability and vector databases.
Cloud data warehouses & lakehouses
Cloud-native data warehouses and lakehouse platforms. Snowflake and Databricks lead independents; BigQuery (Google), Amazon Redshift and Microsoft Fabric anchor hyperscaler. Open table formats (Iceberg and Delta) increasingly cross-compatible.
Transactional databases
OLTP databases used by application teams. MongoDB is the largest standalone (NASDAQ: MDB); Couchbase and Cassandra serve specific use cases; PostgreSQL ecosystem (Neon, Supabase, Crunchy Data and Aiven) growing fast.
Streaming & real-time data
Apache Kafka-based streaming, change data capture and real-time analytics. Confluent (NASDAQ: CFLT) leads commercial Kafka; Redpanda, Materialize and Striim compete; Aiven and Amazon MSK serve managed services.
ELT/ETL & data integration
Software moving data between systems and into warehouses. Fivetran leads commercial ELT; Airbyte leads open-source; Hevo and Stitch (Talend) compete; dbt Labs anchors the transformation layer.
Data observability
Software monitoring data quality, freshness and lineage. Monte Carlo, Bigeye, Acceldata and Validio lead; Anomalo, Soda, Datafold and Sifflet serve adjacent segments.
Data catalogs & governance
Data discovery, governance and metadata management. Collibra and Alation lead enterprise; Atlan and Castor (Coalesce) lead modern; Microsoft Purview and Unity Catalog (Databricks) compete from cloud platforms.
Reverse ETL & operational analytics
Software moving data from warehouses back to operational tools. Hightouch and Census lead the standalone category; Polytomic and Grouparoo compete.
Vector databases & AI infrastructure
Vector databases for embeddings and AI applications. Pinecone, Weaviate, Qdrant and Chroma lead standalone; pgvector (PostgreSQL extension) competes from the database side; major warehouses now support native vector workloads.
Fractional CFO, financial modelling and deal advice for data infrastructure companies
See how Flow helps data infrastructure founders.
We speak founders' language and have great operational understanding of data infrastructure businesses.
Book an intro call - we'll look under the hood and recommend concrete next steps.
Fractional CFO
For founders who want to improve their FP&A functions, build an investor-ready financial model, and prepare for the next VC round.

Capital raising
For bootstrapped and already-VC-backed entrepreneurs who are looking to raise late stage venture or growth capital.

M&A
For category-leading technology companies who are exploring exit alternatives with either financial or strategic acquirers.

Key data infrastructure KPIs to track
ARR, consumption credit growth, net revenue retention, gross margin, customer count and net new ARR are the metrics investors and operators track in data infrastructure software.
| KPI | Definition |
|---|---|
| ARR | Recurring revenue. Standard headline for SaaS data platforms; less informative for consumption-priced businesses where usage is the cleaner signal. |
| Consumption / credit growth | Snowflake credits or Databricks DBUs consumed. The headline activity metric for consumption-based data platforms. |
| Net revenue retention | Expansion via consumption growth and new workload deployment. Snowflake and Databricks have historically run NRR at 130-160%. |
| Gross margin | Pure-software data platform SaaS at 70-80%; consumption-priced platforms net of cloud costs sit at 65-75%. |
| Customer count | Enterprise logo count. Concentration in $1M+ ARR customers is the key revenue-quality lens. |
| Net new ARR | Period-over-period change in ARR. Cleanest read on underlying business momentum after consumption volatility. |
Main data infrastructure players globally
The most active data infrastructure software companies and category leaders globally.
| Company | HQ | Overview |
|---|---|---|
Snowflake snowflake.com | Bozeman | Cloud data warehouse and data cloud (NYSE: SNOW). One of the largest cloud-native SaaS franchises globally; consumption-based pricing model defines the modern data cloud category. |
Databricks databricks.com | San Francisco | Lakehouse and AI data platform. Private; raised at $62B valuation in 2024. Direct competitor to Snowflake; stronger ML and unstructured data positioning. |
Confluent confluent.io | Mountain View | Apache Kafka-based streaming platform (NASDAQ: CFLT). Commercial Kafka leader; expanding into broader real-time data infrastructure. |
MongoDB mongodb.com | New York | Document database (NASDAQ: MDB). Largest standalone non-relational database business; Atlas managed service is the dominant revenue line. |
Fivetran fivetran.com | Oakland | Managed ELT platform. Private; raised at $5.6B valuation in 2021. Mature commercial ELT business. |
dbt Labs getdbt.com | Philadelphia | Transformation layer for the modern data stack. Private; raised at $4.2B valuation in 2022. Open-source core with dbt Cloud as enterprise tier. |
Monte Carlo montecarlodata.com | San Francisco | Data observability leader. Private; raised at $1.6B valuation in 2022. |
Collibra collibra.com | Brussels | Data governance and catalog platform. Private; raised at $5.25B valuation in 2022. |
Alation alation.com | Redwood City | Data catalog and intelligence platform. Private; backed by ICONIQ, Costanoa, Sapphire and Salesforce Ventures. |
Pinecone pinecone.io | New York | Vector database for AI applications. Private; raised at $750M valuation in 2023. Leading commercial vector database. |
Thinking about M&A?
Fractional CFO services for sell-side, buy-side and strategic processes - from prep to close.
Key data infrastructure market trends
Lakehouse architecture, AI workloads driving data infra spend and vector databases and RAG infrastructure are reshaping data infrastructure software right now.
Lakehouse architecture
Databricks-led lakehouse architecture competing with Snowflake's warehouse-first model. The architectural debate has narrowed since both vendors embraced open table formats (Iceberg and Delta).
Modern data stack consolidation
dbt, Fivetran, Hightouch and Census remain the modern data stack reference. Consolidation pressure from cloud platforms (Snowflake Snowpipe and Databricks Lakeflow) bundling overlapping capabilities.
AI workloads driving data infra spend
RAG infrastructure, vector databases, AI-data pipelines and inference-optimised warehouses driving structural new spend. Pinecone, Weaviate and warehouse-native vector capabilities competing for the AI data tier.
Open table formats and Iceberg adoption
Apache Iceberg has emerged as the de-facto open table format. Snowflake embraced Iceberg in 2024; Databricks Delta Lake continues to be the default for Databricks. Multi-format support is now standard.
Data observability becoming standard
Monte Carlo, Bigeye and Acceldata moving from emerging to mainstream as data teams scale. Embedded observability in dbt, Snowflake and Databricks competing with standalone vendors.
Vector databases and RAG infrastructure
Pinecone leads commercial vector databases; Weaviate, Qdrant and Chroma compete; pgvector and warehouse-native vector support reshape the category. The RAG infrastructure stack is still evolving rapidly.
Similar verticals to data infrastructure
Explore niches like automotive software, education software, energy & utilities software and financial services software.
Explore other sectors
We know tech inside & out.
We live and breath tech - true understanding of how startups operate is fundamental at what we do.
Recent insights across data infrastructure and beyond
Talk to us
Schedule a call to get a health check on your business and see how we could help.
Fractional CFO
- Fractional CFO for Software
- Fractional CFO for AI & ML
- Fractional CFO for Fintech
- Fractional CFO for Consumer internet
- Fractional CFO for Digital media
- Fractional CFO for E-commerce & marketplaces
- Fractional CFO for Consumer products
- Fractional CFO for Mobility
- Fractional CFO for Digital health
- Fractional CFO for Industrial technology
- Fractional CFO for Digital infrastructure
- Fractional CFO for IT services
Stages
Countries
- UK Fractional CFO
- Ireland Fractional CFO
- France Fractional CFO
- Germany Fractional CFO
- Spain Fractional CFO
- Portugal Fractional CFO
- Italy Fractional CFO
- Netherlands Fractional CFO
- Belgium Fractional CFO
- Switzerland Fractional CFO
- Austria Fractional CFO
- Denmark Fractional CFO
- Sweden Fractional CFO
- Norway Fractional CFO
- Finland Fractional CFO
- Poland Fractional CFO
- Estonia Fractional CFO
- US Fractional CFO
- Canada Fractional CFO
- Mexico Fractional CFO
- Brazil Fractional CFO
- UAE Fractional CFO
- Australia Fractional CFO
Cities
- London Fractional CFO
- Dublin Fractional CFO
- Paris Fractional CFO
- Berlin Fractional CFO
- Madrid Fractional CFO
- Lisbon Fractional CFO
- Milan Fractional CFO
- Amsterdam Fractional CFO
- Brussels Fractional CFO
- Zurich Fractional CFO
- Vienna Fractional CFO
- Copenhagen Fractional CFO
- Stockholm Fractional CFO
- Oslo Fractional CFO
- Helsinki Fractional CFO
- Warsaw Fractional CFO
- Tallinn Fractional CFO
- New York Fractional CFO
- Toronto Fractional CFO
- Mexico City Fractional CFO
- São Paulo Fractional CFO
- Dubai Fractional CFO
- Sydney Fractional CFO































