Data Engineering Startups funded by Y Combinator (YC) in the San Francisco Bay Area 2026

May 2026

Browse 38 of the top Data Engineering startups funded by Y Combinator. Headquartered in the San Francisco Bay Area, these are some of the hottest and fastest-growing startups. This doesn't include all companies originally founded in San Francisco Bay Area or by founders from there.

We also have a Startup Directory where you can search through over 5,000 companies.

  • Fivetran
    Fivetran
    Y Combinator LogoW2013
    Active • 1,200 employees • Oakland, CA, USA
    Fivetran automates data movement out of, into and across cloud data platforms. We automate the most time-consuming parts of the ELT process from extracts to schema drift handling to transformations, so data engineers can focus on higher-impact projects with total pipeline peace of mind. With 99.9% uptime and self-healing pipelines, Fivetran enables hundreds of leading brands across the globe, including Autodesk, Conagra Brands, JetBlue, Lionsgate, Morgan Stanley, and Ziff Davis, to accelerate data-driven decisions and drive business growth. Fivetran is headquartered in Oakland, California, with offices around the world. 
    data-engineering
    saas
    analytics
    b2b
  • Ardent
    Ardent
    Y Combinator LogoP2026
    Active • 2 employees • San Francisco, CA, USA
    Clone any postgres DB regardless of size in 6s so agents can test their work
    infrastructure
    data-engineering
    ai
  • Haladir
    Haladir
    Y Combinator LogoW2026
    Active • 4 employees • San Francisco
    Haladir is the operational AI layer for logistics. We unify and structure data across WMS, TMS, OMS, etc., and embed solver-grade optimization, ML models, RL environments for frontier AI labs, and forecasting into the decisions that power supply chains. Today's AI brought intelligence. The next frontier is judgement. We define operational superintelligence as AI that consistently makes maximally-optimal operational decisions in complex environments. The first component is speed: continuous decisions that take operations managers and OR analysts weeks to make, in seconds. The second is reliability: every decision is guaranteed to satisfy operational constraints. The third is scope: optimize across thousands of constraints no human or team could possibly reason about.
    reinforcement-learning
    data-engineering
    logistics
    operations
  • Velum Labs
    Velum Labs
    Y Combinator LogoW2026
    Active • 2 employees • San Francisco, CA, USA
    Velum is the operating system for data quality. Velum automatically monitors and enforces data quality across a company's data stack, so bad data never reaches dashboards. We turn data quality from a manual task into infrastructure that runs itself. Data trust you can prove. From the pipeline to the boardroom.
    machine-learning
    data-engineering
  • Captain
    Captain
    Y Combinator LogoW2026
    Active • 2 employees • San Francisco
    Captain delivers the most accurate file search engine built for AI agents. We’ll index data from the sources folks already use like S3, SharePoint, and Google Drive, and easily scale multimodal, petabyte-level content search. We’re the Snowflake for Unstructured Data. Captain tops the Open-RAG-Benchmark with over 20% higher accuracy than standard RAG pipelines. We achieve this through robust data processing techniques like embedding normalization across modalities, ensuring that representations cluster by semantic content rather than data type.
    data-engineering
    infrastructure
    b2b
    api
  • Hyperspell
    Hyperspell
    Y Combinator LogoF2025
    Active • 6 employees • San Francisco, CA, USA
    Hyperspell is the Memory & Context Layer for AI Agents. AI agents are clueless geniuses. They crush humans on any standardized test, but wouldn't last a day at a real job. What today's super-intelligent agents are missing is the real world context they are operating in. A context that humans stitch together from hundreds of data points across dozens of interactions and channels. A context that grows with their tasks. Hyperspell gives AI agents this context by connecting to their user’s workspace data and building a personalized memory and context layer.
    artificial-intelligence
    machine-learning
    saas
    data-engineering
  • DeepAware AI
    DeepAware AI
    Y Combinator LogoS2025
    Active • 2 employees • San Francisco, CA, USA
    DeepAware (Silicon Valley Robotics Center roboticscenter.ai) is the fastest way for enterprises and researchers to get robots and robotics parts in the US — 72-hour delivery or Bay Area pickup. Beyond hardware, we help teams collect teleoperation data, build reinforcement learning environments, and deploy robots into production. Customers include AI labs, industrial operations, research teams, and event producers.
    supply-chain
    data-engineering
    artificial-intelligence
    robotics
    machine-learning
  • Vision Lab
    Vision Lab
    Y Combinator LogoP2025
    Active • 8 employees • San Francisco, CA, USA
    We capture and structure real factory workflows at scale by combining first-person industrial video with SOP-level process knowledge. This enables robotics and AI labs to train on real production data, not just controlled lab environments.
    manufacturing
    artificial-intelligence
    robotics
    data-engineering
  • Nitrode
    Nitrode
    Y Combinator LogoW2025
    Active • 4 employees • San Francisco, CA, USA
    Nitrode builds high-quality game data to train and evaluate LLMs and agents on spatial and temporal reasoning. Today’s models are trained on static text and images, but struggle to understand how the world evolves over time. We create small, fully specified game environments that generate ground-truth data on state, transitions, and hidden dynamics. This enables AI systems to move beyond simple pattern matching, incorporating memory, causality, and multi-step reasoning to create more reliable agents in dynamic environments.
    data-engineering
    machine-learning
    b2b
    artificial-intelligence
  • Sensei
    Sensei
    Y Combinator LogoS2024
    Active • 2 employees • San Francisco, CA, USA
    Sensei helps robotics companies scale and outsource their training data collection. Our hardware platform enables the collection of human-demonstration data at a tenth of the cost and twice the speed of current teleop approaches. Our software platform acts like Scale AI for robotics data: a large network of paid human operators use our low-cost collection platform to fulfill data-generation requests.
    robotics
    hard-tech
    artificial-intelligence
    marketplace
    data-engineering
  • Sharpe
    Sharpe
    Y Combinator LogoS2024
    Active • 3 employees • San Francisco, CA, USA
    Sharpe helps traders go from idea to profit in minutes with AI, bundling petabytes of market data with high-performance infrastructure. Sharpe helped a top 5 quant firm find $10m+ of profit within a few hours of deployment.
    artificial-intelligence
    finance
    analytics
    data-engineering
    big-data
  • Mica AI
    Mica AI
    Y Combinator LogoS2024
    Active • 3 employees • San Francisco, CA, USA
    Mica's AI agents replace the data ops teams fixing bad data. When bad or missing data breaks the pipeline, and orchestration, retries, and monitoring fail, painful manual review work kicks in, pulling humans in to investigate and patch data issues across systems. Mica does what those humans do: gathering the right information from internal docs and external systems, reasoning across context, and resolving errors autonomously to get the pipeline moving again. The result: dramatically reduce time, cost, and operational drag as your data pipelines scale without scaling ops headcount. Mica turns judgment-heavy data fixes from a manual bottleneck into an automated background process with full auditability.
    data-engineering
    data-science
    enterprise-software
  • Trellis AI
    Trellis AI
    Y Combinator LogoW2024
    Active • 25 employees • San Francisco
    Trellis helps healthcare providers treat more patients, faster—while eliminating pre-service paperwork. We automate document intake, prior authorizations, and appeals at scale to streamline operations and accelerate care. Our AI agent is trained on millions of clinical data points and converts messy, unstructured documents into clean, structured data directly in your EHR. With Trellis, leading healthcare providers and pharmaceutical companies were able to: 1. Reduce time to treatment by over 90% 2. Improve prior authorization approval and reimbursement rates 3. Leverage structured data to enhance drug program performance and clinical decision-making Administrative costs account for over 20% of U.S. healthcare spending—delaying care, draining revenue, and driving staff burnout while having less visibility into patient care than ever before. We built Trellis to tackle this head on.
    b2b
    data-engineering
    databases
    infrastructure
    ai
  • kater.ai
    kater.ai
    Y Combinator LogoW2024
    Active • 3 employees • San Francisco, CA, USA
    1. You explain your problem. 2. Kater identifies the most important data questions to ask. 3. Kater writes the code. 4. You get insights in seconds rather than weeks. Kater.ai flips the script on enterprise analytics by making every user an expert analyst. It uses a continuous classification engine to turn a single business question into a contextualized package of questions that is specific to your needs. Kater puts the power of data into the hands of business experts while ensuring they use trusted data that is specific to their persona. No more waiting for data analysts. No more wasted time on analysis misfires and rework. Yvonne was a data engineer and analyst who built the entire data stack at CREXi. Robin led engineering in Microsoft. Data is the new oil. Companies are data-rich, insight-poor. We're helping companies become insight-rich. This is the future of data.
    data-engineering
    analytics
    artificial-intelligence
  • Ocular AI
    Ocular AI
    Y Combinator LogoW2024
    Active • 6 employees • San Francisco, CA, USA
    Ocular AI is the data annotation engine for Generative AI, Computer Vision, and Enterprise AI models. We help you transform unstructured, multi-modal data into golden datasets to power generative AI, frontier models, and computer vision. Ocular Foundry is the most intuitive, data-centric, and fastest platform that lets you label, annotate, version, and deploy your data for training models. It also orchestrates your annotation jobs, improving collaboration with members and annotators. With Ocular Bolt, shift from humans in the loop to experts in the loop to supercharge your data labeling and annotation projects. Our global expert workforce ensures fast, accurate results—no matter the scale or complexity of your data. Companies spend huge amounts on training data, but Foundry and Bolt are AI-native tools that lower costs, reduce manual effort, and accelerate high-quality data collection. We’re replacing outdated, clunky, and expensive data software!
    artificial-intelligence
    data-engineering
    machine-learning
    computer-vision
    developer-tools
  • Reducto
    Reducto
    Y Combinator LogoW2024
    Active • 30 employees • San Francisco, CA, USA
    Reducto offers robust and reliable document ingestion for any workflow. Our API allows you to convert complex, unstructured documents into structured outputs that are perfect for RAG, process automation, and more.
    artificial-intelligence
    documents
    data-engineering
    search
    enterprise-software
  • Cedalio
    Cedalio
    Y Combinator LogoS2023
    Active • 6 employees • San Francisco, CA, USA
    Track and control emissions, energy, water, gas, and more—across all sites and countries, from a single place. Cedalio automates utility bill processing, detects anomalies, and centralizes your data into one source of truth. Save hours of manual work, make faster decisions, and enhance your sustainability efforts.​
    data-engineering
    artificial-intelligence
    energy
  • Artie
    Artie
    Y Combinator LogoS2023
    Active • 16 employees • San Francisco, CA, USA
    Artie is software that streams data from databases to data warehouses in real-time. Today, most companies run their ETL process every few hours or overnight, so their data warehouse is always out of date; with Artie, the warehouse always has live production data.
    data-engineering
    developer-tools
    enterprise-software
    saas
  • Ohm
    Ohm
    Y Combinator LogoW2023
    Active • 6 employees • San Francisco, CA, USA
    Ohm is transforming how physical products are developed, using AI. We support major tech enterprises across batteries, automotive and consumer electronics to reimagine how they develop, test and validate their products. Ohm is used by leading engineering teams to launch new hardware products faster, optimize engineer productivity and reduce the cost of engineering test programs.
    data-engineering
    ai
    artificial-intelligence
  • TableFlow
    TableFlow
    Y Combinator LogoW2023
    Active • 2 employees • San Francisco, CA, USA
    TableFlow builds AI teammates for data tasks, helping operations and data teams automate the messy, manual tasks buried in PDFs, spreadsheets, images, and emails.
    ai
    automation
    documents
    data-engineering
    saas
  • Sunpia
    Sunpia
    Y Combinator LogoS2022
    Active • 3 employees • San Jose, CA, USA
    Sunpia lets developers easily experience the cost and speed benefits of serverless infrastructure, without having to rewrite their code. Developers annotate their code and Sunpia automatically designs a microservice version of it they can deploy on their own cloud.
    data-engineering
    developer-tools
    kubernetes
  • LanceDB
    LanceDB
    Y Combinator LogoW2022
    Active • 35 employees • San Francisco, CA, USA
    LanceDB is a new open-source vector database that can support low-latency billion-scale vector search on a single node. Built around a new columnar data format, LanceDB makes it incredibly easy to build applications for generative AI, recsys, search engines, content moderation, and more.
    aiops
    data-engineering
    machine-learning
    open-source
  • Dynamo AI
    Dynamo AI
    Y Combinator LogoW2022
    Active • 40 employees • San Francisco, CA, USA
    End-to-end privacy, security, and compliance solutions to prepare your organization for emerging AI regulations.
    privacy
    data-engineering
    machine-learning
  • Sieve
    Sieve
    Y Combinator LogoW2022
    Active • 18 employees • San Francisco, CA, USA
    Sieve is the only AI research lab exclusively focused on video data. Video already makes up 80% of internet traffic and has become the dominant medium driving creativity, communication, gaming, AR/VR, and robotics. Unlocking the ability to truly model video is the key to breakthroughs across all of these domains but progress has been bottlenecked by one thing: high-quality training data. That’s where Sieve comes in. We bring together exabyte-scale video infrastructure, novel video understanding techniques, and dozens of diverse data sources to create datasets that push the frontier of video modeling. This unique combination allows us to deliver data with unmatched precision, quality, and speed which has earned the trust of frontier AI labs, Fortune 100 companies, and fast-growing generative AI startups.
    video
    developer-tools
    ai
    data-engineering
    data-labeling
  • Versable
    Versable
    Y Combinator LogoW2022
    Active • 3 employees • San Francisco
    Auto parts retailers get product data from hundreds of manufacturers that is inaccurate and inconsistent, often with big gaps in key values. They currently have a team of "catalog managers" who are required to process and enhance this data line by line, resulting in a week to months long lag between receiving product data and actually being able to start generating revenue from those products. Versable leverages AI to scan the web for tens of millions of auto parts listings, and uses a fine-tuned LLM with RAG to instantly process, enhance, and transform data. With just a part number, Versable is able to generate market-ready titles, product descriptions, and specs, in any format that's needed.
    automotive
    manufacturing
    data-engineering
    ai
  • Waydev AI
    Waydev AI
    Y Combinator LogoW2021
    Active • 13 employees • Menlo Park, CA, USA
    Waydev tracks AI-generated code from commit to production. AI Checkpoints capture which agent wrote the code, tokens consumed, cost per PR, acceptance rate, and whether it deployed. Compare Copilot, Cursor, and Claude Code on real production outcomes. Measure AI adoption by team, monthly spend, avg tokens per developer, and total tokens used. The Waydev Agent answers plain-English questions about your AI investment. Built for VPs of Engineering and CTOs. Trusted by Dropbox, American Express, Caterpillar, and PwC.
    data-engineering
    ai
    ai-assistant
    artificial-intelligence
    ai-enhanced-learning
  • Polytomic
    Polytomic
    Y Combinator LogoW2020
    Active • 7 employees • San Francisco, CA, USA
    Polytomic is a no-code web app to sync data between your internal databases, business systems (e.g. Stripe, Salesforce, etc), data warehouses, spreadsheets, and even HTTP APIs.
    saas
    b2b
    data-engineering
  • Mozart Data
    Mozart Data
    Y Combinator LogoS2020
    Active • 24 employees • San Francisco, CA, USA
    Mozart Data provides an out-of-the-box modern data stack that empowers anyone to easily consolidate, organize, and prepare their data for analysis. Spin up a data stack that’s built on a best-in-class data warehouse and ETL tool in hours, without any engineering. You can finally spend more time on generating insights and less time wrangling your data.
    saas
    b2b
    data-engineering
  • Supabase
    Supabase
    Y Combinator LogoS2020
    Active • 120 employees • San Francisco, CA, USA
    Supabase is the easiest way to get started with Postgres. Each project within Supabase is an isolated Postgres cluster, allowing customers to scale independently, while still providing the features that you need to build: instant database setup, auth, row level security, realtime data streams, auto-generating APIs, and a simple to use web interface. We are 100% remote.
    open-source
    databases
    data-engineering
    big-data
    developer-tools
  • Airbyte
    Airbyte
    Y Combinator LogoW2020
    Active • 90 employees • San Francisco, CA, USA
    Airbyte Agents is the context layer your AI agents query before they act. Features the Context Store, which unifies entities across Salesforce, Stripe, Zendesk and more. Use it from the UI, your LLM via MCP, or our SDK. 40% fewer tool calls, up to 80% fewer tokens. Airbyte Data Replication is the leading open data movement platform that empowers data teams in the AI era by transforming raw data into actionable intelligence. https://github.com/airbytehq/
    developer-tools
    open-source
    data-engineering
    ai
    artificial-intelligence
  • Logarithm Labs
    Logarithm Labs
    Y Combinator LogoW2020
    Active • 2 employees • Foster City, CA, USA
    Easy button to use data for your daily operations. Power your business workflows with quality data. Logarithm Labs helps you turn manual data wrangling and ad-hoc scripts into repeatable pipelines for your operational workflows. Power your workflows with quality data. Our product and team of experts do the heavy lifting so that can focus on the business logic that drives your organization. To learn more, contact us at hello@logarithmlabs.com.
    data-engineering
    developer-tools
  • TRM Labs
    TRM Labs
    Y Combinator LogoS2019
    Active • 250 employees • San Francisco, CA, USA
    TRM is on a mission to build a safer financial system for billions of people. We deliver a blockchain intelligence data platform to financial institutions, crypto companies, and governments to fight cryptocurrency fraud and financial crime. We consider our business — and our profit — as a way to move towards our mission sustainably and at scale. Join our mission ➔ www.trmlabs.com/careers
    crypto-web3
    machine-learning
    data-engineering
    fintech
  • Mezmo
    Mezmo
    Y Combinator LogoW2015
    Active • 172 employees • San Jose, CA, USA
    Mezmo, formerly LogDNA, is an observability platform to manage and take action on your data. It ingests, processes, and routes log data to fuel enterprise-level application development and delivery, security, and compliance use cases. Mezmo was brought to life by three-time co-founders Chris Nguyen and Lee Liu and included in the Winter 2015 batch of Y Combinator. In 2018 the company partnered with tech giant, IBM, to become the sole logging provider for IBM Cloud. Mezmo is on a mission to empower people who build solutions that shape the world. We’re doing this by delivering a platform that enables enterprises to get more value from their observability data in real time, regardless of source, destination, use case, or scale. We’re not the only ones working on this problem but we have a few things the others don’t. We’re cloud-native and know how to make the most of modern technology like Kubernetes. We have scaled a solution from zero to petabyte scale in a short amount of time, while supporting thousands of active users across multiple environments. We are hungry for change and are surrounded by enterprises telling us they’re hungry, too. We have a kick-ass group of people who are thinking about the problem analytically and are excited to change the observability world for the better. Mezmo has helped some of the world’s most innovative companies transform how they manage their systems and applications. Still, we know that we can help them get more value from their observability data by providing more flexibility and control over how they use it. This will enable teams to spend less time switching between data silos so they can focus on shipping better, more resilient, and secure products. We have momentum on our side. Last year we saw triple digit revenue growth and added 800 new customers to our roster. Recent accolades include being named to YC’s Top Companies, CRN’s 10 Hottest DevOps Startups, and EMA’s Top 3 Observability Platforms.
    data-engineering
    devsecops
    kubernetes
    saas
    developer-tools
  • Etleap
    Etleap
    Y Combinator LogoW2013
    Active • 11 employees • San Francisco, CA, USA
    Etleap is an ETL solution for creating perfect data pipelines from day one. Unlike other enterprise solutions, Etleap doesn’t require extensive engineering work to set up, maintain, and scale. It automates most ETL setup and maintenance work, and simplifies the rest into 10-minute tasks that analysts can own.
    data-engineering
  • Satsuma
    Satsuma
    Y Combinator LogoS2021
    Acquired • 5 employees • San Francisco, CA, USA
    Satsuma is a developer tool for building applications on top of real-time blockchain data. Our product lets developers take decoded data from multiple chains, customize it for their use cases, and access it through API endpoints. Blockchains serve as distributed databases for these products, holding their most important data. However, it’s difficult to access and query that data. We believe this friction is an enormous blocker for web3 developers and that better tooling will enable mass adoption for web3. We’re a founding team of engineers, having built data infrastructure and product as early employees at Airtable, Heap, and Y Combinator.
    crypto-web3
    data-engineering
    developer-tools
    saas
  • HomeRoom
    HomeRoom
    Y Combinator LogoW2022
    Acquired • 25 employees • San Jose, CA, USA
    Homeroom helps investors provide affordable housing while making a 22% ROI. We do this by sourcing properties, arranging capital, managing construction, vetting tenants and collecting rent by the room. To date, Homeroom has brought on 85 property investors, growing 6X annually, are bringing in 420K in annualized net-revenue How it works: We help investors buy homes in cities that are attractive to young people, but lack affordable housing options. We then renovate and after about 20 days, the home is ready and we find qualified renters by the room. We launched in 2018 in Kansas City with 1 home. We now have 105 homes in 31 cities. In 2021, we grew rental GMV to $1.8M (300% YoY growth). Our average rent across every property is $458, which is about 50% lower than market comps, and our investors see returns up to 50% higher. We are HomeRoom. Johnny is the financial analyst/domain expert. Thomas is a cereal entrepreneur with a PHD in ML, and Mike hacked growth for Airbnb and Facebook.
    real-estate
    proptech
    machine-learning
    nlp
    data-engineering
  • Hydra
    Hydra
    Y Combinator LogoW2022
    Acquired • 6 employees • San Francisco, CA, USA
    Hydra is a real-time analytics database management system for Postgres. We seperate compute from storage to offer software engineers serverless analytics with autoscale, write isolation, automatic caching, and more. Shipping scalable projects on time series and event data has never been easier. Hydra is available for local development, cloud, and bare metal deployment.
    open-source
    data-engineering
    developer-tools
    analytics