Data Engineering Startups funded by Y Combinator (YC) in the San Francisco Bay Area 2024

May 2024

Browse 26 of the top Data Engineering startups funded by Y Combinator. Headquartered in the San Francisco Bay Area, these are some of the hottest and fastest-growing startups.

We also have a Startup Directory where you can search through over 5,000 companies.

  • TRM Labs
    TRM Labs (s2019)Active • 180 employees • San Francisco, CA, USA
    At TRM, we're on a mission to build trust in digital assets, because the promise of crypto is too valuable to be impeded by bad actors. We provide a blockchain intelligence platform to law enforcement, financial institutions, and crypto firms to assist in the detection and prevention of cryptocurrency fraud and financial crime. Our vision is to build a company that can sustainably deliver on our mission for decades to come, enabling consumers to transact safely and securely on the blockchain. Join our mission ➔ www.trmlabs.com/careers
    fintech
    machine-learning
    crypto-web3
    data-engineering
  • Airbyte
    Airbyte (w2020)Active • 110 employees • San Francisco, CA, USA
    Airbyte is the leading open-source ELT platform that replicates data from applications, APIs & databases to data warehouses, data lakes, and other destinations. https://github.com/airbytehq/airbyte
    developer-tools
    open-source
    data-engineering
  • Mozart Data
    Mozart Data (s2020)Active • 24 employees • San Francisco, CA, USA
    Mozart Data provides an out-of-the-box modern data stack that empowers anyone to easily consolidate, organize, and prepare their data for analysis. Spin up a data stack that’s built on a best-in-class data warehouse and ETL tool in hours, without any engineering. You can finally spend more time on generating insights and less time wrangling your data.
    saas
    b2b
    data-engineering
  • Jitsu
    Jitsu (s2020)Active • 4 employees • San Francisco, CA, USA
    Jitsu is the fastest, most durable way to collect event data from every source - web, app, email, chatbot, CRM - into your data warehouse. 100% open-source. Purpose built, secure and ready in minutes.
    saas
    b2b
    open-source
    data-engineering
  • Imbue (formerly Generally Intelligent)
    Imbue (formerly Generally Intelligent) (s2017)Active • 15 employees • San Francisco, CA, USA
    Imbue builds AI systems that reason and code, enabling AI agents to accomplish larger goals and safely work in the real world. We train our own foundation models optimized for reasoning and prototype agents on top of these models. By using these agents extensively, we gain insights into improving both the capabilities of the underlying models and the interaction design for agents. We aim to rekindle the dream of the *personal* computer, where computers become truly intelligent tools that empower us, giving us freedom, dignity, and agency to pursue the things we love.
    machine-learning
    data-engineering
    ai
  • HomeRoom
    HomeRoom (w2022)Active • 25 employees • San Jose, CA, USA
    Homeroom helps investors provide affordable housing while making a 22% ROI. We do this by sourcing properties, arranging capital, managing construction, vetting tenants and collecting rent by the room. To date, Homeroom has brought on 85 property investors, growing 6X annually, are bringing in 420K in annualized net-revenue How it works: We help investors buy homes in cities that are attractive to young people, but lack affordable housing options. We then renovate and after about 20 days, the home is ready and we find qualified renters by the room. We launched in 2018 in Kansas City with 1 home. We now have 105 homes in 31 cities. In 2021, we grew rental GMV to $1.8M (300% YoY growth). Our average rent across every property is $458, which is about 50% lower than market comps, and our investors see returns up to 50% higher. We are HomeRoom. Johnny is the financial analyst/domain expert. Thomas is a cereal entrepreneur with a PHD in ML, and Mike hacked growth for Airbnb and Facebook.
    machine-learning
    real-estate
    proptech
    nlp
    data-engineering
  • Polytomic
    Polytomic (w2020)Active • 7 employees • San Francisco, CA, USA
    Polytomic is a no-code web app to sync data between your internal databases, business systems (e.g. Stripe, Salesforce, etc), data warehouses, spreadsheets, and even HTTP APIs.
    saas
    b2b
    data-engineering
  • Outerbase
    Outerbase (w2023)Active • 4 employees • Pittsburgh, PA, USA
    Outerbase is the interface for your database. Companies use Outerbase to view, edit, and modify their data and even generate beautiful visual dashboards without having to write a single line of SQL.
    developer-tools
    generative-ai
    analytics
    data-engineering
    ai
  • Tarsal
    Tarsal (s2021)Active • 10 employees • New York, NY, USA
    Tarsal is a data pipeline custom built for security teams. As security data grows 25% year over year, security teams desperately need access to best-in-class data infrastructure. Tarsal bridges the gap between the modern data stack and security teams, pioneering the modern security data stack.
    b2b
    cybersecurity
    big-data
    data-engineering
  • Waydev
    Waydev (w2021)Active • 15 employees • San Francisco, CA, USA
    Leverage insights from your engineering stack to accelerate velocity, align engineering work to business priorities, and increase visibility into your team’s DORA Metrics and SPACE Framework Metrics
    b2b
    analytics
    enterprise
    data-engineering
    ai-assistant
  • Pipekit
    Pipekit (s2021)Active • 7 employees • San Francisco, CA, USA
    Our app manages Argo Workflows for data teams, enabling complex data & CI pipelines in half the time while saving companies hundreds of thousands of dollars annually. Argo Workflows is an open-source pipeline framework for Kubernetes that’s used in production by Bloomberg, Intuit, Adobe, New Relic, NVIDIA, and many other open-source early adopters.
    developer-tools
    open-source
    data-engineering
    devops
  • Dynamo AI
    Dynamo AI (w2022)Active • 40 employees • San Francisco, CA, USA
    End-to-end privacy, security, and compliance solutions to prepare your organization for emerging AI regulations.
    machine-learning
    privacy
    data-engineering
  • Hydra
    Hydra (w2022)Active • 6 employees • San Francisco, CA, USA
    Open source Snowflake alternative. Query billions of rows instantly on column-oriented Postgres. Hydra can be used as open source, managed cloud, or deployable in customer cloud infrastructure. Get parallelized analytics in minutes with no code changes
    developer-tools
    analytics
    open-source
    data-engineering
  • OneSchema
    OneSchema (s2021)Active • 10 employees • San Francisco, CA, USA
    Product and engineering teams use OneSchema to save months of development time to build a CSV importer. OneSchema improves customer activation / import completion rates by automatically correcting customer data.
    developer-tools
    saas
    b2b
    data-engineering
  • Mezmo
    Mezmo (w2015)Active • 172 employees • San Jose, CA, USA
    Mezmo, formerly LogDNA, is an observability platform to manage and take action on your data. It ingests, processes, and routes log data to fuel enterprise-level application development and delivery, security, and compliance use cases. Mezmo was brought to life by three-time co-founders Chris Nguyen and Lee Liu and included in the Winter 2015 batch of Y Combinator. In 2018 the company partnered with tech giant, IBM, to become the sole logging provider for IBM Cloud. Mezmo is on a mission to empower people who build solutions that shape the world. We’re doing this by delivering a platform that enables enterprises to get more value from their observability data in real time, regardless of source, destination, use case, or scale. We’re not the only ones working on this problem but we have a few things the others don’t. We’re cloud-native and know how to make the most of modern technology like Kubernetes. We have scaled a solution from zero to petabyte scale in a short amount of time, while supporting thousands of active users across multiple environments. We are hungry for change and are surrounded by enterprises telling us they’re hungry, too. We have a kick-ass group of people who are thinking about the problem analytically and are excited to change the observability world for the better. Mezmo has helped some of the world’s most innovative companies transform how they manage their systems and applications. Still, we know that we can help them get more value from their observability data by providing more flexibility and control over how they use it. This will enable teams to spend less time switching between data silos so they can focus on shipping better, more resilient, and secure products. We have momentum on our side. Last year we saw triple digit revenue growth and added 800 new customers to our roster. Recent accolades include being named to YC’s Top Companies, CRN’s 10 Hottest DevOps Startups, and EMA’s Top 3 Observability Platforms.
    developer-tools
    devsecops
    saas
    kubernetes
    data-engineering
  • Converge
    Converge (s2023)Active • 3 employees • San Francisco, CA, USA
    Tracking customer events (e.g. Add To Cart, Purchase, etc.) correctly is important, yet unattainable for most online stores due to the limitations of tracking in the browser and lack of in-house developers. Converge auto-tracks all important events – across the browser, store backend and subscription platforms. Once tracking is set up, Converge allows online stores to forward these events with the flip of a switch to their advertising platforms and analytics tools leading to improved ad performance and better insights. Our larger vision is to go beyond data infrastructure; and leverage our single customer data layer to build out a perfectly integrated set of applications that helps brands reduce their customer acquisition cost.
    saas
    analytics
    e-commerce
    data-engineering
    infrastructure
  • Platypus
    Platypus (w2021)Active • 3 employees • San Francisco, CA, USA
    For Business Operators: Connect & automate processes on top of any data, crazy fast. For Engineering Teams: Connect any data, across any stack, in any format, crazy fast.
    b2b
    workflow-automation
    data-engineering
    ai-assistant
  • Patterns
    Patterns (s2021)Active • 2 employees • San Francisco, CA, USA
    Patterns enables everyone to analyze data, no matter their technical ability. No more waiting for reports from your data team or fiddling around with dashboards, simply make an analytics request, and get an AI generated answer from a fine-tuned bot on your company’s data.
    analytics
    data-science
    data-engineering
    data-visualization
  • Fivetran
    Fivetran (w2013)Active • 1,200 employees • Oakland, CA, USA
    Fivetran automates data movement out of, into and across cloud data platforms. We automate the most time-consuming parts of the ELT process from extracts to schema drift handling to transformations, so data engineers can focus on higher-impact projects with total pipeline peace of mind. With 99.9% uptime and self-healing pipelines, Fivetran enables hundreds of leading brands across the globe, including Autodesk, Conagra Brands, JetBlue, Lionsgate, Morgan Stanley, and Ziff Davis, to accelerate data-driven decisions and drive business growth. Fivetran is headquartered in Oakland, California, with offices around the world. 
    saas
    b2b
    analytics
    data-engineering
  • FlowDeploy
    FlowDeploy (w2022)Active • 3 employees • Mountain View, CA, USA
    FlowDeploy helps bioinformaticians manage their data analysis pipelines. We provide everything they need to try, run, develop, and share their pipelines. That includes integrations with AWS, Snakemake, Nextflow, GitHub, Slack, SSO, and more, as well as a clean API and web app for launching and monitoring pipelines and managing their data. FlowDeploy is built for bioinformaticians: it doesn't restrict how pipelines are built and managed, as long as a bioinformatics workflow manager like Nextflow or Snakemake is used. But it does eliminate several footguns like idle spend and accidental data egress, and it reduces the potential for users accidentally sharing credentials. FlowDeploy runs the pipelines in either our managed cloud or the customer's cloud – eliminating the need to transfer data externally. Non-computational biologists can use FlowDeploy, too: features like pipelines templates decrease the complexity to launch a new pipeline, which reduces user error and decreases the need for advanced cloud training for non-computational users.
    developer-tools
    drug-discovery
    data-engineering
  • Etleap
    Etleap (w2013)Active • 11 employees • San Francisco, CA, USA
    Etleap is an ETL solution for creating perfect data pipelines from day one. Unlike other enterprise solutions, Etleap doesn’t require extensive engineering work to set up, maintain, and scale. It automates most ETL setup and maintenance work, and simplifies the rest into 10-minute tasks that analysts can own.
    data-engineering
  • Chaos Genius
    Chaos Genius (w2020)Active • 10 employees • San Francisco, CA, USA
    Chaos Genius is a DataOps Observability platform for Snowflake. Enable Snowflake Observability to reduce Snowflake costs and optimize query performance.
    cloud-workload-protection
    machine-learning
    analytics
    open-source
    data-engineering
  • Logarithm Labs
    Logarithm Labs (w2020)Active • 2 employees • Foster City, CA, USA
    Easy button to use data for your daily operations. Power your business workflows with quality data. Logarithm Labs helps you turn manual data wrangling and ad-hoc scripts into repeatable pipelines for your operational workflows. Power your workflows with quality data. Our product and team of experts do the heavy lifting so that can focus on the business logic that drives your organization. To learn more, contact us at hello@logarithmlabs.com.
    developer-tools
    data-engineering
  • LanceDB
    LanceDB (w2022)Active • 10 employees • San Francisco, CA, USA
    LanceDB is a new open-source vector database that can support low-latency billion-scale vector search on a single node. Built around a new columnar data format, LanceDB makes it incredibly easy to build applications for generative AI, recsys, search engines, content moderation, and more.
    aiops
    machine-learning
    data-engineering
  • Satsuma
    Satsuma (s2021)Acquired • 5 employees • San Francisco, CA, USA
    Satsuma is a developer tool for building applications on top of real-time blockchain data. Our product lets developers take decoded data from multiple chains, customize it for their use cases, and access it through API endpoints. Blockchains serve as distributed databases for these products, holding their most important data. However, it’s difficult to access and query that data. We believe this friction is an enormous blocker for web3 developers and that better tooling will enable mass adoption for web3. We’re a founding team of engineers, having built data infrastructure and product as early employees at Airtable, Heap, and Y Combinator.
    developer-tools
    saas
    crypto-web3
    data-engineering