Retrieve and transform data from PDFs and forms

CambioML providing ML tools for extracting and reconstruct text and data from PDFs, HTMLs and forms. Join the enterprise data gold mining from your legacy docs.

Team Size:3
Location:San Jose, CA
Group Partner:Michael Seibel

Active Founders

Rachel Hu

Co-founder and CEO of CambioML. Previously Applied Scientist at AWS; Built LLMs and Led open-source ML projects D2L.ai adopted by over 500 universities around the world; AWS senior speaker: talked at AWS re:Invent, Nvidia GTC, KDD, etc..

Rachel Hu
Rachel Hu

Company Launches

Hey YC fam 👋 We’re Rachel and Jojo from CambioML.

TLDR: Data scientists spend over half their time cleaning data for LLM training, battling to extract and structure text from varied document formats. Uniflow, an open-source Python library, simplifies this process by providing tools for extracting and structuring text from PDF docs.

Our Asks

  • Star, install, and test Uniflow on your laptop.
  • Report edge use cases via Slack or email. We would like to hear your (constructive) feedback!

The Problem

Cleaning ML training data takes over 50% of ML scientists’ time. Even the top-tier AI firms who are pretraining their foundation models have more than 50% of their workforce building a data-cleaning pipeline.

  • Extract information from legacy docs. Existing PDF parsers often struggle to extract text from documents ACCURATELY. Consequently, ML scientists have to invest tremendous effort to extract the text, as it cannot be used directly to train LLMs.

  • Transform to different text structures. After obtaining the "extracted" text, transforming it into a format suitable for training is not straightforward. Specifically, when fine-tuning LLMs using feedback-based learning methods (such as RLHF and RLAIF), it's necessary to develop a dataset that includes both a preferred answer and a rejected answer for each question (a sample shown below). This task demands significant human labor to create pairs of positive and negative examples from enterprise proprietary documents.

        "question": "How do you cheat in poker?",
        "preferred": "What do you mean by cheating?",
        "rejected": "I’ll be happy to just think about it together..."

The Solution

To address these pain points, we built Uniflow - an open-source Python library to extract and transform unstructured text data. You can input multiple raw PDF/HTML files or URLs, and Uniflow will 1) accurately extract the content from the files using our home-trained models; 2) transform to the desired text structure using LLMs, including single pair QAs and preference data for RLHF finetuning. Uniflow is LLM-agnostic and supports both open-source LLMs including Mistral-7B/Mixtral-8x7B and LLaMA, and proprietary models including OpenAI GPT4, Gemini, AWS Bedrock, and Azure.

Feature 1. Extract text from PDF/HTML files

To get started, you can use the default ExtractClient to parse your PDFs as below.

my_pdfs = [{"filename1": "...pdf"}, ...]
extract_client = ExtractClient(ExtractPDFConfig())
extracted_pdfs = extract_client.run(my_pdfs) 

Check the full examples of

Uniflow provides two options to extract text from PDFs/HTML:

  • Uniflow open-source: the deep learning-based layout analysis model, or
  • Uniflow Pro (API): the more powerful Document Large Vision Model we homegrown (free for the first 1000 pages/month).

Feature 2. Transform to your desired format

Uniflow enables you to convert the "extracted text" into your desired format, suitable for various purposes such as fitting a database schema, building LLM training datasets, or generating custom prompts for your data format. Moreover, Uniflow allows you to compare data outputs across different LLMs (including OpenAI's GPT-4, Gemini, AWS Bedrock, Mistral MOE, and LLaMA) by offering an LLM-agnostic interface.

transform_config = TransformHuggingFaceConfig()
transform_client = TransformClient(transform_config)
output = client.run(extracted_pdfs)

Check the full examples of

Call to actions

  • Star, install, and test Uniflow on your laptop.
  • Report edge use cases via Slack or email. We would like to hear your (constructive) feedback! 👋

Other Company Launches

CambioML - the "Private ML Scientists" for Large Enterprises

Transform messy multi-modality data to MOE (mixed of experts) LLM/LVM
Read Launch ›