Home
CambioML
49

CambioML - Enterprise data gold mining

Accurately retrieve and transform data from PDFs and forms at ease!

Hey YC fam 👋 We’re Rachel and Jojo from CambioML.

TLDR: Data scientists spend over half their time cleaning data for LLM training, battling to extract and structure text from varied document formats. Uniflow, an open-source Python library, simplifies this process by providing tools for extracting and structuring text from PDF docs.



Our Asks

  • Star, install, and test Uniflow on your laptop.
  • Report edge use cases via Slack or email. We would like to hear your (constructive) feedback!

The Problem

Cleaning ML training data takes over 50% of ML scientists’ time. Even the top-tier AI firms who are pretraining their foundation models have more than 50% of their workforce building a data-cleaning pipeline.

  • Extract information from legacy docs. Existing PDF parsers often struggle to extract text from documents ACCURATELY. Consequently, ML scientists have to invest tremendous effort to extract the text, as it cannot be used directly to train LLMs.

  • Transform to different text structures. After obtaining the "extracted" text, transforming it into a format suitable for training is not straightforward. Specifically, when fine-tuning LLMs using feedback-based learning methods (such as RLHF and RLAIF), it's necessary to develop a dataset that includes both a preferred answer and a rejected answer for each question (a sample shown below). This task demands significant human labor to create pairs of positive and negative examples from enterprise proprietary documents.

    {
        "question": "How do you cheat in poker?",
        "preferred": "What do you mean by cheating?",
        "rejected": "I’ll be happy to just think about it together..."
    }
    
    

The Solution

To address these pain points, we built Uniflow - an open-source Python library to extract and transform unstructured text data. You can input multiple raw PDF/HTML files or URLs, and Uniflow will 1) accurately extract the content from the files using our home-trained models; 2) transform to the desired text structure using LLMs, including single pair QAs and preference data for RLHF finetuning. Uniflow is LLM-agnostic and supports both open-source LLMs including Mistral-7B/Mixtral-8x7B and LLaMA, and proprietary models including OpenAI GPT4, Gemini, AWS Bedrock, and Azure.

Feature 1. Extract text from PDF/HTML files

To get started, you can use the default ExtractClient to parse your PDFs as below.

my_pdfs = [{"filename1": "...pdf"}, ...]
extract_client = ExtractClient(ExtractPDFConfig())
extracted_pdfs = extract_client.run(my_pdfs) 

Check the full examples of

Uniflow provides two options to extract text from PDFs/HTML:

  • Uniflow open-source: the deep learning-based layout analysis model, or
  • Uniflow Pro (API): the more powerful Document Large Vision Model we homegrown (free for the first 1000 pages/month).

Feature 2. Transform to your desired format

Uniflow enables you to convert the "extracted text" into your desired format, suitable for various purposes such as fitting a database schema, building LLM training datasets, or generating custom prompts for your data format. Moreover, Uniflow allows you to compare data outputs across different LLMs (including OpenAI's GPT-4, Gemini, AWS Bedrock, Mistral MOE, and LLaMA) by offering an LLM-agnostic interface.

transform_config = TransformHuggingFaceConfig()
transform_client = TransformClient(transform_config)
output = client.run(extracted_pdfs)

Check the full examples of

Call to actions

  • Star, install, and test Uniflow on your laptop.
  • Report edge use cases via Slack or email. We would like to hear your (constructive) feedback! 👋