Accurately retrieve and transform data from PDFs and forms at ease!
Hey YC fam 👋 We’re Rachel and Jojo from CambioML.
TLDR: Data scientists spend over half their time cleaning data for LLM training, battling to extract and structure text from varied document formats. Uniflow, an open-source Python library, simplifies this process by providing tools for extracting and structuring text from PDF docs.
Cleaning ML training data takes over 50% of ML scientists’ time. Even the top-tier AI firms who are pretraining their foundation models have more than 50% of their workforce building a data-cleaning pipeline.
Extract information from legacy docs. Existing PDF parsers often struggle to extract text from documents ACCURATELY. Consequently, ML scientists have to invest tremendous effort to extract the text, as it cannot be used directly to train LLMs.
Transform to different text structures. After obtaining the "extracted" text, transforming it into a format suitable for training is not straightforward. Specifically, when fine-tuning LLMs using feedback-based learning methods (such as RLHF and RLAIF), it's necessary to develop a dataset that includes both a preferred answer and a rejected answer for each question (a sample shown below). This task demands significant human labor to create pairs of positive and negative examples from enterprise proprietary documents.
{
"question": "How do you cheat in poker?",
"preferred": "What do you mean by cheating?",
"rejected": "I’ll be happy to just think about it together..."
}
To address these pain points, we built Uniflow - an open-source Python library to extract and transform unstructured text data. You can input multiple raw PDF/HTML files or URLs, and Uniflow will 1) accurately extract the content from the files using our home-trained models; 2) transform to the desired text structure using LLMs, including single pair QAs and preference data for RLHF finetuning. Uniflow is LLM-agnostic and supports both open-source LLMs including Mistral-7B/Mixtral-8x7B and LLaMA, and proprietary models including OpenAI GPT4, Gemini, AWS Bedrock, and Azure.
Feature 1. Extract text from PDF/HTML files
To get started, you can use the default ExtractClient to parse your PDFs as below.
my_pdfs = [{"filename1": "...pdf"}, ...]
extract_client = ExtractClient(ExtractPDFConfig())
extracted_pdfs = extract_client.run(my_pdfs)
Check the full examples of
Uniflow provides two options to extract text from PDFs/HTML:
Feature 2. Transform to your desired format
Uniflow enables you to convert the "extracted text" into your desired format, suitable for various purposes such as fitting a database schema, building LLM training datasets, or generating custom prompts for your data format. Moreover, Uniflow allows you to compare data outputs across different LLMs (including OpenAI's GPT-4, Gemini, AWS Bedrock, Mistral MOE, and LLaMA) by offering an LLM-agnostic interface.
transform_config = TransformHuggingFaceConfig()
transform_client = TransformClient(transform_config)
output = client.run(extracted_pdfs)
Check the full examples of