written by Eric J. Ma on 2024-12-20 | tags: docling nougat llms document parsing gpu
In this blog post, I explore the challenges of extracting structured text from PDFs, especially when dealing with equations, tables, and figures. I discuss two tools, Nougat-OCR by Facebook Research and Docling by IBM, which I found effective for this task. Nougat-OCR excels at handling equations and tables, while Docling excels on extracting figures. By combining these tools, we can develop a workflow that captures all critical components of a PDF. Want to know how to retain valuable knowledge from complex PDFs?
Parsing published literature into plain text is a task that seems deceptively simple. In reality, PDFs can be notoriously difficult to work with, especially when they include elements like equations, tables, and figures. If you're working with large language models (LLMs) or just trying to extract data for analysis, the standard text extraction tools often leave significant amounts of useful context behind. Recently, I explored two tools Nougat-OCR by Facebook Research and Docling by IBM to address this problem more effectively.
Standard methods for extracting text from PDFs often work well for plain paragraphs but stumble when it comes to three critical areas:
Given these challenges, I wanted to find tools that could improve text extraction for equations, tables, and figures. Here’s what I found.
Nougat-OCR is a tool developed by Facebook Research that focuses on converting scientific PDFs into structured text, including support for equations and tables. Its installation and usage are straightforward.
I used the uv
package manager to set up Nougat-OCR:
uv tool install nougat-ocr --python 3.12 --with transformers==4.38.2
Once installed, the nougat
command becomes available on your system PATH.
To extract text from a PDF, run the following command:
nougat data/curve-sim.pdf > data/curve-sim.mmd
This processes the PDF and redirects the extracted text into a Markdown file.
Nougat-OCR handles equations and tables impressively well. For example, consider this equation from the paper A curve similarity approach to parallelism testing in bioassay:
\[f(\theta_{i},x)=a_{i}+\frac{(b_{i}-a_{i})}{1+\exp\{d_{i}(x-\log(c_{i}))\}}\,. \tag{1}\]
It also processes tables cleanly. Here’s an example table:
\begin{table} \begin{tabular}{l c c c c c c} \hline \hline & \multicolumn{3}{c}{Reference} & \multicolumn{3}{c}{Sample} \\ \cline{2-7} Concentration & 1 & 2 & 3 & 1 & 2 & 3 \\ \hline 125,000 & 2.086879 & 2.119145 & 2.273702 & 1.524275 & 1.438422 & 1.563780 \\ ... (truncated for brevity) ... \hline \hline \end{tabular} \end{table}
This table is extracted with alignment preserved, making it ideal for further analysis. However, Nougat-OCR does not perform well with figures.
For figures, I turned to Docling by IBM. While Nougat-OCR shines at text-based elements like equations and tables, Docling focuses on images and visual components.
Like Nougat-OCR,
Docling can be installed with uv
:
uv tool install docling --python 3.12
To extract images from a PDF, run:
docling data/curve-sim.pdf > data/curve-sim-figures.md
Docling processes the PDF and outputs images as base64-encoded strings embedded in a Markdown file. For example:
Docling faithfully extracts all figures and encodes them as reusable base64 strings, which can then be passed to a multimodal LLM for description or analysis.
Docling does not handle equations well. For example, the earlier equation:
f(\theta_{i},x)=a_{i}+...
was rendered incorrectly as:
f θ i ; x ð Þ ¼ ai þ bi /C0 ai ð Þ 1 þ exp di x /C0 log ci ð Þ ð Þ f g : (1)
To get the best of both tools, I used a multi-step workflow:
Extract text-based elements (equations, tables) with Nougat-OCR:
nougat data/curve-sim.pdf > data/curve-sim.mmd
Extract figures with Docling:
docling data/curve-sim.pdf
Process figures with a multimodal LLM like LlamaBot:
import base64 image = base64.decode(img_string) description = lmb.SimpleBot(lmb.user(image)) print(description.content)
This approach ensures that you capture all critical components of a PDF— equations, tables, and figures— with minimal loss of context.
Both Nougat-OCR and Docling benefit significantly from GPU acceleration, especially when processing large volumes of PDFs. To make this workflow more scalable and accessible, my next step is to deploy these tools on Modal, a serverless platform that supports GPU-based processing. By deploying Nougat and Docling as APIs on Modal, I can:
This deployment will allow me to scale preprocessing tasks effortlessly and unlock the full potential of structured PDF data.
Parsing PDFs into structured plain text is more than just a convenience; it's a necessity when working with LLMs or conducting scientific analysis. By combining Nougat-OCR for text-based elements and Docling for visual content, you can extract high-quality data from published literature.
To make this solution scalable, deploying these tools on Modal with GPU support will ensure rapid, on-demand preprocessing through simple API calls. This workflow allows you to retain equations, tables, and figures, ensuring that no valuable knowledge is left behind. As tools like Nougat-OCR and Docling continue to improve, so too will our ability to make sense of complex, multimodal content.
@article{
ericmjl-2024-accurately-docling,
author = {Eric J. Ma},
title = {Accurately extract text from research literature PDFs with Nougat-OCR and Docling},
year = {2024},
month = {12},
day = {20},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2024/12/20/accurately-extract-text-from-research-literature-pdfs-with-nougat-ocr-and-docling},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!