Read text, metadata, words, lines, and tables from PDF files (with optional Tesseract OCR), and convert documents (docx, odt, rtf, html, …) to PDF.
Installing and Loading
INSTALL pdf FROM community;
LOAD pdf;
Example
-- Load the extension
LOAD pdf;
-- One row per page (filename, page, page_count, text, width, height).
-- Accepts a single file, a list, or a glob.
SELECT page, text
FROM read_pdf('report.pdf');
-- Search across many PDFs at once
SELECT filename, page
FROM read_pdf('reports/*.pdf')
WHERE text ILIKE '%revenue%';
-- Document metadata, one row per file
SELECT title, author, pages FROM read_pdf_meta('report.pdf');
-- Extract tabular regions from digital PDFs
SELECT * FROM read_pdf_tables('financial_report.pdf');
-- Whole document as plain text (scalar)
SELECT pdf_to_text('report.pdf') AS full_text;
-- Convert a document (docx, odt, rtf, html, ...) to PDF, then read it back
-- (requires LibreOffice installed at runtime)
SELECT to_pdf('resume.docx'); -- writes resume.pdf, returns the path
SELECT * FROM read_pdf((SELECT to_pdf('resume.docx')));
About pdf
The pdf extension brings native PDF reading to DuckDB using the Poppler
library for rendering and Tesseract for OCR of scanned pages.
All table functions accept a single path, a list of paths, or a glob
('docs/*.pdf'). Shared named parameters: first_page, last_page,
password, layout ('reading' | 'physical' | 'raw'), and the OCR knobs
ocr, auto_ocr, ocr_language, ocr_dpi, ocr_psm, ocr_oem,
tessdata_dir.
Table functions
read_pdf(path, ...)— one row per page; columns: filename, page, page_count, text, width, height (page size in PDF points).read_pdf_lines(path, ...)— one row per layout-preserving line of text; columns: filename, page, line, text. A PDF-aware analog toread_lines.read_pdf_meta(path)— one row per file; columns: filename, title, author, subject, keywords, creator, producer, pages, pdf_version, encrypted.read_pdf_words(path, ...)— one row per word; columns: filename, page, word, x0, y0, x1, y1 (bounding box in PDF user-space points, origin bottom-left), font_name, font_size.read_pdf_tables(path, ...)— extracts tabular regions from digital PDFs; columns: filename, page, table_index, row_index, cells (VARCHAR[]).
Scalar functions
pdf_to_text(path[, layout])— entire document as a plain-text VARCHAR.pdf_to_html(path)— document rendered to HTML.pdf_to_xml(path)— document rendered to XML (Poppler pdftoxml format).pdf_to_svg(path, page)— a single page rendered to SVG.to_pdf(path[, output_path])— convert a document (docx, doc, odt, rtf, html, odp, pptx, xlsx, …) to PDF; writes alongside the input (extension swapped to.pdf) or tooutput_path, and returns the written path. See "Saving documents to PDF" below.
Saving documents to PDF
to_pdf converts office and markup documents to PDF by invoking LibreOffice
at runtime (the conversion engine is not bundled — only a runtime process is
spawned, so nothing is added to the build). LibreOffice is auto-detected on
$PATH (soffice/libreoffice), in the macOS app bundle, or via the
LIBREOFFICE_PATH environment variable; if none is found it raises a clear,
actionable error (install with brew install --cask libreoffice,
apt-get install libreoffice, or the Windows installer). A pure-SQL
alternative needs no new function at all — compose the shellfs extension
with LibreOffice's headless converter:
LOAD shellfs;
SELECT * FROM read_text(
'soffice --headless --convert-to pdf --outdir /tmp "resume.docx" && echo ok |'
);
SELECT * FROM read_pdf('/tmp/resume.pdf');
OCR (scanned / image-only PDFs)
Pages with no extractable text layer are OCR'd automatically (auto_ocr,
on by default); pass ocr := true to force OCR on every page. OCR requires a
Tesseract language model at runtime — package managers do not ship one — but
once you install one the usual way it works with no configuration: the
extension auto-detects the standard model directories used by Homebrew
(brew install tesseract tesseract-lang), apt
(apt-get install tesseract-ocr tesseract-ocr-eng), and the Windows
installer. To use a non-standard location, pass
tessdata_dir := '/path/to/tessdata' per query, or set the TESSDATA_PREFIX
environment variable (resolution order: tessdata_dir → TESSDATA_PREFIX →
auto-detected paths). Select the language with ocr_language
(e.g. ocr_language := 'deu'). If no model is found anywhere, OCR raises a
clear, actionable error rather than returning empty text.
Table extraction: scope
read_pdf_tables uses a precision-first geometric heuristic (word
bounding-box column clustering with a regularity gate) on digital PDFs. It
reliably handles clean, aligned tables and avoids emitting spurious tables
from prose, but it does not do ML-based table-structure recognition — merged
cells, borderless/sparse tables, and scanned tables are out of scope. For
state-of-the-art document understanding, reach for tools like docling,
marker, or a cloud Document AI service; this extension targets the ~80% of
everyday text/word/line/metadata/simple-table extraction directly in SQL.
Platform support
Linux (x86_64, arm64), macOS (x86_64, arm64), and Windows (x64, MSVC) are supported — all dependencies are resolved through vcpkg and statically linked. The mingw/rtools Windows variants and windows_arm64 are excluded (different toolchain / untested), and WebAssembly is excluded because Poppler and Tesseract cannot be linked into the wasm target.
License
GPL-2.0-or-later. Poppler is GPL-2.0; statically linking it requires the combined work to be distributed under the GPL.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| pdf_to_html | scalar | NULL | NULL | |
| pdf_to_svg | scalar | NULL | NULL | |
| pdf_to_text | scalar | NULL | NULL | |
| pdf_to_xml | scalar | NULL | NULL | |
| read_pdf | table | NULL | NULL | |
| read_pdf_lines | table | NULL | NULL | |
| read_pdf_meta | table | NULL | NULL | |
| read_pdf_tables | table | NULL | NULL | |
| read_pdf_words | table | NULL | NULL | |
| to_pdf | scalar | NULL | NULL |
Overloaded Functions
This extension does not add any function overloads.
Added Types
This extension does not add any types.
Added Settings
This extension does not add any settings.