pdf – DuckDB Community Extensions

Search Shortcut cmd + k | ctrl + k

Documentation

pdf

Downloads 280this week

GitHub stars 3

Extension repository on GitHub

Extension descriptor (YAML)

Read text, metadata, words, lines, and tables from PDF files (with optional Tesseract OCR), and convert documents (docx, odt, rtf, html, …) to PDF.

Maintainer(s): asubbarao

Installing and Loading

INSTALL pdf FROM community;
LOAD pdf;

Example

-- Load the extension
LOAD pdf;

-- One row per page (filename, page, page_count, text, width, height).
-- Accepts a single file, a list, or a glob.
SELECT page, text
FROM read_pdf('report.pdf');

-- Search across many PDFs at once
SELECT filename, page
FROM read_pdf('reports/*.pdf')
WHERE text ILIKE '%revenue%';

-- Document metadata, one row per file
SELECT title, author, pages FROM read_pdf_meta('report.pdf');

-- Extract tabular regions from digital PDFs
SELECT * FROM read_pdf_tables('financial_report.pdf');

-- Whole document as plain text (scalar)
SELECT pdf_to_text('report.pdf') AS full_text;

-- Convert a document (docx, odt, rtf, html, ...) to PDF, then read it back
-- (requires LibreOffice installed at runtime)
SELECT to_pdf('resume.docx');                    -- writes resume.pdf, returns the path
SELECT * FROM read_pdf((SELECT to_pdf('resume.docx')));

About pdf

The pdf extension brings native PDF reading to DuckDB using the Poppler library for rendering and Tesseract for OCR of scanned pages.

All table functions accept a single path, a list of paths, or a glob ('docs/*.pdf'). Shared named parameters: first_page, last_page, password, layout ('reading' | 'physical' | 'raw'), and the OCR knobs ocr, auto_ocr, ocr_language, ocr_dpi, ocr_psm, ocr_oem, tessdata_dir.

Table functions

read_pdf(path, ...) — one row per page; columns: filename, page, page_count, text, width, height (page size in PDF points).
read_pdf_lines(path, ...) — one row per layout-preserving line of text; columns: filename, page, line, text. A PDF-aware analog to read_lines.
read_pdf_meta(path) — one row per file; columns: filename, title, author, subject, keywords, creator, producer, pages, pdf_version, encrypted.
read_pdf_words(path, ...) — one row per word; columns: filename, page, word, x0, y0, x1, y1 (bounding box in PDF user-space points, origin bottom-left), font_name, font_size.
read_pdf_tables(path, ...) — extracts tabular regions from digital PDFs; columns: filename, page, table_index, row_index, cells (VARCHAR[]).

Scalar functions

pdf_to_text(path[, layout]) — entire document as a plain-text VARCHAR.
pdf_to_html(path) — document rendered to HTML.
pdf_to_xml(path) — document rendered to XML (Poppler pdftoxml format).
pdf_to_svg(path, page) — a single page rendered to SVG.
to_pdf(path[, output_path]) — convert a document (docx, doc, odt, rtf, html, odp, pptx, xlsx, …) to PDF; writes alongside the input (extension swapped to .pdf) or to output_path, and returns the written path. See "Saving documents to PDF" below.

Saving documents to PDF

to_pdf converts office and markup documents to PDF by invoking LibreOffice at runtime (the conversion engine is not bundled — only a runtime process is spawned, so nothing is added to the build). LibreOffice is auto-detected on $PATH (soffice/libreoffice), in the macOS app bundle, or via the LIBREOFFICE_PATH environment variable; if none is found it raises a clear, actionable error (install with brew install --cask libreoffice, apt-get install libreoffice, or the Windows installer). A pure-SQL alternative needs no new function at all — compose the shellfs extension with LibreOffice's headless converter:

LOAD shellfs;
SELECT * FROM read_text(
  'soffice --headless --convert-to pdf --outdir /tmp "resume.docx" && echo ok |'
);
SELECT * FROM read_pdf('/tmp/resume.pdf');

OCR (scanned / image-only PDFs)

Pages with no extractable text layer are OCR'd automatically (auto_ocr, on by default); pass ocr := true to force OCR on every page. OCR requires a Tesseract language model at runtime — package managers do not ship one — but once you install one the usual way it works with no configuration: the extension auto-detects the standard model directories used by Homebrew (brew install tesseract tesseract-lang), apt (apt-get install tesseract-ocr tesseract-ocr-eng), and the Windows installer. To use a non-standard location, pass tessdata_dir := '/path/to/tessdata' per query, or set the TESSDATA_PREFIX environment variable (resolution order: tessdata_dir → TESSDATA_PREFIX → auto-detected paths). Select the language with ocr_language (e.g. ocr_language := 'deu'). If no model is found anywhere, OCR raises a clear, actionable error rather than returning empty text.

Table extraction: scope

read_pdf_tables uses a precision-first geometric heuristic (word bounding-box column clustering with a regularity gate) on digital PDFs. It reliably handles clean, aligned tables and avoids emitting spurious tables from prose, but it does not do ML-based table-structure recognition — merged cells, borderless/sparse tables, and scanned tables are out of scope. For state-of-the-art document understanding, reach for tools like docling, marker, or a cloud Document AI service; this extension targets the ~80% of everyday text/word/line/metadata/simple-table extraction directly in SQL.

Platform support

Linux (x86_64, arm64), macOS (x86_64, arm64), and Windows (x64, MSVC) are supported — all dependencies are resolved through vcpkg and statically linked. The mingw/rtools Windows variants and windows_arm64 are excluded (different toolchain / untested), and WebAssembly is excluded because Poppler and Tesseract cannot be linked into the wasm target.

License

GPL-2.0-or-later. Poppler is GPL-2.0; statically linking it requires the combined work to be distributed under the GPL.

Added Functions

function_name	function_type	description	comment
pdf_to_html	scalar	NULL	NULL
pdf_to_svg	scalar	NULL	NULL
pdf_to_text	scalar	NULL	NULL
pdf_to_xml	scalar	NULL	NULL
read_pdf	table	NULL	NULL
read_pdf_lines	table	NULL	NULL
read_pdf_meta	table	NULL	NULL
read_pdf_tables	table	NULL	NULL
read_pdf_words	table	NULL	NULL
to_pdf	scalar	NULL	NULL

Installing and Loading

Example

About pdf

Added Functions

Overloaded Functions

Added Types

Added Settings

In this article