FileKit
Todos los artículos
·6 min de lectura

How to Extract Text from a PDF (Text-Based and Scanned)

Two approaches for two kinds of PDFs: direct text extraction for digital PDFs, and OCR for scanned documents. Covers browser tools, command-line, and common problems.

Two Kinds of PDFs

Before you try to extract text, you need to understand what kind of PDF you have. This determines which method works — and which doesn't.

  • Text-based PDFs. Created by Word, Google Docs, LaTeX, InDesign, or any application that “prints” digital text to PDF. The text is stored as character data inside the file. You can select it, copy it, and search it. Extraction is fast and accurate.
  • Scanned / image-based PDFs. Created by scanning paper documents. Each page is a photograph — the PDF contains pixel images, not character data. You cannot select or search the text. Extracting it requires OCR (Optical Character Recognition).

Quick test: open the PDF, try to select a word with your cursor. If you can highlight individual words, it's text-based. If the entire page highlights as one block (or nothing highlights), it's a scanned image.

Extracting Text from Text-Based PDFs

1. Copy and Paste

The simplest method. Open the PDF in any viewer, select the text, copy, and paste into your target application. This works for short passages but is tedious for long documents, and it often loses formatting — columns merge, headers mix into body text, and tables collapse into a jumble.

2. Browser-Based Extraction

FileKit PDF Text Extractor reads the PDF's internal text layer and outputs clean plain text. It processes the entire document at once, handling multi-page files and preserving basic line breaks. Everything runs in your browser — the file stays on your device.

Best for: extracting full-document text for editing, analysis, or migration to another format.

3. Command Line (pdftotext)

# Poppler's pdftotext
pdftotext input.pdf output.txt

# With layout preservation
pdftotext -layout input.pdf output.txt

Best for: batch processing, scripting, preserving spatial layout.

Extracting Text from Scanned PDFs (OCR)

If the PDF is a scanned image, you need OCR to recognize the text in the photos.

1. Browser-Based OCR

FileKit OCR uses Tesseract.js to recognize text in scanned documents and images. It supports English, Chinese, Japanese, and Spanish. Like all FileKit tools, processing happens in your browser.

2. Adobe Acrobat

Acrobat Pro can “Recognize Text” (Scan & OCR tool) and add a searchable text layer on top of the scanned images. The result is a hybrid PDF that looks the same but is now searchable and selectable.

3. Tesseract (Command Line)

# Basic OCR
tesseract input.png output -l eng

# OCR a PDF (with pdf2image first)
pdftoppm input.pdf page -png
tesseract page-1.png output -l eng

Common Extraction Problems

ProblemCauseSolution
Garbled characters (e.g. “fi” → strange symbol)Custom font encoding without a ToUnicode mapTry OCR instead of text extraction
Missing spaces between wordsPDF positions each glyph individuallyUse a tool that infers word boundaries (FileKit does this)
Columns merged into one streamText extraction reads left to right across the full pageUse pdftotext -layout or extract column by column
Tables become plain text soupPDF tables are not semantically markedUse Tabula or Camelot for structured table extraction

Tips for Better Results

  • For scanned documents, higher scan resolution (300 DPI+) dramatically improves OCR accuracy. 150 DPI is the minimum for English text; 300 DPI is recommended for CJK characters.
  • If OCR results are poor, try preprocessing the image: convert to grayscale, increase contrast, and de-skew rotated pages.
  • For structured data (tables, forms), consider dedicated table extraction tools rather than general text extraction.
  • Always proofread OCR output. Even the best OCR engines make mistakes on unusual fonts, handwriting, or low-quality scans.