Document Parsing

6 packages

Document Parsing

Unstructured

Toolkit for loading and partitioning documents (PDFs, HTML, PPTX, images) into clean text/chunks with metadata for downstream NLP/RAG.

Recently updated

Hero Score 60

Document Parsing

PyMuPDF

Fast, feature-rich PDF toolkit for text/HTML extraction, images, metadata, and page rendering with coordinates.

Recently updated

Hero Score 65

Document Parsing

Docling

Open-source toolkit by IBM Research for parsing diverse document formats (PDF, DOCX, PPTX, HTML, XLSX, images, audio) into a unified structured representation ideal for GenAI and RAG workflows. Hosted by LF AI & Data Foundation.

Recently updated

Hero Score 80

Document Parsing

Marker

PDF/image to clean Markdown converter using deep-learning models — strong on tables, equations, and complex academic layouts.

Active

Hero Score 52

Document Parsing

MarkItDown

Microsoft utility for converting Office docs, PDFs, images, audio, and more to Markdown — designed for LLM-friendly text extraction.

Active

Hero Score 69

Document Parsing

PaddleOCR

Multilingual OCR toolkit with text detection, recognition, layout analysis, and table extraction — built on Baidu's PaddlePaddle framework.

Active

Hero Score 63