Document Parsing
6 packages
Document ParsingRecently updated
Unstructured
Toolkit for loading and partitioning documents (PDFs, HTML, PPTX, images) into clean text/chunks with metadata for downstream NLP/RAG.
Hero Score 60
Document ParsingActive
PyMuPDF
Fast, feature-rich PDF toolkit for text/HTML extraction, images, metadata, and page rendering with coordinates.
Hero Score 61
Document ParsingRecently updated
Docling
Open-source toolkit by IBM Research for parsing diverse document formats (PDF, DOCX, PPTX, HTML, XLSX, images, audio) into a unified structured representation ideal for GenAI and RAG workflows. Hosted by LF AI & Data Foundation.
Hero Score 79
Document ParsingActive
Marker
PDF/image to clean Markdown converter using deep-learning models — strong on tables, equations, and complex academic layouts.
Hero Score 52
Document ParsingRecently updated
MarkItDown
Microsoft utility for converting Office docs, PDFs, images, audio, and more to Markdown — designed for LLM-friendly text extraction.
Hero Score 70
Document ParsingRecently updated
PaddleOCR
Multilingual OCR toolkit with text detection, recognition, layout analysis, and table extraction — built on Baidu's PaddlePaddle framework.
Hero Score 64