← All open source projects

MinerU

opendatalab/MinerU

MinerU is a document-analysis tool that turns PDFs and office files into Markdown/JSON for search, RAG, and agent workflows.

Forks 5,734
Author opendatalab
Language Python
License NOASSERTION
Synced 2026-06-20

What it is

MinerU is a document-analysis tool. It turns complex PDFs, images, and office formats into Markdown/JSON so the material can be searched, processed by models, or used in RAG scenarios.

The repository appeared in 2024, and its main language is Python. Its topics include PDF, OCR, layout analysis, DOCX, PPTX, XLSX, and data extraction.

What is inside

Inside are document parsing models and pipeline, CLI, demos, documentation, PDF and office-format support, table parsing, image handling, and complex layout analysis.

Converting a document

The example shows the basic MinerU command: an input document is parsed and saved into an output folder.

Language: Bash
mineru -p ./paper.pdf -o ./output
mineru -p ./paper.pdf -o ./output -b pipeline

How people use it

MinerU is useful where ordinary PDF text extraction produces noise: scientific papers, reports, tables, scans, presentations, and documents with complicated structure. The project tries to preserve not only text, but layout meaning.

Its strength is treating documents as input for AI systems. Markdown and JSON are easier to feed into search, indexing, and models than raw PDF pages.

Project details

MinerU addresses one of the most stubborn AI-project tasks: documents are rarely clean text. Tables, images, footnotes, columns, and page breaks break simple extraction and hurt knowledge-base search.

The project is useful as a preparation layer. Before a document reaches a model, it must be parsed, structure preserved, tables separated from text, and results turned into a format that can be checked and indexed.

The limitation is that document parsing is almost always probabilistic. Good results on some PDFs do not guarantee the same accuracy on scans, presentations, or tables with unusual layouts. Teams need quality checks on their own files.

Strengths and limitations

The limitation is that accuracy depends on document quality, language, scans, and table structure. Automated parsing must be checked, especially in legal or financial processes.

MinerU matters as practical document-AI infrastructure: it sits between a raw file and data suitable for further automation.

Context