← All open source projects

MarkItDown

microsoft/markitdown

MarkItDown is a Microsoft/AutoGen Python utility for converting PDF, Office, HTML, images, audio, and archives into Markdown for LLM pipelines.

Forks 10,067
Author microsoft
Language Python
License MIT
Synced 2026-06-07

What MarkItDown is

MarkItDown is a lightweight Python utility for converting files to Markdown for LLM and text-analysis pipelines. It is not trying to be a perfect visual converter; its goal is to preserve document structure such as headings, lists, tables, links, and text in a model-friendly format.

It supports PDF, PowerPoint, Word, Excel, images with EXIF/OCR, audio metadata and transcription, HTML, CSV/JSON/XML, ZIP, YouTube URLs, EPUB, and more. That makes it useful as an ingestion layer before RAG, classification, summarization, or search.

What is inside and how it is used

File conversion

This example shows the project shape and the usual way it is used.

Language: Python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("report.pdf")
print(result.text_content[:1000])

Security matters because the tool performs I/O with the privileges of the current process. Untrusted environments should narrow inputs, use specific `convert_*` functions, and avoid broad filesystem access.

Strengths and limits

The limit is fidelity. If the goal is perfect visual Word or PDF output for humans, another class of tools is needed. MarkItDown is strongest when Markdown is an intermediate analysis format.