A fast, reliable TypeScript library and CLI tool to convert web pages into clean, readable markdown.
mdfetch is a production-ready tool for extracting article content from web pages and converting it to clean markdown format. It combines Mozilla's battle-tested Readability algorithm with Turndown's markdown conversion to deliver consistently high-quality results.
The library follows a clean pipeline architecture with four main modules:
┌─────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐
│ Fetcher │ → │ Readable │ → │ Reader │ → │ Markdown │
└─────────┘ └──────────┘ └───────────┘ └──────────┘
↓ ↓ ↓ ↓
HTTP GET Readability Orchestration Turndown
+ Retries + Absolute URLs + Error Handling + GFM
npm install mdfetch
import { readURL } from 'mdfetch';
// Fetch and convert a URL
const result = await readURL('https://example.com/article');
// Access the content in different formats
console.log(result.markdown); // GitHub Flavored Markdown
console.log(result.plainText); // Plain text
console.log(result.readableHTML); // Clean HTML
// Access metadata
console.log(result.title); // "Article Title"
console.log(result.byline); // "Author Name"
console.log(result.excerpt); // "Article summary..."
console.log(result.publishedTime);// "2024-01-15T10:30:00Z"
import { readURL } from 'mdfetch';
const result = await readURL('https://example.com/article', {
timeout: 60000, // 60 second timeout
retries: 5, // Retry up to 5 times
retryDelay: 2000 // 2 second initial delay
});
readURL(url, options?) - Main entry point that orchestrates the full pipelinefetchHTML(url, options?) - HTTP fetching with retry logicmakeReadable(html) - Extract readable content using Mozilla ReadabilitymakeImgPathsAbsolute(baseURL, html) - Convert relative image URLs to absolutemakeLinksAbsolute(baseURL, html) - Convert relative link URLs to absoluteconvertToAbsoluteURL(baseURL, relativeURL) - URL resolution utilityConversionResult - Complete result with content in all formats plus metadataReaderOptions - Configuration options for fetching (timeout, retries)Article - Extracted article with content and metadataFetchOptions - HTTP fetching configurationReaderError - Wraps errors from the reading/conversion pipelineFetchError - HTTP fetching errors with status codesimport { readURL } from 'mdfetch';
import { writeFile } from 'fs/promises';
const result = await readURL('https://blog.example.com/post');
await writeFile('article.md', result.markdown);
import { readURL } from 'mdfetch';
const result = await readURL('https://news.example.com/story');
const wordCount = result.plainText.split(/\s+/).length;
const readingTime = Math.ceil(wordCount / 200); // Minutes at 200 WPM
import { readURL } from 'mdfetch';
const urls = [
'https://example.com/article1',
'https://example.com/article2',
'https://example.com/article3'
];
const results = await Promise.all(
urls.map(url => readURL(url))
);
for (const result of results) {
console.log(`${result.title} (${result.length} chars)`);
}
import { readURL, ReaderError } from 'mdfetch';
import { FetchError } from 'mdfetch/fetcher';
try {
const result = await readURL('https://example.com/article');
console.log(result.markdown);
} catch (error) {
if (error instanceof FetchError) {
console.error(`HTTP Error ${error.statusCode}: ${error.message}`);
} else if (error instanceof ReaderError) {
console.error(`Extraction Error: ${error.message}`);
} else {
console.error(`Unknown Error: ${error}`);
}
}
import { readURL } from 'mdfetch';
// Increase timeout and retries for slow sites
const result = await readURL('https://slow-site.example.com/article', {
timeout: 120000, // 2 minutes
retries: 10, // Try 10 times
retryDelay: 3000 // Wait 3 seconds between retries
});
The markdown output includes a metadata header followed by the article content:
# Article Title
**By:** Author Name
**Source:** Example Site
**Published:** 2024-01-15T10:30:00Z
**URL:** https://example.com/article
---
Article content begins here with proper markdown formatting...
## Section Heading
Paragraphs, **bold text**, *italic text*, and [links](https://example.com).
- Bullet lists
- Are properly formatted
```javascript
// Code blocks with syntax highlighting
const example = "code";
| Tables | Are |
|---|---|
| Also | Supported |
### ConversionResult Object
```typescript
{
url: string; // Original URL
title: string; // Article title
readableHTML: string; // Clean HTML (no ads/nav/footer)
plainText: string; // Plain text version
markdown: string; // GitHub Flavored Markdown
excerpt: string | null; // Article summary
byline: string | null; // Author name
siteName: string | null; // Site/publication name
lang: string | null; // Content language code
dir: string | null; // Text direction (ltr/rtl)
publishedTime: string | null; // Publication timestamp
length: number; // Reading length (chars)
}
interface ReaderOptions {
timeout?: number; // Request timeout in ms (default: 30000)
retries?: number; // Number of retry attempts (default: 3)
retryDelay?: number; // Initial retry delay in ms (default: 1000)
}
The library uses two custom error classes:
Thrown when HTTP fetching fails:
class FetchError extends Error {
statusCode?: number; // HTTP status code (404, 500, etc.)
originalError?: Error; // Original error that caused the failure
}
Common scenarios:
Thrown when content extraction or conversion fails:
class ReaderError extends Error {
originalError?: Error; // Original error that caused the failure
}
Common scenarios:
This library is designed for Node.js environments only. It uses:
fetch API (Node 18+)linkedom for DOM parsing (not browser DOM)Minimum Node version: 18.0.0
The library is written in TypeScript and includes complete type definitions:
import { readURL, ConversionResult, ReaderOptions, ReaderError } from 'mdfetch';
import { fetchHTML, FetchError, FetchOptions } from 'mdfetch/fetcher';
import { makeReadable, Article } from 'mdfetch/readable';
All functions are fully typed with JSDoc comments for excellent IDE autocomplete and inline documentation.
For JavaScript-heavy sites:
// Use puppeteer or playwright to render the page first
import puppeteer from 'puppeteer';
import { readURL } from 'mdfetch';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://spa-example.com');
const html = await page.content();
await browser.close();
// Now process the rendered HTML
const { parseHTML } = await import('linkedom');
const { makeReadable } = await import('mdfetch/readable');
const article = makeReadable(html);