Labsco
0xMassi logo

Webclaw

β˜… 1,600

from 0xMassi

Web content extraction for LLM pipelines β€” clean markdown or structured JSON from any URL using browser-grade TLS fingerprinting, no headless browser required. CLI, REST API, and MCP server.

πŸ”₯πŸ”₯πŸ”₯βœ“ VerifiedAccount requiredAdvanced setup

webclaw

Turn websites into clean markdown, JSON, and LLM-ready context. CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines.

Most web scraping tools give your agent one of two bad outputs:

  • a blocked page, login wall, or empty app shell

  • raw HTML full of nav, scripts, styling, ads, and duplicated boilerplate

webclaw.io is the hosted web extraction API for webclaw. This repo contains the open-source CLI, MCP server, extraction engine, and self-hostable server.

webclaw turns a URL into clean content your tools can actually use.

Copy & paste β€” that's it
webclaw https://example.com --format markdown
Copy & paste β€” that's it
# Example Domain

This domain is for use in illustrative examples in documents.

You may use this domain in literature without prior coordination or asking for permission.

Use it from the terminal, wire it into Claude/Cursor through MCP, call the hosted API from your app, or self-host the OSS server.

MCP Server

webclaw ships with an MCP server for AI agents.

Copy & paste β€” that's it
npx create-webclaw

Manual config:

Copy & paste β€” that's it
{
 "mcpServers": {
 "webclaw": {
 "command": "~/.webclaw/webclaw-mcp"
 }
 }
}

Then ask your agent things like:

Copy & paste β€” that's it
Scrape these competitor pricing pages and summarize the differences.
Copy & paste β€” that's it
Crawl this documentation site and prepare clean context for a RAG index.
Copy & paste β€” that's it
Extract the brand colors, fonts, and logos from this company website.

Tools

Tool What it does Local scrape Extract one URL as markdown, text, JSON, LLM format, or HTML Yes crawl Follow same-origin links and extract discovered pages Yes map Discover URLs without extracting every page Yes batch Scrape multiple URLs in parallel Yes extract Convert page content into structured data Yes, with local or configured LLM summarize Summarize a page Yes, with local or configured LLM diff Compare page content snapshots Yes brand Extract colors, fonts, logos, and metadata Yes search Search the web and scrape results Hosted API research Multi-source research workflow Hosted API

SDKs

Copy & paste β€” that's it
npm install @webclaw/sdk
pip install webclaw
go get github.com/0xMassi/webclaw-go

TypeScript

Copy & paste β€” that's it
import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });

const page = await client.scrape({
 url: "https://example.com",
 formats: ["markdown"],
 only_main_content: true,
});

console.log(page.markdown);

Python

Copy & paste β€” that's it
from webclaw import Webclaw

client = Webclaw(api_key="wc_your_key")

page = client.scrape(
 "https://example.com",
 formats=["markdown"],
 only_main_content=True,
)

print(page.markdown)

cURL

Copy & paste β€” that's it
curl -X POST https://api.webclaw.io/v1/scrape \
 -H "Authorization: Bearer $WEBCLAW_API_KEY" \
 -H "Content-Type: application/json" \
 -d '{
 "url": "https://example.com",
 "formats": ["markdown"],
 "only_main_content": true
 }'

Output Formats

Format Use it when you need markdown Clean page content with structure preserved llm Compact context for agents and RAG pipelines text Plain text with minimal formatting json Structured metadata, links, images, and extracted fields html Cleaned HTML for custom processing

Local First, Hosted When Needed

The CLI and MCP server work locally without an account for the core extraction path.

Use the hosted API at webclaw.io when you need:

  • protected-site access without managing infrastructure

  • JavaScript rendering

  • async crawl and research jobs

  • web search

  • watches and production usage tracking

  • SDKs for application code

Copy & paste β€” that's it
export WEBCLAW_API_KEY=wc_your_key

webclaw https://example.com --cloud

What You Can Build

Use case Example AI agent web access Give Claude, Cursor, or another MCP client clean page context RAG ingestion Crawl docs, help centers, blogs, and knowledge bases Competitor monitoring Track pricing pages, changelogs, docs, and product pages Structured extraction Turn messy pages into typed JSON for automations Research workflows Search, scrape, summarize, and cite multiple sources Brand intelligence Extract logos, colors, fonts, and social metadata

Architecture

Copy & paste β€” that's it
webclaw/
 crates/
 webclaw-core HTML to markdown, text, JSON, and LLM-ready output
 webclaw-fetch Fetching, crawling, batching, and mapping
 webclaw-llm Local and hosted LLM provider support
 webclaw-pdf PDF text extraction
 webclaw-mcp MCP server for AI agents
 webclaw-cli Command-line interface

webclaw-core is pure extraction logic: no network I/O, small surface area, and usable independently from the fetching layer.

Contributing

The most useful contributions right now are practical and small:

  • add examples for real agent and RAG workflows

  • improve SDK snippets

  • report pages that extract poorly

  • add failing fixtures for messy HTML

  • improve docs for MCP clients and local setup

  • test the CLI on more Linux/macOS environments

Good first places to start:

If a page extracts badly, include:

Copy & paste β€” that's it
URL:
Command or API request:
Expected output:
Actual output:
Format used: markdown / llm / text / json / html
CLI, MCP, SDK, or API:

Please remove secrets, cookies, private tokens, and customer data from logs before posting.

Infrastructure Partner

ColdProxy supports webclaw as an Infrastructure Partner, providing residential IPv4, residential IPv6, and datacenter IPv6 proxy infrastructure across 195+ countries for public data collection, regional testing, monitoring, and web scraping workflows. Explore ColdProxy's latest plans and available offers directly on the website. See the proxy-backed crawling guide for a hands-on walkthrough of wiring ColdProxy into webclaw.

Studio Partners

NodeMaven is the most reliable proxy provider with the highest-quality IPs on the market. Best solution for automation, web scraping, SEO research, and social media management: 99.9% uptime, sticky sessions up to 7 days, IP filtering (all proxies under a 97% fraud score), no KYC, and cashback up to 10% on traffic. Use WEBCLAW35 for 35% off Mobile and Residential proxies, or WEBCLAW40 for 40% off ISP (Static) proxies at NodeMaven.

RapidProxy delivers fast, reliable proxy infrastructure for large-scale data collection. With 90M+ residential IPs, smart rotation, high concurrency, AI-powered CAPTCHA bypass, and non-expiring traffic, it helps keep scraping workflows stable at scale. Use code webclaw for 10% off, or Try it free.

MangoProxy provides residential, ISP, datacenter, and mobile proxies across 200+ locations, backed by a 90M+ IP pool with HTTP and SOCKS5 support and high stability for web scraping and data collection at scale. Use code 0XMASSI for 8% off ISP (Static) proxies at mangoproxy.com.

Community Plugins

Third-party plugins that integrate webclaw with AI agent platforms:

Plugin Platform What it does openclaw-webclaw OpenClaw Native webclaw v1 API plugin with 9 tools: scrape, search, crawl, extract, summarize, diff, map, batch, brand hermes-webclaw Hermes Agent Web search provider and 9 dedicated tools for the full v1 API surface. Install with hermes plugins install jal-co/hermes-webclaw

Built a webclaw integration? Open a PR to add it here.

Contributors

Thanks to everyone improving webclaw through issues, examples, docs, bug reports, and pull requests.

Star History

License

AGPL-3.0