Labsco
coleam00 logo

Crawl4AI RAG

β˜… 2,200

from coleam00

Integrates web crawling and Retrieval-Augmented Generation (RAG) into AI agents and coding assistants.

πŸ”₯πŸ”₯πŸ”₯πŸ”₯βœ“ VerifiedAccount requiredAdvanced setup

Crawl4AI RAG MCP Server

Web Crawling and RAG Capabilities for AI Agents and AI Coding Assistants

A powerful implementation of the Model Context Protocol (MCP) integrated with Crawl4AI and Supabase for providing AI agents and AI coding assistants with advanced web crawling and RAG capabilities.

With this MCP server, you can scrape anything and then use that knowledge anywhere for RAG.

The primary goal is to bring this MCP server into Archon as I evolve it to be more of a knowledge engine for AI coding assistants to build AI agents. This first version of the Crawl4AI/RAG MCP server will be improved upon greatly soon, especially making it more configurable so you can use different embedding models and run everything locally with Ollama.

Consider this GitHub repository a testbed, hence why I haven't been super actively address issues and pull requests yet. I certainly will though as I bring this into Archon V2!

Overview

This MCP server provides tools that enable AI agents to crawl websites, store content in a vector database (Supabase), and perform RAG over the crawled content. It follows the best practices for building MCP servers based on the Mem0 MCP server template I provided on my channel previously.

The server includes several advanced RAG strategies that can be enabled to enhance retrieval quality:

  • Contextual Embeddings for enriched semantic understanding

  • Hybrid Search combining vector and keyword search

  • Agentic RAG for specialized code example extraction

  • Reranking for improved result relevance using cross-encoder models

  • Knowledge Graph for AI hallucination detection and repository code analysis

See the Configuration section below for details on how to enable and configure these strategies.

Vision

The Crawl4AI RAG MCP server is just the beginning. Here's where we're headed:

Integration with Archon: Building this system directly into Archon to create a comprehensive knowledge engine for AI coding assistants to build better AI agents.

Multiple Embedding Models: Expanding beyond OpenAI to support a variety of embedding models, including the ability to run everything locally with Ollama for complete control and privacy.

Advanced RAG Strategies: Implementing sophisticated retrieval techniques like contextual retrieval, late chunking, and others to move beyond basic "naive lookups" and significantly enhance the power and precision of the RAG system, especially as it integrates with Archon.

Enhanced Chunking Strategy: Implementing a Context 7-inspired chunking approach that focuses on examples and creates distinct, semantically meaningful sections for each chunk, improving retrieval precision.

Performance Optimization: Increasing crawling and indexing speed to make it more realistic to "quickly" index new documentation to then leverage it within the same prompt in an AI coding assistant.

Features

  • Smart URL Detection: Automatically detects and handles different URL types (regular webpages, sitemaps, text files)

  • Recursive Crawling: Follows internal links to discover content

  • Parallel Processing: Efficiently crawls multiple pages simultaneously

  • Content Chunking: Intelligently splits content by headers and size for better processing

  • Vector Search: Performs RAG over crawled content, optionally filtering by data source for precision

  • Source Retrieval: Retrieve sources available for filtering to guide the RAG process

Tools

The server provides essential web crawling and search tools:

Core Tools (Always Available)

  • crawl_single_page: Quickly crawl a single web page and store its content in the vector database

  • smart_crawl_url: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively)

  • get_available_sources: Get a list of all available sources (domains) in the database

  • perform_rag_query: Search for relevant content using semantic search with optional source filtering

Conditional Tools

  • search_code_examples (requires USE_AGENTIC_RAG=true): Search specifically for code examples and their summaries from crawled documentation. This tool provides targeted code snippet retrieval for AI coding assistants.

Knowledge Graph Tools (requires USE_KNOWLEDGE_GRAPH=true, see below)

  • parse_github_repository: Parse a GitHub repository into a Neo4j knowledge graph, extracting classes, methods, functions, and their relationships for hallucination detection

  • check_ai_script_hallucinations: Analyze Python scripts for AI hallucinations by validating imports, method calls, and class usage against the knowledge graph

  • query_knowledge_graph: Explore and query the Neo4j knowledge graph with commands like repos, classes, methods, and custom Cypher queries

Integration with MCP Clients

SSE Configuration

Once you have the server running with SSE transport, you can connect to it using this configuration:

Copy & paste β€” that's it
{
 "mcpServers": {
 "crawl4ai-rag": {
 "transport": "sse",
 "url": "http://localhost:8051/sse"
 }
 }
}

Note for Windsurf users: Use serverUrl instead of url in your configuration:

Copy & paste β€” that's it
{
 "mcpServers": {
 "crawl4ai-rag": {
 "transport": "sse",
 "serverUrl": "http://localhost:8051/sse"
 }
 }
}

Note for Docker users: Use host.docker.internal instead of localhost if your client is running in a different container. This will apply if you are using this MCP server within n8n!

Note for Claude Code users:

Copy & paste β€” that's it
claude mcp add-json crawl4ai-rag '{"type":"http","url":"http://localhost:8051/sse"}' --scope user

Stdio Configuration

Add this server to your MCP configuration for Claude Desktop, Windsurf, or any other MCP client:

Copy & paste β€” that's it
{
 "mcpServers": {
 "crawl4ai-rag": {
 "command": "python",
 "args": ["path/to/crawl4ai-mcp/src/crawl4ai_mcp.py"],
 "env": {
 "TRANSPORT": "stdio",
 "OPENAI_API_KEY": "your_openai_api_key",
 "SUPABASE_URL": "your_supabase_url",
 "SUPABASE_SERVICE_KEY": "your_supabase_service_key",
 "USE_KNOWLEDGE_GRAPH": "false",
 "NEO4J_URI": "bolt://localhost:7687",
 "NEO4J_USER": "neo4j",
 "NEO4J_PASSWORD": "your_neo4j_password"
 }
 }
 }
}

Docker with Stdio Configuration

Copy & paste β€” that's it
{
 "mcpServers": {
 "crawl4ai-rag": {
 "command": "docker",
 "args": ["run", "--rm", "-i", 
 "-e", "TRANSPORT", 
 "-e", "OPENAI_API_KEY", 
 "-e", "SUPABASE_URL", 
 "-e", "SUPABASE_SERVICE_KEY",
 "-e", "USE_KNOWLEDGE_GRAPH",
 "-e", "NEO4J_URI",
 "-e", "NEO4J_USER",
 "-e", "NEO4J_PASSWORD",
 "mcp/crawl4ai"],
 "env": {
 "TRANSPORT": "stdio",
 "OPENAI_API_KEY": "your_openai_api_key",
 "SUPABASE_URL": "your_supabase_url",
 "SUPABASE_SERVICE_KEY": "your_supabase_service_key",
 "USE_KNOWLEDGE_GRAPH": "false",
 "NEO4J_URI": "bolt://localhost:7687",
 "NEO4J_USER": "neo4j",
 "NEO4J_PASSWORD": "your_neo4j_password"
 }
 }
 }
}

Knowledge Graph Architecture

The knowledge graph system stores repository code structure in Neo4j with the following components:

Core Components (knowledge_graphs/ folder):

  • parse_repo_into_neo4j.py: Clones and analyzes GitHub repositories, extracting Python classes, methods, functions, and imports into Neo4j nodes and relationships

  • ai_script_analyzer.py: Parses Python scripts using AST to extract imports, class instantiations, method calls, and function usage

  • knowledge_graph_validator.py: Validates AI-generated code against the knowledge graph to detect hallucinations (non-existent methods, incorrect parameters, etc.)

  • hallucination_reporter.py: Generates comprehensive reports about detected hallucinations with confidence scores and recommendations

  • query_knowledge_graph.py: Interactive CLI tool for exploring the knowledge graph (functionality now integrated into MCP tools)

Knowledge Graph Schema:

The Neo4j database stores code structure as:

Nodes:

  • Repository: GitHub repositories

  • File: Python files within repositories

  • Class: Python classes with methods and attributes

  • Method: Class methods with parameter information

  • Function: Standalone functions

  • Attribute: Class attributes

Relationships:

  • Repository -[:CONTAINS]-> File

  • File -[:DEFINES]-> Class

  • File -[:DEFINES]-> Function

  • Class -[:HAS_METHOD]-> Method

  • Class -[:HAS_ATTRIBUTE]-> Attribute

Workflow:

  • Repository Parsing: Use parse_github_repository tool to clone and analyze open-source repositories

  • Code Validation: Use check_ai_script_hallucinations tool to validate AI-generated Python scripts

  • Knowledge Exploration: Use query_knowledge_graph tool to explore available repositories, classes, and methods

Building Your Own Server

This implementation provides a foundation for building more complex MCP servers with web crawling capabilities. To build your own:

  • Add your own tools by creating methods with the @mcp.tool() decorator

  • Create your own lifespan function to add your own dependencies

  • Modify the utils.py file for any helper functions you need

  • Extend the crawling capabilities by adding more specialized crawlers