Pinned Project

Async Web Crawler API

FastAPI + HTTPX crawler with Beautiful Soup parsing, image extraction variant, and domain-constrained crawling.

PYTHON
Default branch
main
Last pushed
Dec 15, 2025

Technologies Used

  • FastAPI exposes /crawl and root endpoints with automatic docs.
  • Pydantic validates request/response models (e.g., CrawlRequest, PageData).
  • HTTPX handles async fetching.
  • Beautiful Soup parses HTML for titles, links, and images.
  • asyncio manages batched concurrency and depth/host limits.

Repository Layout

  • main.py: Core crawler with FastAPI; extracts URL, title, status, depth, link count.
  • data.py: Enhanced variant adding ImageData plus extract_images for image metadata.
  • README.md: Brief project description.

Key Behaviors

  • Batched async crawling via asyncio.gather (default batch size 5).
  • Optional same-domain restriction to stay within the starting host.
  • Image scraping path for richer responses when using data.py.
Need more detail?
Happy to walk through the implementation or roadmap.