← Back to home

Projects GitHub

Pinned Project

Async Web Crawler API

FastAPI + HTTPX crawler with Beautiful Soup parsing, image extraction variant, and domain-constrained crawling.

PYTHON

Repository

aueskinj/web_crawler

Default branch

main

Last pushed

Dec 15, 2025

Technologies Used

FastAPI exposes /crawl and root endpoints with automatic docs.
Pydantic validates request/response models (e.g., CrawlRequest, PageData).
HTTPX handles async fetching.
Beautiful Soup parses HTML for titles, links, and images.
asyncio manages batched concurrency and depth/host limits.

Repository Layout

main.py: Core crawler with FastAPI; extracts URL, title, status, depth, link count.
data.py: Enhanced variant adding ImageData plus extract_images for image metadata.
README.md: Brief project description.

Key Behaviors

Batched async crawling via asyncio.gather (default batch size 5).
Optional same-domain restriction to stay within the starting host.
Image scraping path for richer responses when using data.py.

Need more detail?

Happy to walk through the implementation or roadmap.

Email me LinkedIn