Technologies Used
- FastAPI exposes
/crawland root endpoints with automatic docs. - Pydantic validates request/response models (e.g.,
CrawlRequest,PageData). - HTTPX handles async fetching.
- Beautiful Soup parses HTML for titles, links, and images.
- asyncio manages batched concurrency and depth/host limits.
Repository Layout
- main.py: Core crawler with FastAPI; extracts URL, title, status, depth, link count.
- data.py: Enhanced variant adding
ImageDataplusextract_imagesfor image metadata. - README.md: Brief project description.
Key Behaviors
- Batched async crawling via
asyncio.gather(default batch size 5). - Optional same-domain restriction to stay within the starting host.
- Image scraping path for richer responses when using
data.py.