2026 / Practical Python CLI / portfolio engineering project
Sitemap Product Image Scraper
Python CLI for ecommerce product data extraction and image inventory workflows
Built a Python CLI for ecommerce sitemap product data extraction, image downloading, structured CSV/JSON export, and asset inventory workflows.
PythonPlaywrightData ExtractionCSV/JSONAutomationEthical Scraping
Problem
Ecommerce migration, SEO audit, product-data review, and asset inventory workflows often require product information and image assets to be collected from many product pages. Manual review is slow, inconsistent, and difficult to organise.
Approach
Reads sitemap.xml URLs, filters likely product pages, renders pages with Playwright, extracts product data through structured and fallback sources, downloads images, deduplicates files with SHA-256 hashes, and exports CSV/JSON files.
Outcome
Demonstrates Python scripting, browser automation, data extraction, fallback logic, file organisation, documentation, and ethical-use awareness.
Tools / Frameworks
Python 3.11+PlaywrightBeautifulSoupCSV exportJSON exportSHA-256 hashing
Evidence Produced
- Public GitHub repository
- README documentation
- CLI usage examples
- Sample output structure
- products.csv export
- products.json export
- Organised image folders
- Ethical-use documentation
Technical Workflow
Sitemap discovery, rendering, extraction, filtering, deduplication, and local export workflow.
| Stage | What happens | Why it matters |
|---|---|---|
| Sitemap discovery | Reads a sitemap.xml URL as the starting inventory of candidate pages. | Keeps the workflow structured and avoids manually collecting page URLs. |
| Product URL filtering | Filters pages using a configurable pattern such as /product/ or /products/. | Reduces noise and keeps extraction focused on likely product pages. |
| Browser rendering | Uses Playwright to render JavaScript-powered pages before extraction. | Supports modern ecommerce pages where product data appears after client-side rendering. |
| Product metadata extraction | Prefers JSON-LD Product schema, then falls back to visible HTML, Open Graph metadata, Twitter image metadata, and common ecommerce selectors. | Improves resilience across different storefront themes and markup patterns. |
| Image URL collection | Collects product image candidates from structured data, metadata, and page content. | Builds an asset inventory alongside product data instead of only scraping text fields. |
| Image filtering | Filters likely icons, logos, favicons, SVGs, sprites, placeholders, base64 images, and tracking pixels. | Keeps downloaded folders focused on useful product assets. |
| Deduplication | Deduplicates image URLs and downloaded files using SHA-256 hashes. | Avoids repeated downloads and keeps local output cleaner. |
| Local export | Exports products.csv and products.json, saves images into product-specific folders, and records page-level errors. | Creates reviewable local outputs for migration, SEO audit, product-data review, and asset inventory work. |
Evidence / Artefacts
Coming soon
Sample output structure
Sample folder and output structure not published yet.
Coming soonComing soon
CLI screenshots
CLI usage screenshots not published yet.
Coming soonComing soon
products.csv / products.json sample
Example export files not published yet.
Coming soonComing soon
Architecture / workflow diagram
Workflow diagram not published yet.
Coming soonLimitations
This tool is intended for authorised ecommerce migration, SEO audit, product-data review, and asset inventory workflows only. It is not designed to bypass authentication, paywalls, robots.txt restrictions, rate limits, or anti-bot controls.
What I Learned
Structured a practical CLI around sitemap discovery, Playwright rendering, fallback extraction, local file organisation, and ethical scraping boundaries.
Next Improvement
Add tests for parsing helpers, richer per-page diagnostics, optional Shopify/WooCommerce export formats, config file support, and clearer sample datasets.
Next Step