2026 / Practical Python CLI / portfolio engineering project

Sitemap Product Image Scraper

Python CLI for ecommerce product data extraction and image inventory workflows

Built a Python CLI for ecommerce sitemap product data extraction, image downloading, structured CSV/JSON export, and asset inventory workflows.

PythonPlaywrightData ExtractionCSV/JSONAutomationEthical Scraping

Problem

Ecommerce migration, SEO audit, product-data review, and asset inventory workflows often require product information and image assets to be collected from many product pages. Manual review is slow, inconsistent, and difficult to organise.

Approach

Reads sitemap.xml URLs, filters likely product pages, renders pages with Playwright, extracts product data through structured and fallback sources, downloads images, deduplicates files with SHA-256 hashes, and exports CSV/JSON files.

Outcome

Demonstrates Python scripting, browser automation, data extraction, fallback logic, file organisation, documentation, and ethical-use awareness.

Tools / Frameworks

Python 3.11+PlaywrightBeautifulSoupCSV exportJSON exportSHA-256 hashing

Evidence Produced

Public GitHub repository
README documentation
CLI usage examples
Sample output structure
products.csv export
products.json export
Organised image folders
Ethical-use documentation

Technical Workflow

Sitemap discovery, rendering, extraction, filtering, deduplication, and local export workflow.

Sitemap Product Image Scraper workflow showing each stage, what happens, and why it matters.
Stage	What happens	Why it matters
Sitemap discovery	Reads a sitemap.xml URL as the starting inventory of candidate pages.	Keeps the workflow structured and avoids manually collecting page URLs.
Product URL filtering	Filters pages using a configurable pattern such as /product/ or /products/.	Reduces noise and keeps extraction focused on likely product pages.
Browser rendering	Uses Playwright to render JavaScript-powered pages before extraction.	Supports modern ecommerce pages where product data appears after client-side rendering.
Product metadata extraction	Prefers JSON-LD Product schema, then falls back to visible HTML, Open Graph metadata, Twitter image metadata, and common ecommerce selectors.	Improves resilience across different storefront themes and markup patterns.
Image URL collection	Collects product image candidates from structured data, metadata, and page content.	Builds an asset inventory alongside product data instead of only scraping text fields.
Image filtering	Filters likely icons, logos, favicons, SVGs, sprites, placeholders, base64 images, and tracking pixels.	Keeps downloaded folders focused on useful product assets.
Deduplication	Deduplicates image URLs and downloaded files using SHA-256 hashes.	Avoids repeated downloads and keeps local output cleaner.
Local export	Exports products.csv and products.json, saves images into product-specific folders, and records page-level errors.	Creates reviewable local outputs for migration, SEO audit, product-data review, and asset inventory work.

Evidence / Artefacts

Available

GitHub repository

Public repository for the Python CLI.

Open artefact

Available

README documentation

Repository README with usage and project context.

Open artefact

Coming soon

Sample output structure

Sample folder and output structure not published yet.

Coming soon

CLI screenshots

CLI usage screenshots not published yet.

Coming soon

products.csv / products.json sample

Example export files not published yet.

Coming soon

Architecture / workflow diagram

Workflow diagram not published yet.

Coming soon

Limitations

This tool is intended for authorised ecommerce migration, SEO audit, product-data review, and asset inventory workflows only. It is not designed to bypass authentication, paywalls, robots.txt restrictions, rate limits, or anti-bot controls.

What I Learned

Structured a practical CLI around sitemap discovery, Playwright rendering, fallback extraction, local file organisation, and ethical scraping boundaries.

Next Improvement

Add tests for parsing helpers, richer per-page diagnostics, optional Shopify/WooCommerce export formats, config file support, and clearer sample datasets.

Next Step

Review the work or start an enquiry.

Back to Work Email Me