Back to Work

2026 / Practical Python CLI / portfolio engineering project

Sitemap Product Image Scraper

Python CLI for ecommerce product data extraction and image inventory workflows

Built a Python CLI for ecommerce sitemap product data extraction, image downloading, structured CSV/JSON export, and asset inventory workflows.

PythonPlaywrightData ExtractionCSV/JSONAutomationEthical Scraping

Problem

Ecommerce migration, SEO audit, product-data review, and asset inventory workflows often require product information and image assets to be collected from many product pages. Manual review is slow, inconsistent, and difficult to organise.

Approach

Reads sitemap.xml URLs, filters likely product pages, renders pages with Playwright, extracts product data through structured and fallback sources, downloads images, deduplicates files with SHA-256 hashes, and exports CSV/JSON files.

Outcome

Demonstrates Python scripting, browser automation, data extraction, fallback logic, file organisation, documentation, and ethical-use awareness.

Tools / Frameworks

Python 3.11+PlaywrightBeautifulSoupCSV exportJSON exportSHA-256 hashing

Evidence Produced

  • Public GitHub repository
  • README documentation
  • CLI usage examples
  • Sample output structure
  • products.csv export
  • products.json export
  • Organised image folders
  • Ethical-use documentation

Technical Workflow

Sitemap discovery, rendering, extraction, filtering, deduplication, and local export workflow.

Sitemap Product Image Scraper workflow showing each stage, what happens, and why it matters.
StageWhat happensWhy it matters
Sitemap discoveryReads a sitemap.xml URL as the starting inventory of candidate pages.Keeps the workflow structured and avoids manually collecting page URLs.
Product URL filteringFilters pages using a configurable pattern such as /product/ or /products/.Reduces noise and keeps extraction focused on likely product pages.
Browser renderingUses Playwright to render JavaScript-powered pages before extraction.Supports modern ecommerce pages where product data appears after client-side rendering.
Product metadata extractionPrefers JSON-LD Product schema, then falls back to visible HTML, Open Graph metadata, Twitter image metadata, and common ecommerce selectors.Improves resilience across different storefront themes and markup patterns.
Image URL collectionCollects product image candidates from structured data, metadata, and page content.Builds an asset inventory alongside product data instead of only scraping text fields.
Image filteringFilters likely icons, logos, favicons, SVGs, sprites, placeholders, base64 images, and tracking pixels.Keeps downloaded folders focused on useful product assets.
DeduplicationDeduplicates image URLs and downloaded files using SHA-256 hashes.Avoids repeated downloads and keeps local output cleaner.
Local exportExports products.csv and products.json, saves images into product-specific folders, and records page-level errors.Creates reviewable local outputs for migration, SEO audit, product-data review, and asset inventory work.

Evidence / Artefacts

Available

GitHub repository

Public repository for the Python CLI.

Open artefact
Available

README documentation

Repository README with usage and project context.

Open artefact
Coming soon

Sample output structure

Sample folder and output structure not published yet.

Coming soon
Coming soon

CLI screenshots

CLI usage screenshots not published yet.

Coming soon
Coming soon

products.csv / products.json sample

Example export files not published yet.

Coming soon
Coming soon

Architecture / workflow diagram

Workflow diagram not published yet.

Coming soon

Limitations

This tool is intended for authorised ecommerce migration, SEO audit, product-data review, and asset inventory workflows only. It is not designed to bypass authentication, paywalls, robots.txt restrictions, rate limits, or anti-bot controls.

What I Learned

Structured a practical CLI around sitemap discovery, Playwright rendering, fallback extraction, local file organisation, and ethical scraping boundaries.

Next Improvement

Add tests for parsing helpers, richer per-page diagnostics, optional Shopify/WooCommerce export formats, config file support, and clearer sample datasets.

Next Step

Review the work or start an enquiry.