Introduction

Sitemap Web Scraper (sws) is a tool for simple, flexible, and yet performant web pages scraping. It consists of a CLI that executes a Lua JIT script and outputs a CSV file.

All the logic for crawling/scraping is defined in Lua and executed on a multiple threads in Rust. The actual parsing of HTML is done in Rust. Standard CSS selectors are also implemented in Rust (using Servo's html5ever and selectors). Both functionalities are accessible through a Lua API for flexible scraping logic.

As for the crawling logic, multiple seeding options are available: robots.txt, sitemaps, or a custom HTML pages list. By default, sitemaps (either provided or extracted from robots.txt) will be crawled recursively and the discovered HTML pages will be scraped with the provided Lua script. It's also possible to dynamically add page links to the crawling queue when scraping an HTML page. See the crawl subcommand and the Lua scraper for more details.

Besides, the Lua scraping script can be used on HTML pages stored as local files, without any crawling. See the scrap subcommand doc for more details.

Furthermore, the CLI is composed of crates that can be used independently in a custom Rust program.