Introduction

Sitemap Web Scraper (sws) is a tool for simple, flexible, and yet performant web pages scraping. It consists of a CLI that executes a Lua JIT script and outputs a CSV file.

All the logic for crawling/scraping is defined in Lua and executed on a multiple threads in Rust. The actual parsing of HTML is done in Rust. Standard CSS selectors are also implemented in Rust (using Servo's html5ever and selectors). Both functionalities are accessible through a Lua API for flexible scraping logic.

As for the crawling logic, multiple seeding options are available: robots.txt, sitemaps, or a custom HTML pages list. By default, sitemaps (either provided or extracted from robots.txt) will be crawled recursively and the discovered HTML pages will be scraped with the provided Lua script. It's also possible to dynamically add page links to the crawling queue when scraping an HTML page. See the crawl subcommand and the Lua scraper for more details.

Besides, the Lua scraping script can be used on HTML pages stored as local files, without any crawling. See the scrap subcommand doc for more details.

Furthermore, the CLI is composed of crates that can be used independently in a custom Rust program.

Getting Started

Get the binary

Download the latest standalone binary for your OS on the release page, and put it in a location available in your PATH.

Basic example

Let's create a simple urbandict.lua scraper for Urban Dictionary. Copy paste the following command:

cat << 'EOF' > urbandict_demo.lua
sws.seedPages = {
   "https://www.urbandictionary.com/define.php?term=Lua"
}

function scrapPage(page, context)
   for defIndex, def in page:select("section .definition"):enumerate() do
      local word = def:select("h1 a.word"):iter()()
      if not word then
         word = def:select("h2 a.word"):iter()()
      end
      if not word then
         goto continue
      end
      word = word:innerHtml()

      local contributor = def:select(".contributor"):iter()()
      local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
      date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")

      local meaning = def:select(".meaning"):iter()()
      meaning = meaning:innerText():gsub("[\n\r]+", " ")

      local example = def:select(".example"):iter()()
      example = example:innerText():gsub("[\n\r]+", " ")

      if word and date and meaning and example then
         local record = sws.Record()
         record:pushField(word)
         record:pushField(defIndex)
         record:pushField(date)
         record:pushField(meaning)
         record:pushField(example)
         context:sendRecord(record)
      end

      ::continue::
   end
end
EOF

You can then run it with:

sws crawl --script urbandict_demo.lua

As we have defined sws.seedPages to be a single page (that is Urban Dictionary's Lua definition), the scrapPage function will be run on that single page only. There are multiple seeding options which are detailed in the Lua scraper - Seed definition section.

By default the resulting csv file is written to stdout, however the -o (or --output-file) lets us specify a proper output file. Note that this file can be also be appended or truncated, using the additional flags --append or --truncate respectively. See the crawl subcommand section for me details.

Bash completion

You can source the completion script in your ~/.bashrc file with:

echo 'source <(sws completion)' >> ~/.bashrc

Subcommand: crawl

Crawl sitemaps and scrap pages content

Usage: sws crawl [OPTIONS] --script <SCRIPT>

Options:
  -s, --script <SCRIPT>
          Path to the Lua script that defines scraping logic
  -o, --output-file <OUTPUT_FILE>
          Optional file that will contain scraped data, stdout otherwise
      --append
          Append to output file
      --truncate
          Truncate output file
  -q, --quiet
          Don't output logs
  -h, --help
          Print help information

More options in CLI override

Crawler Config

The crawler configurable parameters are:

Parameter	Default	Description
user_agent	"SWSbot"	The `User-Agent` header that will be used in all HTTP requests
page_buffer	10_000	The size of the pages download queue. When the queue is full new downloads are on hold. This parameter is particularly relevant when using concurrent throttling.
throttle	`Concurrent(100)` if `robot` is `None` Otherwise `Delay(N)` where `N` is read from `robots.txt` field `Crawl-delay: N`	A throttling strategy for HTML pages download. `Concurrent(N)` means at max `N` downloads at the same time, `PerSecond(N)` means at max `N` downloads per second, `Delay(N)` means wait for `N` seconds betwen downloads
num_workers	max(1, num_cpus-2)	The number of CPU cores that will be used for scraping page in parallel using the provided Lua script.
on_dl_error	`SkipAndLog`	Behaviour when an error occurs while downloading an HTML page. Other possible value is `Fail`.
on_xml_error	`SkipAndLog`	Behaviour when an error occurs while processing a XML sitemap. Other possible value is `Fail`.
on_scrap_error	`SkipAndLog`	Behaviour when an error occurs while scraping an HTML page in Lua. Other possible value is `Fail`.
robot	`None`	An optional `robots.txt` URL used to retrieve a specific `Throttle::Delay`. ⚠ Conflicts with `seedRobotsTxt` in Lua Scraper, meaning that when `robot` is defined the `seed` cannot be a robot too.

These parameters can be changed through Lua script or CLI arguments.

The priority order is: CLI (highest priority) > Lua > Default values

Lua override

You can override parameters in Lua through the global variable sws.crawlerConfig.

Parameter	Lua name	Example Lua value
user_agent	userAgent	"SWSbot"
page_buffer	pageBuffer	10000
throttle	throttle	{ Concurrent = 100 }
num_workers	numWorkers	4
on_dl_error	onDlError	"SkipAndLog"
on_xml_error	onXmlError	"Fail"
on_scrap_error	onScrapError	"SkipAndLog"
robot	robot	"https://www.google.com/robots.txt"

Here is an example of crawler configuration parmeters set using Lua:

-- You don't have to specify all parameters, only the ones you want to override.
sws.crawlerConfig = {
  userAgent = "SWSbot",
  pageBuffer = 10000,
  throttle = { Concurrent = 100 }, -- or: { PerSecond = 100 }, { Delay = 2 }
  numWorkers = 4,
  onDlError = "SkipAndLog", -- or: "Fail"
  onXmlError = "SkipAndLog",
  onScrapError = "SkipAndLog",
  robot = nil,
}

CLI override

You can override parameters through the CLI arguments.

Parameter	CLI argument name	Example CLI argument value
user_agent	--user-agent	'SWSbot'
page_buffer	--page-buffer	10000
throttle (Concurent)	--conc-dl	100
throttle (PerSecond)	--rps	10
throttle (Delay)	--delay	2
num_workers	--num-workers	4
on_dl_error	--on-dl-error	skip-and-log
on_xml_error	--on-xml-error	fail
on_scrap_error	--on-scrap-error	skip-and-log
robot	--robot	'https://www.google.com/robots.txt'

Here is an example of crawler configuration parmeters set using CLI arguments:

sws --script path/to/scrape_logic.lua -o results.csv     \
    --user-agent     'SWSbot'                            \
    --page-buffer    10000                               \
    --conc-dl        100                                 \
    --num-workers    4                                   \
    --on-dl-error    skip-and-log                        \
    --on-xml-error   fail                                \
    --on-scrap-error skip-and-log                        \
    --robot          'https://www.google.com/robots.txt' \

Subcommand: scrap

Scrap a single remote page or multiple local pages

Usage: sws scrap [OPTIONS] --script <SCRIPT> <--url <URL>|--files <GLOB>>

Options:
  -s, --script <SCRIPT>            Path to the Lua script that defines scraping logic
      --url <URL>                  A distant html page to scrap
      --files <GLOB>               A glob pattern to select local files to scrap
  -o, --output-file <OUTPUT_FILE>  Optional file that will contain scraped data, stdout otherwise
      --append                     Append to output file
      --truncate                   Truncate output file
      --num-workers <NUM_WORKERS>  Set the number of CPU workers when scraping local files
      --on-error <ON_ERROR>        Scrap error handling strategy when scraping local files [possible values: fail, skip-and-log]
  -q, --quiet                      Don't output logs
  -h, --help                       Print help information

The parameters --url and --files are mutually exclusive (only one can be specified).

This subcommand is meant to either:

Quickly test a Lua script on a given URL (with --url)
Process HTML pages that have been previously stored on disk (with --files)

Lua Scraper

The scraping logic is configured through a single Lua script.

The customizable parameters are:

seed: Defines the seed pages for crawling
acceptUrl: A function to specify whether to accept a URL when crawling an XML Sitemap
scrapPage: A function that defines the scraping logic for a single HTML page

Seed definition

The seed be one of seedSitemaps, seedPages, or seedRobotsTxt.

Defining a seed is always mandatory. However, when using the scrap subcommand it will be ignored as the input will be either the specified URL or the specified local files.

⚠️ Defining multiple seeds will throw an error ⚠️

Example

-- A list of sitemap URLs (gzipped sitemaps are supported)
sws.seedSitemaps = {
   "https://www.urbandictionary.com/sitemap-https.xml.gz"
}

-- A list of HTML pages
sws.seedPages = {
   "https://www.urbandictionary.com/define.php?term=Rust",
   "https://www.urbandictionary.com/define.php?term=Lua",
}

-- A single robots.txt URL
sws.seedRobotsTxt = "https://www.urbandictionary.com/robots.txt"

Robot definition

A robots.txt can be used either as:

A crawling seed through sws.seedRobotsTxt (see above)
A URL validation helper through sws.crawlerConfig's parameter robot (see crawler configuration)

In both cases, the resulting Robot can be used to check whether a given URL is crawlable. This Robot is available through both CrawlingContext (in acceptUrl), and ScrapingContext (in scrapPage).

The underlying Robot implementation in Rust is using the crate texting_robots.

Defining a robot is optional.

Function acceptUrl

function acceptUrl(url, context)

A Lua function to specify whether to accept a URL when crawling an XML Sitemap. Its parameters are:

url: A URL string that is a candidate for crawling/scraping
context: An instance of CrawlingContext

Defining acceptUrl is optional.

Example

From examples/urbandict.lua:

function acceptUrl(url, context)
   if context:sitemap() == sws.Sitemap.URL_SET then
      return string.find(url, "term=")
   else
      -- For sws.Sitemap.INDEX accept all entries
      return true
   end
end

Function scrapPage

function scrapPage(page, context)

A Lua function that defines the scraping logic for a single page. Its parameters are:

page: The Html page being scraped
context: An instance of ScrapingContext

Defining scrapPage is mandatory.

CSS Selectors

CSS selectors are the most powerful feature of this scraper, they are used to target and extract HTML elements in a flexible and efficient way. You can read more about CSS selectors on MDN doc, and find a good reference on W3C doc.

function scrapPage(page, context)
   for i, def in page:select("section .definition"):enumerate() do
      local word = def:select("h1 a.word"):iter()()
      print(string.format("Definition %i: %s", i, word))
   end
end

The select method is expecting a CSS selector string, its result can be either iterated or enumerated with iter and enumerate respectively. Interestingly, the elements being iterated over allow for sub selection as they also have a select method, this enables very flexible HTML elements selection.

See more details in the reference for the Select class.

Utils

Some utility functions are also exposed in Lua.

Date utils:

The Date helper can parse and format dates:
```
local date = "March 18, 2005" -- Extracted from some page's element
date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d") -- Now date is "2005-03-18"
```
Under the hood a Date wraps a Rust chrono::NaiveDate that is created using NaiveDate::parse_from_str. The format method will return a string formatted with the specified format (see specifiers for the formatting options).

Example

From examples/urbandict.lua:

function scrapPage(page, context)
   for defIndex, def in page:select("section .definition"):enumerate() do
      local word = def:select("h1 a.word"):iter()()
      if not word then
         word = def:select("h2 a.word"):iter()()
      end
      if not word then
         goto continue
      end
      word = word:innerHtml()

      local contributor = def:select(".contributor"):iter()()
      local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
      date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")

      local meaning = def:select(".meaning"):iter()()
      meaning = meaning:innerText():gsub("[\n\r]+", " ")

      local example = def:select(".example"):iter()()
      example = example:innerText():gsub("[\n\r]+", " ")

      if word and date and meaning and example then
         local record = sws.Record()
         record:pushField(word)
         record:pushField(defIndex)
         record:pushField(date)
         record:pushField(meaning)
         record:pushField(example)
         context:sendRecord(record)
      end

      ::continue::
   end
end

CSV Record

The Lua Record class wraps a Rust csv::StringRecord struct. In Lua it can be instantiated through sws.Record(). Its pushField(someString) method should be used to add string fields to the record.

It is possible to customize the underlying CSV Writer in Lua through the sws.csvWriterConfig table.

csv::WriterBuilder method	Lua parameter	Example Lua value	Default Lua value
delimiter	delimiter	"\t"	","
escape	escape	";"	"\""
flexible	flexible	true	false
terminator	terminator	CRLF	{ Any = "\n" }

Example

sws.csvWriterConfig = {
   delimiter = "\t"
}

function scrapPage(page, context)
    local record = sws.Record()
    record:pushField("foo field")
    record:pushField("bar field")
    context:sendRecord(record)
end

Lua API Overview

Global variables

Lua name	Lua Type	Description
scrapPage	function	Define the scraping logic for a single HTML page. See details
acceptUrl	function	Specify whether to accept a URL when crawling an XML Sitemap, `true` by default. See details
sws	table	The sws namespace

Namespaced variables

All the following variables are defined in the sws table.

Seeds

The configurable seed

Lua name	Lua Type	Description
seedSitemaps	table	A list of sitemap URLs
seedPages	table	A list of HTML page URLs
seedRobotsTxt	string	A single robots.txt URL

Configurations

Lua name	Lua Type	Description
csvWriterConfig	table	Config used to write output csv records. See details
crawlerConfig	table	Config used to customize crawler behavior. See details

Types

All types are defined in the sws table.

Class Html

A parsed HTML page. Its HTML elements can be selected with CSS selectors.

Lua signature	Description
Html:select(selector: string) -> Select	Parses the given CSS `selector` and returns a Select instance
Html:root() -> ElemRef	Returns an ElemRef to the HTML root node

Class Select

A selection made with CSS selectors. Its HTML elements can be iterated.

Lua signature	Description
Select:iter() -> iterator<ElemRef>	An iterator of ElemRef over the selected HTML nodes
Select:enumerate() -> iterator<(integer, ElemRef)>	An iterator of ElemRef and their indices over the selected HTML nodes

Class ElemRef

An HTML element reference. Its descendant HTML elements can be selected with CSS selectors.

Lua signature	Description
ElemRef:select(selector: string) -> Select	Parses the given CSS `selector` and returns a Select instance over its descendants
ElemRef:innerHtml() -> string	The inner HTML string of this element
ElemRef:innerText() -> string	Returns all the descendent text nodes content concatenated
ElemRef:name() -> string	The HTML element name
ElemRef:id() -> string	The HTML element id, if any
ElemRef:hasClass(class: string) -> boolean	Whether the HTML element has the given `class`
ElemRef:classes() -> table	Returns all classes of the HTML element
ElemRef:attr(name: string) -> string	If the HTML element has the `name` attribute, return its value, nil otherwise
ElemRef:attrs() -> table	Returns all attributes of the HTML element

Class Date

A helper class for parsing and formatting dates.

Lua signature	Description
Date(date: string, fmt: string) -> Date	Parses the given `date` accordingly to `fmt`, uses chrono::NaiveDate::parse_from_str under the hood
Date:format(fmt: string) -> string	Formats the current date accordingly to `fmt`, uses chrono::NaiveDate::format under the hood

Class ScrapingContext

The context available when an HTML page is scraped, provided as parameter in scrapPage

Lua signature	Description
ScrapingContext:pageLocation() -> PageLocation	Returns the current PageLocation
ScrapingContext:sendRecord(rec: Record)	Sends a CSV Record to the current output (either `stdout` or the specified output file)
ScrapingContext:sendUrl(url: string)	Adds the given `url` to the internal crawling queue so that it will be scraped later
ScrapingContext:workerId() -> string	A string identifying the current worker thread. It simply consists of the worker's number (starting from 0)
ScrapingContext:robot() -> Robot	Returns current Robot if it was setup, nil otherwise

Class PageLocation

The location of an HTML page.

Lua signature	Description
PageLocation:kind() -> option<Location>	Get the page's Location kind
PageLocation:get() -> option<string>	If the current page is a `Location.URL` returns its URL, if it's a `Location.PATH` returns its path on disk

Enum Location

Location kind.

Lua variant	Description
Location.URL	A URL location kind (remote). Relevant when using the crawl subcommand
Location.PATH	A PATH location kind (local). Relevant when using the scrap subcommand

Class Record

A dynamic CSV record. CSV formatting can be customized (see details).

Lua signature	Description
Record() -> Record	Creates a new empty CSV record
Record:pushField(field: string)	Adds the given `field` value to this CSV record

Class CrawlingContext

The context available when an XML Sitemap page is crawled, provided as parameter in acceptUrl

Lua signature	Description
CrawlingContext:robot() -> Robot	Returns current Robot if it was setup, nil otherwise
CrawlingContext:sitemap() -> Sitemap	The Sitemap format of the sitemap page being crawled

Class Robot

Lua signature	Description
Robot:allowed(url: string) -> boolean	Whether the given `url` is allowed for crawling or not. This relies on texting_robots::Robot::allowed

Enum Sitemap

The Sitemaps formats of an XML Sitemap page.

Lua variant	Description
Sitemap.INDEX	A `<sitemapindex>` format
Sitemap.URL_SET	A `<urlset>` format