Introduction

Sitemap Web Scraper (sws) is a tool for simple, flexible, and yet performant web pages scraping. It consists of a CLI that executes a Lua JIT script and outputs a CSV file.

All the logic for crawling/scraping is defined in Lua and executed on a multiple threads in Rust. The actual parsing of HTML is done in Rust. Standard CSS selectors are also implemented in Rust (using Servo's html5ever and selectors). Both functionalities are accessible through a Lua API for flexible scraping logic.

As for the crawling logic, multiple seeding options are available: robots.txt, sitemaps, or a custom HTML pages list. By default, sitemaps (either provided or extracted from robots.txt) will be crawled recursively and the discovered HTML pages will be scraped with the provided Lua script. It's also possible to dynamically add page links to the crawling queue when scraping an HTML page. See the crawl subcommand and the Lua scraper for more details.

Besides, the Lua scraping script can be used on HTML pages stored as local files, without any crawling. See the scrap subcommand doc for more details.

Furthermore, the CLI is composed of crates that can be used independently in a custom Rust program.

Getting Started

Get the binary

Download the latest standalone binary for your OS on the release page, and put it in a location available in your PATH.

Basic example

Let's create a simple urbandict.lua scraper for Urban Dictionary. Copy paste the following command:

cat << 'EOF' > urbandict_demo.lua
sws.seedPages = {
   "https://www.urbandictionary.com/define.php?term=Lua"
}

function scrapPage(page, context)
   for defIndex, def in page:select("section .definition"):enumerate() do
      local word = def:select("h1 a.word"):iter()()
      if not word then
         word = def:select("h2 a.word"):iter()()
      end
      if not word then
         goto continue
      end
      word = word:innerHtml()

      local contributor = def:select(".contributor"):iter()()
      local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
      date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")

      local meaning = def:select(".meaning"):iter()()
      meaning = meaning:innerText():gsub("[\n\r]+", " ")

      local example = def:select(".example"):iter()()
      example = example:innerText():gsub("[\n\r]+", " ")

      if word and date and meaning and example then
         local record = sws.Record()
         record:pushField(word)
         record:pushField(defIndex)
         record:pushField(date)
         record:pushField(meaning)
         record:pushField(example)
         context:sendRecord(record)
      end

      ::continue::
   end
end
EOF

You can then run it with:

sws crawl --script urbandict_demo.lua

As we have defined sws.seedPages to be a single page (that is Urban Dictionary's Lua definition), the scrapPage function will be run on that single page only. There are multiple seeding options which are detailed in the Lua scraper - Seed definition section.

By default the resulting csv file is written to stdout, however the -o (or --output-file) lets us specify a proper output file. Note that this file can be also be appended or truncated, using the additional flags --append or --truncate respectively. See the crawl subcommand section for me details.

Bash completion

You can source the completion script in your ~/.bashrc file with:

echo 'source <(sws completion)' >> ~/.bashrc

Subcommand: crawl

Crawl sitemaps and scrap pages content

Usage: sws crawl [OPTIONS] --script <SCRIPT>

Options:
  -s, --script <SCRIPT>
          Path to the Lua script that defines scraping logic
  -o, --output-file <OUTPUT_FILE>
          Optional file that will contain scraped data, stdout otherwise
      --append
          Append to output file
      --truncate
          Truncate output file
  -q, --quiet
          Don't output logs
  -h, --help
          Print help information

More options in CLI override

Crawler Config

The crawler configurable parameters are:

ParameterDefaultDescription
user_agent"SWSbot"The User-Agent header that will be used in all HTTP requests
page_buffer10_000The size of the pages download queue. When the queue is full new downloads are on hold. This parameter is particularly relevant when using concurrent throttling.
throttleConcurrent(100) if robot is None

Otherwise Delay(N) where N is read from robots.txt field Crawl-delay: N
A throttling strategy for HTML pages download.

Concurrent(N) means at max N downloads at the same time, PerSecond(N) means at max N downloads per second, Delay(N) means wait for N seconds betwen downloads
num_workersmax(1, num_cpus-2)The number of CPU cores that will be used for scraping page in parallel using the provided Lua script.
on_dl_errorSkipAndLogBehaviour when an error occurs while downloading an HTML page. Other possible value is Fail.
on_xml_errorSkipAndLogBehaviour when an error occurs while processing a XML sitemap. Other possible value is Fail.
on_scrap_errorSkipAndLogBehaviour when an error occurs while scraping an HTML page in Lua. Other possible value is Fail.
robotNoneAn optional robots.txt URL used to retrieve a specific Throttle::Delay.

⚠ Conflicts with seedRobotsTxt in Lua Scraper, meaning that when robot is defined the seed cannot be a robot too.

These parameters can be changed through Lua script or CLI arguments.

The priority order is: CLI (highest priority) > Lua > Default values

Lua override

You can override parameters in Lua through the global variable sws.crawlerConfig.

ParameterLua nameExample Lua value
user_agentuserAgent"SWSbot"
page_bufferpageBuffer10000
throttlethrottle{ Concurrent = 100 }
num_workersnumWorkers4
on_dl_erroronDlError"SkipAndLog"
on_xml_erroronXmlError"Fail"
on_scrap_erroronScrapError"SkipAndLog"
robotrobot"https://www.google.com/robots.txt"

Here is an example of crawler configuration parmeters set using Lua:

-- You don't have to specify all parameters, only the ones you want to override.
sws.crawlerConfig = {
  userAgent = "SWSbot",
  pageBuffer = 10000,
  throttle = { Concurrent = 100 }, -- or: { PerSecond = 100 }, { Delay = 2 }
  numWorkers = 4,
  onDlError = "SkipAndLog", -- or: "Fail"
  onXmlError = "SkipAndLog",
  onScrapError = "SkipAndLog",
  robot = nil,
}

CLI override

You can override parameters through the CLI arguments.

ParameterCLI argument nameExample CLI argument value
user_agent--user-agent'SWSbot'
page_buffer--page-buffer10000
throttle (Concurent)--conc-dl100
throttle (PerSecond)--rps10
throttle (Delay)--delay2
num_workers--num-workers4
on_dl_error--on-dl-errorskip-and-log
on_xml_error--on-xml-errorfail
on_scrap_error--on-scrap-errorskip-and-log
robot--robot'https://www.google.com/robots.txt'

Here is an example of crawler configuration parmeters set using CLI arguments:

sws --script path/to/scrape_logic.lua -o results.csv     \
    --user-agent     'SWSbot'                            \
    --page-buffer    10000                               \
    --conc-dl        100                                 \
    --num-workers    4                                   \
    --on-dl-error    skip-and-log                        \
    --on-xml-error   fail                                \
    --on-scrap-error skip-and-log                        \
    --robot          'https://www.google.com/robots.txt' \

Subcommand: scrap

Scrap a single remote page or multiple local pages

Usage: sws scrap [OPTIONS] --script <SCRIPT> <--url <URL>|--files <GLOB>>

Options:
  -s, --script <SCRIPT>            Path to the Lua script that defines scraping logic
      --url <URL>                  A distant html page to scrap
      --files <GLOB>               A glob pattern to select local files to scrap
  -o, --output-file <OUTPUT_FILE>  Optional file that will contain scraped data, stdout otherwise
      --append                     Append to output file
      --truncate                   Truncate output file
      --num-workers <NUM_WORKERS>  Set the number of CPU workers when scraping local files
      --on-error <ON_ERROR>        Scrap error handling strategy when scraping local files [possible values: fail, skip-and-log]
  -q, --quiet                      Don't output logs
  -h, --help                       Print help information

The parameters --url and --files are mutually exclusive (only one can be specified).

This subcommand is meant to either:

  • Quickly test a Lua script on a given URL (with --url)

  • Process HTML pages that have been previously stored on disk (with --files)

Lua Scraper

The scraping logic is configured through a single Lua script.

The customizable parameters are:

  • seed: Defines the seed pages for crawling
  • acceptUrl: A function to specify whether to accept a URL when crawling an XML Sitemap
  • scrapPage: A function that defines the scraping logic for a single HTML page

Seed definition

The seed be one of seedSitemaps, seedPages, or seedRobotsTxt.

Defining a seed is always mandatory. However, when using the scrap subcommand it will be ignored as the input will be either the specified URL or the specified local files.

⚠️ Defining multiple seeds will throw an error ⚠️

Example

-- A list of sitemap URLs (gzipped sitemaps are supported)
sws.seedSitemaps = {
   "https://www.urbandictionary.com/sitemap-https.xml.gz"
}
-- A list of HTML pages
sws.seedPages = {
   "https://www.urbandictionary.com/define.php?term=Rust",
   "https://www.urbandictionary.com/define.php?term=Lua",
}
-- A single robots.txt URL
sws.seedRobotsTxt = "https://www.urbandictionary.com/robots.txt"

Robot definition

A robots.txt can be used either as:

  • A crawling seed through sws.seedRobotsTxt (see above)

  • A URL validation helper through sws.crawlerConfig's parameter robot (see crawler configuration)

In both cases, the resulting Robot can be used to check whether a given URL is crawlable. This Robot is available through both CrawlingContext (in acceptUrl), and ScrapingContext (in scrapPage).

The underlying Robot implementation in Rust is using the crate texting_robots.

Defining a robot is optional.

Function acceptUrl

function acceptUrl(url, context)

A Lua function to specify whether to accept a URL when crawling an XML Sitemap. Its parameters are:

  • url: A URL string that is a candidate for crawling/scraping

  • context: An instance of CrawlingContext

Defining acceptUrl is optional.

Example

From examples/urbandict.lua:

function acceptUrl(url, context)
   if context:sitemap() == sws.Sitemap.URL_SET then
      return string.find(url, "term=")
   else
      -- For sws.Sitemap.INDEX accept all entries
      return true
   end
end

Function scrapPage

function scrapPage(page, context)

A Lua function that defines the scraping logic for a single page. Its parameters are:

Defining scrapPage is mandatory.

CSS Selectors

CSS selectors are the most powerful feature of this scraper, they are used to target and extract HTML elements in a flexible and efficient way. You can read more about CSS selectors on MDN doc, and find a good reference on W3C doc.

function scrapPage(page, context)
   for i, def in page:select("section .definition"):enumerate() do
      local word = def:select("h1 a.word"):iter()()
      print(string.format("Definition %i: %s", i, word))
   end
end

The select method is expecting a CSS selector string, its result can be either iterated or enumerated with iter and enumerate respectively. Interestingly, the elements being iterated over allow for sub selection as they also have a select method, this enables very flexible HTML elements selection.

See more details in the reference for the Select class.

Utils

Some utility functions are also exposed in Lua.

  • Date utils:

    The Date helper can parse and format dates:

    local date = "March 18, 2005" -- Extracted from some page's element
    date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d") -- Now date is "2005-03-18"
    

    Under the hood a Date wraps a Rust chrono::NaiveDate that is created using NaiveDate::parse_from_str. The format method will return a string formatted with the specified format (see specifiers for the formatting options).

Example

From examples/urbandict.lua:

function scrapPage(page, context)
   for defIndex, def in page:select("section .definition"):enumerate() do
      local word = def:select("h1 a.word"):iter()()
      if not word then
         word = def:select("h2 a.word"):iter()()
      end
      if not word then
         goto continue
      end
      word = word:innerHtml()

      local contributor = def:select(".contributor"):iter()()
      local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
      date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")

      local meaning = def:select(".meaning"):iter()()
      meaning = meaning:innerText():gsub("[\n\r]+", " ")

      local example = def:select(".example"):iter()()
      example = example:innerText():gsub("[\n\r]+", " ")

      if word and date and meaning and example then
         local record = sws.Record()
         record:pushField(word)
         record:pushField(defIndex)
         record:pushField(date)
         record:pushField(meaning)
         record:pushField(example)
         context:sendRecord(record)
      end

      ::continue::
   end
end

CSV Record

The Lua Record class wraps a Rust csv::StringRecord struct. In Lua it can be instantiated through sws.Record(). Its pushField(someString) method should be used to add string fields to the record.

It is possible to customize the underlying CSV Writer in Lua through the sws.csvWriterConfig table.

csv::WriterBuilder methodLua parameterExample Lua valueDefault Lua value
delimiterdelimiter"\t"","
escapeescape";""\""
flexibleflexibletruefalse
terminatorterminatorCRLF{ Any = "\n" }

Example

sws.csvWriterConfig = {
   delimiter = "\t"
}

function scrapPage(page, context)
    local record = sws.Record()
    record:pushField("foo field")
    record:pushField("bar field")
    context:sendRecord(record)
end

Lua API Overview

Global variables

Lua nameLua TypeDescription
scrapPagefunctionDefine the scraping logic for a single HTML page. See details
acceptUrlfunctionSpecify whether to accept a URL when crawling an XML Sitemap, true by default. See details
swstableThe sws namespace

Namespaced variables

All the following variables are defined in the sws table.

Seeds

The configurable seed

Lua nameLua TypeDescription
seedSitemapstableA list of sitemap URLs
seedPagestableA list of HTML page URLs
seedRobotsTxtstringA single robots.txt URL

Configurations

Lua nameLua TypeDescription
csvWriterConfigtableConfig used to write output csv records. See details
crawlerConfigtableConfig used to customize crawler behavior. See details

Types

All types are defined in the sws table.

Class Html

A parsed HTML page. Its HTML elements can be selected with CSS selectors.

Lua signatureDescription
Html:select(selector: string) -> SelectParses the given CSS selector and returns a Select instance
Html:root() -> ElemRefReturns an ElemRef to the HTML root node

Class Select

A selection made with CSS selectors. Its HTML elements can be iterated.

Lua signatureDescription
Select:iter() -> iterator<ElemRef>An iterator of ElemRef over the selected HTML nodes
Select:enumerate() -> iterator<(integer, ElemRef)>An iterator of ElemRef and their indices over the selected HTML nodes

Class ElemRef

An HTML element reference. Its descendant HTML elements can be selected with CSS selectors.

Lua signatureDescription
ElemRef:select(selector: string) -> SelectParses the given CSS selector and returns a Select instance over its descendants
ElemRef:innerHtml() -> stringThe inner HTML string of this element
ElemRef:innerText() -> stringReturns all the descendent text nodes content concatenated
ElemRef:name() -> stringThe HTML element name
ElemRef:id() -> stringThe HTML element id, if any
ElemRef:hasClass(class: string) -> booleanWhether the HTML element has the given class
ElemRef:classes() -> tableReturns all classes of the HTML element
ElemRef:attr(name: string) -> stringIf the HTML element has the name attribute, return its value, nil otherwise
ElemRef:attrs() -> tableReturns all attributes of the HTML element

Class Date

A helper class for parsing and formatting dates.

Lua signatureDescription
Date(date: string, fmt: string) -> DateParses the given date accordingly to fmt, uses chrono::NaiveDate::parse_from_str under the hood
Date:format(fmt: string) -> stringFormats the current date accordingly to fmt, uses chrono::NaiveDate::format under the hood

Class ScrapingContext

The context available when an HTML page is scraped, provided as parameter in scrapPage

Lua signatureDescription
ScrapingContext:pageLocation() -> PageLocationReturns the current PageLocation
ScrapingContext:sendRecord(rec: Record)Sends a CSV Record to the current output (either stdout or the specified output file)
ScrapingContext:sendUrl(url: string)Adds the given url to the internal crawling queue so that it will be scraped later
ScrapingContext:workerId() -> stringA string identifying the current worker thread. It simply consists of the worker's number (starting from 0)
ScrapingContext:robot() -> RobotReturns current Robot if it was setup, nil otherwise

Class PageLocation

The location of an HTML page.

Lua signatureDescription
PageLocation:kind() -> option<Location>Get the page's Location kind
PageLocation:get() -> option<string>If the current page is a Location.URL returns its URL, if it's a Location.PATH returns its path on disk

Enum Location

Location kind.

Lua variantDescription
Location.URLA URL location kind (remote). Relevant when using the crawl subcommand
Location.PATHA PATH location kind (local). Relevant when using the scrap subcommand

Class Record

A dynamic CSV record. CSV formatting can be customized (see details).

Lua signatureDescription
Record() -> RecordCreates a new empty CSV record
Record:pushField(field: string)Adds the given field value to this CSV record

Class CrawlingContext

The context available when an XML Sitemap page is crawled, provided as parameter in acceptUrl

Lua signatureDescription
CrawlingContext:robot() -> RobotReturns current Robot if it was setup, nil otherwise
CrawlingContext:sitemap() -> SitemapThe Sitemap format of the sitemap page being crawled

Class Robot

Lua signatureDescription
Robot:allowed(url: string) -> booleanWhether the given url is allowed for crawling or not. This relies on texting_robots::Robot::allowed

Enum Sitemap

The Sitemaps formats of an XML Sitemap page.

Lua variantDescription
Sitemap.INDEXA <sitemapindex> format
Sitemap.URL_SETA <urlset> format