Introduction
Sitemap Web Scraper (sws) is a tool for simple, flexible, and yet performant web pages scraping. It consists of a CLI that executes a Lua JIT script and outputs a CSV file.
All the logic for crawling/scraping is defined in Lua and executed on a multiple threads in Rust. The actual parsing of HTML is done in Rust. Standard CSS selectors are also implemented in Rust (using Servo's html5ever and selectors). Both functionalities are accessible through a Lua API for flexible scraping logic.
As for the crawling logic, multiple seeding options are available: robots.txt,
sitemaps, or a custom HTML pages list. By default, sitemaps (either provided or
extracted from robots.txt
) will be crawled recursively and the discovered HTML pages
will be scraped with the provided Lua script. It's also possible to dynamically add page
links to the crawling queue when scraping an HTML page. See the crawl
subcommand and the Lua scraper for more details.
Besides, the Lua scraping script can be used on HTML pages stored as local files, without any crawling. See the scrap subcommand doc for more details.
Furthermore, the CLI is composed of crates
that can be used independently in a custom
Rust program.
Getting Started
Get the binary
Download the latest standalone binary for your OS on the release page, and put it in
a location available in your PATH
.
Basic example
Let's create a simple urbandict.lua
scraper for Urban Dictionary. Copy paste the
following command:
cat << 'EOF' > urbandict_demo.lua
sws.seedPages = {
"https://www.urbandictionary.com/define.php?term=Lua"
}
function scrapPage(page, context)
for defIndex, def in page:select("section .definition"):enumerate() do
local word = def:select("h1 a.word"):iter()()
if not word then
word = def:select("h2 a.word"):iter()()
end
if not word then
goto continue
end
word = word:innerHtml()
local contributor = def:select(".contributor"):iter()()
local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")
local meaning = def:select(".meaning"):iter()()
meaning = meaning:innerText():gsub("[\n\r]+", " ")
local example = def:select(".example"):iter()()
example = example:innerText():gsub("[\n\r]+", " ")
if word and date and meaning and example then
local record = sws.Record()
record:pushField(word)
record:pushField(defIndex)
record:pushField(date)
record:pushField(meaning)
record:pushField(example)
context:sendRecord(record)
end
::continue::
end
end
EOF
You can then run it with:
sws crawl --script urbandict_demo.lua
As we have defined sws.seedPages
to be a single page (that is Urban Dictionary's
Lua definition), the scrapPage
function will be run on that single page
only. There are multiple seeding options which are detailed in the Lua scraper - Seed
definition section.
By default the resulting csv file is written to stdout, however the -o
(or
--output-file
) lets us specify a proper output file. Note that this file can be also
be appended or truncated, using the additional flags --append
or --truncate
respectively. See the crawl subcommand section for me details.
Bash completion
You can source the completion script in your ~/.bashrc
file with:
echo 'source <(sws completion)' >> ~/.bashrc
Subcommand: crawl
Crawl sitemaps and scrap pages content
Usage: sws crawl [OPTIONS] --script <SCRIPT>
Options:
-s, --script <SCRIPT>
Path to the Lua script that defines scraping logic
-o, --output-file <OUTPUT_FILE>
Optional file that will contain scraped data, stdout otherwise
--append
Append to output file
--truncate
Truncate output file
-q, --quiet
Don't output logs
-h, --help
Print help information
More options in CLI override
Crawler Config
The crawler configurable parameters are:
Parameter | Default | Description |
---|---|---|
user_agent | "SWSbot" | The User-Agent header that will be used in all HTTP requests |
page_buffer | 10_000 | The size of the pages download queue. When the queue is full new downloads are on hold. This parameter is particularly relevant when using concurrent throttling. |
throttle | Concurrent(100) if robot is None Otherwise Delay(N) where N is read from robots.txt field Crawl-delay: N | A throttling strategy for HTML pages download. Concurrent(N) means at max N downloads at the same time, PerSecond(N) means at max N downloads per second, Delay(N) means wait for N seconds betwen downloads |
num_workers | max(1, num_cpus-2) | The number of CPU cores that will be used for scraping page in parallel using the provided Lua script. |
on_dl_error | SkipAndLog | Behaviour when an error occurs while downloading an HTML page. Other possible value is Fail . |
on_xml_error | SkipAndLog | Behaviour when an error occurs while processing a XML sitemap. Other possible value is Fail . |
on_scrap_error | SkipAndLog | Behaviour when an error occurs while scraping an HTML page in Lua. Other possible value is Fail . |
robot | None | An optional robots.txt URL used to retrieve a specific Throttle::Delay . ⚠ Conflicts with seedRobotsTxt in Lua Scraper, meaning that when robot is defined the seed cannot be a robot too. |
These parameters can be changed through Lua script or CLI arguments.
The priority order is: CLI (highest priority) > Lua > Default values
Lua override
You can override parameters in Lua through the global variable sws.crawlerConfig
.
Parameter | Lua name | Example Lua value |
---|---|---|
user_agent | userAgent | "SWSbot" |
page_buffer | pageBuffer | 10000 |
throttle | throttle | { Concurrent = 100 } |
num_workers | numWorkers | 4 |
on_dl_error | onDlError | "SkipAndLog" |
on_xml_error | onXmlError | "Fail" |
on_scrap_error | onScrapError | "SkipAndLog" |
robot | robot | "https://www.google.com/robots.txt" |
Here is an example of crawler configuration parmeters set using Lua:
-- You don't have to specify all parameters, only the ones you want to override.
sws.crawlerConfig = {
userAgent = "SWSbot",
pageBuffer = 10000,
throttle = { Concurrent = 100 }, -- or: { PerSecond = 100 }, { Delay = 2 }
numWorkers = 4,
onDlError = "SkipAndLog", -- or: "Fail"
onXmlError = "SkipAndLog",
onScrapError = "SkipAndLog",
robot = nil,
}
CLI override
You can override parameters through the CLI arguments.
Parameter | CLI argument name | Example CLI argument value |
---|---|---|
user_agent | --user-agent | 'SWSbot' |
page_buffer | --page-buffer | 10000 |
throttle (Concurent) | --conc-dl | 100 |
throttle (PerSecond) | --rps | 10 |
throttle (Delay) | --delay | 2 |
num_workers | --num-workers | 4 |
on_dl_error | --on-dl-error | skip-and-log |
on_xml_error | --on-xml-error | fail |
on_scrap_error | --on-scrap-error | skip-and-log |
robot | --robot | 'https://www.google.com/robots.txt' |
Here is an example of crawler configuration parmeters set using CLI arguments:
sws --script path/to/scrape_logic.lua -o results.csv \
--user-agent 'SWSbot' \
--page-buffer 10000 \
--conc-dl 100 \
--num-workers 4 \
--on-dl-error skip-and-log \
--on-xml-error fail \
--on-scrap-error skip-and-log \
--robot 'https://www.google.com/robots.txt' \
Subcommand: scrap
Scrap a single remote page or multiple local pages
Usage: sws scrap [OPTIONS] --script <SCRIPT> <--url <URL>|--files <GLOB>>
Options:
-s, --script <SCRIPT> Path to the Lua script that defines scraping logic
--url <URL> A distant html page to scrap
--files <GLOB> A glob pattern to select local files to scrap
-o, --output-file <OUTPUT_FILE> Optional file that will contain scraped data, stdout otherwise
--append Append to output file
--truncate Truncate output file
--num-workers <NUM_WORKERS> Set the number of CPU workers when scraping local files
--on-error <ON_ERROR> Scrap error handling strategy when scraping local files [possible values: fail, skip-and-log]
-q, --quiet Don't output logs
-h, --help Print help information
The parameters --url
and --files
are mutually exclusive (only one can be specified).
This subcommand is meant to either:
-
Quickly test a Lua script on a given URL (with
--url
) -
Process HTML pages that have been previously stored on disk (with
--files
)
Lua Scraper
The scraping logic is configured through a single Lua
script.
The customizable parameters are:
- seed: Defines the seed pages for crawling
- acceptUrl: A function to specify whether to accept a URL when crawling an XML Sitemap
- scrapPage: A function that defines the scraping logic for a single HTML page
Seed definition
The seed be one of seedSitemaps
, seedPages
, or
seedRobotsTxt
.
Defining a seed
is always mandatory. However, when using the scrap
subcommand it will be ignored as the input will be either the
specified URL or the specified local files.
⚠️ Defining multiple seeds
will throw an error ⚠️
Example
-- A list of sitemap URLs (gzipped sitemaps are supported)
sws.seedSitemaps = {
"https://www.urbandictionary.com/sitemap-https.xml.gz"
}
-- A list of HTML pages
sws.seedPages = {
"https://www.urbandictionary.com/define.php?term=Rust",
"https://www.urbandictionary.com/define.php?term=Lua",
}
-- A single robots.txt URL
sws.seedRobotsTxt = "https://www.urbandictionary.com/robots.txt"
Robot definition
A robots.txt can be used either as:
-
A crawling seed through
sws.seedRobotsTxt
(see above) -
A URL validation helper through
sws.crawlerConfig
's parameterrobot
(see crawler configuration)
In both cases, the resulting Robot can be used to
check whether a given URL is crawlable. This Robot
is available through both
CrawlingContext (in
acceptUrl), and
ScrapingContext (in
scrapPage).
The underlying Robot
implementation in Rust is using the crate
texting_robots.
Defining a robot
is optional.
Function acceptUrl
function acceptUrl(url, context)
A Lua
function to specify whether to accept a URL when crawling an XML
Sitemap. Its parameters are:
-
url: A URL
string
that is a candidate for crawling/scraping -
context: An instance of CrawlingContext
Defining acceptUrl
is optional.
Example
From examples/urbandict.lua
:
function acceptUrl(url, context)
if context:sitemap() == sws.Sitemap.URL_SET then
return string.find(url, "term=")
else
-- For sws.Sitemap.INDEX accept all entries
return true
end
end
Function scrapPage
function scrapPage(page, context)
A Lua
function that defines the scraping logic for a single page. Its parameters are:
-
page: The Html page being scraped
-
context: An instance of ScrapingContext
Defining scrapPage
is mandatory.
CSS Selectors
CSS selectors are the most powerful feature of this scraper, they are used to target and extract HTML elements in a flexible and efficient way. You can read more about CSS selectors on MDN doc, and find a good reference on W3C doc.
function scrapPage(page, context)
for i, def in page:select("section .definition"):enumerate() do
local word = def:select("h1 a.word"):iter()()
print(string.format("Definition %i: %s", i, word))
end
end
The select
method is expecting a CSS selector string, its result can be either
iterated or enumerated with iter
and enumerate
respectively. Interestingly, the
elements being iterated over allow for sub selection as they also have a select
method, this enables very flexible HTML elements selection.
See more details in the reference for the Select class.
Utils
Some utility functions are also exposed in Lua
.
-
Date utils:
The Date helper can parse and format dates:
local date = "March 18, 2005" -- Extracted from some page's element date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d") -- Now date is "2005-03-18"
Under the hood a
Date
wraps a Rust chrono::NaiveDate that is created using NaiveDate::parse_from_str. Theformat
method will return a string formatted with the specified format (see specifiers for the formatting options).
Example
From examples/urbandict.lua
:
function scrapPage(page, context)
for defIndex, def in page:select("section .definition"):enumerate() do
local word = def:select("h1 a.word"):iter()()
if not word then
word = def:select("h2 a.word"):iter()()
end
if not word then
goto continue
end
word = word:innerHtml()
local contributor = def:select(".contributor"):iter()()
local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")
local meaning = def:select(".meaning"):iter()()
meaning = meaning:innerText():gsub("[\n\r]+", " ")
local example = def:select(".example"):iter()()
example = example:innerText():gsub("[\n\r]+", " ")
if word and date and meaning and example then
local record = sws.Record()
record:pushField(word)
record:pushField(defIndex)
record:pushField(date)
record:pushField(meaning)
record:pushField(example)
context:sendRecord(record)
end
::continue::
end
end
CSV Record
The Lua Record class wraps a Rust
csv::StringRecord
struct. In Lua
it can be instantiated through
sws.Record()
. Its pushField(someString)
method should be used to add string fields
to the record.
It is possible to customize the underlying CSV Writer in Lua
through the
sws.csvWriterConfig
table.
csv::WriterBuilder method | Lua parameter | Example Lua value | Default Lua value |
---|---|---|---|
delimiter | delimiter | "\t" | "," |
escape | escape | ";" | "\"" |
flexible | flexible | true | false |
terminator | terminator | CRLF | { Any = "\n" } |
Example
sws.csvWriterConfig = {
delimiter = "\t"
}
function scrapPage(page, context)
local record = sws.Record()
record:pushField("foo field")
record:pushField("bar field")
context:sendRecord(record)
end
Lua API Overview
Global variables
Lua name | Lua Type | Description |
---|---|---|
scrapPage | function | Define the scraping logic for a single HTML page. See details |
acceptUrl | function | Specify whether to accept a URL when crawling an XML Sitemap, true by default. See details |
sws | table | The sws namespace |
Namespaced variables
All the following variables are defined in the sws
table.
Seeds
The configurable seed
Lua name | Lua Type | Description |
---|---|---|
seedSitemaps | table | A list of sitemap URLs |
seedPages | table | A list of HTML page URLs |
seedRobotsTxt | string | A single robots.txt URL |
Configurations
Lua name | Lua Type | Description |
---|---|---|
csvWriterConfig | table | Config used to write output csv records. See details |
crawlerConfig | table | Config used to customize crawler behavior. See details |
Types
All types are defined in the sws
table.
Class Html
A parsed HTML page. Its HTML elements can be selected with CSS selectors.
Lua signature | Description |
---|---|
Html:select(selector: string) -> Select | Parses the given CSS selector and returns a Select instance |
Html:root() -> ElemRef | Returns an ElemRef to the HTML root node |
Class Select
A selection made with CSS selectors. Its HTML elements can be iterated.
Lua signature | Description |
---|---|
Select:iter() -> iterator<ElemRef> | An iterator of ElemRef over the selected HTML nodes |
Select:enumerate() -> iterator<(integer, ElemRef)> | An iterator of ElemRef and their indices over the selected HTML nodes |
Class ElemRef
An HTML element reference. Its descendant HTML elements can be selected with CSS selectors.
Lua signature | Description |
---|---|
ElemRef:select(selector: string) -> Select | Parses the given CSS selector and returns a Select instance over its descendants |
ElemRef:innerHtml() -> string | The inner HTML string of this element |
ElemRef:innerText() -> string | Returns all the descendent text nodes content concatenated |
ElemRef:name() -> string | The HTML element name |
ElemRef:id() -> string | The HTML element id, if any |
ElemRef:hasClass(class: string) -> boolean | Whether the HTML element has the given class |
ElemRef:classes() -> table | Returns all classes of the HTML element |
ElemRef:attr(name: string) -> string | If the HTML element has the name attribute, return its value, nil otherwise |
ElemRef:attrs() -> table | Returns all attributes of the HTML element |
Class Date
A helper class for parsing and formatting dates.
Lua signature | Description |
---|---|
Date(date: string, fmt: string) -> Date | Parses the given date accordingly to fmt , uses chrono::NaiveDate::parse_from_str under the hood |
Date:format(fmt: string) -> string | Formats the current date accordingly to fmt , uses chrono::NaiveDate::format under the hood |
Class ScrapingContext
The context available when an HTML page is scraped, provided as parameter in scrapPage
Lua signature | Description |
---|---|
ScrapingContext:pageLocation() -> PageLocation | Returns the current PageLocation |
ScrapingContext:sendRecord(rec: Record) | Sends a CSV Record to the current output (either stdout or the specified output file) |
ScrapingContext:sendUrl(url: string) | Adds the given url to the internal crawling queue so that it will be scraped later |
ScrapingContext:workerId() -> string | A string identifying the current worker thread. It simply consists of the worker's number (starting from 0) |
ScrapingContext:robot() -> Robot | Returns current Robot if it was setup, nil otherwise |
Class PageLocation
The location of an HTML page.
Lua signature | Description |
---|---|
PageLocation:kind() -> option<Location> | Get the page's Location kind |
PageLocation:get() -> option<string> | If the current page is a Location.URL returns its URL, if it's a Location.PATH returns its path on disk |
Enum Location
Location kind.
Lua variant | Description |
---|---|
Location.URL | A URL location kind (remote). Relevant when using the crawl subcommand |
Location.PATH | A PATH location kind (local). Relevant when using the scrap subcommand |
Class Record
A dynamic CSV record. CSV formatting can be customized (see details).
Lua signature | Description |
---|---|
Record() -> Record | Creates a new empty CSV record |
Record:pushField(field: string) | Adds the given field value to this CSV record |
Class CrawlingContext
The context available when an XML Sitemap page is crawled, provided as parameter in acceptUrl
Lua signature | Description |
---|---|
CrawlingContext:robot() -> Robot | Returns current Robot if it was setup, nil otherwise |
CrawlingContext:sitemap() -> Sitemap | The Sitemap format of the sitemap page being crawled |
Class Robot
Lua signature | Description |
---|---|
Robot:allowed(url: string) -> boolean | Whether the given url is allowed for crawling or not. This relies on texting_robots::Robot::allowed |
Enum Sitemap
The Sitemaps formats of an XML Sitemap page.
Lua variant | Description |
---|---|
Sitemap.INDEX | A <sitemapindex> format |
Sitemap.URL_SET | A <urlset> format |