Lua Scraper
The scraping logic is configured through a single Lua script.
The customizable parameters are:
- seed: Defines the seed pages for crawling
- acceptUrl: A function to specify whether to accept a URL when crawling an XML Sitemap
- scrapPage: A function that defines the scraping logic for a single HTML page
Seed definition
The seed be one of seedSitemaps, seedPages, or
seedRobotsTxt.
Defining a seed is always mandatory. However, when using the scrap
subcommand it will be ignored as the input will be either the
specified URL or the specified local files.
⚠️ Defining multiple seeds will throw an error ⚠️
Example
-- A list of sitemap URLs (gzipped sitemaps are supported)
sws.seedSitemaps = {
"https://www.urbandictionary.com/sitemap-https.xml.gz"
}
-- A list of HTML pages
sws.seedPages = {
"https://www.urbandictionary.com/define.php?term=Rust",
"https://www.urbandictionary.com/define.php?term=Lua",
}
-- A single robots.txt URL
sws.seedRobotsTxt = "https://www.urbandictionary.com/robots.txt"
Robot definition
A robots.txt can be used either as:
-
A crawling seed through
sws.seedRobotsTxt(see above) -
A URL validation helper through
sws.crawlerConfig's parameterrobot(see crawler configuration)
In both cases, the resulting Robot can be used to
check whether a given URL is crawlable. This Robot is available through both
CrawlingContext (in
acceptUrl), and
ScrapingContext (in
scrapPage).
The underlying Robot implementation in Rust is using the crate
texting_robots.
Defining a robot is optional.
Function acceptUrl
function acceptUrl(url, context)
A Lua function to specify whether to accept a URL when crawling an XML
Sitemap. Its parameters are:
-
url: A URL
stringthat is a candidate for crawling/scraping -
context: An instance of CrawlingContext
Defining acceptUrl is optional.
Example
From examples/urbandict.lua:
function acceptUrl(url, context)
if context:sitemap() == sws.Sitemap.URL_SET then
return string.find(url, "term=")
else
-- For sws.Sitemap.INDEX accept all entries
return true
end
end
Function scrapPage
function scrapPage(page, context)
A Lua function that defines the scraping logic for a single page. Its parameters are:
-
page: The Html page being scraped
-
context: An instance of ScrapingContext
Defining scrapPage is mandatory.
CSS Selectors
CSS selectors are the most powerful feature of this scraper, they are used to target and extract HTML elements in a flexible and efficient way. You can read more about CSS selectors on MDN doc, and find a good reference on W3C doc.
function scrapPage(page, context)
for i, def in page:select("section .definition"):enumerate() do
local word = def:select("h1 a.word"):iter()()
print(string.format("Definition %i: %s", i, word))
end
end
The select method is expecting a CSS selector string, its result can be either
iterated or enumerated with iter and enumerate respectively. Interestingly, the
elements being iterated over allow for sub selection as they also have a select
method, this enables very flexible HTML elements selection.
See more details in the reference for the Select class.
Utils
Some utility functions are also exposed in Lua.
-
Date utils:
The Date helper can parse and format dates:
local date = "March 18, 2005" -- Extracted from some page's element date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d") -- Now date is "2005-03-18"Under the hood a
Datewraps a Rust chrono::NaiveDate that is created using NaiveDate::parse_from_str. Theformatmethod will return a string formatted with the specified format (see specifiers for the formatting options).
Example
From examples/urbandict.lua:
function scrapPage(page, context)
for defIndex, def in page:select("section .definition"):enumerate() do
local word = def:select("h1 a.word"):iter()()
if not word then
word = def:select("h2 a.word"):iter()()
end
if not word then
goto continue
end
word = word:innerHtml()
local contributor = def:select(".contributor"):iter()()
local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")
local meaning = def:select(".meaning"):iter()()
meaning = meaning:innerText():gsub("[\n\r]+", " ")
local example = def:select(".example"):iter()()
example = example:innerText():gsub("[\n\r]+", " ")
if word and date and meaning and example then
local record = sws.Record()
record:pushField(word)
record:pushField(defIndex)
record:pushField(date)
record:pushField(meaning)
record:pushField(example)
context:sendRecord(record)
end
::continue::
end
end
CSV Record
The Lua Record class wraps a Rust
csv::StringRecord struct. In Lua it can be instantiated through
sws.Record(). Its pushField(someString) method should be used to add string fields
to the record.
It is possible to customize the underlying CSV Writer in Lua through the
sws.csvWriterConfig table.
| csv::WriterBuilder method | Lua parameter | Example Lua value | Default Lua value |
|---|---|---|---|
| delimiter | delimiter | "\t" | "," |
| escape | escape | ";" | "\"" |
| flexible | flexible | true | false |
| terminator | terminator | CRLF | { Any = "\n" } |
Example
sws.csvWriterConfig = {
delimiter = "\t"
}
function scrapPage(page, context)
local record = sws.Record()
record:pushField("foo field")
record:pushField("bar field")
context:sendRecord(record)
end