Lua Scraper
The scraping logic is configured through a single Lua
script.
The customizable parameters are:
- seed: Defines the seed pages for crawling
- acceptUrl: A function to specify whether to accept a URL when crawling an XML Sitemap
- scrapPage: A function that defines the scraping logic for a single HTML page
Seed definition
The seed be one of seedSitemaps
, seedPages
, or
seedRobotsTxt
.
Defining a seed
is always mandatory. However, when using the scrap
subcommand it will be ignored as the input will be either the
specified URL or the specified local files.
⚠️ Defining multiple seeds
will throw an error ⚠️
Example
-- A list of sitemap URLs (gzipped sitemaps are supported)
sws.seedSitemaps = {
"https://www.urbandictionary.com/sitemap-https.xml.gz"
}
-- A list of HTML pages
sws.seedPages = {
"https://www.urbandictionary.com/define.php?term=Rust",
"https://www.urbandictionary.com/define.php?term=Lua",
}
-- A single robots.txt URL
sws.seedRobotsTxt = "https://www.urbandictionary.com/robots.txt"
Robot definition
A robots.txt can be used either as:
-
A crawling seed through
sws.seedRobotsTxt
(see above) -
A URL validation helper through
sws.crawlerConfig
's parameterrobot
(see crawler configuration)
In both cases, the resulting Robot can be used to
check whether a given URL is crawlable. This Robot
is available through both
CrawlingContext (in
acceptUrl), and
ScrapingContext (in
scrapPage).
The underlying Robot
implementation in Rust is using the crate
texting_robots.
Defining a robot
is optional.
Function acceptUrl
function acceptUrl(url, context)
A Lua
function to specify whether to accept a URL when crawling an XML
Sitemap. Its parameters are:
-
url: A URL
string
that is a candidate for crawling/scraping -
context: An instance of CrawlingContext
Defining acceptUrl
is optional.
Example
From examples/urbandict.lua
:
function acceptUrl(url, context)
if context:sitemap() == sws.Sitemap.URL_SET then
return string.find(url, "term=")
else
-- For sws.Sitemap.INDEX accept all entries
return true
end
end
Function scrapPage
function scrapPage(page, context)
A Lua
function that defines the scraping logic for a single page. Its parameters are:
-
page: The Html page being scraped
-
context: An instance of ScrapingContext
Defining scrapPage
is mandatory.
CSS Selectors
CSS selectors are the most powerful feature of this scraper, they are used to target and extract HTML elements in a flexible and efficient way. You can read more about CSS selectors on MDN doc, and find a good reference on W3C doc.
function scrapPage(page, context)
for i, def in page:select("section .definition"):enumerate() do
local word = def:select("h1 a.word"):iter()()
print(string.format("Definition %i: %s", i, word))
end
end
The select
method is expecting a CSS selector string, its result can be either
iterated or enumerated with iter
and enumerate
respectively. Interestingly, the
elements being iterated over allow for sub selection as they also have a select
method, this enables very flexible HTML elements selection.
See more details in the reference for the Select class.
Utils
Some utility functions are also exposed in Lua
.
-
Date utils:
The Date helper can parse and format dates:
local date = "March 18, 2005" -- Extracted from some page's element date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d") -- Now date is "2005-03-18"
Under the hood a
Date
wraps a Rust chrono::NaiveDate that is created using NaiveDate::parse_from_str. Theformat
method will return a string formatted with the specified format (see specifiers for the formatting options).
Example
From examples/urbandict.lua
:
function scrapPage(page, context)
for defIndex, def in page:select("section .definition"):enumerate() do
local word = def:select("h1 a.word"):iter()()
if not word then
word = def:select("h2 a.word"):iter()()
end
if not word then
goto continue
end
word = word:innerHtml()
local contributor = def:select(".contributor"):iter()()
local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")
local meaning = def:select(".meaning"):iter()()
meaning = meaning:innerText():gsub("[\n\r]+", " ")
local example = def:select(".example"):iter()()
example = example:innerText():gsub("[\n\r]+", " ")
if word and date and meaning and example then
local record = sws.Record()
record:pushField(word)
record:pushField(defIndex)
record:pushField(date)
record:pushField(meaning)
record:pushField(example)
context:sendRecord(record)
end
::continue::
end
end
CSV Record
The Lua Record class wraps a Rust
csv::StringRecord
struct. In Lua
it can be instantiated through
sws.Record()
. Its pushField(someString)
method should be used to add string fields
to the record.
It is possible to customize the underlying CSV Writer in Lua
through the
sws.csvWriterConfig
table.
csv::WriterBuilder method | Lua parameter | Example Lua value | Default Lua value |
---|---|---|---|
delimiter | delimiter | "\t" | "," |
escape | escape | ";" | "\"" |
flexible | flexible | true | false |
terminator | terminator | CRLF | { Any = "\n" } |
Example
sws.csvWriterConfig = {
delimiter = "\t"
}
function scrapPage(page, context)
local record = sws.Record()
record:pushField("foo field")
record:pushField("bar field")
context:sendRecord(record)
end