Lua Scraper

The scraping logic is configured through a single Lua script.

The customizable parameters are:

  • seed: Defines the seed pages for crawling
  • acceptUrl: A function to specify whether to accept a URL when crawling an XML Sitemap
  • scrapPage: A function that defines the scraping logic for a single HTML page

Seed definition

The seed be one of seedSitemaps, seedPages, or seedRobotsTxt.

Defining a seed is always mandatory. However, when using the scrap subcommand it will be ignored as the input will be either the specified URL or the specified local files.

⚠️ Defining multiple seeds will throw an error ⚠️

Example

-- A list of sitemap URLs (gzipped sitemaps are supported)
sws.seedSitemaps = {
   "https://www.urbandictionary.com/sitemap-https.xml.gz"
}
-- A list of HTML pages
sws.seedPages = {
   "https://www.urbandictionary.com/define.php?term=Rust",
   "https://www.urbandictionary.com/define.php?term=Lua",
}
-- A single robots.txt URL
sws.seedRobotsTxt = "https://www.urbandictionary.com/robots.txt"

Robot definition

A robots.txt can be used either as:

  • A crawling seed through sws.seedRobotsTxt (see above)

  • A URL validation helper through sws.crawlerConfig's parameter robot (see crawler configuration)

In both cases, the resulting Robot can be used to check whether a given URL is crawlable. This Robot is available through both CrawlingContext (in acceptUrl), and ScrapingContext (in scrapPage).

The underlying Robot implementation in Rust is using the crate texting_robots.

Defining a robot is optional.

Function acceptUrl

function acceptUrl(url, context)

A Lua function to specify whether to accept a URL when crawling an XML Sitemap. Its parameters are:

  • url: A URL string that is a candidate for crawling/scraping

  • context: An instance of CrawlingContext

Defining acceptUrl is optional.

Example

From examples/urbandict.lua:

function acceptUrl(url, context)
   if context:sitemap() == sws.Sitemap.URL_SET then
      return string.find(url, "term=")
   else
      -- For sws.Sitemap.INDEX accept all entries
      return true
   end
end

Function scrapPage

function scrapPage(page, context)

A Lua function that defines the scraping logic for a single page. Its parameters are:

Defining scrapPage is mandatory.

CSS Selectors

CSS selectors are the most powerful feature of this scraper, they are used to target and extract HTML elements in a flexible and efficient way. You can read more about CSS selectors on MDN doc, and find a good reference on W3C doc.

function scrapPage(page, context)
   for i, def in page:select("section .definition"):enumerate() do
      local word = def:select("h1 a.word"):iter()()
      print(string.format("Definition %i: %s", i, word))
   end
end

The select method is expecting a CSS selector string, its result can be either iterated or enumerated with iter and enumerate respectively. Interestingly, the elements being iterated over allow for sub selection as they also have a select method, this enables very flexible HTML elements selection.

See more details in the reference for the Select class.

Utils

Some utility functions are also exposed in Lua.

  • Date utils:

    The Date helper can parse and format dates:

    local date = "March 18, 2005" -- Extracted from some page's element
    date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d") -- Now date is "2005-03-18"
    

    Under the hood a Date wraps a Rust chrono::NaiveDate that is created using NaiveDate::parse_from_str. The format method will return a string formatted with the specified format (see specifiers for the formatting options).

Example

From examples/urbandict.lua:

function scrapPage(page, context)
   for defIndex, def in page:select("section .definition"):enumerate() do
      local word = def:select("h1 a.word"):iter()()
      if not word then
         word = def:select("h2 a.word"):iter()()
      end
      if not word then
         goto continue
      end
      word = word:innerHtml()

      local contributor = def:select(".contributor"):iter()()
      local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
      date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")

      local meaning = def:select(".meaning"):iter()()
      meaning = meaning:innerText():gsub("[\n\r]+", " ")

      local example = def:select(".example"):iter()()
      example = example:innerText():gsub("[\n\r]+", " ")

      if word and date and meaning and example then
         local record = sws.Record()
         record:pushField(word)
         record:pushField(defIndex)
         record:pushField(date)
         record:pushField(meaning)
         record:pushField(example)
         context:sendRecord(record)
      end

      ::continue::
   end
end

CSV Record

The Lua Record class wraps a Rust csv::StringRecord struct. In Lua it can be instantiated through sws.Record(). Its pushField(someString) method should be used to add string fields to the record.

It is possible to customize the underlying CSV Writer in Lua through the sws.csvWriterConfig table.

csv::WriterBuilder methodLua parameterExample Lua valueDefault Lua value
delimiterdelimiter"\t"","
escapeescape";""\""
flexibleflexibletruefalse
terminatorterminatorCRLF{ Any = "\n" }

Example

sws.csvWriterConfig = {
   delimiter = "\t"
}

function scrapPage(page, context)
    local record = sws.Record()
    record:pushField("foo field")
    record:pushField("bar field")
    context:sendRecord(record)
end