Lua API Overview

Global variables

Lua name	Lua Type	Description
scrapPage	function	Define the scraping logic for a single HTML page. See details
acceptUrl	function	Specify whether to accept a URL when crawling an XML Sitemap, `true` by default. See details
sws	table	The sws namespace

All the following variables are defined in the sws table.

The configurable seed

Lua name	Lua Type	Description
csvWriterConfig	table	Config used to write output csv records. See details
crawlerConfig	table	Config used to customize crawler behavior. See details

All types are defined in the sws table.

A parsed HTML page. Its HTML elements can be selected with CSS selectors.

Lua signature	Description
Html:select(selector: string) -> Select	Parses the given CSS `selector` and returns a Select instance
Html:root() -> ElemRef	Returns an ElemRef to the HTML root node

A selection made with CSS selectors. Its HTML elements can be iterated.

Lua signature	Description
Select:iter() -> iterator<ElemRef>	An iterator of ElemRef over the selected HTML nodes
Select:enumerate() -> iterator<(integer, ElemRef)>	An iterator of ElemRef and their indices over the selected HTML nodes

An HTML element reference. Its descendant HTML elements can be selected with CSS selectors.

Lua signature	Description
ElemRef:select(selector: string) -> Select	Parses the given CSS `selector` and returns a Select instance over its descendants
ElemRef:innerHtml() -> string	The inner HTML string of this element
ElemRef:innerText() -> string	Returns all the descendent text nodes content concatenated
ElemRef:name() -> string	The HTML element name
ElemRef:id() -> string	The HTML element id, if any
ElemRef:hasClass(class: string) -> boolean	Whether the HTML element has the given `class`
ElemRef:classes() -> table	Returns all classes of the HTML element
ElemRef:attr(name: string) -> string	If the HTML element has the `name` attribute, return its value, nil otherwise
ElemRef:attrs() -> table	Returns all attributes of the HTML element

A helper class for parsing and formatting dates.

Lua signature	Description
Date(date: string, fmt: string) -> Date	Parses the given `date` accordingly to `fmt`, uses chrono::NaiveDate::parse_from_str under the hood
Date:format(fmt: string) -> string	Formats the current date accordingly to `fmt`, uses chrono::NaiveDate::format under the hood

The context available when an HTML page is scraped, provided as parameter in scrapPage

Lua signature	Description
ScrapingContext:pageLocation() -> PageLocation	Returns the current PageLocation
ScrapingContext:sendRecord(rec: Record)	Sends a CSV Record to the current output (either `stdout` or the specified output file)
ScrapingContext:sendUrl(url: string)	Adds the given `url` to the internal crawling queue so that it will be scraped later
ScrapingContext:workerId() -> string	A string identifying the current worker thread. It simply consists of the worker's number (starting from 0)
ScrapingContext:robot() -> Robot	Returns current Robot if it was setup, nil otherwise

The location of an HTML page.

Lua signature	Description
PageLocation:kind() -> option<Location>	Get the page's Location kind
PageLocation:get() -> option<string>	If the current page is a `Location.URL` returns its URL, if it's a `Location.PATH` returns its path on disk

Location kind.

Lua variant	Description
Location.URL	A URL location kind (remote). Relevant when using the crawl subcommand
Location.PATH	A PATH location kind (local). Relevant when using the scrap subcommand

A dynamic CSV record. CSV formatting can be customized (see details).

Lua signature	Description
Record() -> Record	Creates a new empty CSV record
Record:pushField(field: string)	Adds the given `field` value to this CSV record

The context available when an XML Sitemap page is crawled, provided as parameter in acceptUrl

Lua signature	Description
CrawlingContext:robot() -> Robot	Returns current Robot if it was setup, nil otherwise
CrawlingContext:sitemap() -> Sitemap	The Sitemap format of the sitemap page being crawled

Lua signature	Description
Robot:allowed(url: string) -> boolean	Whether the given `url` is allowed for crawling or not. This relies on texting_robots::Robot::allowed

The Sitemaps formats of an XML Sitemap page.

Lua variant	Description
Sitemap.INDEX	A `<sitemapindex>` format
Sitemap.URL_SET	A `<urlset>` format