Lua API Overview

Global variables

Lua nameLua TypeDescription
scrapPagefunctionDefine the scraping logic for a single HTML page. See details
acceptUrlfunctionSpecify whether to accept a URL when crawling an XML Sitemap, true by default. See details
swstableThe sws namespace

Namespaced variables

All the following variables are defined in the sws table.

Seeds

The configurable seed

Lua nameLua TypeDescription
seedSitemapstableA list of sitemap URLs
seedPagestableA list of HTML page URLs
seedRobotsTxtstringA single robots.txt URL

Configurations

Lua nameLua TypeDescription
csvWriterConfigtableConfig used to write output csv records. See details
crawlerConfigtableConfig used to customize crawler behavior. See details

Types

All types are defined in the sws table.

Class Html

A parsed HTML page. Its HTML elements can be selected with CSS selectors.

Lua signatureDescription
Html:select(selector: string) -> SelectParses the given CSS selector and returns a Select instance
Html:root() -> ElemRefReturns an ElemRef to the HTML root node

Class Select

A selection made with CSS selectors. Its HTML elements can be iterated.

Lua signatureDescription
Select:iter() -> iterator<ElemRef>An iterator of ElemRef over the selected HTML nodes
Select:enumerate() -> iterator<(integer, ElemRef)>An iterator of ElemRef and their indices over the selected HTML nodes

Class ElemRef

An HTML element reference. Its descendant HTML elements can be selected with CSS selectors.

Lua signatureDescription
ElemRef:select(selector: string) -> SelectParses the given CSS selector and returns a Select instance over its descendants
ElemRef:innerHtml() -> stringThe inner HTML string of this element
ElemRef:innerText() -> stringReturns all the descendent text nodes content concatenated
ElemRef:name() -> stringThe HTML element name
ElemRef:id() -> stringThe HTML element id, if any
ElemRef:hasClass(class: string) -> booleanWhether the HTML element has the given class
ElemRef:classes() -> tableReturns all classes of the HTML element
ElemRef:attr(name: string) -> stringIf the HTML element has the name attribute, return its value, nil otherwise
ElemRef:attrs() -> tableReturns all attributes of the HTML element

Class Date

A helper class for parsing and formatting dates.

Lua signatureDescription
Date(date: string, fmt: string) -> DateParses the given date accordingly to fmt, uses chrono::NaiveDate::parse_from_str under the hood
Date:format(fmt: string) -> stringFormats the current date accordingly to fmt, uses chrono::NaiveDate::format under the hood

Class ScrapingContext

The context available when an HTML page is scraped, provided as parameter in scrapPage

Lua signatureDescription
ScrapingContext:pageLocation() -> PageLocationReturns the current PageLocation
ScrapingContext:sendRecord(rec: Record)Sends a CSV Record to the current output (either stdout or the specified output file)
ScrapingContext:sendUrl(url: string)Adds the given url to the internal crawling queue so that it will be scraped later
ScrapingContext:workerId() -> stringA string identifying the current worker thread. It simply consists of the worker's number (starting from 0)
ScrapingContext:robot() -> RobotReturns current Robot if it was setup, nil otherwise

Class PageLocation

The location of an HTML page.

Lua signatureDescription
PageLocation:kind() -> option<Location>Get the page's Location kind
PageLocation:get() -> option<string>If the current page is a Location.URL returns its URL, if it's a Location.PATH returns its path on disk

Enum Location

Location kind.

Lua variantDescription
Location.URLA URL location kind (remote). Relevant when using the crawl subcommand
Location.PATHA PATH location kind (local). Relevant when using the scrap subcommand

Class Record

A dynamic CSV record. CSV formatting can be customized (see details).

Lua signatureDescription
Record() -> RecordCreates a new empty CSV record
Record:pushField(field: string)Adds the given field value to this CSV record

Class CrawlingContext

The context available when an XML Sitemap page is crawled, provided as parameter in acceptUrl

Lua signatureDescription
CrawlingContext:robot() -> RobotReturns current Robot if it was setup, nil otherwise
CrawlingContext:sitemap() -> SitemapThe Sitemap format of the sitemap page being crawled

Class Robot

Lua signatureDescription
Robot:allowed(url: string) -> booleanWhether the given url is allowed for crawling or not. This relies on texting_robots::Robot::allowed

Enum Sitemap

The Sitemaps formats of an XML Sitemap page.

Lua variantDescription
Sitemap.INDEXA <sitemapindex> format
Sitemap.URL_SETA <urlset> format