Lua name | Lua Type | Description |
scrapPage | function | Define the scraping logic for a single HTML page. See details |
acceptUrl | function | Specify whether to accept a URL when crawling an XML Sitemap, true by default. See details |
sws | table | The sws namespace |
All the following variables are defined in the sws
table.
The configurable seed
Lua name | Lua Type | Description |
seedSitemaps | table | A list of sitemap URLs |
seedPages | table | A list of HTML page URLs |
seedRobotsTxt | string | A single robots.txt URL |
Lua name | Lua Type | Description |
csvWriterConfig | table | Config used to write output csv records. See details |
crawlerConfig | table | Config used to customize crawler behavior. See details |
All types are defined in the sws
table.
A parsed HTML page. Its HTML elements can be selected with CSS selectors.
Lua signature | Description |
Html:select(selector: string) -> Select | Parses the given CSS selector and returns a Select instance |
Html:root() -> ElemRef | Returns an ElemRef to the HTML root node |
A selection made with CSS selectors. Its HTML elements can be iterated.
Lua signature | Description |
Select:iter() -> iterator<ElemRef> | An iterator of ElemRef over the selected HTML nodes |
Select:enumerate() -> iterator<(integer, ElemRef)> | An iterator of ElemRef and their indices over the selected HTML nodes |
An HTML element reference. Its descendant HTML elements can be selected with CSS selectors.
Lua signature | Description |
ElemRef:select(selector: string) -> Select | Parses the given CSS selector and returns a Select instance over its descendants |
ElemRef:innerHtml() -> string | The inner HTML string of this element |
ElemRef:innerText() -> string | Returns all the descendent text nodes content concatenated |
ElemRef:name() -> string | The HTML element name |
ElemRef:id() -> string | The HTML element id, if any |
ElemRef:hasClass(class: string) -> boolean | Whether the HTML element has the given class |
ElemRef:classes() -> table | Returns all classes of the HTML element |
ElemRef:attr(name: string) -> string | If the HTML element has the name attribute, return its value, nil otherwise |
ElemRef:attrs() -> table | Returns all attributes of the HTML element |
A helper class for parsing and formatting dates.
The context available when an HTML page is scraped, provided as parameter in scrapPage
Lua signature | Description |
ScrapingContext:pageLocation() -> PageLocation | Returns the current PageLocation |
ScrapingContext:sendRecord(rec: Record) | Sends a CSV Record to the current output (either stdout or the specified output file) |
ScrapingContext:sendUrl(url: string) | Adds the given url to the internal crawling queue so that it will be scraped later |
ScrapingContext:workerId() -> string | A string identifying the current worker thread. It simply consists of the worker's number (starting from 0) |
ScrapingContext:robot() -> Robot | Returns current Robot if it was setup, nil otherwise |
The location of an HTML page.
Lua signature | Description |
PageLocation:kind() -> option<Location> | Get the page's Location kind |
PageLocation:get() -> option<string> | If the current page is a Location.URL returns its URL, if it's a Location.PATH returns its path on disk |
Location kind.
Lua variant | Description |
Location.URL | A URL location kind (remote). Relevant when using the crawl subcommand |
Location.PATH | A PATH location kind (local). Relevant when using the scrap subcommand |
A dynamic CSV record. CSV formatting can be customized (see details).
Lua signature | Description |
Record() -> Record | Creates a new empty CSV record |
Record:pushField(field: string) | Adds the given field value to this CSV record |
The context available when an XML Sitemap page is crawled, provided as parameter in acceptUrl
Lua signature | Description |
CrawlingContext:robot() -> Robot | Returns current Robot if it was setup, nil otherwise |
CrawlingContext:sitemap() -> Sitemap | The Sitemap format of the sitemap page being crawled |
Lua signature | Description |
Robot:allowed(url: string) -> boolean | Whether the given url is allowed for crawling or not. This relies on texting_robots::Robot::allowed |
The Sitemaps formats of an XML Sitemap page.
Lua variant | Description |
Sitemap.INDEX | A <sitemapindex> format |
Sitemap.URL_SET | A <urlset> format |