| Lua name | Lua Type | Description |
| scrapPage | function | Define the scraping logic for a single HTML page. See details |
| acceptUrl | function | Specify whether to accept a URL when crawling an XML Sitemap, true by default. See details |
| sws | table | The sws namespace |
All the following variables are defined in the sws table.
The configurable seed
| Lua name | Lua Type | Description |
| seedSitemaps | table | A list of sitemap URLs |
| seedPages | table | A list of HTML page URLs |
| seedRobotsTxt | string | A single robots.txt URL |
| Lua name | Lua Type | Description |
| csvWriterConfig | table | Config used to write output csv records. See details |
| crawlerConfig | table | Config used to customize crawler behavior. See details |
All types are defined in the sws table.
A parsed HTML page. Its HTML elements can be selected with CSS selectors.
| Lua signature | Description |
| Html:select(selector: string) -> Select | Parses the given CSS selector and returns a Select instance |
| Html:root() -> ElemRef | Returns an ElemRef to the HTML root node |
A selection made with CSS selectors. Its HTML elements can be iterated.
| Lua signature | Description |
| Select:iter() -> iterator<ElemRef> | An iterator of ElemRef over the selected HTML nodes |
| Select:enumerate() -> iterator<(integer, ElemRef)> | An iterator of ElemRef and their indices over the selected HTML nodes |
An HTML element reference. Its descendant HTML elements can be selected with CSS selectors.
| Lua signature | Description |
| ElemRef:select(selector: string) -> Select | Parses the given CSS selector and returns a Select instance over its descendants |
| ElemRef:innerHtml() -> string | The inner HTML string of this element |
| ElemRef:innerText() -> string | Returns all the descendent text nodes content concatenated |
| ElemRef:name() -> string | The HTML element name |
| ElemRef:id() -> string | The HTML element id, if any |
| ElemRef:hasClass(class: string) -> boolean | Whether the HTML element has the given class |
| ElemRef:classes() -> table | Returns all classes of the HTML element |
| ElemRef:attr(name: string) -> string | If the HTML element has the name attribute, return its value, nil otherwise |
| ElemRef:attrs() -> table | Returns all attributes of the HTML element |
A helper class for parsing and formatting dates.
The context available when an HTML page is scraped, provided as parameter in scrapPage
| Lua signature | Description |
| ScrapingContext:pageLocation() -> PageLocation | Returns the current PageLocation |
| ScrapingContext:sendRecord(rec: Record) | Sends a CSV Record to the current output (either stdout or the specified output file) |
| ScrapingContext:sendUrl(url: string) | Adds the given url to the internal crawling queue so that it will be scraped later |
| ScrapingContext:workerId() -> string | A string identifying the current worker thread. It simply consists of the worker's number (starting from 0) |
| ScrapingContext:robot() -> Robot | Returns current Robot if it was setup, nil otherwise |
The location of an HTML page.
| Lua signature | Description |
| PageLocation:kind() -> option<Location> | Get the page's Location kind |
| PageLocation:get() -> option<string> | If the current page is a Location.URL returns its URL, if it's a Location.PATH returns its path on disk |
Location kind.
| Lua variant | Description |
| Location.URL | A URL location kind (remote). Relevant when using the crawl subcommand |
| Location.PATH | A PATH location kind (local). Relevant when using the scrap subcommand |
A dynamic CSV record. CSV formatting can be customized (see details).
| Lua signature | Description |
| Record() -> Record | Creates a new empty CSV record |
| Record:pushField(field: string) | Adds the given field value to this CSV record |
The context available when an XML Sitemap page is crawled, provided as parameter in acceptUrl
| Lua signature | Description |
| CrawlingContext:robot() -> Robot | Returns current Robot if it was setup, nil otherwise |
| CrawlingContext:sitemap() -> Sitemap | The Sitemap format of the sitemap page being crawled |
| Lua signature | Description |
| Robot:allowed(url: string) -> boolean | Whether the given url is allowed for crawling or not. This relies on texting_robots::Robot::allowed |
The Sitemaps formats of an XML Sitemap page.
| Lua variant | Description |
| Sitemap.INDEX | A <sitemapindex> format |
| Sitemap.URL_SET | A <urlset> format |