Crawler Config

The crawler configurable parameters are:

ParameterDefaultDescription
user_agent"SWSbot"The User-Agent header that will be used in all HTTP requests
page_buffer10_000The size of the pages download queue. When the queue is full new downloads are on hold. This parameter is particularly relevant when using concurrent throttling.
throttleConcurrent(100) if robot is None

Otherwise Delay(N) where N is read from robots.txt field Crawl-delay: N
A throttling strategy for HTML pages download.

Concurrent(N) means at max N downloads at the same time, PerSecond(N) means at max N downloads per second, Delay(N) means wait for N seconds betwen downloads
num_workersmax(1, num_cpus-2)The number of CPU cores that will be used for scraping page in parallel using the provided Lua script.
on_dl_errorSkipAndLogBehaviour when an error occurs while downloading an HTML page. Other possible value is Fail.
on_xml_errorSkipAndLogBehaviour when an error occurs while processing a XML sitemap. Other possible value is Fail.
on_scrap_errorSkipAndLogBehaviour when an error occurs while scraping an HTML page in Lua. Other possible value is Fail.
robotNoneAn optional robots.txt URL used to retrieve a specific Throttle::Delay.

⚠ Conflicts with seedRobotsTxt in Lua Scraper, meaning that when robot is defined the seed cannot be a robot too.

These parameters can be changed through Lua script or CLI arguments.

The priority order is: CLI (highest priority) > Lua > Default values

Lua override

You can override parameters in Lua through the global variable sws.crawlerConfig.

ParameterLua nameExample Lua value
user_agentuserAgent"SWSbot"
page_bufferpageBuffer10000
throttlethrottle{ Concurrent = 100 }
num_workersnumWorkers4
on_dl_erroronDlError"SkipAndLog"
on_xml_erroronXmlError"Fail"
on_scrap_erroronScrapError"SkipAndLog"
robotrobot"https://www.google.com/robots.txt"

Here is an example of crawler configuration parmeters set using Lua:

-- You don't have to specify all parameters, only the ones you want to override.
sws.crawlerConfig = {
  userAgent = "SWSbot",
  pageBuffer = 10000,
  throttle = { Concurrent = 100 }, -- or: { PerSecond = 100 }, { Delay = 2 }
  numWorkers = 4,
  onDlError = "SkipAndLog", -- or: "Fail"
  onXmlError = "SkipAndLog",
  onScrapError = "SkipAndLog",
  robot = nil,
}

CLI override

You can override parameters through the CLI arguments.

ParameterCLI argument nameExample CLI argument value
user_agent--user-agent'SWSbot'
page_buffer--page-buffer10000
throttle (Concurent)--conc-dl100
throttle (PerSecond)--rps10
throttle (Delay)--delay2
num_workers--num-workers4
on_dl_error--on-dl-errorskip-and-log
on_xml_error--on-xml-errorfail
on_scrap_error--on-scrap-errorskip-and-log
robot--robot'https://www.google.com/robots.txt'

Here is an example of crawler configuration parmeters set using CLI arguments:

sws --script path/to/scrape_logic.lua -o results.csv     \
    --user-agent     'SWSbot'                            \
    --page-buffer    10000                               \
    --conc-dl        100                                 \
    --num-workers    4                                   \
    --on-dl-error    skip-and-log                        \
    --on-xml-error   fail                                \
    --on-scrap-error skip-and-log                        \
    --robot          'https://www.google.com/robots.txt' \