Skip to main content
POST
/
crawls
Create a web crawl data source
curl --request POST \
  --url https://developer.qaip.com/api/v1/crawls \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "name": "<string>",
  "start_url": "<string>",
  "max_depth": 5,
  "max_num_files": 50000,
  "path_filters": [
    "<string>"
  ],
  "content_pattern": [
    "<string>"
  ],
  "html_only": false,
  "use_browser": false,
  "file_extensions": [
    "<string>"
  ],
  "rrule": "<string>"
}
'
{
  "id": "<string>",
  "name": "<string>",
  "start_url": "<string>",
  "status": "unknown",
  "ingestion_setting_id": "<string>",
  "creation_time": 123,
  "start_time": 123,
  "end_time": 123,
  "error": {
    "title": "<string>",
    "message": "<string>"
  }
}

Authorizations

x-api-key
string
header
required

API key for authentication

Body

application/json
name
string
required

Name of the web crawl data source

Maximum string length: 200
start_url
string
required

Start URL of the web crawl

Maximum string length: 2000
max_depth
integer
required

Maximum crawl depth

Required range: 1 <= x <= 10
max_num_files
integer
required

Maximum number of files to crawl

Required range: 1 <= x <= 100000
path_filters
string[]

Path filters for crawling. The total number of characters across all elements in the array must be 2000 or fewer.

Maximum array length: 2000
content_pattern
string[]

Content patterns for filtering. The total number of characters across all elements in the array must be 2000 or fewer.

Maximum array length: 2000
html_only
boolean
default:false

When true, only HTML files will be downloaded

use_browser
boolean
default:false

Whether to use a headless browser for crawling

file_extensions
string[]
Maximum array length: 2000
Maximum string length: 10
rrule
string

Recurrence rule (RFC 5545 RRULE)

Response

Successfully created web crawl data source

id
string
required

Web crawl data source ID

name
string
required

Name of the web crawl ingestion setting

start_url
string
required

Start URL of the web crawl

status
enum<string>
required

Job status

Available options:
unknown,
queued,
not_started,
managed,
starting,
started,
success,
failure,
canceling,
canceled,
deleting,
delete_job_failure
ingestion_setting_id
string

Web crawl ingestion setting ID

creation_time
integer<int64>

Creation time (Unix timestamp in seconds)

start_time
integer<int64>

Job start time (Unix timestamp in seconds)

end_time
integer<int64>

Job end time (Unix timestamp in seconds)

error
object