Documentation

API

Web Scraper Cloud can be managed via an HTTPS JSON API. The API allows to manage sitemaps, scraping jobs and download data.

  • Use our Node.js package when developing your application in JS.
  • Use our PHP SDK when developing your application in PHP.

API access token can be found in Web Scraper Cloud API page.

API call limit

By default, each user has a limit of 200 API calls per 15 minutes. Limit can be tracked by API call response headers:

X-RateLimit-Limit: 200
X-RateLimit-Remaining: 199
X-RateLimit-Reset: 1609372800   // returned only when limit is reached

Handle API request limit

PHP SDK and API SDK has built-in backoff mechanism in case of reaching the limit. If API request limit is reached and 429 response code is returned, client will be automatically put to sleep and will make request again when API request limits are restored.

This behavior can be disabled and throw exception instead of sleep.

// ES6 import
import { Client } from "@webscraperio/api-client-nodejs";

// or CommonJS require
const api = require("@webscraperio/api-client-nodejs");
const Client = api.Client;

const client = new Client({
	token: "your api token",
	useBackoffSleep: false
});
$client = new Client([
	'token' => 'your api token',
	'use_backoff_sleep' => false,
]);

If more API calls are required, please contact support.

Scraping job status

Scraping job can have one of these statuses:

  • waiting-to-be-scheduled - the scraping job is waiting in a queue to be scraped;
  • scheduled - the scraping job is waiting for the scraper server and will start in a moment;
  • started - the scraping job is in motion;
  • failed - the website returned more than 50% 4xx or 50xx responses or there were network errors, which means that job execution was stopped and scraping job marked as failed; however, the user can continue it manually;
  • finished - the scraping job has completed successfully without any failed or empty pages;
  • stopped - the scraping job has been stopped manually by the user;

API calls

Create Sitemap

Method: POST
URL: https://api.webscraper.io/api/v1/sitemap?api_token=<YOUR API TOKEN>
JSON:
{
	"_id": "webscraper-io-landing",
	"startUrl": [
		"http://webscraper.io/"
	],
	"selectors": [
		{
			"parentSelectors": [
				"_root"
			],
			"type": "SelectorText",
			"multiple": false,
			"id": "title",
			"selector": "h1",
			"regex": "",
			"delay": ""
		}
	]
}
Response:
{
	"success": true,
	"data": {
		"id": 123
	}
}
const sitemap = `
{
	"_id": "webscraper-io-landing",
	"startUrl":[
		"http://webscraper.io/"
	],
	"selectors":[
		{
			"parentSelectors":[
				"_root"
			],
			"type": "SelectorText",
			"multiple": false,
			"id": "title",
			"selector": "h1",
			"regex": ""
		}
	]
}
`;

const response = await client.createSitemap(sitemap);
Response:
{
	id: 123
}
$sitemapJSON = '
{
	"_id": "webscraper-io-landing",
	"startUrl": [
		"http://webscraper.io/"
	],
	"selectors": [
		{
			"parentSelectors": [
				"_root"
			],
			"type": "SelectorText",
			"multiple": false,
			"id": "title",
			"selector": "h1",
			"regex": "",
			"delay": ""
		}
	]
}
';

$sitemap = json_decode($sitemapJSON, true);
$response = $client->createSitemap($sitemap);
Response:
[
	'id' => 123
]

Get Sitemap

Method: GET
URL: https://api.webscraper.io/api/v1/sitemap/<Sitemap ID>?api_token=<YOUR API TOKEN>
Response:
{
	"success": true,
	"data": {
		"id": 123
		"name": "webscraper-io-landing",
		"sitemap": "{\"_id\": \"webscraper-io-landing\", ...}",
	}
}
const sitemap = await client.getSitemap(sitemapId);
Response:
{
	id: 123
	name: 'webscraper-io-landing'
	sitemap: '{"_id": "webscraper-io-landing", ...}'
}
$sitemap = $client->getSitemap($sitemapId);
Response:
[
	'id' => 123,
	'name' => 'webscraper-io-landing',
	'sitemap' => '{"_id": "webscraper-io-landing", ...}',
]

Get Sitemaps

Method: GET
URL: https://api.webscraper.io/api/v1/sitemaps?api_token=<YOUR API TOKEN>
Optional query parameters:
-page: &page=2
Response:
{
	"success": true,
	"data": [
		{
			"id": 123
			"name": "webscraper-io-landing",
		},
		{
			"id": 123
			"name": "webscraper-io-landing2",
		}
	],
	"current_page": 1,
	"last_page": 1,
	"total": 2,
	"per_page": 100,
}
let generator = client.getSitemaps();
const sitemaps = await generator.getAllRecords();

// or iterate through all sitemaps manually
generator = client.getSitemaps();
for await (const record of await generator.fetchRecords()) {
	console.log(JSON.stringify(record));
}
Response:
// response (generator)
[
	{
		id: 123,
		name: "webscraper-io-landing"
	},
	{
		id: 124,
		name: "webscraper-io-landing2"
	}
]
$sitemapIterator = $client->getSitemaps();

// iterate through all sitemaps
foreach($sitemapIterator as $sitemap) {
	var_dump($sitemap);
}

// or iterate through all sitemaps while manually handling pagination
$page = 1;
do {
	$sitemaps = $sitemapIterator->getPageData($page);
	foreach($sitemaps as $sitemap) {
		var_dump($sitemap);
	}
	$page++;
} while($page <= $sitemapIterator->getLastPage());
Response:
// response (Iterator)
[
	[
		'id' => 123,
		'name' => 'webscraper-io-landing'
	],
	[
		'id' => 124,
		'name' => 'webscraper-io-landing2'
	]
]

Update Sitemap

Method: PUT
URL: https://api.webscraper.io/api/v1/sitemap/<Sitemap ID>?api_token=<YOUR API TOKEN>
JSON:
{
    "_id": "webscraper-io-landing",
    "startUrl": [
        "http://webscraper.io/"
    ],
    "selectors": [
        {
            "parentSelectors": [
                "_root"
            ],
            "type": "SelectorText",
            "multiple": false,
            "id": "title",
            "selector": "h1",
            "regex": "",
            "delay": ""
        }
    ]
}
Response:
{
    "success": true,
    "data": "ok"
}
const sitemap = `
{
	"_id": "webscraper-io-landing",
	"startUrl":[
		"http://webscraper.io/"
	],
	"selectors":[
		{
			"parentSelectors":[
				"_root"
			],
			"type": "SelectorText",
			"multiple": false,
			"id": "title",
			"selector": "h1",
			"regex": ""
		}
	]
}
`;

const response = await client.updateSitemap(500, sitemap);
Response:
"ok"
$sitemapJSON = '
{
	"_id": "webscraper-io-landing",
	"startUrl": [
		"http://webscraper.io/"
	],
	"selectors": [
		{
			"parentSelectors": [
				"_root"
			],
			"type": "SelectorText",
			"multiple": false,
			"id": "title",
			"selector": "h1",
			"regex": "",
			"delay": ""
		}
	]
}
';

$sitemap = json_decode($sitemapJSON, true);
$response = $client->updateSitemap(500, $sitemap);
Response:
"ok"

Delete Sitemap

Method: DELETE
URL: https://api.webscraper.io/api/v1/sitemap/<Sitemap ID>?api_token=<YOUR API TOKEN>
Response:
{
	"success": true,
	"data": "ok"
}
await client.deleteSitemap(123);
Response:
"ok"
$client->deleteSitemap(123);
Response:
"ok"

Create Scraping Job (Scrape Sitemap)

Method: POST
URL: https://api.webscraper.io/api/v1/scraping-job?api_token=<YOUR API TOKEN>
JSON:
{
	"sitemap_id": 123,
	"driver": "fast", // "fast" or "fulljs"
	"page_load_delay": 2000,
	"request_interval": 2000,
	"proxy": 0, // optional. 0 - no proxy, 1 - use proxy. Or proxy id for Scale plan users
	"start_urls": [		// optional, if set, will overwrite sitemap start URLs
		"https://www.webscraper.io/test-sites/e-commerce/allinone/computers",
		"https://www.webscraper.io/test-sites/e-commerce/allinone/phones"
	],
	"custom_id": "custom-scraping-job-12" // optional, will be included in webhook notification
}
Response:
{
	"success": true,
	"data": {
		"id": 500,
		"custom_id": "custom-scraping-job-12"
	}
}
const response = await client.createScrapingJob({
	sitemap_id: 123,
	driver: "fast", // "fast" or "fulljs"
	page_load_delay: 2000,
	request_interval: 2000,
	proxy: 0, // optional. 0 - no proxy, 1 - use proxy. Or proxy id for Scale plan users
	start_urls: [   // optional, if set, will overwrite sitemap start URLs
		"https://www.webscraper.io/test-sites/e-commerce/allinone/computers",
		"https://www.webscraper.io/test-sites/e-commerce/allinone/phones"
	],
	custom_id: "custom-scraping-job-12" // optional, will be included in webhook notification
})
Response:
{
	id: 500
	custom_id: "custom-scraping-job-12"
}
$response = $client->createScrapingJob([
	'sitemap_id' => 123,
	'driver' => 'fast', // 'fast' or 'fulljs'
	'page_load_delay' => 2000,
	'request_interval' => 2000,
	'proxy' => 0, // optional. 0 - no proxy, 1 - use proxy. Or proxy id for Scale plan users
	'start_urls' => [   // optional, if set, will overwrite sitemap start URLs
		'https://www.webscraper.io/test-sites/e-commerce/allinone/computers',
		'https://www.webscraper.io/test-sites/e-commerce/allinone/phones'
	],
	'custom_id' => 'custom-scraping-job-12' // optional, will be included in webhook notification
]);
Response:
[
	'id' => 500,
	'custom_id' => 'custom-scraping-job-12'
]

Enable Sitemap Scheduler

Method: POST
URL: https://api.webscraper.io/api/v1/sitemap/<Sitemap ID>/enable-scheduler?api_token=<YOUR API TOKEN>
JSON:
{
    "cron_minute": "*/10",
    "cron_hour": "*",
    "cron_day": "*",
    "cron_month": "*",
    "cron_weekday": "*",
    "request_interval": 2000,
    "page_load_delay": 2000,
    "cron_timezone": "Europe/Riga",
    "driver": "fast", // "fast" or "fulljs"
    "proxy": 0 // optional. 0 - no proxy, 1 - use proxy. Or proxy id for Scale plan users
}
Response:
{
    "success": true,
    "data": "ok"
}
const response = await client.enableSitemapScheduler(123, {
	cron_minute: "*/10",
	cron_hour: "*",
	cron_day: "*",
	cron_month: "*",
	cron_weekday: "*",
	request_interval: 2000,
	page_load_delay: 2000,
	cron_timezone: "Europe/Riga",
	driver: "fast", // 'fast' or 'fulljs'
	proxy: 0, // optional. 0 - no proxy, 1 - use proxy. Or proxy id for Scale plan users
});
Response:
"ok"
$response = $client->enableSitemapScheduler(123, [
	'cron_minute' => '*/10',
	'cron_hour' => '*',
	'cron_day' => '*',
	'cron_month' => '*',
	'cron_weekday' => '*',
	'request_interval' => 2000,
	'page_load_delay' => 2000,
	'cron_timezone' => 'Europe/Riga',
	'driver' => 'fast', // 'fast' or 'fulljs'
	'proxy' => 0, // optional. 0 - no proxy, 1 - use proxy. Or proxy id for Scale plan users
]);
Response:
"ok"

Disable Sitemap Scheduler

Method: POST
URL: https://api.webscraper.io/api/v1/sitemap/<Sitemap ID>/disable-scheduler?api_token=<YOUR API TOKEN>
Response:
{
    "success": true,
    "data": "ok"
}
const response = await client.disableSitemapScheduler(123);
Response:
"ok"
$response = $client->disableSitemapScheduler(123);
Response:
"ok"

Get Sitemap Scheduler

Method: GET
URL: https://api.webscraper.io/api/v1/sitemap/<Sitemap ID>/scheduler?api_token=<YOUR API TOKEN>
Response:
{
    "success": true,
    "data": {
        "scheduler_enabled" => true,
        "proxy": 0, // optional. 0 - no proxy, 1 - use proxy. Or proxy id for Scale plan users
        "cron_minute": "*/10",
        "cron_hour": "*",
        "cron_day": "*",
        "cron_month": "*",
        "cron_weekday": "*",
        "request_interval": 2000,
        "page_load_delay": 2000,
        "driver": "fast", // "fast" or "fulljs"
        "cron_timezone": "Europe/Riga"
    }
}
const config = await client.getSitemapScheduler(123);
Response:
{
	scheduler_enabled: true,
	proxy: 0, // optional. 0 - no proxy, 1 - use proxy. Or proxy id for Scale plan users
	cron_minute: "*/10",
	cron_hour: "*",
	cron_day: "*",
	cron_month: "*",
	cron_weekday: "*",
	request_interval: 2000,
	page_load_delay: 2000,
	driver: "fast", // 'fast' or 'fulljs'
	cron_timezone: "Europe/Riga",
}
$response = $client->getSitemapScheduler(123);
Response:
[
    'scheduler_enabled' => true,
    'proxy' => 0, // optional. 0 - no proxy, 1 - use proxy. Or proxy id for Scale plan users
    'cron_minute' => '*/10',
    'cron_hour' => '*',
    'cron_day' => '*',
    'cron_month' => '*',
    'cron_weekday' => '*',
    'request_interval' => 2000,
    'page_load_delay' => 2000,
    'driver' => 'fast', // 'fast' or 'fulljs'
    'cron_timezone': 'Europe/Riga'
]

Get Scraping Job

Note! You can also receive a push notification once the scraping job has finished. Pinging the API until the scraping job has finished isn't a good practice.

Method: GET
URL: https://api.webscraper.io/api/v1/scraping-job/<SCRAPING JOB ID>?api_token=<YOUR API TOKEN>
Response:
{
	"success": true,
	"data": {
		"id": 500,
		"custom_id": "custom-scraping-job-12",
		"sitemap_name": "webscraper-io-landing",
		"status": "scheduling",
		"sitemap_id": 123,
		"test_run": 0,
		"jobs_scheduled": 0,
		"jobs_executed": 0,
		"jobs_failed": 0,
		"jobs_empty": 0,
		"stored_record_count": 0,
		"request_interval": 2000,
		"page_load_delay": 2000,
		"driver": "fast",
		"scheduled": 0, // scraping job was started by scheduler
		"time_created": 1493370624, // unix timestamp
		"scraping_duration": 60, // seconds
	}
}
const scrapingJob = await client.getScrapingJob(500);
Response:
{
	id: 500
	custom_id: "custom-scraping-job-12"
	sitemap_name: "webscraper-io-landing"
	status: "scheduling"
	sitemap_id: 123
	test_run: 0
	jobs_scheduled: 0
	jobs_executed: 0
	jobs_failed: 0
	jobs_empty: 0
	stored_record_count: 0
	request_interval: 2000
	page_load_delay: 2000
	driver: "fast"
	scheduled: 0 // scraping job was started by scheduler
	time_created: "1493370624" // unix timestamp
}
$scrapingJob = $client->getScrapingJob(500);
Response:
[
	'id' => 500,
	'custom_id' => 'custom-scraping-job-12',
	'sitemap_name' => 'webscraper-io-landing',
	'status' => 'scheduling',
	'sitemap_id' => 123,
	'test_run' => 0,
	'jobs_scheduled' => 0,
	'jobs_executed' => 0,
	'jobs_failed' => 0,
	'jobs_empty' => 0,
	'stored_record_count' => 0,
	'request_interval' => 2000,
	'page_load_delay' => 2000,
	'driver' => 'fast',
	'scheduled' => 0, // scraping job was started by scheduler
	'time_created' => 1493370624, // unix timestamp
	'scraping_duration' => 60, // seconds
]

Get Scraping Jobs

Method: GET
URL: https://api.webscraper.io/api/v1/scraping-jobs?api_token=<YOUR API TOKEN>
Optional query parameters:
- page: &page=2
- sitemap: &sitemap_id=123
Response:
{
	"success": true,
	"data": [
		{
			"id": 500,
			"custom_id": "custom-scraping-job-12",
			"sitemap_name": "webscraper-io-landing",
			"status": "scheduling",
			"sitemap_id": 123,
			"test_run": 0,
			"jobs_scheduled": 0,
			"jobs_executed": 0,
			"jobs_failed": 0,
			"jobs_empty": 0,
			"stored_record_count": 0,
			"request_interval": 2000,
			"page_load_delay": 2000,
			"driver": "fast",
			"scheduled": 0, // scraping job was started by scheduler
			"time_created": 1493370624, // unix timestamp
			"scraping_duration": 60, // seconds
		},
		{
		...
		}
	],
	"current_page": 1,
	"last_page": 1,
	"total": 5,
	"per_page": 100,
}
let generator = client.getScrapingJobs({
	sitemap_id: 123, // optional
});
const scrapingJobs = await generator.getAllRecords();

// or iterate through all scraping jobs manually
generator = client.getScrapingJobs({
	sitemap_id: 123, // optional
});
for await (const record of await generator.fetchRecords()) {
	console.log(JSON.stringify(record));
}
Response:
// response (generator)
[
	{
		id: 500,
		custom_id: "custom-scraping-job-12",
		sitemap_name: "webscraper-io-landing",
		status: "scheduling",
		sitemap_id: 123,
		test_run: 0,
		jobs_scheduled: 0,
		jobs_executed: 0,
		jobs_failed: 0,
		jobs_empty: 0,
		stored_record_count: 0,
		request_interval: 2000,
		page_load_delay: 2000,
		driver: "fast",
		scheduled: 0, // scraping job was started by scheduler
		time_created: "1493370624", // unix timestamp
	},
	{
		...
	},
]
$scrapingJobIterator = $client->getScrapingJobs($sitemapId = null);

// iterate through all scraping jobs
foreach($scrapingJobIterator as $scrapingJob) {
	var_dump($scrapingJob);
}

// or iterate through all scraping jobs while manually handling pagination
$page = 1;
do {
	$scrapingJobs = $scrapingJobIterator->getPageData($page);
	foreach($scrapingJobs as $scrapingJob) {
		var_dump($scrapingJob);
	}
	$page++;
} while($page <= $scrapingJobIterator->getLastPage());
Response:
// response (iterator)
[
	[
		'id' => 500,
		'custom_id' => 'custom-scraping-job-12',
		'sitemap_name' => 'webscraper-io-landing',
		'status' => 'scheduling',
		'sitemap_id' => 123,
		'test_run' => 0,
		'jobs_scheduled' => 0,
		'jobs_executed' => 0,
		'jobs_failed' => 0,
		'jobs_empty' => 0,
		'stored_record_count' => 0,
		'request_interval' => 2000,
		'page_load_delay' => 2000,
		'driver' => 'fast',
		'scheduled' => 0, // scraping job was started by scheduler
		'time_created' => 1493370624, // unix timestamp
		'scraping_duration' => 60, // seconds
	],
	[
	...
	],
]

Download scraped data in JSON format

Note! A good practice would be to move the download/import task to a queue job. A good example of queue system.

Method: GET
URL: https://api.webscraper.io/api/v1/scraping-job/<SCRAPING JOB ID>/json?api_token=<YOUR API TOKEN>
Response:
// File with one JSON string per line.
{"title":"Nokia 123","price":"$24.99","description":"7 day battery"}
{"title":"ProBook","price":"$739.99","description":"14\", Core i5 2.6GHz, 4GB, 500GB, Win7 Pro 64bit"}
{"title":"ThinkPad X240","price":"$1311.99","description":"12.5\", Core i5-4300U, 8GB, 240GB SSD, Win7 Pro 64bit"}
{"title":"Aspire E1-572G","price":"$581.99","description":"15.6\", Core i5-4200U, 8GB, 1TB, Radeon R7 M265, Windows 8.1"}
import * as fs from "fs";

const scrapingJobId = 500;
const outputFile = `/tmp/scrapingjob-${scrapingJobId}.json`;
await client.downloadScrapingJobJSON(scrapingJobId, outputFile);

// read data from file with built in JSON reader
const reader = new JsonReader(outputFile);
for await (const row of reader.fetchRows()) {
	console.log(`ROW: ${JSON.stringify(row)} \n`);
}

// remove temporary file
fs.unlinkSync(outputFile);
Response:
// File with one JSON string per line.
{"title":"Nokia 123","price":"$24.99","description":"7 day battery"}
{"title":"ProBook","price":"$739.99","description":"14\", Core i5 2.6GHz, 4GB, 500GB, Win7 Pro 64bit"}
{"title":"ThinkPad X240","price":"$1311.99","description":"12.5\", Core i5-4300U, 8GB, 240GB SSD, Win7 Pro 64bit"}
{"title":"Aspire E1-572G","price":"$581.99","description":"15.6\", Core i5-4200U, 8GB, 1TB, Radeon R7 M265, Windows 8.1"}
use WebScraper\ApiClient\Reader\JsonReader;

$scrapingJobId = 500;
$outputFile = "/tmp/scrapingjob{$scrapingJobId}.json";
$client->downloadScrapingJobJSON($scrapingJobId, $outputFile);

// read data from file with built in JSON reader
$reader = new JsonReader($outputFile);
$rows = $reader->fetchRows();
foreach ($rows as $row) {
	echo "ROW: " . json_encode($row) . "\n";
}

// remove temporary file
unlink($outputFile);
Response:
// File with one JSON string per line.
{"title":"Nokia 123","price":"$24.99","description":"7 day battery"}
{"title":"ProBook","price":"$739.99","description":"14\", Core i5 2.6GHz, 4GB, 500GB, Win7 Pro 64bit"}
{"title":"ThinkPad X240","price":"$1311.99","description":"12.5\", Core i5-4300U, 8GB, 240GB SSD, Win7 Pro 64bit"}
{"title":"Aspire E1-572G","price":"$581.99","description":"15.6\", Core i5-4200U, 8GB, 1TB, Radeon R7 M265, Windows 8.1"}

Download scraped data in CSV format

Note! We recommend using JSON format since multiple CSV notations are being used by different products. For example:

  • CSV Standard: https://tools.ietf.org/html/rfc4180
  • MS Excel cannot handle escape sequences from the CSV standard
  • PHP has incorrect default implementation. See https://wiki.php.net/rfc/kill-csv-escaping

Method: GET
URL: https://api.webscraper.io/api/v1/scraping-job/<SCRAPING JOB ID>/csv?api_token=<YOUR API TOKEN>
Response:
// CSV file
web-scraper-order,title,Color
1494492462-1,Fluffy Cat,blue
1494492462-1,Fluffy Dog,white
import * as fs from "fs";

const scrapingJobId = 500;
const outputFile = `/tmp/scrapingjob-${scrapingJobId}.csv`;
await client.downloadScrapingJobCSV(scrapingJobId, outputFile);

// Use a library that supports RFC 4180 standard to parse the csv file.
// That said we recommend downloading data and using JSON format data.
// CSV format readers and writers have been incorrectly implemented in multiple applications and programming languages.

// remove temporary file
fs.unlinkSync(outputFile);
Response:
// CSV file
web-scraper-order,title,Color
1494492462-1,Fluffy Cat,blue
1494492462-1,Fluffy Dog,white
use League\Csv\Reader;

$scrapingJobId = 500;
$outputFile = "/tmp/scrapingjob-data{$scrapingJobId}.csv";
$client->downloadScrapingJobCSV($scrapingJobId, $outputFile);

$records = Reader::createFromPath($outputFile)->fetchAssoc();

foreach($records as $record) {
	// Import records into database. Importing records in bulk will speed up
	// the process.
}

// remove temporary file
unlink($outputFile);
Response:
// CSV file
web-scraper-order,title,Color
1494492462-1,Fluffy Cat,blue
1494492462-1,Fluffy Dog,white

Get Scraping Job Problematic Urls

Returns empty and failed urls for specific scraping job.

Method: GET
URL: https://api.webscraper.io/api/v1/scraping-job/<SCRAPING JOB ID>/problematic-urls?api_token=<YOUR API TOKEN>
Optional query parameters:
- page: &page=2
Response:
{
	"success": true,
	"data": [
		{
			"url": "https://webscraper.io/empty",
			"type": "empty",
		},
		{
			"url": "https://webscraper.io/failed",
			"type": "failed",
		},
		{
		...
		}
	],
	"current_page": 1,
	"last_page": 1,
	"total": 5,
	"per_page": 100,
}
let generator = client.getProblematicUrls(scrapingJobId);
const problematicUrls = await generator.getAllRecords();

// or iterate through all problematic urls manually
generator = client.getProblematicUrls(scrapingJobId);
for await (const record of await generator.fetchRecords()) {
	console.log(JSON.stringify(record));
}
Response:
// response (generator)
[
	{
		url: "https://webscraper.io/empty",
		type: "empty",
	},
	{
		url: "https://webscraper.io/failed",
		type: "failed",
	},
	{
		...
	},
]
$problematicUrlsIterator = $client->getProblematicUrls($scrapingJobId);

// iterate through all urls
foreach($problematicUrlsIterator as $problematicUrl) {
	var_dump($problematicUrl);
}

// or iterate through all problematic urls while manually handling pagination
$page = 1;
do {
	$problematicUrls = $problematicUrlsIterator->getPageData($page);
	foreach($problematicUrls as $problematicUrl) {
		var_dump($problematicUrl);
	}
	$page++;
} while($page <= $problematicUrlsIterator->getLastPage());
Response:
// response (iterator)
[
	[
		'url' => 'https://webscraper.io/empty',
		'type' => 'empty',
	],
	[
		'url' => 'https://webscraper.io/failed',
		'type' => 'failed',
	],
	[
	...
	],
]

Get Scraping Job Data Quality

Method: GET
URL: https://api.webscraper.io/api/v1/scraping-job/<SCRAPING JOB ID>/data-quality?api_token=<YOUR API TOKEN>
Response:
{
    "success": true,
    "data": {
        "min_record_count": {
            "got": 1,
            "expected": 1,
            "success": true // Specific data quality control indication
        },
        "max_failed_pages_percent": {
            "got": 0,
            "expected": 5,
            "success": true // Specific data quality control indication
        },
        "max_empty_pages_percent": {
            "got": 0,
            "expected": 5,
            "success": true // Specific data quality control indication
        },
        "min_column_records": {
            "title": {
                "got": 100,
                "expected": 95,
                "success": true // Specific data quality control indication
            }
        },
        "overall_data_quality_success": true // Global data quality control indication
    }
}
const scrapingJobQuality = await client.getScrapingJobDataQuality(123);
Response:
{
	min_record_count: {
		got: 1,
		expected: 1,
		success: true // Specific data quality control indication
	},
	max_failed_pages_percent: {
		got: 0,
		expected: 5,
		success: true // Specific data quality control indication
	},
	max_empty_pages_percent: {
		got: 0,
		expected: 5,
		success: true // Specific data quality control indication
	},
	min_column_records: {
		title: {
			got: 100,
			expected: 95,
			success: true // Specific data quality control indication
		}
	},
	overall_data_quality_success: true // Global data quality control indication
}
$scrapingJob = $client->getScrapingJobDataQuality(123);
Response:
[
	'min_record_count' => [
		'got' => 1,
		'expected' => 1,
		'success' => true, // Specific data quality control indication
	],
	'max_failed_pages_percent' => [
		'got' => 0,
		'expected' => 5,
		'success' => true, // Specific data quality control indication
	],
	'max_empty_pages_percent' => [
		'got' => 0,
		'expected' => 5,
		'success' => true, // Specific data quality control indication
	],
	'min_column_records' => [
		'title' => [
			'got' => 100,
			'expected' => 95,
			'success' => true, // Specific data quality control indication
		],
	],
	'overall_data_quality_success' => true, // Global data quality control indication
]

Delete Scraping Job

Method: DELETE
URL: https://api.webscraper.io/api/v1/scraping-job/<SCRAPING JOB ID>?api_token=<YOUR API TOKEN>
Response:
{
	"success": true,
	"data": "ok"
}
await client.deleteScrapingJob(500);
Response:
"ok"
$client->deleteScrapingJob(500);
Response:
"ok"

Account info

Method: GET
URL: https://api.webscraper.io/api/v1/account?api_token=<YOUR API TOKEN>
Response:
{
	"success": true,
	"data": {
		"email": "user@example.com",
		"firstname": "John",
		"lastname": "Deere",
		"page_credits": 500
	}
}
const info = await client.getAccountInfo();
Response:
{
	email: "user@example.com"
	firstname: "John"
	lastname: "Deere"
	page_credits: 500
}
$info = $client->getAccountInfo();
Response:
[
	'email' => 'user@example.com',
	'firstname' => 'John',
	'lastname' => 'Deere',
	'page_credits' => 500
]


Was this page helpful?