Go to file
2020-02-23 19:10:49 +00:00
bin add simple console command 2020-02-22 13:57:43 +00:00
src crawler takes request options in constructor, update docs 2020-02-23 18:23:36 +00:00
tests crawler takes request options in constructor, update docs 2020-02-23 18:23:36 +00:00
.gitignore inital 2020-02-20 17:06:44 +00:00
composer.json add test and patch for the https://github.com/spatie/crawler/issues/271 2020-02-23 18:00:58 +00:00
composer.lock add simple console command 2020-02-22 13:57:43 +00:00
README.md Update 'README.md' 2020-02-23 19:10:49 +00:00

This is very similar to spatie/http-status-check but due to the way guzzle handles redirects I wasn't happy with the results; known 404's, and even whole areas of the site were missing from crawl results. By following this https://github.com/guzzle/guzzle/blob/master/docs/faq.rst#how-can-i-track-redirected-requests I was able to get a crawl result with the complete list of responses as I was expecting.

Full details of differences below:

Patch guzzle for invalid status code bug

There is a patch applied to guzzle to prevent the crawl failing with EXCEPTION: Status code must be an integer value between 1xx and 5xx https://github.com/spatie/crawler/issues/271 without this patch CrawlerTest::testInvalidStatusCode will fail.

Tests

The node.js test server is copied directly from spatie/http-status-check but refactored to offer a more diverse range of tests cases that cover the redirect issues and new functionality described above.

The examples below are run against the test server in this project.

Collate all FoundOnUrl's

Additionally I wanted to have a list of all the pages a 404 or specific link was found on. This was not possible as only the first FoundOn URL is reported. I created a patch to add a new function to the observer to make this possible. https://patch-diff.githubusercontent.com/raw/spatie/crawler/pull/280 without this the CrawlerTest::testInterlinked test will fail.

$ bin/crawler crawl http://localhost:8080/

200 http://localhost:8080/
200 http://localhost:8080/found
404 http://localhost:8080/notFound
200 http://localhost:8080/externalLink
200 http://localhost:8080/deeplink1
200 http://localhost:8080/interlinked1
500 http://localhost:8080/internalServerError
--- http://localhost:8080/invalidStatusCode
200 http://localhost:8080/twoRedirectsToSameLocation
200 http://localhost:8080/deeplink2
200 http://localhost:8080/interlinked2
200 http://localhost:8080/interlinked3
302 http://localhost:8080/redirectToFound
302 http://localhost:8080/redirectToNotFound
200 http://localhost:8080/deeplink3
302 http://localhost:8080/redirectToRedirectToNotFound
302 http://localhost:8080/redirect1
302 http://localhost:8080/redirect2
--- http://localhost:8080/redirectLoop
200 http://example.com/
--- http://localhost:8080/timeout
200 http://localhost:8080/deeplink4
404 http://localhost:8080/deeplink5


$ bin/crawler crawl http://localhost:8080/ -f

200 http://localhost:8080/
    -> (1) 
200 http://localhost:8080/found
    -> (2) http://localhost:8080/
    -> (2) http://localhost:8080/twoRedirectsToSameLocation
404 http://localhost:8080/notFound
    -> (3) http://localhost:8080/
200 http://localhost:8080/externalLink
    -> (1) http://localhost:8080/
200 http://localhost:8080/deeplink1
    -> (1) http://localhost:8080/
200 http://localhost:8080/interlinked1
    -> (1) http://localhost:8080/
    -> (1) http://localhost:8080/interlinked1
    -> (1) http://localhost:8080/interlinked2
    -> (1) http://localhost:8080/interlinked3
500 http://localhost:8080/internalServerError
    -> (1) http://localhost:8080/
--- http://localhost:8080/invalidStatusCode
    -> (1) http://localhost:8080/
200 http://localhost:8080/twoRedirectsToSameLocation
    -> (1) http://localhost:8080/
200 http://localhost:8080/deeplink2
    -> (1) http://localhost:8080/deeplink1
200 http://localhost:8080/interlinked2
    -> (1) http://localhost:8080/interlinked1
    -> (1) http://localhost:8080/interlinked2
    -> (1) http://localhost:8080/interlinked3
200 http://localhost:8080/interlinked3
    -> (1) http://localhost:8080/interlinked2
    -> (1) http://localhost:8080/interlinked1
    -> (1) http://localhost:8080/interlinked3
302 http://localhost:8080/redirectToFound
    -> (1) http://localhost:8080/
302 http://localhost:8080/redirectToNotFound
    -> (2) http://localhost:8080/
200 http://localhost:8080/deeplink3
    -> (1) http://localhost:8080/deeplink2
302 http://localhost:8080/redirectToRedirectToNotFound
    -> (1) http://localhost:8080/
302 http://localhost:8080/redirect1
    -> (1) http://localhost:8080/twoRedirectsToSameLocation
302 http://localhost:8080/redirect2
    -> (1) http://localhost:8080/twoRedirectsToSameLocation
--- http://localhost:8080/redirectLoop
    -> (1) http://localhost:8080/
200 http://example.com/
    -> (1) http://localhost:8080/externalLink
--- http://localhost:8080/timeout
    -> (1) http://localhost:8080/
200 http://localhost:8080/deeplink4
    -> (1) http://localhost:8080/deeplink3
404 http://localhost:8080/deeplink5
    -> (1) http://localhost:8080/deeplink4

Follow Redirects

By default spatie/http-status-check has guzzle set to not follow redirects. This results in the potential for parts of the site to be uncrawlable if they are behind a 301 or 302 redirect, and not linked internally anywhere else with a non-redirecting link.

Some webservers include a <a href="destination"> link on the 301/302 body and this will mitigate the problem (spaite follows and indexes), however if the webserver does not do this, then the link won't be followed to its true destination, and the destination won't be indexed.

This is most obvious with a redirect to a not found page: You'd expect to see a 404 here:

./http-status-check scan http://localhost:8080/redirectToNotFound

Start scanning http://localhost:8080/redirectToNotFound

[2020-02-22 12:07:22] 302 Found - http://localhost:8080/redirectToNotFound

Crawling summary
----------------
Crawled 1 url(s) with statuscode 302

Or a redirect to a page thats found: You'd expect to see a 200 and the links of that page crawled too:

$ ./http-status-check scan http://localhost:8080/redirectToFound

Start scanning http://localhost:8080/redirectToFound

[2020-02-22 12:08:34] 302 Found - http://localhost:8080/redirectToFound

Crawling summary
----------------
Crawled 1 url(s) with statuscode 302

If I enable RequestOptions::ALLOW_REDIRECTS as suggested here https://github.com/spatie/crawler/issues/263 I get:

$ ./http-status-check scan http://localhost:8080/redirectToNotFound

Start scanning http://localhost:8080/redirectToNotFound

[2020-02-22 12:13:53] 404: Not Found - http://localhost:8080/redirectToNotFound

Crawling summary
----------------
Crawled 1 url(s) with statuscode 404

and

$ ./http-status-check scan http://localhost:8080/redirectToFound

Start scanning http://localhost:8080/redirectToFound

[2020-02-23 18:10:19] 200 OK - http://localhost:8080/redirectToFound

Crawling summary
----------------
Crawled 1 url(s) with statuscode 200

This looks better, we can see the 404 and the 200 now, however on closer inspection you'll notice there are no 302's being shown. And the /redirectToFound is actually showing as a 200 where the real response was 302, the 200 should be associated with the /found URL that is still missing from the results. For generating a sitemap of valid paged we'd need the destination URL, not the redirect URL.

Enter this sitemap crawler:

$ bin/crawler crawl http://localhost:8080/redirectToNotFound

302 http://localhost:8080/redirectToNotFound
404 http://localhost:8080/notFound

$ bin/crawler crawl http://localhost:8080/redirectToFound

302 http://localhost:8080/redirectToFound
200 http://localhost:8080/found