diff --git a/README.md b/README.md index 06b0376..86f61d0 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,6 @@ This is very similar to spatie/http-status-check but due to the way guzzle handles redirects I wasn't happy with the results; known 404's, and even whole areas of the site were missing from crawl results. By following this https://github.com/guzzle/guzzle/blob/master/docs/faq.rst#how-can-i-track-redirected-requests I was able to get a crawl result with the complete list of responses as I was expecting. -## Collate all FoundOnUrl's - -Additionally I wanted to have a list of **all** the pages a 404 or specific link was found on. This was not possible as only the first FoundOn URL is reported. I created a patch to add a new function to the observer to make this possible. https://patch-diff.githubusercontent.com/raw/spatie/crawler/pull/280 without this the `CrawlerTest::testInterlinked` test will fail. +Full details of differences below: ## Patch guzzle for invalid status code bug @@ -14,7 +12,97 @@ The node.js test server is copied directly from spatie/http-status-check but ref The examples below are run against the test server in this project. -## More info about the redirects +## Collate all FoundOnUrl's + +Additionally I wanted to have a list of **all** the pages a 404 or specific link was found on. This was not possible as only the first FoundOn URL is reported. I created a patch to add a new function to the observer to make this possible. https://patch-diff.githubusercontent.com/raw/spatie/crawler/pull/280 without this the `CrawlerTest::testInterlinked` test will fail. + +```plain +$ bin/crawler crawl http://localhost:8080/ + +200 http://localhost:8080/ +200 http://localhost:8080/found +404 http://localhost:8080/notFound +200 http://localhost:8080/externalLink +200 http://localhost:8080/deeplink1 +200 http://localhost:8080/interlinked1 +500 http://localhost:8080/internalServerError +--- http://localhost:8080/invalidStatusCode +200 http://localhost:8080/twoRedirectsToSameLocation +200 http://localhost:8080/deeplink2 +200 http://localhost:8080/interlinked2 +200 http://localhost:8080/interlinked3 +302 http://localhost:8080/redirectToFound +302 http://localhost:8080/redirectToNotFound +200 http://localhost:8080/deeplink3 +302 http://localhost:8080/redirectToRedirectToNotFound +302 http://localhost:8080/redirect1 +302 http://localhost:8080/redirect2 +--- http://localhost:8080/redirectLoop +200 http://example.com/ +--- http://localhost:8080/timeout +200 http://localhost:8080/deeplink4 +404 http://localhost:8080/deeplink5 + + +$ bin/crawler crawl http://localhost:8080/ -f + +200 http://localhost:8080/ + -> (1) +200 http://localhost:8080/found + -> (2) http://localhost:8080/ + -> (2) http://localhost:8080/twoRedirectsToSameLocation +404 http://localhost:8080/notFound + -> (3) http://localhost:8080/ +200 http://localhost:8080/externalLink + -> (1) http://localhost:8080/ +200 http://localhost:8080/deeplink1 + -> (1) http://localhost:8080/ +200 http://localhost:8080/interlinked1 + -> (1) http://localhost:8080/ + -> (1) http://localhost:8080/interlinked1 + -> (1) http://localhost:8080/interlinked2 + -> (1) http://localhost:8080/interlinked3 +500 http://localhost:8080/internalServerError + -> (1) http://localhost:8080/ +--- http://localhost:8080/invalidStatusCode + -> (1) http://localhost:8080/ +200 http://localhost:8080/twoRedirectsToSameLocation + -> (1) http://localhost:8080/ +200 http://localhost:8080/deeplink2 + -> (1) http://localhost:8080/deeplink1 +200 http://localhost:8080/interlinked2 + -> (1) http://localhost:8080/interlinked1 + -> (1) http://localhost:8080/interlinked2 + -> (1) http://localhost:8080/interlinked3 +200 http://localhost:8080/interlinked3 + -> (1) http://localhost:8080/interlinked2 + -> (1) http://localhost:8080/interlinked1 + -> (1) http://localhost:8080/interlinked3 +302 http://localhost:8080/redirectToFound + -> (1) http://localhost:8080/ +302 http://localhost:8080/redirectToNotFound + -> (2) http://localhost:8080/ +200 http://localhost:8080/deeplink3 + -> (1) http://localhost:8080/deeplink2 +302 http://localhost:8080/redirectToRedirectToNotFound + -> (1) http://localhost:8080/ +302 http://localhost:8080/redirect1 + -> (1) http://localhost:8080/twoRedirectsToSameLocation +302 http://localhost:8080/redirect2 + -> (1) http://localhost:8080/twoRedirectsToSameLocation +--- http://localhost:8080/redirectLoop + -> (1) http://localhost:8080/ +200 http://example.com/ + -> (1) http://localhost:8080/externalLink +--- http://localhost:8080/timeout + -> (1) http://localhost:8080/ +200 http://localhost:8080/deeplink4 + -> (1) http://localhost:8080/deeplink3 +404 http://localhost:8080/deeplink5 + -> (1) http://localhost:8080/deeplink4 +``` + +## Follow Redirects By default `spatie/http-status-check` has `guzzle` set to not follow redirects. This results in the potential for parts of the site to be uncrawlable if they are behind a 301 or 302 redirect, and not linked internally anywhere else with a non-redirecting link.