Update 'README.md'

This commit is contained in:
James 2020-02-23 19:10:49 +00:00
parent 2b6bd3024e
commit 5fc2b1fb83

View File

@ -1,8 +1,6 @@
This is very similar to spatie/http-status-check but due to the way guzzle handles redirects I wasn't happy with the results; known 404's, and even whole areas of the site were missing from crawl results. By following this https://github.com/guzzle/guzzle/blob/master/docs/faq.rst#how-can-i-track-redirected-requests I was able to get a crawl result with the complete list of responses as I was expecting.
## Collate all FoundOnUrl's
Additionally I wanted to have a list of **all** the pages a 404 or specific link was found on. This was not possible as only the first FoundOn URL is reported. I created a patch to add a new function to the observer to make this possible. https://patch-diff.githubusercontent.com/raw/spatie/crawler/pull/280 without this the `CrawlerTest::testInterlinked` test will fail.
Full details of differences below:
## Patch guzzle for invalid status code bug
@ -14,7 +12,97 @@ The node.js test server is copied directly from spatie/http-status-check but ref
The examples below are run against the test server in this project.
## More info about the redirects
## Collate all FoundOnUrl's
Additionally I wanted to have a list of **all** the pages a 404 or specific link was found on. This was not possible as only the first FoundOn URL is reported. I created a patch to add a new function to the observer to make this possible. https://patch-diff.githubusercontent.com/raw/spatie/crawler/pull/280 without this the `CrawlerTest::testInterlinked` test will fail.
```plain
$ bin/crawler crawl http://localhost:8080/
200 http://localhost:8080/
200 http://localhost:8080/found
404 http://localhost:8080/notFound
200 http://localhost:8080/externalLink
200 http://localhost:8080/deeplink1
200 http://localhost:8080/interlinked1
500 http://localhost:8080/internalServerError
--- http://localhost:8080/invalidStatusCode
200 http://localhost:8080/twoRedirectsToSameLocation
200 http://localhost:8080/deeplink2
200 http://localhost:8080/interlinked2
200 http://localhost:8080/interlinked3
302 http://localhost:8080/redirectToFound
302 http://localhost:8080/redirectToNotFound
200 http://localhost:8080/deeplink3
302 http://localhost:8080/redirectToRedirectToNotFound
302 http://localhost:8080/redirect1
302 http://localhost:8080/redirect2
--- http://localhost:8080/redirectLoop
200 http://example.com/
--- http://localhost:8080/timeout
200 http://localhost:8080/deeplink4
404 http://localhost:8080/deeplink5
$ bin/crawler crawl http://localhost:8080/ -f
200 http://localhost:8080/
-> (1)
200 http://localhost:8080/found
-> (2) http://localhost:8080/
-> (2) http://localhost:8080/twoRedirectsToSameLocation
404 http://localhost:8080/notFound
-> (3) http://localhost:8080/
200 http://localhost:8080/externalLink
-> (1) http://localhost:8080/
200 http://localhost:8080/deeplink1
-> (1) http://localhost:8080/
200 http://localhost:8080/interlinked1
-> (1) http://localhost:8080/
-> (1) http://localhost:8080/interlinked1
-> (1) http://localhost:8080/interlinked2
-> (1) http://localhost:8080/interlinked3
500 http://localhost:8080/internalServerError
-> (1) http://localhost:8080/
--- http://localhost:8080/invalidStatusCode
-> (1) http://localhost:8080/
200 http://localhost:8080/twoRedirectsToSameLocation
-> (1) http://localhost:8080/
200 http://localhost:8080/deeplink2
-> (1) http://localhost:8080/deeplink1
200 http://localhost:8080/interlinked2
-> (1) http://localhost:8080/interlinked1
-> (1) http://localhost:8080/interlinked2
-> (1) http://localhost:8080/interlinked3
200 http://localhost:8080/interlinked3
-> (1) http://localhost:8080/interlinked2
-> (1) http://localhost:8080/interlinked1
-> (1) http://localhost:8080/interlinked3
302 http://localhost:8080/redirectToFound
-> (1) http://localhost:8080/
302 http://localhost:8080/redirectToNotFound
-> (2) http://localhost:8080/
200 http://localhost:8080/deeplink3
-> (1) http://localhost:8080/deeplink2
302 http://localhost:8080/redirectToRedirectToNotFound
-> (1) http://localhost:8080/
302 http://localhost:8080/redirect1
-> (1) http://localhost:8080/twoRedirectsToSameLocation
302 http://localhost:8080/redirect2
-> (1) http://localhost:8080/twoRedirectsToSameLocation
--- http://localhost:8080/redirectLoop
-> (1) http://localhost:8080/
200 http://example.com/
-> (1) http://localhost:8080/externalLink
--- http://localhost:8080/timeout
-> (1) http://localhost:8080/
200 http://localhost:8080/deeplink4
-> (1) http://localhost:8080/deeplink3
404 http://localhost:8080/deeplink5
-> (1) http://localhost:8080/deeplink4
```
## Follow Redirects
By default `spatie/http-status-check` has `guzzle` set to not follow redirects. This results in the potential for parts of the site to be uncrawlable if they are behind a 301 or 302 redirect, and not linked internally anywhere else with a non-redirecting link.