Update 'README.md'
This commit is contained in:
parent
2b6bd3024e
commit
5fc2b1fb83
96
README.md
96
README.md
@ -1,8 +1,6 @@
|
|||||||
This is very similar to spatie/http-status-check but due to the way guzzle handles redirects I wasn't happy with the results; known 404's, and even whole areas of the site were missing from crawl results. By following this https://github.com/guzzle/guzzle/blob/master/docs/faq.rst#how-can-i-track-redirected-requests I was able to get a crawl result with the complete list of responses as I was expecting.
|
This is very similar to spatie/http-status-check but due to the way guzzle handles redirects I wasn't happy with the results; known 404's, and even whole areas of the site were missing from crawl results. By following this https://github.com/guzzle/guzzle/blob/master/docs/faq.rst#how-can-i-track-redirected-requests I was able to get a crawl result with the complete list of responses as I was expecting.
|
||||||
|
|
||||||
## Collate all FoundOnUrl's
|
Full details of differences below:
|
||||||
|
|
||||||
Additionally I wanted to have a list of **all** the pages a 404 or specific link was found on. This was not possible as only the first FoundOn URL is reported. I created a patch to add a new function to the observer to make this possible. https://patch-diff.githubusercontent.com/raw/spatie/crawler/pull/280 without this the `CrawlerTest::testInterlinked` test will fail.
|
|
||||||
|
|
||||||
## Patch guzzle for invalid status code bug
|
## Patch guzzle for invalid status code bug
|
||||||
|
|
||||||
@ -14,7 +12,97 @@ The node.js test server is copied directly from spatie/http-status-check but ref
|
|||||||
|
|
||||||
The examples below are run against the test server in this project.
|
The examples below are run against the test server in this project.
|
||||||
|
|
||||||
## More info about the redirects
|
## Collate all FoundOnUrl's
|
||||||
|
|
||||||
|
Additionally I wanted to have a list of **all** the pages a 404 or specific link was found on. This was not possible as only the first FoundOn URL is reported. I created a patch to add a new function to the observer to make this possible. https://patch-diff.githubusercontent.com/raw/spatie/crawler/pull/280 without this the `CrawlerTest::testInterlinked` test will fail.
|
||||||
|
|
||||||
|
```plain
|
||||||
|
$ bin/crawler crawl http://localhost:8080/
|
||||||
|
|
||||||
|
200 http://localhost:8080/
|
||||||
|
200 http://localhost:8080/found
|
||||||
|
404 http://localhost:8080/notFound
|
||||||
|
200 http://localhost:8080/externalLink
|
||||||
|
200 http://localhost:8080/deeplink1
|
||||||
|
200 http://localhost:8080/interlinked1
|
||||||
|
500 http://localhost:8080/internalServerError
|
||||||
|
--- http://localhost:8080/invalidStatusCode
|
||||||
|
200 http://localhost:8080/twoRedirectsToSameLocation
|
||||||
|
200 http://localhost:8080/deeplink2
|
||||||
|
200 http://localhost:8080/interlinked2
|
||||||
|
200 http://localhost:8080/interlinked3
|
||||||
|
302 http://localhost:8080/redirectToFound
|
||||||
|
302 http://localhost:8080/redirectToNotFound
|
||||||
|
200 http://localhost:8080/deeplink3
|
||||||
|
302 http://localhost:8080/redirectToRedirectToNotFound
|
||||||
|
302 http://localhost:8080/redirect1
|
||||||
|
302 http://localhost:8080/redirect2
|
||||||
|
--- http://localhost:8080/redirectLoop
|
||||||
|
200 http://example.com/
|
||||||
|
--- http://localhost:8080/timeout
|
||||||
|
200 http://localhost:8080/deeplink4
|
||||||
|
404 http://localhost:8080/deeplink5
|
||||||
|
|
||||||
|
|
||||||
|
$ bin/crawler crawl http://localhost:8080/ -f
|
||||||
|
|
||||||
|
200 http://localhost:8080/
|
||||||
|
-> (1)
|
||||||
|
200 http://localhost:8080/found
|
||||||
|
-> (2) http://localhost:8080/
|
||||||
|
-> (2) http://localhost:8080/twoRedirectsToSameLocation
|
||||||
|
404 http://localhost:8080/notFound
|
||||||
|
-> (3) http://localhost:8080/
|
||||||
|
200 http://localhost:8080/externalLink
|
||||||
|
-> (1) http://localhost:8080/
|
||||||
|
200 http://localhost:8080/deeplink1
|
||||||
|
-> (1) http://localhost:8080/
|
||||||
|
200 http://localhost:8080/interlinked1
|
||||||
|
-> (1) http://localhost:8080/
|
||||||
|
-> (1) http://localhost:8080/interlinked1
|
||||||
|
-> (1) http://localhost:8080/interlinked2
|
||||||
|
-> (1) http://localhost:8080/interlinked3
|
||||||
|
500 http://localhost:8080/internalServerError
|
||||||
|
-> (1) http://localhost:8080/
|
||||||
|
--- http://localhost:8080/invalidStatusCode
|
||||||
|
-> (1) http://localhost:8080/
|
||||||
|
200 http://localhost:8080/twoRedirectsToSameLocation
|
||||||
|
-> (1) http://localhost:8080/
|
||||||
|
200 http://localhost:8080/deeplink2
|
||||||
|
-> (1) http://localhost:8080/deeplink1
|
||||||
|
200 http://localhost:8080/interlinked2
|
||||||
|
-> (1) http://localhost:8080/interlinked1
|
||||||
|
-> (1) http://localhost:8080/interlinked2
|
||||||
|
-> (1) http://localhost:8080/interlinked3
|
||||||
|
200 http://localhost:8080/interlinked3
|
||||||
|
-> (1) http://localhost:8080/interlinked2
|
||||||
|
-> (1) http://localhost:8080/interlinked1
|
||||||
|
-> (1) http://localhost:8080/interlinked3
|
||||||
|
302 http://localhost:8080/redirectToFound
|
||||||
|
-> (1) http://localhost:8080/
|
||||||
|
302 http://localhost:8080/redirectToNotFound
|
||||||
|
-> (2) http://localhost:8080/
|
||||||
|
200 http://localhost:8080/deeplink3
|
||||||
|
-> (1) http://localhost:8080/deeplink2
|
||||||
|
302 http://localhost:8080/redirectToRedirectToNotFound
|
||||||
|
-> (1) http://localhost:8080/
|
||||||
|
302 http://localhost:8080/redirect1
|
||||||
|
-> (1) http://localhost:8080/twoRedirectsToSameLocation
|
||||||
|
302 http://localhost:8080/redirect2
|
||||||
|
-> (1) http://localhost:8080/twoRedirectsToSameLocation
|
||||||
|
--- http://localhost:8080/redirectLoop
|
||||||
|
-> (1) http://localhost:8080/
|
||||||
|
200 http://example.com/
|
||||||
|
-> (1) http://localhost:8080/externalLink
|
||||||
|
--- http://localhost:8080/timeout
|
||||||
|
-> (1) http://localhost:8080/
|
||||||
|
200 http://localhost:8080/deeplink4
|
||||||
|
-> (1) http://localhost:8080/deeplink3
|
||||||
|
404 http://localhost:8080/deeplink5
|
||||||
|
-> (1) http://localhost:8080/deeplink4
|
||||||
|
```
|
||||||
|
|
||||||
|
## Follow Redirects
|
||||||
|
|
||||||
By default `spatie/http-status-check` has `guzzle` set to not follow redirects. This results in the potential for parts of the site to be uncrawlable if they are behind a 301 or 302 redirect, and not linked internally anywhere else with a non-redirecting link.
|
By default `spatie/http-status-check` has `guzzle` set to not follow redirects. This results in the potential for parts of the site to be uncrawlable if they are behind a 301 or 302 redirect, and not linked internally anywhere else with a non-redirecting link.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user