From ca4686f81b0ada4c571655880fccb8f8b431233e Mon Sep 17 00:00:00 2001 From: James Date: Fri, 6 Mar 2020 13:46:36 +0000 Subject: [PATCH] new readme --- README.md | 202 +++++++++--------------------------------------------- 1 file changed, 34 insertions(+), 168 deletions(-) diff --git a/README.md b/README.md index 86f61d0..4246ac3 100644 --- a/README.md +++ b/README.md @@ -1,181 +1,47 @@ -This is very similar to spatie/http-status-check but due to the way guzzle handles redirects I wasn't happy with the results; known 404's, and even whole areas of the site were missing from crawl results. By following this https://github.com/guzzle/guzzle/blob/master/docs/faq.rst#how-can-i-track-redirected-requests I was able to get a crawl result with the complete list of responses as I was expecting. +Similar project to [spatie/http-status-check](https://github.com/spatie/http-status-check) but this will collect ALL the found on URL's (not just first occurrence) and return the results as an array for further processing in your own application (no console command). -Full details of differences below: +## Install -## Patch guzzle for invalid status code bug +```plain +composer config repositories.jhodges composer https://git.jhodges.co.uk/composer +composer require jhodges/sitemap +``` -There is a patch applied to guzzle to prevent the crawl failing with `EXCEPTION: Status code must be an integer value between 1xx and 5xx` https://github.com/spatie/crawler/issues/271 without this patch `CrawlerTest::testInvalidStatusCode` will fail. +## Usage +Note: start the test server as shown below if you want to use the localhost:8080 site as per usage example. + +```php +use \JHodges\Sitemap\Crawler; + +$crawler=new Crawler(); +$crawler->crawl('https://localhost:8080'); +$sitemap=$crawler->getResults(); +print_r($sitemap); +``` + +To crawl multiple areas of the same site that are not interlinked, you may: + +```php +use \JHodges\Sitemap\Crawler; + +$crawler=new Crawler(); +$crawler->crawl('https://localhost:8080/interlinked1'); +$crawler->crawl('https://localhost:8080/deeplinked1'); +$sitemap=$crawler->getResults(); +print_r($sitemap); +``` ## Tests -The node.js test server is copied directly from spatie/http-status-check but refactored to offer a more diverse range of tests cases that cover the redirect issues and new functionality described above. - -The examples below are run against the test server in this project. - -## Collate all FoundOnUrl's - -Additionally I wanted to have a list of **all** the pages a 404 or specific link was found on. This was not possible as only the first FoundOn URL is reported. I created a patch to add a new function to the observer to make this possible. https://patch-diff.githubusercontent.com/raw/spatie/crawler/pull/280 without this the `CrawlerTest::testInterlinked` test will fail. +Start the test server, will listen on localhost:8080 ```plain -$ bin/crawler crawl http://localhost:8080/ - -200 http://localhost:8080/ -200 http://localhost:8080/found -404 http://localhost:8080/notFound -200 http://localhost:8080/externalLink -200 http://localhost:8080/deeplink1 -200 http://localhost:8080/interlinked1 -500 http://localhost:8080/internalServerError ---- http://localhost:8080/invalidStatusCode -200 http://localhost:8080/twoRedirectsToSameLocation -200 http://localhost:8080/deeplink2 -200 http://localhost:8080/interlinked2 -200 http://localhost:8080/interlinked3 -302 http://localhost:8080/redirectToFound -302 http://localhost:8080/redirectToNotFound -200 http://localhost:8080/deeplink3 -302 http://localhost:8080/redirectToRedirectToNotFound -302 http://localhost:8080/redirect1 -302 http://localhost:8080/redirect2 ---- http://localhost:8080/redirectLoop -200 http://example.com/ ---- http://localhost:8080/timeout -200 http://localhost:8080/deeplink4 -404 http://localhost:8080/deeplink5 - - -$ bin/crawler crawl http://localhost:8080/ -f - -200 http://localhost:8080/ - -> (1) -200 http://localhost:8080/found - -> (2) http://localhost:8080/ - -> (2) http://localhost:8080/twoRedirectsToSameLocation -404 http://localhost:8080/notFound - -> (3) http://localhost:8080/ -200 http://localhost:8080/externalLink - -> (1) http://localhost:8080/ -200 http://localhost:8080/deeplink1 - -> (1) http://localhost:8080/ -200 http://localhost:8080/interlinked1 - -> (1) http://localhost:8080/ - -> (1) http://localhost:8080/interlinked1 - -> (1) http://localhost:8080/interlinked2 - -> (1) http://localhost:8080/interlinked3 -500 http://localhost:8080/internalServerError - -> (1) http://localhost:8080/ ---- http://localhost:8080/invalidStatusCode - -> (1) http://localhost:8080/ -200 http://localhost:8080/twoRedirectsToSameLocation - -> (1) http://localhost:8080/ -200 http://localhost:8080/deeplink2 - -> (1) http://localhost:8080/deeplink1 -200 http://localhost:8080/interlinked2 - -> (1) http://localhost:8080/interlinked1 - -> (1) http://localhost:8080/interlinked2 - -> (1) http://localhost:8080/interlinked3 -200 http://localhost:8080/interlinked3 - -> (1) http://localhost:8080/interlinked2 - -> (1) http://localhost:8080/interlinked1 - -> (1) http://localhost:8080/interlinked3 -302 http://localhost:8080/redirectToFound - -> (1) http://localhost:8080/ -302 http://localhost:8080/redirectToNotFound - -> (2) http://localhost:8080/ -200 http://localhost:8080/deeplink3 - -> (1) http://localhost:8080/deeplink2 -302 http://localhost:8080/redirectToRedirectToNotFound - -> (1) http://localhost:8080/ -302 http://localhost:8080/redirect1 - -> (1) http://localhost:8080/twoRedirectsToSameLocation -302 http://localhost:8080/redirect2 - -> (1) http://localhost:8080/twoRedirectsToSameLocation ---- http://localhost:8080/redirectLoop - -> (1) http://localhost:8080/ -200 http://example.com/ - -> (1) http://localhost:8080/externalLink ---- http://localhost:8080/timeout - -> (1) http://localhost:8080/ -200 http://localhost:8080/deeplink4 - -> (1) http://localhost:8080/deeplink3 -404 http://localhost:8080/deeplink5 - -> (1) http://localhost:8080/deeplink4 +cd tests/server +./start_server.sh ``` -## Follow Redirects - -By default `spatie/http-status-check` has `guzzle` set to not follow redirects. This results in the potential for parts of the site to be uncrawlable if they are behind a 301 or 302 redirect, and not linked internally anywhere else with a non-redirecting link. - -Some webservers include a `` link on the 301/302 body and this will mitigate the problem (spaite follows and indexes), however if the webserver does not do this, then the link won't be followed to its true destination, and the destination won't be indexed. - -This is most obvious with a redirect to a not found page: You'd expect to see a 404 here: +Run the tests: ```plain -./http-status-check scan http://localhost:8080/redirectToNotFound - -Start scanning http://localhost:8080/redirectToNotFound - -[2020-02-22 12:07:22] 302 Found - http://localhost:8080/redirectToNotFound - -Crawling summary ----------------- -Crawled 1 url(s) with statuscode 302 +vendor/bin/phpunit tests ``` - -Or a redirect to a page thats found: You'd expect to see a 200 and the links of that page crawled too: -```plain -$ ./http-status-check scan http://localhost:8080/redirectToFound - -Start scanning http://localhost:8080/redirectToFound - -[2020-02-22 12:08:34] 302 Found - http://localhost:8080/redirectToFound - -Crawling summary ----------------- -Crawled 1 url(s) with statuscode 302 -``` - -If I enable `RequestOptions::ALLOW_REDIRECTS` as suggested here https://github.com/spatie/crawler/issues/263 I get: - -```plain -$ ./http-status-check scan http://localhost:8080/redirectToNotFound - -Start scanning http://localhost:8080/redirectToNotFound - -[2020-02-22 12:13:53] 404: Not Found - http://localhost:8080/redirectToNotFound - -Crawling summary ----------------- -Crawled 1 url(s) with statuscode 404 -``` - -and - -```plain -$ ./http-status-check scan http://localhost:8080/redirectToFound - -Start scanning http://localhost:8080/redirectToFound - -[2020-02-23 18:10:19] 200 OK - http://localhost:8080/redirectToFound - -Crawling summary ----------------- -Crawled 1 url(s) with statuscode 200 -``` - -This looks better, we can see the 404 and the 200 now, however on closer inspection you'll notice there are no 302's being shown. And the `/redirectToFound` is actually showing as a 200 where the real response was 302, the 200 should be associated with the `/found` URL that is still missing from the results. For generating a sitemap of valid paged we'd need the destination URL, not the redirect URL. - -Enter this sitemap crawler: - -```plain -$ bin/crawler crawl http://localhost:8080/redirectToNotFound - -302 http://localhost:8080/redirectToNotFound -404 http://localhost:8080/notFound - -$ bin/crawler crawl http://localhost:8080/redirectToFound - -302 http://localhost:8080/redirectToFound -200 http://localhost:8080/found -``` - \ No newline at end of file