This gets the main articles from the Guardian newspaper site. The result doesn't look very nice but a lot of the content is there. The Guardian site has extremely complicated source HTML but all the articles have a URL which includes the date, so using the inclusion string /2016/ means that only links to these articles are followed. (You'll need to change this when the year changes.) The inclusion string .jpg means that links to some pictures are also followed.
http://www.theguardian.com/uk Guardian main articles /Guardian/index.html /Guardian/index.html/shortened467/an.com/uk/index.html 2 0 500 1 1 30/09/2014 19:24:58 2 /2016/ .jpg 0
Back to example webSet definitions