Handle non-HTML content referenced by a URL

If a URL references non-HTML resources (PDFs, DOCx,...) the content of the document is not inspected (as far as I can see) and the document is saved in the working directory. These artifacts pile up and get not deleted.