Re: [Hampshire] How can I mirror a webpage?

Author: Chris Dennis
Date:
To: Hampshire LUG Discussion List
Subject: Re: [Hampshire] How can I mirror a webpage?

Steve Kemp wrote:
> A simple request which is confusing me mightily!
>
> I'd like to download a remote webpage *including* any images, css
> files, etc which are required and rewrite those to work in the
> local copy. This is simple stuff with wget usually, but I'm
> running into problems because I must have the initial page
> be downloaded to a fixed name.
>
> wget seems to dislike my initial attempt:
>
> wget --O index.html --no-clobber --page-requisites \ > --convert-links --no-directories --url=http://en.wikipedia.org/

>
> The "--no-clobber" here, designed to avoid a file overwriting one
> which already exists, stops things from working.
>
> curl seems to allow me to name files like -O "index_#1", but it
> doesn't do rewriting of the page contents (images/css/etc).
>
> (I'm trying to create archives of bookmarks in an online bookmark
> application - so I want files for bookmark "xx" to be located in
> /path/to/archives/xx/ - which is why I have to insist upon "index.html"
> as the initial page.)
>
> I guess I could use perl to get a URLs contents, parse it for
> links, and then get them individually - but it seems like this should
> be a simple request... I looked at httrack too, but that seemed
> confusingly complex.
>
> Steve

Could you use

    wget -nv  en.wikipedia.org

to get the name of the first file in a relatively-easy-to-parse format?
Then after running

    wget --page-requisites --convert-links --no-directories \
         en.wikipedia.org

you could just rename the relevant file. As you aren't doing recursion,
links shouldn't be messed up.

The -O option won't work because it will try to put all the files
(images and css etc.) into a single file. And --no-clobber is no good
to you either.

Hope that helps.

cheers

Chris
-- 
Chris Dennis                                  cgdennis@???
Fordingbridge, Hampshire, UK

This message is part of the following thread:
	the complete thread tree sorted by date
	Steve Kemp at