Preston Maness ☭<p>I've mirrored a relatively simple website (redsails.org; it's mostly text, some images) for posterity via <a href="https://tenforward.social/tags/wget" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>wget</span></a>. However, I also wanted to grab snapshots of any outlinks (of which there are many, as citations/references). By default, I couldn't figure out a configuration where wget would do that out of the box, without endlessly, recursively spidering the whole internet. I ended up making a kind-of poor man's <a href="https://tenforward.social/tags/ArchiveBox" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ArchiveBox</span></a> instead:</p><p>for i in $(cat others.txt) ; do dirname=$(echo "$i" | sha256sum | cut -d' ' -f 1) ; mkdir -p $dirname ; wget --span-hosts --page-requisites --convert-links --backup-converted --adjust-extension --tries=5 --warc-file="$dirname/$dirname" --execute robots=off --wait 1 --waitretry 5 --timeout 60 -o "$dirname/wget-$dirname.log" --directory-prefix="$dirname/" $i ; done</p><p>Basically, there's a list of bookmarks^W URLs in others.txt that I grabbed from the initial mirror of the website with some <a href="https://tenforward.social/tags/grep" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>grep</span></a> foo. I want to do as good of a mirror/snapshot of each specific URL as I can, without spidering/mirroring endlessly all over. So, I hash the URL, and kick off a specific wget job for it that will span hosts, but only for the purposes of making the specific URL as usable locally/offline as possible. I know from experience that this isn't perfect. But... it'll be good enough for my purposes. I'm also stashing a WARC file. Probably a bit overkill, but I figure it might be nice to have.</p><p><a href="https://tenforward.social/tags/RedSails" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>RedSails</span></a> <a href="https://tenforward.social/tags/archive" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>archive</span></a> <a href="https://tenforward.social/tags/archival" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>archival</span></a> <a href="https://tenforward.social/tags/archiving" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>archiving</span></a> <a href="https://tenforward.social/tags/warc" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>warc</span></a></p>