Monday, March 9, 2009

UrlExtract - Extract URLs from a web page

This is a very simple script to extract URLs from a web page.

Script

url-extract.sh:

#!/bin/sh

for URL
do
lynx -dump "$URL" | sed -n -e '/^References$/,$s/^ *[0-9]\+\. \+//p' | sed -e 's/ /%20/g'
done

Usage

Usage is straightforward:

url-extract.sh <url> ...

The combination with grep, xargs, and wget, to download only the URLs containined in a page that match a certain regular expression:
url-extract.sh <url> | grep  | xargs wget


UrlExtract homepage
José Fonseca's Tech blog (UrlExtract author)
jrfonseca, José Fonseca's utilitarian scripts

0 comments:

Post a Comment