hello
In this example, I want to show you how to extract file links from HTML pages easily.

start here:

  1. make sure lxml is installed on your linux
    • sudo dnf install python2-lxml in fedora
    • sudo apt install python2-lxml in ubuntu
  2. clone the code (or copy from the code preview box):
    • git clone https://gist.github.com/e045c27e0d27ea6baab4ddce16b906ab.git ~/bin
  3. set the execution permission:
    • chmod +x ~/bin/getlink.py
  4. all done. how to use? :
    • type the full path instead of “URL”
      • # ~/bin/getlink.py "URL"
    • If all is done correctly, you should see a list of links.
      In this example, I want to search .rar files in this page:
      • [email protected]:~$ getlink.py "http://p30download.com/fa/entry/69723/" | grep "\.rar"
    • I usually like to download the output list to my local computer. So, try this:

notes:

  1. To support non-ASCII charset , I’ve added lines 6,7
  2. with grep/sed/tr/awk command you can use complex pattern to find your target. like grep -Eo '\b[[:digit:]]{2}-[[:upper:]][[:lower:]]{2}-[[:digit:]]{2}\b'
  3. add PATH=$PATH:$HOME/bin to your .bash_profile or .profile file

code preview:

if you enjoyed this, please give me a star in https://gist.github.com/Ahmadly/e045c27e0d27ea6baab4ddce16b906ab
thank you.
ahmad