David Nash

Wordpress Guru Sydney

Delete links across multiple HTML files in bash

Posted on December 5, 2011

Or any type of file, really. I needed to quickly remove links from an old website that was using flat HTML files. In my linux command line, I found I could do:

perl -pi -e 's/SEARCH/REPLACE/g' *.html

To replace all instances of SEARCH with REPLACE in *.html.

Except I needed to do a fair bit of escaping, because HTML is full of characters that mean something else on the command line.

So let’s say the string I needed to remove was:

<a title="Search Engine Optimisation" href="http://superspammyseocompany.com/" target="_self"><span>Search Engine Optimisation</span></a> by <a title="Super Spammy SEO Company" href="http://superspammyseocompany.com/" target="_self">Super Spammy SEO Company</a>

I copy + pasted this into vim, and then every time these characters occur:

< , >, / and ”

I put a \ in front of each of these, which gave me:

\<a title=\"Search Engine Optimisation\" href=\"http:\/\/superspammyseocompany.com\/\" target=\"_self\"\>\<span\>Search Engine Optimisation\<\/span\>\<\/a\> by \<a title=\"Super Spammy SEO Company\" href=\"http:\/\/superspammyseocompany.com\/\" target=\"_self\">Super Spammy SEO Company\<\/a\>

Which was a bit of work, but still much more fun than manually removing the link from each file.

Note that these characters do not need to be escaped with a backslash:

= (equals), . (dot), and  _ (underscore)

So my final command was:

perl -pi -e 's\\<a title=\"Search Engine Optimisation\" href=\"http:\/\/superspammyseocompany.com\/\" target=\"_self\"\>\<span\>Search Engine Optimisation\<\/span\>\<\/a\> by \<a title=\"Super Spammy SEO Company\" href=\"http:\/\/superspammyseocompany.com\/\" target=\"_self\">Super Spammy SEO Company\<\/a\>//' *.html

I’d already initialised a git repository and committed the files so I could easily restore the files in case of a mistake. A quick look through the links showed it all worked perfectly, and it saved me so much time I thought I’d write this post about it.

Bonus: I outputted all the changed files to list.html, which had one filename per line, like:

./file1.html
./file2.html
./file3.html

Here’s the vim command to turn them all into links, for easy human checking:

:%s/^\(.*\)$/<a href="\1">\1\<\/a\>\<\/br\>

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>