Delete links across multiple HTML files in bash

David Nash bash, linux, vim Leave a Comment

Or any type of file, really. I needed to quickly remove links from an old website that was using flat HTML files. In my linux command line, I found I could do:
[bash]perl -pi -e ‘s/SEARCH/REPLACE/g’ *.html[/bash]

To replace all instances of SEARCH with REPLACE in *.html.

Except I needed to do a fair bit of escaping, because HTML is full of characters that mean something else on the command line.

So let’s say the string I needed to remove was:
[sourcecode]<a title="Search Engine Optimisation" href="http://superspammyseocompany.com/" target="_self"><span>Search Engine Optimisation</span></a> by <a title="Super Spammy SEO Company" href="http://superspammyseocompany.com/" target="_self">Super Spammy SEO Company</a>[/sourcecode]

I copy + pasted this into vim, and then every time these characters occur:

< , >, / and ”

I put a \ in front of each of these, which gave me:

[sourcecode]\<a title=\"Search Engine Optimisation\" href=\"http:\/\/superspammyseocompany.com\/\" target=\"_self\"\>\<span\>Search Engine Optimisation\<\/span\>\<\/a\> by \<a title=\"Super Spammy SEO Company\" href=\"http:\/\/superspammyseocompany.com\/\" target=\"_self\">Super Spammy SEO Company\<\/a\>[/sourcecode]

Which was a bit of work, but still much more fun than manually removing the link from each file.

Note that these characters do not need to be escaped with a backslash:

= (equals), . (dot), and  _ (underscore)

So my final command was:

[sourcecode]perl -pi -e ‘s\\<a title=\"Search Engine Optimisation\" href=\"http:\/\/superspammyseocompany.com\/\" target=\"_self\"\>\<span\>Search Engine Optimisation\<\/span\>\<\/a\> by \<a title=\"Super Spammy SEO Company\" href=\"http:\/\/superspammyseocompany.com\/\" target=\"_self\">Super Spammy SEO Company\<\/a\>//’ *.html[/sourcecode]

I’d already initialised a git repository and committed the files so I could easily restore the files in case of a mistake. A quick look through the links showed it all worked perfectly, and it saved me so much time I thought I’d write this post about it.

Bonus: I outputted all the changed files to list.html, which had one filename per line, like:

[sourcecode] ./file1.html
./file2.html
./file3.html
[/sourcecode]

Here’s the vim command to turn them all into links, for easy human checking:

[sourcecode]:%s/^\(.*\)$/<a href="\1">\1\<\/a\>\<\/br\>[/sourcecode]

Leave a Reply

Your email address will not be published. Required fields are marked *