Regular expressions for scraping icons, titles, descriptions and more from websites

June 26 2016

I'm building a link preview fetcher endpoint for Ada and I had to assemble some regular expressions for fetching icons, titles and descriptions from websites to give users a nice link preview in their chat conversations. I thought it would be good if I made these for others as it took me a couple of years to get fluent with Regex and I had to test these on a number of different sites to make sure they worked. I hope these are useful to someone else parsing HTML from webpages...

Parsing the title

This can be straight-forward depending on the site. Most websites will change the title dynamically to the content on the page so you can just use the content of the title element:

\<title\>(.*?)\<\/title\>

Unfortunately some sites don't do this. Also, a lot of sites will append or prepend the site name to the title element. I recommend preferring the meta tag for the title (expression below) and falling back on the title element if that doesn't return anything.

\<meta.*?(name|property)=\".*?title\".*?content=\"(.*?)\".*?\>
// Use the second capture group to get the content

Parsing the description

This is just a modified version of the above expression, but scoped only for descriptions.

\<meta.*?name=\".*?description\".*?content=\"(.*?)\".*?\>

Parsing the favicon or shortcut icon

\<link.*?rel=\".*?shortcut icon\".*?href=\"(.*?)\".*?\>

I hope that helps anyone struggling to parse content from webpages.

Bringing it home

I thought I'd share how this project turned out- I'm really happy with how much link previews improve the experience of sharing links in conversations.

Before and After

🖖

Note: You may need to add or remove some backslashes to the above regular expressions depending on your environment.