Regular expressions for scraping icons, titles, descriptions and more from websitesJune 26 2016
I'm building a link preview fetcher endpoint for Ada and I had to assemble some regular expressions for fetching icons, titles and descriptions from websites to give users a nice link preview in their chat conversations. I thought it would be good if I made these for others as it took me a couple of years to get fluent with Regex and I had to test these on a number of different sites to make sure they worked. I hope these are useful to someone else parsing HTML from webpages...
Parsing the title
This can be straight-forward depending on the site. Most websites will change the title dynamically to the content on the page so you can just use the content of the title element:
Unfortunately some sites don't do this. Also, a lot of sites will append or prepend the site name to the title element. I recommend preferring the meta tag for the title (expression below) and falling back on the title element if that doesn't return anything.
\<meta.*?(name|property)=\".*?title\".*?content=\"(.*?)\".*?\> // Use the second capture group to get the content
Parsing the description
This is just a modified version of the above expression, but scoped only for descriptions.
Parsing the favicon or shortcut icon
I hope that helps anyone struggling to parse content from webpages.
Bringing it home
I thought I'd share how this project turned out- I'm really happy with how much link previews improve the experience of sharing links in conversations.
Note: You may need to add or remove some backslashes to the above regular expressions depending on your environment.