Blog - Unity Behind Diversity

Searching for beauty in the dissonance

Tagged: php

Cleaning up HTML entities in MySpace blog RSS feeds (or how to eliminate squidginess)

I recently setup a Facebook musician page for Robyn Dell’Unto. We ran into one really annoying problem importing her blog posts from her MySpace blog. As Robyn described it,

my only issue with the notes is that they go all squidgy when there’s punctuation in the title. which, frankly, embarrasses me! I’m really embarrassed by squidgy punctuation!

By “squidgy,” she meant that the HTML entities were not displaying properly. Titles from imported posts displayed like this: “I’m doing stuff I swear.”

Ugh.

First, I thought it was a problem with Facebook Notes, but upon inspecting the MySpace RSS feed, I found that (aside from being woefully invalid — iTunes?) MySpace seems to have no freaking clue how to handle HTML entities properly. It’s no secret that I’m not a fan of MySpace. Why would I expect a valid feed? *sigh*

There were two really annoying things that MySpace was doing (aside from the whole iTunes thing):

  1. They double encode entities. Sure, it’s necessary that they turn each & into & in links, but not in text that they’ve already encoded!! This leads to the ’ “squidgies” in the titles
  2. There are a bunch of unicode characters that they don’t encode. For all the double encoding, other characters which ought to be encoded are missed entirely.

On top of that, I discovered that Facebook won’t display any of the unicode characters (I think?) even when they are represented by the proper HTML entities. They just display the entity code, causing the ’ “squidgies.”

Now, I’m no expert on character encoding and HTML entities, but I can do better than that. I’ve hacked together some PHP code to clean up the feed a bit before importing to Facebook, which has solved all of our problems so far. I realize I’m only addressing a limited subset of unicode character entities, but it’s working for our purposes for now.

View the code.

It’s nowhere near perfect, but it’s a definite improvement and it works so far. Hopefully this can be of assistance to someone else. Suggestions welcome!

Creative Commons Attribution-ShareAlike 4.0 International Permalink | Post a Comment