Linux.com

Home

Weekend Project: Harvest Microformats for Fun and Profit

 

Microformats are a way to slip computer-readable information into HTML, so it's "semantically" marked-up, not just visually marked up. With hCalendar, for example, when you advertise a big event on your Web page, it's not only human-readable, but the browser can notice that it's a calendar event and prompt the reader to add it to his or her Sunbird or Google Calendar schedule. The tricky part is, most browsers don't highlight microformats at all, much less prompt you to do something interesting with them. This weekend, you can enable that power in Firefox and start making better use of the microformats that are hidden, all around you, on the Web.

Such semantic Web markup is one of those great-sounding ideas that is just a little too hard to explain to the average non-tech-obsessed person on the street. There are a couple of reasons why. First, it combines multiple existing "formats," which can make it hard to keep the pieces straight. In a nutshell, microformats are a way to wrap invisible HTML or XHTML elements and attributes around text in a page, so that it conforms to some other data format. For example, the hCard microformat squeezes the existing vCard contact format into plain HTML by putting <div> elements around a person's name, email address, and whatnot.

Second, as the name implies, there are multiple microformats, so once you start getting into the idea, you pick up a whole list of formats to keep track of. The big ones are pretty simple, though: the aforementioned hCalendar and hCard, geo for geographic locations, adr for postal mailing addresses, XFN for "friend" relationships, and rel for several varieties of HTML <link> elements. There is a full list of established and draft standards at the microformats wiki.

Now comes the big twist: despite is relative lack of publicity, Firefox has built-in support for recognizing and parsing all of these microformats and more — it has ever since Firefox 3.0. There is just no user interface. That's where the Operator extension comes in.

I am the Operator of My Microformat Reader

Operator is an XPI extension for Firefox 2.0 and above, including the latest betas of Firefox 4. You can install it from the Mozilla Add-ons site, or from author Mike Kaply's Operator page. Whenever a page loads, Operator recognizes and parses all of the microformats found within, and presents you with actions you can perform for each. For instance, when it finds hCards on a page, you see actions for adding them to your address book, or exporting them as standard vCards. When it finds geo locations, the actions include looking them up in a variety of mapping services, or exporting them as KML data.

By default, Operator exposes all of this functionality in a Firefox toolbar, with drop down menus for each of the supported formats. The appropriate menu buttons light up automatically when there are microformats on the current page, displaying the number found in parentheses. Under the menu buttons, each microformat item found gets its own entry, which in turn receives its own secondary menu for the available actions. If you get more than a couple dozen of any one format type on a page, you might even have to scroll through them.

If you already have several other toolbars active, it might make more sense to choose one of Operator's other UI options to save vertical real estate. Operator lets you place a button in the status area (at the bottom of the window), inside the location bar (beside the "bookmark" button), or in any other Firefox toolbar (by choosing View -> Toolbars -> Customize). You can also open Operator as a sidebar from View -> Sidebar -> Operator, which presents you with a clickable list. Finally, if you bring up Operator's preferences window, you can activate a "highlight microformats" behavior that pops an outline around any microformat-labeled text when the mouse cursor moves over it. You can then right-click on the item to access the appropriate actions.

Out of the box, Operator supports hCard, hCalendar, geo, adr, xFolk, and RDFa formats. xFolk essentially picks up any hash-tag-like markup, so it will atuomatically populate its menu with tagged words from StatusNet, Twitter, YouTube, and other social media, plus a surprisingly high number of tag-using blogs and news sites. RDFa is an embedded form of the Resource Description Framework, a metadata standard published by the W3C. It is more complex to author, but it offers many of the same features as microformats, and Firefox speaks it as well, so it makes for a natural fit.

Operator's default actions tie in to popular Web services like Google Calendar, Google Maps, Ma.gnolia, Delicious.com, Yahoo Calendar, 30 Boxes, Mapquest, and Upcoming.org. The xFolk support allows you to find content matching the selected tag on Flickr, YouTube, Yedda, or even Amazon.com product searches.

When you first install Operator, I recommend using the toolbar interface because of how it trains you to notice microformats. After a while, it can get distracting because of the space required, and the location bar button eventually became my interface of choice.

Finding Microformats

Given their utter invisibility in the default Firefox build, you might think there aren't any great numbers in the wild. You'd be soooo wrong.

The major social networking sites like Twitter, Identi.ca, Friendfeed, and LinkedIn all already mark up all contacts in hCard (Facebook seems to be the lone holdout here). Those that (like Identi.ca) also include geolocation information in the browser also mark it up with geo tags. That makes it easy to add a contact to your personal addressbook just by visiting the site and clicking on the Operator menu. You can look up a location you don't recognize in the same manner. Oddly enough, I noticed that although my Twitter account is merely a re-direct from my Identi.ca account, Identi.ca marks up the geo location of each post's origin, but Twitter strips it out.

Yahoo's main services are thoroughly microformat-enabled; Yahoo Mail supports hCard for address book entries, and Yahoo Calendar supports hCalendar event export. Delicious supports a bookmarking microformat that I believe is based on HTML <link> elements, although the operator documentation of it is sparse. Still, it is nice to have one-click access to adding a link to your own bookmarks, without scrolling several pages down.

Despite being owned by Yahoo, Flickr is inconsistent. I read in an older article on Operator that geotagged photos were marked with geo information, and individual photo pages seem to include RDF data, but I could not find any hCalendar (even on timestamped images) or hCard (for contacts) data at all.

Regarding other popular services, much to my surprise, I discovered that a large number of Wikipedia pages have geo-formatted locations and dates embedded in article headers that are imperfect, but hCalendar compliant. Of course, many of these are historical events, but when you add an event to your calendar (say, The Battle of Hastings) you can add it as a repeating entry to save it as a recurring holiday.

Many blogging platforms also microformat some or all of their data, including Blogger, but sadly, not Wordpress.com. Last.fm marks up contacts and addresses, Meetup.com (as you would expect) marks up contacts, events, and locations. Bugzilla even marks up bug reporters and developers. Perhaps the most interesting microformat feature on a daily basis that I've found is the xFolk tag folksonomy, probably because it works across multiple services. You can pull up blog entries from news outlets on a particular category topic, or Flickr photos related to an Identi.ca discussion (which is ironic considering Flickr's own present lack of microformat markup).

The big hole at the moment is Google. Outside of Blogger's hCard support and YouTube's hCard and xFolk support, none of the other Google services use any microformats at all. You might think that indicates the search giant's lack of interest in the topic, but Google did recently announce that it was going to start indexing microformats as it crawls Web pages, and integrate them into search results. It even indicated support for a few draft microformat specifications, like hReview for product reviews and hProduct for shopping sites. Hopefully it is just a matter of time before support rolls out to the company's other services.

Because the most common microformats use techniques like embedding HTML element id's and classes to label their data fields, Operator will frequently find what you might call "partial hits" — for example, a news site that marks up its commenters' avatars with hCard-like element ids or publication dates with hCalendar-like classes. These are not as completely useful as a full hCard or hCalendar entry, but you can still use Operator to extract the data and do somethign useful with it. Turn on the "debug" option in Operator's preferences, and the context menu will let you pull up a view on the fully-parsed microformat data, examine its markup, and preview how it will look when exported to an external data format.

Customized Formats and Actions

It's important to remember that which Web sites use microformats has nothing to do with which actions Operator allows you to perform on any of the microformats it finds. That's how you can look up a location on Google Maps, even though Google Map pages aren't themselves marked-up. Operator is using the service's API to launch the data it discovers elsewhere. If you're not interested in a particular action, you can disable it in Operator's preferences window.

Operator's pre-defined actions are a good fit for the Web-browsing public, but they are not exactly geared towards Linux users. For instance, they support vCard export and Yahoo Contacts, but not Evolution, and adding events to Google Calendar or Yahoo Calendar, but not Sunbird or Mozilla Lightning. The solution to this is user scripts. Essentially you can write your own actions using JavaScript, and add them to Operator's menus.

Kaply maintains a page at his site listing known user scripts from himself and other Operator users. They include some lookups for additional Web services (such as BlogMarks.net and corkd.com), plus some interesting gems like "Send to Bluetooth device" that enables entirely new functionality. Third-party users have added actions for Skype, Google Earth, ISBN lookup, and more. Unfortunately we're still waiting on good Evolution and Thunderbird actions, but Kaply has a three-part tutorial for writing your own scripts — and there's still a lot of weekend left.

The other interesting accessory on the page at Kaply's site is additional microformat plugins for Operator. You can add them through the extension's preferences window (just click "Add"). These are also written in JavaScript, and the collection he offers includes XFN, hReview, hProduct, hToDo, and several more. Some of these microformat plugins also add new actions (such as basic Yahoo and Google searches), which is an odd mix of functionality, but it's nice to see the new features nevertheless.

Most people probably hear about microformats and think "ehh, that can't be that useful." I'm betting they'll change their minds in less than a week with Operator and Firefox. There is a lot of microformat-enabled data already out there. When you can click and add a Twitter contact's username to his addressbook entry, or an event to your Google calendar, you'll wonder why you ever put up with doing it the old-fashioned way. Plus, it's a lot of fun to see the browser pull out relevant information for you; it makes you wish that functionality was built right in to the browsing experience, not relegated to a plugin. Maybe we'll see that in the next Firefox development cycle — it's clearly useful, and the competing browsers aren't even trying to support it. So let's all mark that as a future event in our calendars.

 

Comments

Subscribe to Comments Feed

Who we are ?

The Linux Foundation is a non-profit consortium dedicated to the growth of Linux.

More About the foundation...

Frequent Questions

Join / Linux Training / Board