Thursday, May 18th, 2006
Connotea, social bookmarking for scientists.
Why for scientists? Obviously, scientists and clinicians are a core market. doesn’t exclude others, but concentrating on users with a common interest they could increase discover benefits. Hooks into academic publishing technologies.
Connotea is an open tool, is social so connects to other users, and has tags. But what it does is identify articles solely from the bookmark URLs. So it can pull up the citation from the URL - title, author, journal, issue no. page, publication date. This is important for scientists.
Way it does it is by ‘URL scanning’. So user is on a page, e.g. PubMed which is a huge database of abstracts from biomed publications. When the user clicks ‘Add to Connotea’, this opens a window, it recognises that this is a scholarly article, and imports the data.
Uses ‘citation source plug-ins’ - perl modules for each API. It asks each plug-in to see if it recognises the URL and when it does it goes and gets the information which then associates it with the bookmark in the database.
[Now runs through some programming stuff.]
Bookmarks on a lot of these scientific resources are far from clean or permanent and have a lot of session data in. So this needs cleaning off.
So what’s important? Retrieval and discovery. Already has tagging for navigation. Also has search in case there are some articles that haven’t been accurately tagged.
Provides extra link options for bookmarks. Main title links to the article, say in PubMed; but there are links to other sources for this article, e.g. to the original Nature article; plus other databases, and cross-referencing services.
System also produces a long open URL with all the bibliographic information in it.
Now … the hate.
- poorly documented and poorly implemented data formats. Variety of different XML schema. Liberal interpretations of standards.
- have to do lots of unnecessary hoop-jumping to get this data. Lots of pinging different urls to get coookies, POSTs, etc.
- have to do everything on a case-by-case basis. have to reverse engineer each publisher’s site . have to write ad hoc rules and custom procedures for each case.
Nature release a proposal called OTMI, open text mining interface - wants to make Nature’s text open for data mining, but not the articles themselves. So researchers looking for raw XML for doing data mining research, but ever time someone asks they have to make ad hoc arrangements for each case. So OTMI does some pre-processing to make the data more usable.
Publishers could choose to be supported by Connotea and remove the need for them to reverse engineer. Publisher just puts a link through to an ATOM doc with the relevant data in so that the citation can be easily retrieved.
Blogs already do autodiscovery of ATOM feeds, so can test idea using a citation source plug-in for a blog. It works, so can treat any source as a citation, but only whilst the post is still in the RSS feed.
Citation microformat. Connotea would work really well with a citation microformat, so is going to look into that.
How to do URL to metadata
- manual entry
- scraping the page
- recognise and extract some ID, Connotea does that, but it doesn’t scale to the whole web.
- follow a metadata link from page, this is the blog plug-in
- parse the page directly, not possible yet.
Useful not just for Nature as publishers of data, but also anyone else who wants to be discoverable and bookmarkable.
Nature blog about this, Nascent.