Monday, February 27, 2012

Nepomuk Metadata Web-Extraction

When I announced my Conquirere program I got some feedback from Tuukka Verho. Seems I wasn't the first one working on a paper management system with Nepomuk.

Tuukka told me he developed a python plugin based Konqueror system that extracts publication relevant metadata from the currently visited website. If you know Zotero, this should remind you of the translators they are already using.

Extracting metadata from the web is a nice addition to the whole Nepomuk idea. There already exist a few implementations to extract tvshows, websites, movies and there was once a GSoC fully related to this topic.

One of the comments below Sebastians tvshow extractor was about the extension to also fetch anime data. Of course this is just one idea, there are hundreds of other sources where such data can be retrieved from. Also why stop with video files, there also exist music, publications, books and other files where additional metadata from the web can be a real help.

Now we could start to write a new program for any website we want to fetch the data from, but at the end we will only have a lot of copy and past code with some minor changes in it.

As I wanted to add such a metadata extraction for my Conquirere program anyway I sat down and thought of a more general way to combine all of this.

The "Nepomuk-Metadata-Extractor" was born. Hopefully this time the name was better chosen than the last one.

So what is this all about?
I sat down with Tuukka and combined his great python work with most of what Sebastian created for his tvshow stuff.
The end result is now a small little program, that can be called from the command line with the url of the file or folder you want to scan. Also you can simple select the right action from within dolphin if you prefer.

At the moment the program will detect if you scan publications (.pdf or .odt) or if it is a movie/tvshow.

In case you have a movie, you can fetch all the necessary data from imdb, for the tvshow it will fail, as I haven't implement this part.

The publication scan will do some more neat things.
First It starts by retrieving as much information from the file as possible.

This is done by scanning the RDF part of it. As most of my pdf articles have lousy RDF metadata attached to them, I also try to extract the title from the first page of the pdf.

Now that we got as much information as possible the program starts a search via Microsoft Academic Search to get the relevant abstract, bibtex data and all references used by the publication and fills the nepomuk database with it.

But what about the python part I've talked about earlier?
The actual search and extract job is done by a really simple to write python file. Currently I have written only two of them, but in the future you can extend it easily and then select the the right backend directly from the program.

So here we are with a new little program.
I will add this functionality into Conquirere later and hopefully Tuukka will release his Konqueror plugin later on too.

The question is just, should I add support for tvshows?
Is this the right way to start a more general web extraction service?
What do you think about this?
It could be extended with a KCM, so that someone can specify the default engine used for searching. Also a backup search engine, if the first one fails.

Sebastian also added a Nepomuk:Service so his fetcher is executed once the libstreamanalyzer adds a new video file into the storage, the same could be done here just for all kind of files.
Currently the whole system is created around files to scan, but it is also very easy to extend it, so any kind of nepomuk resource can be fed to the python parts to fetch all kind of data from the web.

If you want to try it out, the sources are currently located in my scratch repo and soon in the KDE playground.

Sunday, February 12, 2012

Conquirere joins Nepomuks future

Ok the title might be rather cryptic, but today I like to blog about some of the changing details I have been working on since the last time.

As some of you might now, Nepomuk is going to change. Sebastian already blogged about the new Data Management Service and why it is a good thing. Now Conquirere is ready for this bright future as it relies nearly completely on the new API.

What does this change now? Not much for the end-user to be honest. But it is a great change in the background that allows for some functions that were very hard to do beforehand.
  • Automatic merging of duplicate entries
  • Merging of user selected entries
  • Automatic type checking of the inserted entries
  • Identify what changes are done with Conquirere
Especially the last part is great. Currently, if you tested the program it produced a lot of entries in Nepomuk and often not all of them are removed when you remove some publications (the chapters or the websites connected to it for example). This leaves a lot of junk in the database you don't need anymore.

From now on, you can test the program without any fear. You can always simply clean the Nepomuk database from all entries created or changed by Nepomuk without altering the other parts of the Nepomuk storage. One of the greatest things that come with the new dms api.

So the dms change took some time, but wasn't the only thing I've changed. I did fix a few bugs and produced a few more while doing the transition. Also I've changed a few other parts I didn't like.

UI Changes
The most noticeable change is the refactoring of the ui. I have replace the QDockwidgets by a QSplitter layout. As it seems there is no need to freely rearrange the components of the ui this seems to be a better solution. Now the ui has a fixed layout, but you can simply hide parts of it by collapsing it to the side.
To automate this process a little bit further, I added 3 buttons to change the "mode" of the ui view.

Now there is a "full view mode" that displays all parts of the ui at once. When you like to scroll trough your library there is the "project view mode" which simply hides/collapse the document preview panel and when you want to work with your documents, you can switch to the "document view mode" which hides/collapse the library/resource table and let you concentrate on the document. If you need even more space, you can hide/collapse the right side panel too and you would end up with okular (as this is used as kpart to show your files).

Document view
Project/Library view
Full view
I have also simplified the way you can add your sources (files, remote files, websites).
Additionally you can now add "cited sources" so you can "quickly" check what other papers might be interesting regarding the one you are looking at. Currently you have to add this information manually. In the future I hope such information can be retrieved automatically via pdf parsing or from the web.

Also there is now no extra fields for the note content anymore, instead any kind of note is created as a pimo:Note which is a sub-resource of the publication (or document / email / event / reference / series / webpage). So from now on, bibtex files that have keys like note-1, note 2 etc are handled correctly too.

The "multiply selection" widget is another big change.
Beforehand it was a pain to to any kind of action on several files, especially deleting them. Now you can simply select several resources and do some actions with them, including merging them into 1 resource (thanks to the new dms api this was really easy to implement).

Nepomuk auto-completion is now done based on a live Nepomuk search (same way krunner does it). So at any time there is data from the database available to help you to complete your text.

Multi-selection widget 
Nepomuk auto-completion

Manage cited sources

The imported keywords for the publications are not handled as simple tags anymore, but will be imported as pimo:Topic. This will reduce the clutter in other parts of the system. You can still tag your publication though. But now a publication will have topics such as "face recognition, fantastic math solution, solution to world peace" and you can add tags like "important, reviewed, needs attention"

Zotero changes
The other bigger part I have changed is the Zotero integration. Now I can handle Zotero groups with more than 50 items (forgot to add this beforehand). Also child items (notes) are downloaded and added to the publication, as well as files, that can be downloaded into a specified directory. Sadly uploading files isn't working at the moment, as I have  no idea how to do that via the Zotero API.

New is also the merge dialog. As soon as the item you want to sync with changed on the server, you get a neat little dialog (unless you selected to use always the server/local version) that allows you to specify what changes you really want in your database.

Zotero merge dialog
That's it for now.
There is still  a lot to do and I still do not recommend to use Conquirere in a productive environment, but now I feel comfortable to recommend proper testing. Conquirere won't mess with your system in a way you can't simply wipe all the changes it did anymore. Also as long as you don't upload anything to Zotero, it can't mess with your Zotero data too, but even the upload should work (apart from the bugs that might still be there).

For the next step I like to add the right magic. Sebastian already showed how great automatic meta data fetching will be for the system and I think this should be expanded. Not only tv-shows, but also any other media file and of course text document should have a service that fetches all kind of meta data in the background. This will lead to a magic system that knows more about your files than you do and allows to bring Nepomuk to its full potential. 

Lets hope we can make this dream come true, as soon as possible.