Wednesday, November 7, 2012

MetaData Extractor 0.3 in KDE Review

Hi all,

since a week I have moved the current MetaData Extractor into KDE Review and tagged it with 0.3. As you might have read, Jason moofang already blogged about his anime changes in the extractor plugins.

A few things that changed in my side since the last release:

  • Limit the background processes that will fetch the data at the same time
  • Disable the Nepomuk2::Service on first run.
    This means the user has to activate it in the KCM to make it work and it won't fetch data for all files once it is installed
  • Allow to specify which resource types should be used in the automatic fetching (documents, videos, music)
  • Add a general configuration KCM to the system settings module
  • Save not processed queries on kde shut down and start them again next time
  • Added a docbook to explain most parts of the UI
  • Fix several bugs and crashes
  • Allow to restrict online searches to preferred plugins only and not fetch all the other plugins if the lookup failed
 

In its current state the extractor respects the users privacy as much as possible and can still be configured to do all its magic in the background. Beside the service changes it is always possible to start the extractor from the Dolphin service menu manually.

Another thing that will change is the name of the program.
When I've started this project (mostly as a helper for conquirere) there was already a project named Nepomuk WebExtractor in playground. As it seems this project is dead, the name will be available again.

The name MetaData Extractor might be misleading too. This program does not extract information from files (other than filename analysing for better search parameters) but only  rather find additional data on the Internet.

So possible new names are:
  • Nepomuk-WebExtractor
  • File Metadata Extractor 
  • File Metadata Retriever
  • File Metadata Miner

If you like to test the latest changes, you can find the in kdereview

Monday, November 5, 2012

KCM Wacom Tablet 1.3.7 & 2.0 Beta 1 (1.99.2)

I'm posting this on behalf of Alexander Maret-Huskinson who did most of the work in the latest release.

It's been a while since the last update, but today we released two new versions of KCM Wacom Tablet. Version 1.3.7 which is another update of the stable 1.3 release and 2.0-beta1 (1.99.2) which is the first testing version of the upcoming 2.0 release.
The 1.3.7 update does not contain any new features but adds support for some additional tablets and fixes a bug where the tablet was rotated in the wrong direction when auto-rotation was enabled.

Unfortunately the beta version does not contain any new features either, but still a lot of work has been put into this upcoming 2.0 release. We rewrote the whole backend to make it more manageable and prepared it for the long awaited libwacom support.
With the new architecture we can now easily support different subsystems which allows us not only to support libwacom but also any other configuration tool like for example the Intuos 4 Led project. Support for these projects is not yet included but will be added in one of the next releases.

Although we put much effort in testing, there are probably still some bugs left which we did not catch. We also have only a very limited number of tablets available, which is why we ask everyone to thoroughly test this beta and report any bugs we missed.

You can download the source from http://kde-apps.org/content/show.php/wacom+tablet?content=114856 or (K)Ubuntu packages for 12.04 and 12.10 from these repositories:

1.3.7: https://launchpad.net/~maret/+archive/wacom
2.0-beta1 (1.99.2): https://launchpad.net/~maret/+archive/wacom-unstable

Tuesday, September 11, 2012

MetaData Extractor 0.2

A small update on the Nepomuk MetaData Extractor.
Today I release version 0.2. This means also my extractor is finally in a state that is good enough to get it out of playground and somewhere into KDE (as kdelibs is frozen I'll aim for extragear).

To make this happen a little bit more testing is needed, so it will be "bugfree" by the time 4.10 will be released.

What changed since my last blog post:
* I've added a BatchDownloader so external programs can integrate the extractor stuff better.
* Plugins / Nepomukpipe execution do not block the ui anymore.
* Made some more methods public. Helps to reuse the NepomukPipe classes (mainly from Conquirere)
* Old metadata (from this extractor) will be removed before new data is added.
* In automatic mode loop through all available plugins for one type until a match was found. maximising the chance to get a result.
Lim Yuen Hoe extended the tvdb plugin for anime support.
* imdb tv show fetching works now
* add plugins for SpringerLink, nature.com
* save contacts (Author, Director, Actor etc) as nco:Contact rather than nco:PersonContact
* Show all metadata results in a treeview, rather than a subset of it in a plain list.
* several other bugfixes
Anime support and TreeView in the MetaDataExtractor

Conquirere integration

Friday, August 10, 2012

Music support for the MetaData Extractor

Just a quick update on Nepomuks meta data extractor.

I have added support to music files via MusicBrainz. From now on you can easily get additional data from there. To get some additional search parameters the id3 tags are read in via TagLib.


Now all "important" file types (documents, video, music)  are handled by this little helper.
The next step will be some UI cleanup to increase the usability.

Sunday, August 5, 2012

Conquirere and Nepomuk MetaData Extractor Update

After a long time some new updates on Conquirere and the Nepomuk MetaData extractor.

Since KDE 4.9 now uses nepomuk-core and this the Nepomuk2 namespace some changes had to be done to make everything work again. Andreas Cord-Landwehr was so nice to do the dirty work and ported Conquirere to the new Namespace.

While he worked on the port I've decided to add some minor helpers that will deal with large datasets in the program. The new Conquirere now adds a splash screen to show whats going on during the startup and for large datasets it is possible to use a cache to speed up the loading time of all entries.

What's still left is the very long time necessary to load all data into the model if the cache wasn't created.
This will be the main ToDo for the next versions.

After I have started to work on the Nepomuk parts again I wanted to fix the last few missing corners in the Nepomuk MetaData fetcher too.

Now there is a working fetcher that adds:

  • a gui to fetch folders and files with a Dolphin Service menu
  • an automatic fetcher in the background the files without user interaction
  • Plugin for Konqueror to import data from supported websites directly to Nepomuk
  • NPAPI Plugin and Extension for Chrome to import meta data from websites
  • Everything can be added easily into any other program

What websites are supported and how they are parsed is all done in python scripts (and can be done in Ruby and Javascript too)

Currently only IMDB.com, theTvdb.com and academic.research.microsoft.com are supported.

Because everyone loves screen shots, here is one showing the Chrome integration:

You can get the the source from https://projects.kde.org

What will the future bring?

  • support for music files
  • maybe support to fetch person data
  • more websites to fetch data from

Friday, March 2, 2012

Nepomuk Metadata Extractor Update

A small update on my extractor approach.

I took Ben Smiths advice and had a look at Calibre.
Now you get a neat UI instead of a command line dialog with some random popup dialog.



Also the extractor is now a reusable library (as it was supposed to be anyway). The First screenshot shows integration from within Conquirere. The second screenshot shows what happens when the extractor is called via the dolphin action menu. This way you can move forward/backward the single files and search/extract info for each of them. See what will be inserted and decide if you really want it or use a different entry/search engine and try again.

The UI has still some rough edges, but now you can change the search parameters manually and also restrict the search by author, year and so on.

So a nice UI that could be added to all programs that want to fetch and add metadata for its resources to nepomuk.

While I start to like the UI, the plugin structure in the background isn't well designed to allow it to work with all kind of situations. I'm sure this will be sorted out soon too.

The metadata extractor is now in playground if you like to have a look

Monday, February 27, 2012

Nepomuk Metadata Web-Extraction

When I announced my Conquirere program I got some feedback from Tuukka Verho. Seems I wasn't the first one working on a paper management system with Nepomuk.

Tuukka told me he developed a python plugin based Konqueror system that extracts publication relevant metadata from the currently visited website. If you know Zotero, this should remind you of the translators they are already using.

Extracting metadata from the web is a nice addition to the whole Nepomuk idea. There already exist a few implementations to extract tvshows, websites, movies and there was once a GSoC fully related to this topic.

One of the comments below Sebastians tvshow extractor was about the extension to also fetch anime data. Of course this is just one idea, there are hundreds of other sources where such data can be retrieved from. Also why stop with video files, there also exist music, publications, books and other files where additional metadata from the web can be a real help.

Now we could start to write a new program for any website we want to fetch the data from, but at the end we will only have a lot of copy and past code with some minor changes in it.

As I wanted to add such a metadata extraction for my Conquirere program anyway I sat down and thought of a more general way to combine all of this.

The "Nepomuk-Metadata-Extractor" was born. Hopefully this time the name was better chosen than the last one.

So what is this all about?
I sat down with Tuukka and combined his great python work with most of what Sebastian created for his tvshow stuff.
The end result is now a small little program, that can be called from the command line with the url of the file or folder you want to scan. Also you can simple select the right action from within dolphin if you prefer.

At the moment the program will detect if you scan publications (.pdf or .odt) or if it is a movie/tvshow.

In case you have a movie, you can fetch all the necessary data from imdb, for the tvshow it will fail, as I haven't implement this part.




The publication scan will do some more neat things.
First It starts by retrieving as much information from the file as possible.

This is done by scanning the RDF part of it. As most of my pdf articles have lousy RDF metadata attached to them, I also try to extract the title from the first page of the pdf.

Now that we got as much information as possible the program starts a search via Microsoft Academic Search to get the relevant abstract, bibtex data and all references used by the publication and fills the nepomuk database with it.



But what about the python part I've talked about earlier?
The actual search and extract job is done by a really simple to write python file. Currently I have written only two of them, but in the future you can extend it easily and then select the the right backend directly from the program.

So here we are with a new little program.
I will add this functionality into Conquirere later and hopefully Tuukka will release his Konqueror plugin later on too.

The question is just, should I add support for tvshows?
Is this the right way to start a more general web extraction service?
What do you think about this?
It could be extended with a KCM, so that someone can specify the default engine used for searching. Also a backup search engine, if the first one fails.

Sebastian also added a Nepomuk:Service so his fetcher is executed once the libstreamanalyzer adds a new video file into the storage, the same could be done here just for all kind of files.
Currently the whole system is created around files to scan, but it is also very easy to extend it, so any kind of nepomuk resource can be fed to the python parts to fetch all kind of data from the web.

If you want to try it out, the sources are currently located in my scratch repo and soon in the KDE playground.