joergs weblog: 2012

Wednesday, November 7, 2012

MetaData Extractor 0.3 in KDE Review

Hi all,

since a week I have moved the current MetaData Extractor into KDE Review and tagged it with 0.3. As you might have read, Jason moofang already blogged about his anime changes in the extractor plugins.

A few things that changed in my side since the last release:

Limit the background processes that will fetch the data at the same time
Disable the Nepomuk2::Service on first run.
This means the user has to activate it in the KCM to make it work and it won't fetch data for all files once it is installed
Allow to specify which resource types should be used in the automatic fetching (documents, videos, music)
Add a general configuration KCM to the system settings module
Save not processed queries on kde shut down and start them again next time
Added a docbook to explain most parts of the UI
Fix several bugs and crashes
Allow to restrict online searches to preferred plugins only and not fetch all the other plugins if the lookup failed

In its current state the extractor respects the users privacy as much as possible and can still be configured to do all its magic in the background. Beside the service changes it is always possible to start the extractor from the Dolphin service menu manually.

Another thing that will change is the name of the program.

When I've started this project (mostly as a helper for conquirere) there was already a project named Nepomuk WebExtractor in playground. As it seems this project is dead, the name will be available again.

The name MetaData Extractor might be misleading too. This program does not extract information from files (other than filename analysing for better search parameters) but only rather find additional data on the Internet.

So possible new names are:

Nepomuk-WebExtractor
File Metadata Extractor
File Metadata Retriever
File Metadata Miner

If you like to test the latest changes, you can find the in kdereview

Monday, November 5, 2012

KCM Wacom Tablet 1.3.7 & 2.0 Beta 1 (1.99.2)

I'm posting this on behalf of Alexander Maret-Huskinson who did most of the work in the latest release.

It's been a while since the last update, but today we released two new versions of KCM Wacom Tablet. Version 1.3.7 which is another update of the stable 1.3 release and 2.0-beta1 (1.99.2) which is the first testing version of the upcoming 2.0 release.
The 1.3.7 update does not contain any new features but adds support for some additional tablets and fixes a bug where the tablet was rotated in the wrong direction when auto-rotation was enabled.

Unfortunately the beta version does not contain any new features either, but still a lot of work has been put into this upcoming 2.0 release. We rewrote the whole backend to make it more manageable and prepared it for the long awaited libwacom support.
With the new architecture we can now easily support different subsystems which allows us not only to support libwacom but also any other configuration tool like for example the Intuos 4 Led project. Support for these projects is not yet included but will be added in one of the next releases.

Although we put much effort in testing, there are probably still some bugs left which we did not catch. We also have only a very limited number of tablets available, which is why we ask everyone to thoroughly test this beta and report any bugs we missed.

You can download the source from http://kde-apps.org/content/show.php/wacom+tablet?content=114856 or (K)Ubuntu packages for 12.04 and 12.10 from these repositories:

1.3.7: https://launchpad.net/~maret/+archive/wacom
2.0-beta1 (1.99.2): https://launchpad.net/~maret/+archive/wacom-unstable

Tuesday, September 11, 2012

MetaData Extractor 0.2

A small update on the Nepomuk MetaData Extractor.
Today I release version 0.2. This means also my extractor is finally in a state that is good enough to get it out of playground and somewhere into KDE (as kdelibs is frozen I'll aim for extragear).

To make this happen a little bit more testing is needed, so it will be "bugfree" by the time 4.10 will be released.

What changed since my last blog post:
* I've added a BatchDownloader so external programs can integrate the extractor stuff better.
* Plugins / Nepomukpipe execution do not block the ui anymore.
* Made some more methods public. Helps to reuse the NepomukPipe classes (mainly from Conquirere)
* Old metadata (from this extractor) will be removed before new data is added.
* In automatic mode loop through all available plugins for one type until a match was found. maximising the chance to get a result.
* Lim Yuen Hoe extended the tvdb plugin for anime support.
* imdb tv show fetching works now
* add plugins for SpringerLink, nature.com
* save contacts (Author, Director, Actor etc) as nco:Contact rather than nco:PersonContact
* Show all metadata results in a treeview, rather than a subset of it in a plain list.
* several other bugfixes

Anime support and TreeView in the MetaDataExtractor

Conquirere integration

Friday, August 10, 2012

Music support for the MetaData Extractor

Just a quick update on Nepomuks meta data extractor.

I have added support to music files via MusicBrainz. From now on you can easily get additional data from there. To get some additional search parameters the id3 tags are read in via TagLib.

Now all "important" file types (documents, video, music) are handled by this little helper.
The next step will be some UI cleanup to increase the usability.

Sunday, August 5, 2012

Conquirere and Nepomuk MetaData Extractor Update

After a long time some new updates on Conquirere and the Nepomuk MetaData extractor.

Since KDE 4.9 now uses nepomuk-core and this the Nepomuk2 namespace some changes had to be done to make everything work again. Andreas Cord-Landwehr was so nice to do the dirty work and ported Conquirere to the new Namespace.

While he worked on the port I've decided to add some minor helpers that will deal with large datasets in the program. The new Conquirere now adds a splash screen to show whats going on during the startup and for large datasets it is possible to use a cache to speed up the loading time of all entries.

What's still left is the very long time necessary to load all data into the model if the cache wasn't created.
This will be the main ToDo for the next versions.

After I have started to work on the Nepomuk parts again I wanted to fix the last few missing corners in the Nepomuk MetaData fetcher too.

Now there is a working fetcher that adds:

a gui to fetch folders and files with a Dolphin Service menu
an automatic fetcher in the background the files without user interaction
Plugin for Konqueror to import data from supported websites directly to Nepomuk
NPAPI Plugin and Extension for Chrome to import meta data from websites
Everything can be added easily into any other program

What websites are supported and how they are parsed is all done in python scripts (and can be done in Ruby and Javascript too)

Currently only IMDB.com, theTvdb.com and academic.research.microsoft.com are supported.

Because everyone loves screen shots, here is one showing the Chrome integration:

You can get the the source from https://projects.kde.org

What will the future bring?

support for music files
maybe support to fetch person data
more websites to fetch data from

Friday, March 2, 2012

Nepomuk Metadata Extractor Update

A small update on my extractor approach.

I took Ben Smiths advice and had a look at Calibre.
Now you get a neat UI instead of a command line dialog with some random popup dialog.

Also the extractor is now a reusable library (as it was supposed to be anyway). The First screenshot shows integration from within Conquirere. The second screenshot shows what happens when the extractor is called via the dolphin action menu. This way you can move forward/backward the single files and search/extract info for each of them. See what will be inserted and decide if you really want it or use a different entry/search engine and try again.

The UI has still some rough edges, but now you can change the search parameters manually and also restrict the search by author, year and so on.

So a nice UI that could be added to all programs that want to fetch and add metadata for its resources to nepomuk.

While I start to like the UI, the plugin structure in the background isn't well designed to allow it to work with all kind of situations. I'm sure this will be sorted out soon too.

The metadata extractor is now in playground if you like to have a look

Monday, February 27, 2012

Nepomuk Metadata Web-Extraction

When I announced my Conquirere program I got some feedback from Tuukka Verho. Seems I wasn't the first one working on a paper management system with Nepomuk.

Tuukka told me he developed a python plugin based Konqueror system that extracts publication relevant metadata from the currently visited website. If you know Zotero, this should remind you of the translators they are already using.

Extracting metadata from the web is a nice addition to the whole Nepomuk idea. There already exist a few implementations to extract tvshows, websites, movies and there was once a GSoC fully related to this topic.

One of the comments below Sebastians tvshow extractor was about the extension to also fetch anime data. Of course this is just one idea, there are hundreds of other sources where such data can be retrieved from. Also why stop with video files, there also exist music, publications, books and other files where additional metadata from the web can be a real help.

Now we could start to write a new program for any website we want to fetch the data from, but at the end we will only have a lot of copy and past code with some minor changes in it.

As I wanted to add such a metadata extraction for my Conquirere program anyway I sat down and thought of a more general way to combine all of this.

The "Nepomuk-Metadata-Extractor" was born. Hopefully this time the name was better chosen than the last one.

So what is this all about?

I sat down with Tuukka and combined his great python work with most of what Sebastian created for his tvshow stuff.

The end result is now a small little program, that can be called from the command line with the url of the file or folder you want to scan. Also you can simple select the right action from within dolphin if you prefer.

At the moment the program will detect if you scan publications (.pdf or .odt) or if it is a movie/tvshow.

In case you have a movie, you can fetch all the necessary data from imdb, for the tvshow it will fail, as I haven't implement this part.

The publication scan will do some more neat things.

First It starts by retrieving as much information from the file as possible.

This is done by scanning the RDF part of it. As most of my pdf articles have lousy RDF metadata attached to them, I also try to extract the title from the first page of the pdf.

Now that we got as much information as possible the program starts a search via Microsoft Academic Search to get the relevant abstract, bibtex data and all references used by the publication and fills the nepomuk database with it.

But what about the python part I've talked about earlier?

The actual search and extract job is done by a really simple to write python file. Currently I have written only two of them, but in the future you can extend it easily and then select the the right backend directly from the program.

So here we are with a new little program.

I will add this functionality into Conquirere later and hopefully Tuukka will release his Konqueror plugin later on too.

The question is just, should I add support for tvshows?

Is this the right way to start a more general web extraction service?

What do you think about this?

It could be extended with a KCM, so that someone can specify the default engine used for searching. Also a backup search engine, if the first one fails.

Sebastian also added a Nepomuk:Service so his fetcher is executed once the libstreamanalyzer adds a new video file into the storage, the same could be done here just for all kind of files.

Currently the whole system is created around files to scan, but it is also very easy to extend it, so any kind of nepomuk resource can be fed to the python parts to fetch all kind of data from the web.

If you want to try it out, the sources are currently located in my scratch repo and soon in the KDE playground.

Sunday, February 12, 2012

Conquirere joins Nepomuks future

Ok the title might be rather cryptic, but today I like to blog about some of the changing details I have been working on since the last time.

As some of you might now, Nepomuk is going to change. Sebastian already blogged about the new Data Management Service and why it is a good thing. Now Conquirere is ready for this bright future as it relies nearly completely on the new API.

What does this change now? Not much for the end-user to be honest. But it is a great change in the background that allows for some functions that were very hard to do beforehand.

Automatic merging of duplicate entries
Merging of user selected entries
Automatic type checking of the inserted entries
Identify what changes are done with Conquirere

Especially the last part is great. Currently, if you tested the program it produced a lot of entries in Nepomuk and often not all of them are removed when you remove some publications (the chapters or the websites connected to it for example). This leaves a lot of junk in the database you don't need anymore.

From now on, you can test the program without any fear. You can always simply clean the Nepomuk database from all entries created or changed by Nepomuk without altering the other parts of the Nepomuk storage. One of the greatest things that come with the new dms api.

So the dms change took some time, but wasn't the only thing I've changed. I did fix a few bugs and produced a few more while doing the transition. Also I've changed a few other parts I didn't like.

UI Changes

The most noticeable change is the refactoring of the ui. I have replace the QDockwidgets by a QSplitter layout. As it seems there is no need to freely rearrange the components of the ui this seems to be a better solution. Now the ui has a fixed layout, but you can simply hide parts of it by collapsing it to the side.
To automate this process a little bit further, I added 3 buttons to change the "mode" of the ui view.

Now there is a "full view mode" that displays all parts of the ui at once. When you like to scroll trough your library there is the "project view mode" which simply hides/collapse the document preview panel and when you want to work with your documents, you can switch to the "document view mode" which hides/collapse the library/resource table and let you concentrate on the document. If you need even more space, you can hide/collapse the right side panel too and you would end up with okular (as this is used as kpart to show your files).

Document view

Project/Library view

Full view

I have also simplified the way you can add your sources (files, remote files, websites).

Additionally you can now add "cited sources" so you can "quickly" check what other papers might be interesting regarding the one you are looking at. Currently you have to add this information manually. In the future I hope such information can be retrieved automatically via pdf parsing or from the web.

Also there is now no extra fields for the note content anymore, instead any kind of note is created as a pimo:Note which is a sub-resource of the publication (or document / email / event / reference / series / webpage). So from now on, bibtex files that have keys like note-1, note 2 etc are handled correctly too.

The "multiply selection" widget is another big change.

Beforehand it was a pain to to any kind of action on several files, especially deleting them. Now you can simply select several resources and do some actions with them, including merging them into 1 resource (thanks to the new dms api this was really easy to implement).

Nepomuk auto-completion is now done based on a live Nepomuk search (same way krunner does it). So at any time there is data from the database available to help you to complete your text.

Multi-selection widget

Nepomuk auto-completion

Manage cited sources

The imported keywords for the publications are not handled as simple tags anymore, but will be imported as pimo:Topic. This will reduce the clutter in other parts of the system. You can still tag your publication though. But now a publication will have topics such as "face recognition, fantastic math solution, solution to world peace" and you can add tags like "important, reviewed, needs attention"

Zotero changes

The other bigger part I have changed is the Zotero integration. Now I can handle Zotero groups with more than 50 items (forgot to add this beforehand). Also child items (notes) are downloaded and added to the publication, as well as files, that can be downloaded into a specified directory. Sadly uploading files isn't working at the moment, as I have no idea how to do that via the Zotero API.

New is also the merge dialog. As soon as the item you want to sync with changed on the server, you get a neat little dialog (unless you selected to use always the server/local version) that allows you to specify what changes you really want in your database.

Zotero merge dialog

That's it for now.

There is still a lot to do and I still do not recommend to use Conquirere in a productive environment, but now I feel comfortable to recommend proper testing. Conquirere won't mess with your system in a way you can't simply wipe all the changes it did anymore. Also as long as you don't upload anything to Zotero, it can't mess with your Zotero data too, but even the upload should work (apart from the bugs that might still be there).

For the next step I like to add the right magic. Sebastian already showed how great automatic meta data fetching will be for the system and I think this should be expanded. Not only tv-shows, but also any other media file and of course text document should have a service that fetches all kind of meta data in the background. This will lead to a magic system that knows more about your files than you do and allows to bring Nepomuk to its full potential.

Lets hope we can make this dream come true, as soon as possible.

Monday, January 23, 2012

Hello Planet & Conquirere 0.1

Hello Planet! Most of you don't know me yet, so let me introduce myself: My name is Jörg Ehrichs and I'm the developer of the KDE Wacom tablet module as well as the upcoming Conquirere.

In my first of hopefully many blog posts I like to introduce you to my approach to combine bibliographic data with your files and thoughts via the Nepomuk framework.

Conquirere allows you to add bibliographic data such as journals, books, proceedings papers, articles and many more to Nepomuk. Combine this data with the documents on your harddisk or some online storage and helps you to organize your data to quickly find some parts of your research again.

Why use Nepomuk when the same could be archived via KBibTeX and the good old .bib files? Because we can! Even though KBibTeX does gain file support in its latest versions it will still only operate on the flat database from the .bib files. The same is true for any other bibliographic manager. With the help of Nepomuk we can give our data a semantic meaning.

Connect the publications with the contacts from Akonadi, attach notes to them or mark the article as part of an event from your Akonadi calendar. Group your research papers together without duplicating the data or copying files around on your harddisk and search through your papers and data via Nepomuk to find exactly what you are looking for.

Even if many people seem to dislike Nepomuk, their real problems are within the feeders (which are getting better and better now). The Nepomuk database itself was always working great and very fast.

Nepomuks only problem is the lack of programs making use of the great features offered by this system or better yet, most people don't know that Nepomuk is used in the background by several programs (Searching, tagging adding notes in KDEPim, dolphin or the krunner).

Now there is another program based completely on the idea of this semantic database.

Enough introduction for the moment, now lets see whats working already.
In its current state Conquirere will list all your documents, (pdf, odt etc) as indexed by Nepomuk. Allow you to tag and comment them and tell what kind of publication this file represents. Like you would add data to your usual bibtex file you can add authors, editors, title, publication date and so on.

Because this would be a lot of work, you can also import existing bibliographic data. Thanks to KBibTeX approach to make its functions available for others you can also import any kind of bibliographic data and create the Nepomuk resources from it.

Furthermore, if you want to look for new information search through the several online services and import their data directly. Or use Zotero to add the data while you are surfing for some research papers and import/sync them later on.

Got lost in all your data and can't find the right file with some content you need? Nepomuk offers a great way for full text search and you never have to spend hours opening and reading your pdf files again.

But don't stop with files, connect emails, events or notes together so you have always all the data you need.

Finally all the research is done and you want to start to write and cite. Export your references again. Thanks again to KBibTeX this is as easy as 2 clicks and you can create a .bib file with the references you need. Or pipe the data directly to LyX.

As you can see Conquirere already offers a lot and hopefully one day it will offer a lot more.
If you like to have a look at the current progress, feel free to check it out from playground. Just follow the instructions from the README and you're ready to go.

But please keep in mind: It is in an early state. So expect it to crash and be buggy for now.
Also not all parts are working, especially the zotero sync isn't finished completely. So i wouldn't recommend to use it in a productive environment at the moment.

Enjoy the first preview of Conquirere.