Wednesday, February 6, 2013

Nepomuk WebMiner 0.5

Since my last post a lot has happened to the Nepomuk-WebMiner (former MetaDataExtractor).

The WebMiner went through the KDE Review process and got cleaned up a bit during this process. The new location of it is extragear/base/nepomuk-webminer.

On the code side, I have fixed several bugs and integrated the automatic fetching better into the current Nepomuk system.

The new WebMiner-Service respects the suspend/resume and event monitoring (no internet, low diskspace, on battery mode) in the same was the FileIndexer does it.

When the automatic fetching is started via the command-line or dolphin command, the service is used for the actual fetching. This allows to show the current fetching progress in the nepomukcontroller (in the systray).

Starting with KDE 4.11 the Systemsettings for Nepomuk and the Nepomuk WebMiner are combined and won't show up as two different entries anymore.

Instead of the buggy imdb python script that has a hard time following the changes on the imdb website to allow proper fetching of movie resources, a new plugin for was created.

The next step for the WebMiner will be the full integration into KDE SC for the 4.11 release.
So moving out of extragear again into some other proper place.

In order to make this happen there is still one large blocker task that needs to be done.

So if anyone is good with python and has some time, the script at nepomuk-core/services/storage/rcgen/ needs to be improved.
This script is responsible to generate the SimpleResource classes from the used ontology.
As it takes nearly all ontologies into account and is rather slow right now, the call takes ~20 minutes for each generation. This is a pain for anyone compiling the WebMiner from source.

Any help is very welcome.


  1. Is the webminer the same thing as the metadata extractor you blogged about last year? I'm not following its development closely enough to know what it does...

    1. Yes it is exactly the same thing. Just got renamed because it never actually "extracted" anything but its purpose was always to get more data from the web.

  2. Wow, great feature to have such thing included.

    When looking at the screenshot, I wonder why the name "Nepomuk" is displayed in the settings. Can't it just show:
    - tabs: "Indexing", "Online services" ?
    - remove the word Nemopuk at the "indexer" panels?

  3. Great and interesting work, Thanks! Esp. the academic publication plugin sounds very promising.

  4. I have a suggestion to use the code and idea also MovieThumbs? content = 157543. If you use these ideas can become a very good thing

    1. The kio Slave from sebastians TvNamer can be used independently of the fetcher (mine and tv namer fetcher) both retrieve the poster/banner for movies/tvshow and allow them to be shown.