joergs weblog: Metadata Extractor

Showing posts with label Metadata Extractor. Show all posts

Monday, May 13, 2013

Nepomuk WebMiner 0.6

A few month have past, this my last WebMiner update. In the meantime I finished my Master Thesis, moved to a new location and started my new job. Perfect time to release a new version with the changes I have made since.

The Nepomuk WebMiner 0.6 adds beside several bugfixes:

User changeable regular expression for the filename parsing.
Removed its own and reuse the Nepomuk internal fileindexing to get id3 tags and other file metadata.
Add whitelist for automatic web search. You might like to lookup the folder with your publication pdfs but not your private documents. Or the network share with your tvshows, but not your private family videos. This works on top of the Nepomuk whitelist. So you Nepomuk can index these files, but not all of them will be websearched.
Instead of the dull treeview that shows the raw fetched metadata, you can now see and edit the metadata in several fancy edit fields.

You can find the latest release on projects.kde.org or the tarball on kde-apps.org.
Even though I wanted to get this into KDE SC 4.11, I doubt this is going to happen. Soft feature freeze is around the corner and I don't feel comfortable enough to let this be part of SC and annoy all users with this service yet. There are still a lot of usability problems I like to have solved properly before this can be part of of any KDE installation.

So please test the latest release and report any errors back to me.

Wednesday, February 6, 2013

Nepomuk WebMiner 0.5

Since my last post a lot has happened to the Nepomuk-WebMiner (former MetaDataExtractor).

The WebMiner went through the KDE Review process and got cleaned up a bit during this process. The new location of it is extragear/base/nepomuk-webminer.

On the code side, I have fixed several bugs and integrated the automatic fetching better into the current Nepomuk system.

The new WebMiner-Service respects the suspend/resume and event monitoring (no internet, low diskspace, on battery mode) in the same was the FileIndexer does it.

When the automatic fetching is started via the command-line or dolphin command, the service is used for the actual fetching. This allows to show the current fetching progress in the nepomukcontroller (in the systray).

Starting with KDE 4.11 the Systemsettings for Nepomuk and the Nepomuk WebMiner are combined and won't show up as two different entries anymore.

Instead of the buggy imdb python script that has a hard time following the changes on the imdb website to allow proper fetching of movie resources, a new plugin for themoviedb.org was created.

The next step for the WebMiner will be the full integration into KDE SC for the 4.11 release.
So moving out of extragear again into some other proper place.

In order to make this happen there is still one large blocker task that needs to be done.

So if anyone is good with python and has some time, the script at nepomuk-core/services/storage/rcgen/nepomuk-simpleresource-rcgen.py needs to be improved.
This script is responsible to generate the SimpleResource classes from the used ontology.
As it takes nearly all ontologies into account and is rather slow right now, the call takes ~20 minutes for each generation. This is a pain for anyone compiling the WebMiner from source.

Any help is very welcome.

Wednesday, November 7, 2012

MetaData Extractor 0.3 in KDE Review

Hi all,

since a week I have moved the current MetaData Extractor into KDE Review and tagged it with 0.3. As you might have read, Jason moofang already blogged about his anime changes in the extractor plugins.

A few things that changed in my side since the last release:

Limit the background processes that will fetch the data at the same time
Disable the Nepomuk2::Service on first run.
This means the user has to activate it in the KCM to make it work and it won't fetch data for all files once it is installed
Allow to specify which resource types should be used in the automatic fetching (documents, videos, music)
Add a general configuration KCM to the system settings module
Save not processed queries on kde shut down and start them again next time
Added a docbook to explain most parts of the UI
Fix several bugs and crashes
Allow to restrict online searches to preferred plugins only and not fetch all the other plugins if the lookup failed

In its current state the extractor respects the users privacy as much as possible and can still be configured to do all its magic in the background. Beside the service changes it is always possible to start the extractor from the Dolphin service menu manually.

Another thing that will change is the name of the program.

When I've started this project (mostly as a helper for conquirere) there was already a project named Nepomuk WebExtractor in playground. As it seems this project is dead, the name will be available again.

The name MetaData Extractor might be misleading too. This program does not extract information from files (other than filename analysing for better search parameters) but only rather find additional data on the Internet.

So possible new names are:

Nepomuk-WebExtractor
File Metadata Extractor
File Metadata Retriever
File Metadata Miner

If you like to test the latest changes, you can find the in kdereview

Tuesday, September 11, 2012

MetaData Extractor 0.2

A small update on the Nepomuk MetaData Extractor.
Today I release version 0.2. This means also my extractor is finally in a state that is good enough to get it out of playground and somewhere into KDE (as kdelibs is frozen I'll aim for extragear).

To make this happen a little bit more testing is needed, so it will be "bugfree" by the time 4.10 will be released.

What changed since my last blog post:
* I've added a BatchDownloader so external programs can integrate the extractor stuff better.
* Plugins / Nepomukpipe execution do not block the ui anymore.
* Made some more methods public. Helps to reuse the NepomukPipe classes (mainly from Conquirere)
* Old metadata (from this extractor) will be removed before new data is added.
* In automatic mode loop through all available plugins for one type until a match was found. maximising the chance to get a result.
* Lim Yuen Hoe extended the tvdb plugin for anime support.
* imdb tv show fetching works now
* add plugins for SpringerLink, nature.com
* save contacts (Author, Director, Actor etc) as nco:Contact rather than nco:PersonContact
* Show all metadata results in a treeview, rather than a subset of it in a plain list.
* several other bugfixes

Anime support and TreeView in the MetaDataExtractor

Conquirere integration

Friday, August 10, 2012

Music support for the MetaData Extractor

Just a quick update on Nepomuks meta data extractor.

I have added support to music files via MusicBrainz. From now on you can easily get additional data from there. To get some additional search parameters the id3 tags are read in via TagLib.

Now all "important" file types (documents, video, music) are handled by this little helper.
The next step will be some UI cleanup to increase the usability.

Sunday, August 5, 2012

Conquirere and Nepomuk MetaData Extractor Update

After a long time some new updates on Conquirere and the Nepomuk MetaData extractor.

Since KDE 4.9 now uses nepomuk-core and this the Nepomuk2 namespace some changes had to be done to make everything work again. Andreas Cord-Landwehr was so nice to do the dirty work and ported Conquirere to the new Namespace.

While he worked on the port I've decided to add some minor helpers that will deal with large datasets in the program. The new Conquirere now adds a splash screen to show whats going on during the startup and for large datasets it is possible to use a cache to speed up the loading time of all entries.

What's still left is the very long time necessary to load all data into the model if the cache wasn't created.
This will be the main ToDo for the next versions.

After I have started to work on the Nepomuk parts again I wanted to fix the last few missing corners in the Nepomuk MetaData fetcher too.

Now there is a working fetcher that adds:

a gui to fetch folders and files with a Dolphin Service menu
an automatic fetcher in the background the files without user interaction
Plugin for Konqueror to import data from supported websites directly to Nepomuk
NPAPI Plugin and Extension for Chrome to import meta data from websites
Everything can be added easily into any other program

What websites are supported and how they are parsed is all done in python scripts (and can be done in Ruby and Javascript too)

Currently only IMDB.com, theTvdb.com and academic.research.microsoft.com are supported.

Because everyone loves screen shots, here is one showing the Chrome integration:

You can get the the source from https://projects.kde.org

What will the future bring?

support for music files
maybe support to fetch person data
more websites to fetch data from

Friday, March 2, 2012

Nepomuk Metadata Extractor Update

A small update on my extractor approach.

I took Ben Smiths advice and had a look at Calibre.
Now you get a neat UI instead of a command line dialog with some random popup dialog.

Also the extractor is now a reusable library (as it was supposed to be anyway). The First screenshot shows integration from within Conquirere. The second screenshot shows what happens when the extractor is called via the dolphin action menu. This way you can move forward/backward the single files and search/extract info for each of them. See what will be inserted and decide if you really want it or use a different entry/search engine and try again.

The UI has still some rough edges, but now you can change the search parameters manually and also restrict the search by author, year and so on.

So a nice UI that could be added to all programs that want to fetch and add metadata for its resources to nepomuk.

While I start to like the UI, the plugin structure in the background isn't well designed to allow it to work with all kind of situations. I'm sure this will be sorted out soon too.

The metadata extractor is now in playground if you like to have a look