July 2, 2010

Weekend Project: Spring Clean your Photo Collection


Back in May, we provided a step-by-step guide to sorting out and shoring up a digital audio library. As several readers noted, audio collections aren't the only media stores that can get out of control. Your digital photos may also be an unsorted mess, split up over several directories (or even machines), with inconsistent file names, duplicates, and a host of other problems. This weekend, why not sift through the clutter and get the whole collection in top shape before the camera comes home with a full memory card from the next big cookout?

The broad outline we'll follow is the same at the one used for music: find and eliminate duplicates, centralize storage, standardize the file names, then apply or fix the metadata. Image files have some key differences, though, particularly at the metadata stage, that deserve special attention. On top of that, because our photos are used as "read/write" information, unlike most people's music libraries, we'll have to pay special attention to the complexities that image editors can introduce.

Before starting, it is a good idea to move all of your images into a single location, to help with the duplicate-finding and batch-renaming processes. A side benefit is that you may automatically discover some exact duplicates in the process. If you are worried about accidentally deleting a file that has the same name as another, but is actually different content, feel free to create a subdirectory and move the duplicates there. Unlike with a music collection, where the artist and album make for ready-made folder hierarchies, there is no one-size-fits-all subdirectory scheme for images. Besides, they are best sorted on their metadata by our image management program.

Find and eliminate duplicates

As with music, you can inadvertently acquire multiple copies of the same photo — often when you offload from a memory card onto more than one computer. You can take the exact same approach to catching exact duplicates as we used for audio files: the command-line file scanner Fdupes. Running fdupes -r -n -A /home/username/photos/ will recursively search through the given directory, skipping over empty and hidden files, and compare checksums, listing matching pairs on stdout.

This will give you a list of files that have the exact same contents but different names. Because most cameras automatically name image files, this is not likely to be a big time-consumer, but it is always worth doing. What is more likely is that you will have generated low-resolution copies of your original images in order to send them via email or upload them to a web site.

In most cases, you do not need to retain these low-resolution copies forever, so as long as the original can be found, strongly consider deleting small versions of your images that you created just to send to relatives or add to your eBay auction pages. If you named these files by adding a suffix to the original filename, such as IMG3453-small.JPG, you shouldn't have any trouble finding them with your system's file manager and its search tools.

On the other hand, it is always possible that you have duplicate images that you will not catch by filename alone. Fortunately, several of the popular Linux photo editors and managers have built-in duplicate detection capabilities, with which you can catch duplicates based on image similarity. Geeqie (which is a revival of the venerable GQView photo organizer) and Gwenview are your best best here. The Digikam, Picasa, and F-Spot applications also have this functionality, but they require importing your photo collection first, which would be a mistake at this point in the clean-up process (albeit useful later on).

Rename (and in some cases, convert) in a standard manner

File names are not quite as big of an issue with photo collections as they are with music libraries, but they can still be confusing, particularly if you have photos from multiple cameras, each of which uses a different naming scheme. In addition, the Extensible Metadata platform (XMP) metadata format can be stored in a "sidecar" file that uses the filename of the original file (sans extension), to avoid touching the original — having two files that use the same filename creates unfortunate ambiguity. The easiest solution for your existing files is a batch renamer like PyRenamer, which we also recommended for music collectors.

Most standalone cameras on the market today allow you to tweak the naming scheme as an in-camera setting. Consider doing this if you regularly use two or more cameras, to ensure that they use different prefixes, such as EOS1D_ and COOLPIX_. Picking a descriptive prefix also saves your brain a tiny amount of work; EOS1D_5678.CR2 may not be much more descriptive than IMG_5678.CR2, but every little bit helps.

If you have scanned images (either from your own scanner or from DVD media), you should take the time to give them a thorough batch-renaming to provide some consistency. Since cameras tend to use a predictable filename prefix like DSCN or IMG, something easy-to-understand like SCAN or Scan_ is hard to beat. Finally, do not overlook your phone. Few smartphones allow you to adjust the camera's image file prefix, but you should do it if possible to avoid naming conflicts. On models where it is not possible, you may start to collect images from evenings out that have unwieldy names like 20100526_004.JPG, transferred over Bluetooth to various locations around your hard drive. Ceaseless vigilance will be required on your part; take a minute to steel up your resolve.

One controversial option worth discussing is converting some images into a more archival format. Many people consider this overkill, because even converting a lossy JPEG into a TIFF file (for example) does not add any quality. However, there are proponents of the Digital Negative (DNG) format that suggest keeping your image library in DNG format — specifically, converting TIFF and other generic formats into DNG, though not camera raw formats. The advantages are said to be easier image editing in raw editors like Rawstudio, UFRaw, or RawTherapee, as well as better metadata support in all image management applications. If you decide to do this, Digikam is currently the only option on Linux.

Select an image management application, and assign the metadata

By far, the most time-consuming process of the photo library spring cleaning is writing metadata to the files. This is because, unlike with commercial music tracks, the meaningful metadata has to come from you and cannot be looked up automatically over the Internet. It is certainly tempting to skip this step, but in the long run it is what enables you to find your photos months and years in the future. Just ask yourself this question: could you manage your music library if all you had to browse your collection with were one-to-five star ratings and the dates the various tracks were recorded? Fortunately, there are ways to speed up the metadata assignment process, but it cannot be bypassed entirely.

The only information written automatically into a photo is technical data like the shutter speed, aperture, and whether the flash fired, found in the EXIF tags. But you aren't likely to remember a photo based on its aperture setting. For content-based metadata, you will use the XMP tag format. XMP supports every kind of relevant information, from caption to photographer's name, to names of the people depicted, to geo-tags based either on GPS location or generic country and city name. XMP in an XML-based replacement of an older format called IPTC, and the good news is that it is well-supported in many modern Linux photo managers.

You will probably want to assign metadata to photos within a photo management tool, although that is not absolutely necessary. The photo batch processor Phatch can edit and assign XMP metadata, which allows you to process potentially hundreds of photos with similar features in a single click. The down side to this approach, though, is that you must know which photos to drop into Phatch's batch processing queue.

The more typical solution is to use one of the photo collection managers — of which Linux has a vast and ever-growing array of options. Each one has its proponents, but the vast majority of them offer the same basic feature set. Unfortunately, most projects these days seem to focus on adding editing and web-uploading features rather than image management features, which leads me to offer this very simple test: if you want to see how good a photo manager is, look at its search box. If the search box allows you to look for a photo based on any criteria you can think of, it is going to serve you well. If all it offers you is star-rating and time-stamp search parameters, you are probably going to stop using it before too long.

Based on that practical measurement, the current leader of the pack is definitely Digikam. Digikam can read and edit EXIF, XMP, IPTC, and Makernote (a variant of EXIF) metadata, and it allows you to search on a wide — although not an unlimited — set of these metadata tags. Digikam also allows you to keep your photo collection in its current location, unlike some managers that automatically copy every photo they "import" into a private location. That is valuable not only because it saves you disk space, but because it allows you to access the same collection with other tools. You can even import multiple, discrete directories, and have Digikam watch them individually for changed or added files.

You can edit XMP metadata for an image (or a set of images) through the Image -> Metadata menu. The current release offers eight categories of metadata: Content, Origin, Credits, Subjects, Keywords, Categories, Status, and Properties. Some of them sound redundant at first, and indeed you could interpret some of the categories in overlapping ways, but the important thing is that they are available. Fill in all that you can. The vast majority of your pictures might need to be marked with your name in the "photographer" field, but accurately recording this is a lifesaver when you need to find that photo that comes from your spouse, relative, or co-worker's camera. Finally, note that in Digikam, you can batch-process your metadata edits to affect entire collections at once to save time.

Other useful alternatives include Geeqie, which takes a more lightweight approach than Digikam. In fact, Digikam's one flaw might be that it does too many things — it includes a full-fledged raster image editor complete with effects, as well as slide show and various web service uploading plugins. Geeqie is slimmer, and implements editing by letting you launch any file in an external editing program, whether that program is GIMP, Phatch, Hugin, or Rawstudio. Geeqie supports XMP and EXIF tags, although currently its search interface is limited to the "Keywords" and "Comment" tags, which are generic. Better support is slated to come, though, so keep an eye on the project.

You might also take a look at ResourceSpace, which is a web-based image manager with excellent storage and metadata management options. It is designed for multi-person use (such as around an office), which adds complexity to the interface, but it is a solid and blindingly fast option. If you embed it in a Mozilla Prism container, you can even run it as if it were a desktop program.

Further work: image editing, non-photo images, and backups

With a little time and a little patience, it is possible to whip your photo collection into tip-top shape: a real library, indexed by its contents so you can always find what you are looking for, rather than a digital shoebox that you leave untouched because flipping through it image by image is such an ordeal. However, the real test comes when you start to add new files to it.

If you followed the directions on renaming the file prefixes of your cameras, you should be able to avoid future naming conflicts. And you will find it easier to fill in metadata tags on new image imports than on the entire collection all at once. Digikam even lets you set up metadata templates to pre-fill the most common tags automatically.

What isn't so simple is adopting and sticking to a naming scheme for edited images. You will have to correct, crop, and resize images. Even if you use a lossless raw editor, the output can collect in the same folder as your originals and get out of control if you do not perform maintenance. Unfortunately there is no way to automate that process; it is up to your own willpower.

If you use a raw photo editor, you've probably noticed that it saves a "sidecar" file containing its stack of image adjustments, rather than altering the original file. A similar approach is taken by Hugin and Phatch, and may be taken by future versions of GIMP. This presents two challenges. First, you probably do not want these non-image files taking up space in your photo manager's thumbnail browser. The various managers take different approaches to this: Digikam ignores them completely; Geeqie lets you selectively activate or deactivate each file format individually. Depending on your photo manager's behavior, these sidecar files may trigger the "directory has changed" flag that forces a re-scan of the entire folder. Second, you do want to make sure these files are backed up, so be absolutely certain to account for them in your backup scripts.

It is fully up to you whether or not you want to use the same techniques mentioned here to manage non-photographic images; if you create a lot of screenshots for work or your personal open source process, it can really help to manage them in a consistent way. Furthermore, because XMP metadata can itself be written in a sidecar, you can associate XMP's well-thought-out metadata scheme even with formats that were not originally designed for it, such as PNG. Finally, if managing all of your photos, screenshots, and other images isn't enough, try tackling your video collection, too. The XMP format is designed to support it. But you may need more than a weekend to find a good video manager that understands it.

Click Here!