May 28, 2010

Weekend Project: Spring Clean Your Music Library

 

It is a long weekend coming up. If you're like most of us, your collection of digital music has a few problems that build up over time, making it an iffy proposition to simply copy a folder from your hard drive to your phone or music player. There are duplicate files, files with bad or missing tags, maybe a few stray songs purchased from a different store in a separate folder from the material you rip from CDs; and although your slick new media player software supports displaying album art, most of your albums don't have artwork attached to them. Why not set aside a few hours to give your library a quick scrubbing?

The game plan is simple. First, locate and remove unnecessary duplicate files. Then, combine your stray files into a single library and fix up the directory and file names so that they follow a proper naming scheme. After that, fix the metadata tags, then, finally, find and apply the cover art. At the end, you'll have a uniform and polished music library you can show off to the neighbors.  All it takes is a few special-purpose tools and a bit of time.

With any large collection of digital content, one of the primary ways in which clutter collects is through the acquisition of duplicate files -- old backups, consolidating individual archives into a single location, re-purchasing assets, and re-encoding can all leave you with a filesystem that wastes considerable space on duplicate content. Locating and disposing of the extras is not as simple as searching for the same file name, either -- an inexact naming scheme will let similar file names through, and you may have mislabeled content altogether.

Find and Eliminate Duplicate Files

The best way to identify multiple copies of the same file is to start with a file scanner that matches the contents of files on disk using checksums. Following such a scan, you can eliminate the exact matches and move on to more difficult duplicate-matching techniques. Fdupes is a command-line tool designed for this purpose. You should be able to install fdupes through your distributions package-management system; if not, you can download builds for most popular Linux distributions at the project's Web site. You run it from the command line, and it takes a number of run-time options to perform different types of scans; for example fdupes -r -n -A /home/username/media/ will perform a recursive search (-r) starting in the specified directory, will not report any empty files as matching (-n), and skipping any hidden files (-A; file names starting with a dot). These last two options will save you a lot of spurious matches. Fdupes prints out matching pairs (or triples, or n-tuples) of files to stdout. There is a switch to automatically delete all but the first instance of the duplicates, but that is probably not worth the risk.

While fdupes can catch files with the same content but different names, it obviously will not match two different encodes of the same audio track in different formats. In some cases, you want to maintain two copies -- perhaps a lossless directory with FLAC copies for use on the media server, and a second directory for considerably smaller Vorbis files you can move onto and off of portable devices. But re-encodes can happen accidentally, too -- an old directory filled with a Vorbis or MP3 album purchased off of the Web that you don't really need anymore, now that you have access to a lossless (or simply higher-quality) source.

To capture these duplicates, the easiest approach is to use a dedicated music tag exploration tool, such as EasyTag. It can scan a directory (including a nested directory structure), read all of the tags in the files found within, and allow you to easily identify duplicates by sorting results on the tags. Although EasyTag is powerful, its interface is on the peculiar side. The main window contains a directory tree, a list of files found within, and a tag-editing pane. Whenever you click on a directory in the tree, EasyTag rapidly scans the entire directory and loads the files found within into the list.  This may not be what you expect and could take time on a large directory -- if you are just navigating to the correct folder, be sure you click only on the "+" expansion button next to directories in the tree. Once you have loaded the directory in question into the list, just click on a column heading to sort the found files; you can delete any duplicates through the right-click menu.

Admittedly, mis-tagged files are a problem in their own right, and examining only the tags will not help you there, but to discover the actual content of the file, there is no shortcut to simply firing it up in your audio player of choice.

Rename Using a Standard Syntax

With a single copy of each file, it is the best time to clean up the directory structure of your library itself -- however it is organized. If you follow a standard scheme already, such as Artist/Album/Song.ogg, moving folders around may be the only step required. If, on the other hand, you prefer to encode the artist and album in the file name too for added clarity when viewing individual files (such as Artist/Album/Artist_Album_Song.ogg), or to add the track number, then you may need to rename lots of files themselves. This can not only get tedious, it can easily get confusing after a few hundred files. It is better to use a dedicated batch-renaming tool to do most of the heavy lifting.

Several batch-renamers are available for Linux, but for this particular task one of the best is pyRenamer, thanks to its understanding of audio tags and collection naming conventions. You can do a music-specific renaming run by switching the "Music" tab at the bottom of the pyRenamer window. PyRenamer's syntax uses curly braces for tag names; for example, the pattern {artist}-{album}-{title}. You can also specify common substitutions in a separate tab, to replace spaces with underscores, fix capitalization, and remove accent marks that may cause trouble on some simpler devices. PyRenamer can be run in file-only, directory-only, and files-and-directory mode, depending on what you need to rename.

You may also want to explore the renaming functions of EasyTag and of the Picard tag editor. In EasyTag, the renamer is available as an option under the "Scanner" menu. Before using it, however, you must configure its behavior separately in the EasyTag preferences. EasyTag's renaming syntax uses percent signs for tag types, such as %a-%b-%t (again, for Artist-Album-Title). EasyTag makes more tags available as renaming keys, although some of them, such as %c for "comment," are not particularly useful.

Picard also has a renaming utility, although the built-in documentation does not include an explanation of the syntax, which is more complicated than that of either pyRenamer or EasyTag.  For instance, you must specify the number of digits to be used for numeric values like track number. On the other hand, Picard does include built-in support for naming multi-artist tracks differently, which could be your best option if you have many of those files to process.

Fix Missing and Incorrect Tags

Now, with a single copy of each file in storage and a naming scheme cleaned up, you can begin to tackle the actual contents of the files -- starting with the metadata stored inside the ID3 or Ogg's Vorbis comment tags. Both Picard and EasyTag mentioned above can handle batch tag-filling duties, but in different ways.

In EasyTag, you can add a single "artist"or "genre" tag to an entire directory full of files with one click, for example. First, you scan a directory of files. You can then multi-select all of the files you wish to apply the same tags to from the file list.  EasyTag will always display the most-recently-selected file's tags in the tag editing pane to the right -- to apply one of those tags to the entire file selection, just click the small button next to the text entry widget.

EasyTag can also batch-apply tags to files based on the names of the files, so if you have already corrected your naming conventions as described in the previous step, this can be a quick-and-dirty approach to filling in the basic metadata tags.

Picard takes a different approach. It ties in with the free MusicBrainz service, and tries to match un-tagged files by calculating an "acoustic fingerprint" of the track audio, which it then uses to query the central MusicBrainz server. Assuming that other users have correctly tagged the track, it can then make intelligent guesses about the likely content of tags. You can automatically accept the server's best match, or manually select a metadata listing from the search results.  In the event that you do not get a match, you can add the appropriate metadata yourself in the Picard application, and send it to MusicBrainz for the benefit of others.

Of course, no amount of guesswork can come up with details like composer or year, but most music players don't make use of these tags anyway.  Still, both tag editors can help you fill them in a hurry, if you can look up the appropriate data.

Find and Add Cover Artwork

More and more audio players are making use of cover art in the user interface, particularly on portable devices where such cues can prove useful as you scroll through a collection on a small screen. Unfortunately, the audio world has two competing standards for how album art is stored and associated with music files, and on top of that there is no generally-approved method of grabbing the correct cover.

Format-wise, there is actually a standard ID3 tag for a cover image, which you can use to embed an image file into an MP3. Ogg, however, does not use this format, and Ogg's own cover tag format is still not widely supported. However, if you choose to go the embedded-cover-image route, both EasyTag and Picard can embed the cover image for you as part of their normal file tagging duties.

The alternative is an ad-hoc solution that works for most music players: drop an image file named "folder.jpg" or "cover.jpg" into each album's directory. The player will look in the current directory for such a file whenever it plays a new track, and display the folder image as the album cover.  Naturally, this provides only folder-level granularity, so rarities collectors may face the difficult choice between creating dozens on folders for one or two tracks each, or trying to find an appropriately-broad cover image for a collection of unrelated songs in a single directory.

In either case, though, you must first locate and acquire the cover image before you can add it to the file or rename it for directory-wide usage. Some music players provide plug-ins to try and retrieve album art images (generally from common online music stores), but none of them are flexible in their approach. For batch album art location, there is but one option on desktop Linux systems: Album Cover Art Downloader. It is a small Qt application that is not usually packaged by distributions. However, you can download the latest release from the project's Web site and install it on any modern Linux desktop. The dependencies are Python, Python-Qt, and Python-imaging, all of which are commonly available.

Once installed, you can start the program with albumart-qt & from the command line.  Select a directory to be scanned with File -> Open, then choose Edit -> "Download missing cover images."  The application will recurse through the selected directory, and try to retrieve cover images from a variety of online sources by performing HTTP search queries based on the track metadata. You can configure which sources to check from the preferences. You may want to start with a small directory until you get a feel for the program's success rate with the chosen image source; some are notoriously unreliable, particularly online stores that may rotate cover images or feature band photos instead of product shots. In any case, Album Cover Art Downloader will let you select which of the matching images it saves to the folder location. You can scroll through the images found for each album, replacing false positives with better options, or re-doing the search entirely.

Cover art downloading is more art than science, so this could be the most time-consuming step of the process -- it does require manual attention.  But if you make a few test runs and things seem to be performing smoothly, you can run a batch download session on your entire collection, leave the process running unattended, and spend a few minutes cleaning up the obvious mistakes after the fact.

Further Work: Gain, Formats, Other File Types

If you are still in the mood to fix up your music library but have already finished the above steps with plenty of spare time remaining, there are a few other factors to consider. The first is adding Replay Gain metadata to all of your files. Replay Gain is a numerical calculation that normalizes the apparent volume of different tracks; if enabled in a player, all tracks marked with Replay Gain tags will sound approximately equal loudness, no matter how they were originally mastered. Note that this is not a simple volume boost, and it does not make quiet tracks sound flat and washed out or loud tracks sound "clipped"; it is an acoustic property based on what sounds "loud" to the human ear.

There are tools to calculate and tag Replay Gain in Vorbis and MP3 files, both of which are command-line driven, and in the official FLAC encoder itself. The down side is that the entire file has to be scanned, so this is a set-it-running-overnight process; calculating gain for an entire collection of files could take many hours. Classical music fans especially will be happy to hear that Replay Gain tools have a "whole album" mode, which allows you to fix the gain for an entire piece, not per-individual-movement, which could destroy the move.  Most Linux audio players support Replay Gain.

Next is automatically converting a separate, low-bitrate version of your existing library.  As mentioned earlier, you may prefer to keep high-quality, lossless versions of your music on the hard disk of your desktop machine or media server, where space is plenty and cheap; but you might still want a smaller version for portability, where headsets do not produce the same level of audio fidelity and storage space is restrictive.

Several tools exist to help you recursively batch-convert an entire directory structure into a second format, including SoundConverter. SoundConverter can batch convert and entire directory structure in a single pass, preserving tag information in the output files and gives you the option of saving the results in the same directory as the original files, or in a separate location. As with Replay Gain, though, this is a time consuming process that you will not want to sit and watch (and which may even slow down doing other work, particularly on single-processor system). Best to start it overnight and wake up the next morning with a fresh batch of audio waiting for you.

Finally, you probably realized long ago that music is far from the only file type in need of some cleaning up on your hard drive. You can use many of the same techniques to help straighten up your photos, videos, and other content as well. Sadly, videos, which are the most like audio libraries in other respects, suffer from a distinct lack of metadata standardization. In all probability, you will have to deal with the internal settings of your specific media-playing application (be it Boxee, MythTV, VLC, Miro, or any other tool) in order to set and forget useful properties like titles, plot summaries, cast information, and cover images.  But don't let that discourage you -- as long as you are in spring cleaning mode, use the energy. At the very least, run fdupes on your home folder, and start clearing out the duplicates, whether they are images, source code, office documents, or any other type of content. 

A clutter-free directory is a happy directory.

Click Here!