Author: Manolis Tzanidakis
User Level: Beginner to intermediate
Recent versions of most Linux distributions support non-English languages out of the box by using the Unicode standard. I was pleasantly surprised when I found out that I was able to read and write in Greek — my native language — on a fresh Ubuntu Edgy Eft installation without any manual intervention. Unfortunately, my happiness lasted only until I tried to open files with Greek file names. Instead of Greek characters I saw garbage. I’ve been using the 8-bit ISO 8859-7 encoding for Greek file names, and since it worked well I was too lazy to convert my systems to Unicode. Manually renaming hundreds of files in order to convert them to Unicode was not an option; I needed some kind of automation. Convmv is the right tool for that job.
Convmv is a Perl program that converts file names and directories between different character encodings. It converts only the file names, not the content of the files, and can also convert a whole filesystem, including symlinks. Most Linux distributions offer packages for convmv, and you can also find it in the FreeBSD ports and NetBSD pkgsrc. Manual installation is fairly easy since the program depends only on Perl, which is installed by default on virtually all Linux distributions and BSD variants; running
make install will install the program in /usr/local/bin and its man page in /usr/local/share/man/man1.
convmv without any arguments prints the list of all available options. All options are explained in detail in the program’s man page.
Let’s start by running
convmv --list to display the supported encodings. To convert all Ogg/Vorbis files in the current directory from ISO 8859-7 to UTF-8 (Unicode) run
convmv -f iso-8859-7 -t utf8 *.ogg. This command will not actually rename the files — it just prints what it should do. To rename the files, add the
--notest option. If you want the program to ask for confirmation before any action, add the
-i option to enable interactive mode.
By default the program checks whether the file names you want to rename are already using the specified encoding and skips them accordingly. Although you can speed up the whole process by disabling this feature with the
--nosmart switch, it’s better not to, since it could lead to “double-encoded” file names with incorrect characters. Nevertheless, the man page has a section on how to repair double-encoded files. The program will also stop if you try to rename a file by giving it a name with the target encoding that already exists on the same path. You can however use the
--replace switch to have that file overwritten in case its content is the same as that of the original file.
After making sure that your options work correctly, it’s time to convert the whole filesystem to UTF-8 with a single command. We will also add the
-r switch, which enables recursive mode. For example, issue
convmv -f iso-8859-7 -t utf8 -r --notest --replace ~/data to convert all the files and directories inside the data directory in your home from ISO 8859-7 to UTF-8. You can also use convmv to convert file names to all upper or lower case with the
--upper and –lower options respectively. If the file is not ASCII-encoded you must also supply its encoding with the
Besides the conversion to Unicode, convmv can be useful when you need to exchange files with users of obsolete operating systems that have no support for the Unicode charset, such as Windows 98 or older versions of Linux. Speaking of cross-platform interoperability, Mac OS X has a strange way of handling Unicode-encoded file names. Linux and most other Unix-like OS use the C normalization form (NFC) for encoding to UTF-8, while OS X uses NFD. Convmv can convert file names between these two standards with the
--nfd switches. You might face similar issues with the JFS and NFS v4 file systems; check the convmv man page for more information.
Convmv made my transition to Unicode as painless as possible. It converted all my files while I was making a cup of coffee, giving me plenty of time to play with the new version of Ubuntu.