July 22, 2008

Linux tools to convert file formats

Author: Federico Kereki

Life would be a lot easier if we could live in a Linux-only world and if applications never required data from other sources. However, the need to get data from Windows, MS-DOS, or old Macintosh systems is all too common. This kind of import process requires some conversions to solve file format differences; otherwise, it would be impossible to share data, or file contents would be imported incorrectly. The easiest way to transfer data between systems is by using plain text files or common formats like comma-separated value (CSV) files. However, converting such files from Windows or Mac OS results in formatting differences for the newline characters and character encoding. This article explains why we have these problems and shows ways to solve them.

The newline problem

Every operating system uses a special character (or sequence of characters) to signify the end of a line of text. They cannot use standard, common characters to represent the line end, because those could appear in normal text, so they use special, nonprinting characters -- but each operating system uses different ones:

  • Linux and Mac OS X inherited the Unix style of using LF (line feed, an ASCII control character) at the end of each line.
  • Older Macintosh systems use CR (carriage return, another ASCII control character).
  • Windows uses a pair of characters -- both a CR and an LF.

To check what a particular text file uses to indicate a new line, try hexdump, which lets you inspect the contents of a file at the byte level. I prepared two three-line files -- one on a Linux system and one on a Windows machine -- and dumped their contents. The file gets dumped 16 characters at a time, showing both the actual characters and their octal equivalents. Notice that the Linux file has a \n character at the end of each line, while the Windows version uses \r and \n. An older Macintosh file would have used a single \r character instead.

> cat test.linux
This is the first line of a Linux file.
This is the second line.
Here's the last line.

> hexdump -cb test.linux
0000000 T h i s i s t h e f i r s
0000000 124 150 151 163 040 151 163 040 164 150 145 040 146 151 162 163
0000010 t l i n e o f a L i n u
0000010 164 040 154 151 156 145 040 157 146 040 141 040 114 151 156 165
0000020 x f i l e . \n T h i s i s
0000020 170 040 146 151 154 145 056 012 124 150 151 163 040 151 163 040
0000030 t h e s e c o n d l i n e .
0000030 164 150 145 040 163 145 143 157 156 144 040 154 151 156 145 056
0000040 \n H e r e ' s t h e l a s t
0000040 012 110 145 162 145 047 163 040 164 150 145 040 154 141 163 164
0000050 l i n e . \n
0000050 040 154 151 156 145 056 012
0000057

> cat test.windows
This is the first line on a Windows file.
This is the second line.
Here's the last line.

> hexdump -cb test.windows
0000000 T h i s i s t h e f i r s
0000000 124 150 151 163 040 151 163 040 164 150 145 040 146 151 162 163
0000010 t l i n e o n a W i n d
0000010 164 040 154 151 156 145 040 157 156 040 141 040 127 151 156 144
0000020 o w s f i l e . \r \n T h i s
0000020 157 167 163 040 146 151 154 145 056 015 012 124 150 151 163 040
0000030 i s t h e s e c o n d l i
0000030 151 163 040 164 150 145 040 163 145 143 157 156 144 040 154 151
0000040 n e . \r \n H e r e ' s t h e
0000040 156 145 056 015 012 110 145 162 145 047 163 040 164 150 145 040
0000050 l a s t l i n e . \r \n
0000050 154 141 163 164 040 154 151 156 145 056 015 012
000005c

To convert files from Windows to Linux, you can use the appropriately titled dos2unix command. The simplest way to convert test.windows to the Linux format would be with dos2unix test.windows, but you can also use the command in stream fashion -- for example, dos2unix <test.windows >test.windows.fixed. Check all possible options with dos2unix -h or man dos2unix.

An old-fashioned Macintosh text file requires changing CR characters to LF ones, so you could use the tr (translate) command with tr "\015" "\012" <anOldMacintoshFile >theNewLinuxFile, which simply changes each CR (octal 015) into LF (octal 012). With tr, you could also use the -d (delete) option to remove CR characters from a Windows file, thus giving you a valid Linux file. The last conversion shown in the previous paragraph could also be done with tr -d "\015" <test.windows >test.windows.fixed, and the results would be identical.

The encoding problem

English and other languages include some special typographical characters in addition to the normal 26 letters. Have you ever watched the movie Æon Flux or sent in a curriculum vitæ? If you have to import German text, you'll find lots of vowels with an umlaut on them, and some ß characters as well. Spanish adds ň and acute accents to the mix, and French has grave and circumflex accents ("pie à la mode," anybody?). (If you need to type these characters on a standard English keyboard, check out our article on how to customize your keyboard.)

You don't need Unicode for Standard English: you can do perfectly well with ASCII characters, which includes plain unaccented letters from A to Z, digits from 0 to 9, and some punctuation characters. If you deal with only the Latin alphabets used in Western European languages, you might also get files encoded in ISO 8859-1 (also informally known as Latin-1), which lets you use just a single byte per character at the cost of not being able to represent more foreign languages. However, some languages deal with a wider character set and require Unicode, which supports more than 100,000 characters in dozens of languages. Unicode (also known as the ISO 10646 standard) extends ASCII, but where ASCII requires one byte for each character, Unicode requires more. To achieve compatibility between Unicode and ASCII, the UTF-8 character encoding is generally used. UTF-8 uses just a single byte for ASCII characters (so any ASCII file is by default a valid UTF-8 file) and only goes up to more bytes per character for foreign letters and other symbols.

While UTF-8 and Latin-1 do use the same representation for ASCII characters, they differ for other characters; therefore, in order to process the file correctly, you must know its format, or all non-ASCII characters might end up garbled.

Check out the following files, with the same Spanish text ("¡Que la Fuerza te acompañe!", or "May the Force be with you!") in both formats:

> hexdump -cb test.force.utf8
0000000 302 241 Q u e l a F u e r z a
0000000 302 241 121 165 145 040 154 141 040 106 165 145 162 172 141 040
0000010 t e a c o m p a 303 261 e ! \n
0000010 164 145 040 141 143 157 155 160 141 303 261 145 041 012
000001e

> hexdump -cb test.force.latin1
0000000 241 Q u e l a F u e r z a t
0000000 241 121 165 145 040 154 141 040 106 165 145 162 172 141 040 164
0000010 e a c o m p a 361 e ! \n
0000010 145 040 141 143 157 155 160 141 361 145 041 012
000001c

If you check the foreign characters (the inverted exclamation sign at the beginning and the ñ near the end), you can verify that UTF-8 uses two bytes for each, while Latin-1 requires just one. Also, note that "normal" ASCII characters are the same in both formats.

Determining what format and encoding you require depends on your particular application. While most Linux programs work with UTF-8, many others use Latin-1, and some are even able to use both. You first need to learn what your application expects, and then convert the text file to that format, if necessary. Fortunately, you can easily accomplish the required translation in both directions (either from or to UTF-8) by using the recode command.

recode offers many options (use recode --help or info recode for a more thorough description), and you can use the program to convert to and from many different formats (try recode -l to get the list of supported formats). However, for simple conversions, it's enough to do recode UTF-8..ISO-8859-1 test.force.utf8 to get a Latin-1 version, or recode ISO-8859-1..UTF-8 test.force.latin1. Depending on the specific conversion you need, the newline problem might be taken care of automatically, but check the documentation (or give it a try and see what happens) for each specific case.

In conclusion

Processing text files from other operating systems is not a straightforward process, but Linux provides tools to make the job easy. No matter what format a file is in, you can automate the required conversion steps and deal with the inconvenience of incompatible formats.

Categories:

  • Tools & Utilities
  • Desktop Software
Click Here!