Linux.com

iconv and sed help

Link to this post 21 Feb 11

Hi,

I have a file which is a UTF-8 file type which i need to convert into ISO-8859-1 file type.

Now the UTF-8 file type contains characters like å/ä/ö and i dont want these characters.

So, i apply the sed command.


$ sed "s/å/aa/g; s/ä/aaa/g; s/ö/ooo/g" utf8.txt > output.txt

Now when i view this file, there are no such characters like å/ä/ö

Then,

i use iconv command to covert that UTF-8 (output.txt) file type into ISO-8859-1 file type


$ iconv -c -f UTF-8 -t ISO-8859-1 < output.txt > newfile

BUT


when i view the file type using file command it tells that its an ASCII file type not the ISO-8859-1


$ file newfile
newfile: ASCII text, with CRLF line terminators

I don't understand what went wrong. I have also attached that UTF-8 file with this post.

Please help.

usmangt [file name=utf8.txt size=1010]http://www.linux.com/media/kunena/attachments/legacy/files/utf8.txt[/file]

Link to this post 21 Feb 11

I have went through your exact procedure on slackware 13.1 and my output file is showing as:
ut3.txt: ISO-8859 text, with very long lines

The way that the data is read and displayed may be controlled by a deeper configuration within your OS, can you share what distro you use so those familiar with it can tell you where those settings are?

Link to this post 21 Feb 11

I am using Linux Fedora 13 distribution.

Link to this post 22 Feb 11

Hi,

I am so Sorry that i have attached the wrong file (actually both are of same name but in different folder on my machine).

This is the one which is causing the problem.

Link to this post 22 Feb 11

Here is the file.

Don't know why it become such long name when uploading.

[file name=utf8-7a6351909c73ba4a81575d6ad10cf46f.txt size=1131]http://www.linux.com/media/kunena/attachments/legacy/files/utf8-7a6351909c73ba4a81575d6ad10cf46f.txt[/file]

Link to this post 23 Feb 11

Now that I have processed your original file I am getting the same issue, it appears that something is different between the files.

The two files are very different. I have concatinated your command to

sed "s/å/aa/g; s/ä/aaa/g; s/ö/ooo/g" utf8.txt|iconv -c -f UTF-8 -t ISO-8859-1 -o out.txt

when I ran that command against both files I got the following output:

matt:~/Desktop$rm *.txt.txt;for i in `ls|grep utf|grep -v "txt\.txt"`;do sed "s/å/aa/g; s/ä/aaa/g; s/ö/ooo/g" $i|iconv -c -f UTF-8 -t ISO-8859-1 -o $i.txt ;file $i;file $i.txt;done
utf8.txt: UTF-8 Unicode text, with very long lines, with CRLF line terminators
utf8.txt.txt: ISO-8859 text, with very long lines, with CRLF line terminators
utf82.txt: UTF-8 Unicode text
utf82.txt.txt: ASCII text

Based upon the output it looks as though the line terminators in the second file are not ISO-8859-1 compliant, but the iconv applications does not correct those.

Who we are ?

The Linux Foundation is a non-profit consortium dedicated to the growth of Linux.

More About the foundation...

Frequent Questions

Join / Linux Training / Board