sed behaving strangely in UTF-8 environment (like SuSE LINUX 9.1)

Your problem

You are using a Linux distribution with UTF-8 encoding such as SuSE 9.1. You are using sed to operate on files containing German Umlauts or other non-Ascii characters. sed is behaving quite strangly: An expression like

sed 's/.*/x/'

normally should replace an arbitrary string by a single x. The dot, however, does not match non-Ascii characters any more!

The reason

The problem occurs if you operate on ISO-8859 (Latin) encoded files. A non-ascii character is misinterpreted in UTF-8 as a sequence of characters or - even worse - as an invalid UTF-8 string. So sed classifies the character as something not being matched by a dot. Strange and dangerous...

Solution

Converting your system back from UTF-8 to ISO-8859 seems not to be a good solution. A problem similar to the upper one would occur then when you operate und UTF-8 files. Better use iconv to convert the data on the fly:

> iconv -f latin1 -t utf-8 sourcefile | sed 's/.*/x/' | iconv -f utf-8 -t latin1

Keywords: utf8 sed characterset regex regular expression streameditor stream editor suse suse91 german umlaut umlaute latin1 iso8859 Author: Mathias Kettner

Tauschzone MK