Plaintext Is Not So Plain

One of the bane’s of my current existence is having to deal with text encodings. I have a good friend that uses a few antique Windows programs that puke out Windows-1252 encoded characters as though it were going out of style. Much to my suprise when I opened up a csv file, made a few changes, and saved it, I ended up with a bunch of unicode replacement characters: “�” or otherwise known as \U+FFFD.

I thought I needed a brain replacement instead of a replacement character. I proceeded to guess that the CSV I was working with was encoded in Windows-1252 (aka CP1252). I then needed an easy way to convert those characters in a csv to UTF-8 using PHP if at all possible since the content would be used on the web.

I tried a couple of different ways to do it. Problem was I opened up the file in LibreOffice first and then saved it, which replaces the unknown characters with the unicode replacement character. Therefore they could not be converted when I used the different conversion methods. After playing ring-around-the-rosy for awhile, I figured the whole mess out but not after much wailing and gnashing of teeth.

This page helped me to decipher the differences between ISO-8859-1, ISO-8859-15, and Windows-1252. Which led down the rabbit trail of needed to replace a few characters to end up at UTF-8.

Whell it turns out that replacement is not best solution perhaps. It turns out that you should convert from Windows-1242 straight to UTF-8 whether you are converting to ISO-8859-1 or Windows-1252 (see this post).

The below is the solution that I ended up using to convert Window-1252 to UTF-8. It could be prettier but it did the job and it was code I didn’t have to write thanks to this post on StackOverflow.

<?php

function convert_cp1252_to_utf8($input, $default = '', $replace = array()) {
    if ($input === null || $input == '') {
        return $default;
    }

    $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
    if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
        /*
         * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
         * and control characters, always convert from Windows-1252 to UTF-8.
         */
        $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
        if (count($replace) != 0) {
            $input = html_entity_decode($input);
        }
    }
    return $input;
}

Here are a few handy links for reading up on encoding types:

  • Joel Spoelsky’s Primer on Unicode and Encodings
  • https://en.wikipedia.org/wiki/UTF-8
  • https://en.wikipedia.org/wiki/ISO/IEC_8859-1
  • https://en.wikipedia.org/wiki/Windows-1252
  • http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
  • https://en.wikipedia.org/wiki/ISO/IEC_8859-15
Written on September 25, 2016