Plaintext Is Not So Plain
One of the bane's of my current existence is having to deal with text encodings. I have a good friend that uses a few antique Windows programs that puke out Windows-1252
encoded characters as though it were going out of style. Much to my suprise when I opened up a csv file, made a few changes, and saved it, I ended up with a bunch of unicode replacement characters: "�" or otherwise known as \U+FFFD
.
I thought I needed a brain replacement instead of a replacement character. I proceeded to guess that the CSV I was working with was encoded in Windows-1252
(aka CP1252
). I then needed an easy way to convert those characters in a csv to UTF-8
using PHP if at all possible since the content would be used on the web.
I tried a couple of different ways to do it. Problem was I opened up the file in LibreOffice first and then saved it, which replaces the unknown characters with the unicode replacement character. Therefore they could not be converted when I used the different conversion methods. After playing ring-around-the-rosy for awhile, I figured the whole mess out but not after much wailing and gnashing of teeth.
This page helped me to decipher the differences between ISO-8859-1
, ISO-8859-15
, and Windows-1252
. Which led down the rabbit trail of needed to replace a few characters to end up at UTF-8
.
Whell it turns out that replacement is not best solution perhaps. It turns out that you should convert from Windows-1242
straight to UTF-8
whether you are converting to ISO-8859-1
or Windows-1252
(see this post).
The below is the solution that I ended up using to convert Window-1252
to UTF-8
. It could be prettier but it did the job and it was code I didn't have to write thanks to this post on StackOverflow.
<?php
function convert_cp1252_to_utf8($input, $default = '', $replace = array()) {
if ($input === null || $input == '') {
return $default;
}
$encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
/*
* Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
* and control characters, always convert from Windows-1252 to UTF-8.
*/
$input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
if (count($replace) != 0) {
$input = html_entity_decode($input);
}
}
return $input;
}
Here are a few handy links for reading up on encoding types: