One of the bane’s of my current existence is having to deal with text encodings. I have a good friend that uses a few antique Windows programs that puke out Windows-1252 encoded characters as though it were going out of style. Much to my suprise when I opened up a csv file, made a few changes, and saved it, I ended up with a bunch of unicode replacement characters: “�” or otherwise known as \U+FFFD.

I thought I needed a brain replacement instead of a replacement character. I proceeded to guess that the CSV I was working with was encoded in Windows-1252 (aka CP1252). I then needed an easy way to convert those characters in a csv to UTF-8 using PHP if at all possible since the content would be used on the web.

I tried a couple of different ways to do it. Problem was I opened up the file in LibreOffice first and then saved it, which replaces the unknown characters with the unicode replacement character. Therefore they could not be converted when I used the different conversion methods. After playing ring-around-the-rosy for awhile, I figured the whole mess out but not after much wailing and gnashing of teeth.

This page helped me to decipher the differences between ISO-8859-1, ISO-8859-15, and Windows-1252. Which led down the rabbit trail of needed to replace a few characters to end up at UTF-8.

Whell it turns out that replacement is not best solution perhaps. It turns out that you should convert from Windows-1242 straight to UTF-8 whether you are converting to ISO-8859-1 or Windows-1252 (see this post).

The below is the solution that I ended up using to convert Window-1252 to UTF-8. It could be prettier but it did the job and it was code I didn’t have to write thanks to this post on StackOverflow.

<?php

function convert_cp1252_to_utf8($input,$default = '', $replace = array()) { if ($input === null || $input == '') { return$default;
}

$encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
if ($encoding == 'ISO-8859-1' ||$encoding == 'Windows-1252') {
/*
* Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
* and control characters, always convert from Windows-1252 to UTF-8.
*/
$input = iconv('Windows-1252', 'UTF-8//IGNORE',$input);
if (count($replace) != 0) {$input = html_entity_decode($input); } } return$input;
}


Here are a few handy links for reading up on encoding types: