Whenever you are doing any kind of file shenanigans, you should always explicitly set the expected encoding of the file you are reading from. Most programming languages’ standard file handling libraries use encoding settings by default which do not lead to the desired behaviour.
Example code
Let me demonstrate the potential problems of the default encoding settings with the following example code:
1 2 |
|
1
|
|
Now when I connect to a server with that code installed via SSH from my Mac, everything works perfectly fine:
1 2 3 4 |
|
But when I connect from a Windows machine to the same server and do the same thing, I end up with a different result:
1 2 3 4 5 |
|
Behind the scenes
When you connect to a server via SSH, modern clients send environment variables to pass on your local locale settings. You can see this once you enable the debug output of your SSH client:
1 2 3 4 |
|
If these variables are omitted, you end up with a default locale, which in my
case was a non-unicode version of en_US
.
The same problem will emerge if your script is executed as a cron job: The
cron
daemon normally also ends up with the systems default locale settings.
Explicit encoding to the rescue
When you are reading data from a file, you should always explicitly define the encoding you expect the file to be in:
1 2 |
|
This way you end up with portable code that executes in a predictable way independent of environment settings. Just make sure to specify the encoding your application uses for file handling in your documentation.
The same concepts apply to other programming languages as well. Java’s
java.io.FileReader
for example is completely useless because it does not allow you to state the
encoding manually:
The constructors of this class assume that the default character encoding and
the default byte-buffer size are appropriate. To specify these values yourself,
construct an InputStreamReader on a FileInputStream.