Friday, June 29, 2007

UTF8 For Erlang

Update: added tests and corrected a couple of bugs in the 3- and 4-byte encoding lengths.

A friend and I are learning Erlang more or less at the same time (as well as a bunch of the 'cool kids' programmers). We were talking about how nicely UTF-8 could be handled in Erlang, given its treatment of strings as lists of integers. I decided to try writing binary -> UTF-32 (to code point, really) and code point -> binary transformation functions as an exercise to get to know the language. I'm still new and have to look up just about every piece of syntax and function the first time I use it.

The UTF-8 encoding is based on the explanation in Wikipedia and has simplistic error handling.


-module(utf8).
-export([from_binary/1, to_binary/1]).

%% Given a binary of UTF-8 encoded text, return a UTF-32 String
%% (i.e. each element is a unicode code point).
from_binary(Bin) ->
decode_binary(Bin, []).

decode_binary(<<>>, Str) ->
{ok, lists:reverse(Str)};
decode_binary(Bin, Str) ->
{B1,B2} = split_binary(Bin, 1),
case B1 of
%% 0-7F 0zzzzzzz
<<0:1,Z:7>> ->
decode_binary(B2, [Z|Str]);

%% 110yyyyy 10zzzzzz
<<2#110:3,Y:5>> ->
{<<2#10:2,Z:6>>,B3} = split_binary(B2, 1),
U32 = (Y bsl 6) bor Z,
decode_binary(B3, [U32|Str]);

%% 1110xxxx 10yyyyyy 10zzzzzz
<<2#1110:4,X:4>> ->
{<<2#10:2,Y:6,2#10:2,Z:6>>,B3} = split_binary(B2, 2),
U32 = (X bsl 12) bor (Y bsl 6) bor Z,
decode_binary(B3, [U32|Str]);

%% 11110www 10xxxxxx 10yyyyyy 10zzzzzz
<<2#11110:5,W:3>> ->
{<<2#10:2,X:6,2#10:2,Y:6,2#10:2,Z:6>>,B3} = split_binary(B2, 3),
U32 = (W bsl 18) bor (X bsl 12) bor (Y bsl 6) bor Z,
decode_binary(B3, [U32|Str]);

%% an exception will be raised if the utf8 encoding is off
%% and causes a match error
true ->
{bad_octet, B1}
end.

%% Given a list of unicode code points, return a binary of UTF-8
%% encoded text.
to_binary(Str) ->
encode_utf32(Str, []).

encode_utf32([], Utf8) ->
{ok, list_to_binary(lists:reverse(Utf8))};
encode_utf32([U32|Str], Utf8) ->
if
%% 0-7F 0zzzzzzz
U32 < 16#80 ->
encode_utf32(Str, [U32|Utf8]);

%% 110yyyyy 10zzzzzz
U32 < 16#800 ->
Y = 2#11000000 bor ((U32 band 16#7c0) bsr 6),
Z = 2#10000000 bor (U32 band 16#3f),
encode_utf32(Str, [Z|[Y|Utf8]]);

%% 1110xxxx 10yyyyyy 10zzzzzz
U32 < 16#10000 ->
X = 2#11100000 bor ((U32 band 16#f000) bsr 12),
Y = 2#10000000 bor ((U32 band 16#fc0) bsr 6),
Z = 2#10000000 bor (U32 band 16#3f),
encode_utf32(Str, [Z|[Y|[X|Utf8]]]);

%% 11110www 10xxxxxx 10yyyyyy 10zzzzzz
U32 < 16#110000 ->
W = 2#11110000 bor ((U32 band 16#1c0000) bsr 18),
X = 2#10000000 bor ((U32 band 16#3f000) bsr 12),
Y = 2#10000000 bor ((U32 band 16#fc0) bsr 6),
Z = 2#10000000 bor (U32 band 16#3f),
encode_utf32(Str, [Z|[Y|[X|[W|Utf8]]]]);

%% past allocated code points
true ->
{bad_code_point, U32}
end.


The bit fiddling looks nastier to me than C syntax, but it's reasonable enough. Otherwise Erlang is reminding me of OCaml, which I studied briefly before returning to Ruby, only it is much nicer to work with due to its dynamic typing (which I'm far more used to having written Ruby for more than 4 years now).

Now let's test the code. I saved the above code to a file called utf8.erl, started the erlang shell and compiled the above code.


bishop@arachnis:~$ erl
Erlang (BEAM) emulator version 5.5.2 [source] [async-threads:0] [kernel-poll:false]

Eshell V5.5.2 (abort with ^G)
1> c(utf8).
{ok,utf8}
2> Bin = <<"K\303\274\303\237chen \344\270\255\345\234\213 \360\220\214\260\360\220\214\261\360\220\214\262">>.
<<75,195,188,195,159,99,104,101,110,32,32,228,184,173,229,156,139,32,32,240,144,140,176,240,144,140,177,240,144,...>>


I also initialized the test binary data with the UTF-8 encoded binary for "Küßchen 中國 𐌰𐌱𐌲" which is the German for for a little kiss, the Chinese for China, and the Gothic letters Ahsa, Bairkan, and Giba (cf. ABC). This provides all the UTF-8 encoding lengths: 1 and 2 bytes for the German, 3 bytes for the Chinese, and 4 bytes for the Gothic. Now to decode the UTF-8 into a list of code points, i.e. an Erlang string:


3> {ok,Str} = utf8:from_binary(Bin).
{ok,[75,252,223,99,104,101,110,32,32,20013,22283,32,32,66352,66353,66354]}


By looking at a Unicode character reference you can verify that the UTF-8 and code points agree and that the binary is being decoded correctly. Finally to encode the string back into a binary:


4> {ok,Bin2} = utf8:to_binary(Str).
{ok,<<75,195,188,195,159,99,104,101,110,32,32,228,184,173,229,156,139,32,32,240,144,140,176,240,144,140,177,...>>}
5> Bin =:= Bin2.
true
6>


The string has been re-encoded to the same binary that we started with.

This code looks correct to me, discounting the lack of proper error handling. Feel free to use and improve it. If you are a Unicode expert and see any bugs, I'd love to hear it.

Have fun!

2 comments:

Unknown said...

This is another version of your program which avoids using shifts and other bitwise operations. It also does not return an error message, but raises an exception when there is something wrong with the input.

Your version is probably faster then mine, but I think that my program is interesting since it uses a technique that is hard to replicate in other languages.

I tried to post the code inline in the comment, but I couldn't get it to work.

Matt Bishop said...

Ooh, nice way to use the function argument matching and binary to binary assignment. It's always nice to see others' take on my code, especially since I am new to Erlang's idioms.

I'm not sure what is wrong with inline code, but thanks for trying.