Friday, June 29, 2007

UTF8 For Erlang

Update: added tests and corrected a couple of bugs in the 3- and 4-byte encoding lengths.

A friend and I are learning Erlang more or less at the same time (as well as a bunch of the 'cool kids' programmers). We were talking about how nicely UTF-8 could be handled in Erlang, given its treatment of strings as lists of integers. I decided to try writing binary -> UTF-32 (to code point, really) and code point -> binary transformation functions as an exercise to get to know the language. I'm still new and have to look up just about every piece of syntax and function the first time I use it.

The UTF-8 encoding is based on the explanation in Wikipedia and has simplistic error handling.


-module(utf8).
-export([from_binary/1, to_binary/1]).

%% Given a binary of UTF-8 encoded text, return a UTF-32 String
%% (i.e. each element is a unicode code point).
from_binary(Bin) ->
decode_binary(Bin, []).

decode_binary(<<>>, Str) ->
{ok, lists:reverse(Str)};
decode_binary(Bin, Str) ->
{B1,B2} = split_binary(Bin, 1),
case B1 of
%% 0-7F 0zzzzzzz
<<0:1,Z:7>> ->
decode_binary(B2, [Z|Str]);

%% 110yyyyy 10zzzzzz
<<2#110:3,Y:5>> ->
{<<2#10:2,Z:6>>,B3} = split_binary(B2, 1),
U32 = (Y bsl 6) bor Z,
decode_binary(B3, [U32|Str]);

%% 1110xxxx 10yyyyyy 10zzzzzz
<<2#1110:4,X:4>> ->
{<<2#10:2,Y:6,2#10:2,Z:6>>,B3} = split_binary(B2, 2),
U32 = (X bsl 12) bor (Y bsl 6) bor Z,
decode_binary(B3, [U32|Str]);

%% 11110www 10xxxxxx 10yyyyyy 10zzzzzz
<<2#11110:5,W:3>> ->
{<<2#10:2,X:6,2#10:2,Y:6,2#10:2,Z:6>>,B3} = split_binary(B2, 3),
U32 = (W bsl 18) bor (X bsl 12) bor (Y bsl 6) bor Z,
decode_binary(B3, [U32|Str]);

%% an exception will be raised if the utf8 encoding is off
%% and causes a match error
true ->
{bad_octet, B1}
end.

%% Given a list of unicode code points, return a binary of UTF-8
%% encoded text.
to_binary(Str) ->
encode_utf32(Str, []).

encode_utf32([], Utf8) ->
{ok, list_to_binary(lists:reverse(Utf8))};
encode_utf32([U32|Str], Utf8) ->
if
%% 0-7F 0zzzzzzz
U32 < 16#80 ->
encode_utf32(Str, [U32|Utf8]);

%% 110yyyyy 10zzzzzz
U32 < 16#800 ->
Y = 2#11000000 bor ((U32 band 16#7c0) bsr 6),
Z = 2#10000000 bor (U32 band 16#3f),
encode_utf32(Str, [Z|[Y|Utf8]]);

%% 1110xxxx 10yyyyyy 10zzzzzz
U32 < 16#10000 ->
X = 2#11100000 bor ((U32 band 16#f000) bsr 12),
Y = 2#10000000 bor ((U32 band 16#fc0) bsr 6),
Z = 2#10000000 bor (U32 band 16#3f),
encode_utf32(Str, [Z|[Y|[X|Utf8]]]);

%% 11110www 10xxxxxx 10yyyyyy 10zzzzzz
U32 < 16#110000 ->
W = 2#11110000 bor ((U32 band 16#1c0000) bsr 18),
X = 2#10000000 bor ((U32 band 16#3f000) bsr 12),
Y = 2#10000000 bor ((U32 band 16#fc0) bsr 6),
Z = 2#10000000 bor (U32 band 16#3f),
encode_utf32(Str, [Z|[Y|[X|[W|Utf8]]]]);

%% past allocated code points
true ->
{bad_code_point, U32}
end.


The bit fiddling looks nastier to me than C syntax, but it's reasonable enough. Otherwise Erlang is reminding me of OCaml, which I studied briefly before returning to Ruby, only it is much nicer to work with due to its dynamic typing (which I'm far more used to having written Ruby for more than 4 years now).

Now let's test the code. I saved the above code to a file called utf8.erl, started the erlang shell and compiled the above code.


bishop@arachnis:~$ erl
Erlang (BEAM) emulator version 5.5.2 [source] [async-threads:0] [kernel-poll:false]

Eshell V5.5.2 (abort with ^G)
1> c(utf8).
{ok,utf8}
2> Bin = <<"K\303\274\303\237chen \344\270\255\345\234\213 \360\220\214\260\360\220\214\261\360\220\214\262">>.
<<75,195,188,195,159,99,104,101,110,32,32,228,184,173,229,156,139,32,32,240,144,140,176,240,144,140,177,240,144,...>>


I also initialized the test binary data with the UTF-8 encoded binary for "Küßchen 中國 𐌰𐌱𐌲" which is the German for for a little kiss, the Chinese for China, and the Gothic letters Ahsa, Bairkan, and Giba (cf. ABC). This provides all the UTF-8 encoding lengths: 1 and 2 bytes for the German, 3 bytes for the Chinese, and 4 bytes for the Gothic. Now to decode the UTF-8 into a list of code points, i.e. an Erlang string:


3> {ok,Str} = utf8:from_binary(Bin).
{ok,[75,252,223,99,104,101,110,32,32,20013,22283,32,32,66352,66353,66354]}


By looking at a Unicode character reference you can verify that the UTF-8 and code points agree and that the binary is being decoded correctly. Finally to encode the string back into a binary:


4> {ok,Bin2} = utf8:to_binary(Str).
{ok,<<75,195,188,195,159,99,104,101,110,32,32,228,184,173,229,156,139,32,32,240,144,140,176,240,144,140,177,...>>}
5> Bin =:= Bin2.
true
6>


The string has been re-encoded to the same binary that we started with.

This code looks correct to me, discounting the lack of proper error handling. Feel free to use and improve it. If you are a Unicode expert and see any bugs, I'd love to hear it.

Have fun!

Wednesday, May 16, 2007

Enumerating Integers

My friend Dave B. came to me with a programming problem today. The end goal is to compute the minimum collection of pennies, nickels, dimes and quarters you need to make any amount under a dollar. This can be broken up into two major parts, the first of which is enumerating integer combinations. I will arrive at this algorithm in two steps. The first is something I've solved before and forgot about; the second step is a patching-up of the first which Dave helped me with.

When I visited my Grandmother growing up, I would sleep in a spare bedroom with many unfamiliar objects. The most captivating was this digital clock with a mechanical display instead of LEDs. As time passed, the numbers would flip over like a Rolodex to reveal the next one, finally looping back to 0. When the minutes position looped back to zero, the ten's (of minutes) position would flip over. If the ten's position flipped back to zero, then the hour position would flip over too.

Right vs. Left Directions in an Array.
When I write out an array, I do so like this:
array: [a,b,c,d]
index: 0 1 2 3

The first row shows the array contents itself and the second row shows the indexes used to access the array elements. The first element a is leftmost, and the last element d is rightmost, and I would say that b is to the left of c.

We can enumerate a 'permutation' of integers in this way. The idea is to have each combination of the integers 0..max in each element of the array. (This is not a true mathematical permutation because the collection of numbers we are selecting from has duplicates and is thus not a set.) Note that this will enumerate the indexes to access all elements of an N-dimensional array without needing N nested loops.

First, create an array ary with 3 elements. Then, starting at the rightmost index, increment the element at that index until it overflows (i.e. goes above the max value, 2). When it overflows, reset it to 0 (zero) and increment the element to the left of the current one until they stop overflowing. If you need to increment an element past the start of the array (less than index zero), then you are done enumerating. Ruby code implementing this algorithm follows:


ary = [0, 0, 0]
loop do
k = ary.length - 1
p ary
while (ary[k] += 1) == 2
ary[k] = 0
k -= 1
return if k < 0
end
end


We could easily generalize this to parameterize the number of integers and their min and max values:


def permute_integers(count, min, max)
ary = [min] * count
loop do
k = count - 1
yield ary
while (ary[k] += 1) >= max
ary[k] = min
return if (k -= 1) < 0
end
end
end


Where count is the number of integers in each list and min and max denote the open interval min <= n < max and n is the set of values each integer in the list can take. A simple test reveals:


irb(main):015:0> permute_integers(2, 1, 4) { |a| p a }
[1, 1]
[1, 2]
[1, 3]
[2, 1]
[2, 2]
[2, 3]
[3, 1]
[3, 2]
[3, 3]
=> nil


Now that's all well and good if we're enumerating multi-dimensional array indexes. For coin counting, we really need a 'combination' because the order of the numbers doesn't matter. This requires a different updating strategy. On each iteration, find the rightmost array element that will not overflow if we increment it. Increment that element and copy its new value to the elements to its right, if any. Modifying permute_integers with this strategy gives:


def combine_integers(count, min, max)
ary = [min] * count
loop do
k = ary.length - 1
yield ary
while ary[k] >= max - 1
return if (k -= 1) < 0
end
v = ary[k] + 1
(k...count).each {|i| ary[i] = v }
end
end


To see that this works, let's take the above example and delete those integer collections that are mere reorderings of others:


[1, 1]
[1, 2]
[1, 3]
[2, 1] (same as [1, 2])
[2, 2]
[2, 3]
[3, 1] (same as [1, 3])
[3, 2] (same as [2, 3])
[3, 3]



Testing combine_integers with the same parameters as the permute_integers test, we get:


irb(main):013:0> combine_integers(2, 1, 4) { |a| p a }
[1, 1]
[1, 2]
[1, 3]
[2, 2]
[2, 3]
[3, 3]
=> nil


Which is exactly the list we created manually. After trying this with three integers, I feel assured that the algorithm works. I am not mathematician enough for a real proof, sorry.

Monday, May 14, 2007

Lua: Good or Evil?

I was just checking out Lua as I look for a language for scripting 3d applications. Lua is a neat little language that combines Algol family syntax with the Lisp idea of basic but powerful language constructs. Most exciting are functions and associative arrays. The functions act as closures and can be treated as first-class values. The associative arrays are true to their name, but with a twist in the implementation. Internally, they store values either as an array or a hash table depending on how you index the elements. PHP (associative) 'arrays' done right. I've heard Lua described as a 'fixed Javascript.' And it's speedy.

As I was going through the "Programming in Lua" book I was severely disappointed to learn that the language and relevant library functions assume 1-based indexing for arrays, strings and variadic function parameters. After more than 10 years of C, C++, Ruby and other 0-based indexing languages, that's just too much to swallow. It's a wonderful little language, but if it's going to make me throw out all my hard-earned 0-based indexing tricks and correctness reasoning skills, I won't have it.

The Lua wiki shows some controversy on the subject: http://lua-users.org/wiki/CountingFromOne but no serious moves to change anything. There was a good article by Edsger Dijkstra linked there, "Why numbering should start at zero."