Language Weenie

Saturday, October 20, 2007

How I'm Liking Haskell So Far

I've been playing with Haskell in my spare time for the past few months. So far, I have not found any major annoyances as I did with other languages I've been looking at recently (notably: Lua, Smalltalk, Erlang). Haskell's a keeper.

My major work in Haskell so far is a program to compute the 51 000 000 000 th letter in the string of words for the numbers 1 - 999 999 999 sorted lexicographically (problem description here), which I found through programming.reddit.com a while back. I was thoroughly impressed by Haskell's ability to symbolically represent the number words exactly. I defined a set of types that worked similarly to a grammar in specifying which words could go where. Ultimately I went with a slightly different approach that appeals better to my dynamically typed past, but I am hoping I will leave such habits behind after learning Haskell better.

While working to speed up the word numbers problem, I was pleasantly surprised by the simplicity of the profiler built into the Glasgow Haskell Compiler (GHC). You compile your program with the appropriate couple command line flags, then when you run the program, just pass it a couple more flags. The output is a nicely formatted text file showing where your cpu and memory allocation time went. I told GHC to just profile all top-level functions in my short program; if I had wanted, I could have designated specific functions or even expressions for profiling. I found the interface easier to use than GCC's profiling and the output easier to understand than Ruby's profiler.

Haskell has very beautiful syntax. Certainly nicer than OCaml and Erlang, and perhaps tied with Smalltalk as most-beautiful-syntax-I-know. It is clutter free; for example, using \args -> expr for the lambda syntax is simply brilliant. The language has a very 'mathy' feel to it that rubs me the right way. One thing I'm not as keen on is the idea that 'good software engineers' always document the types of their top-level functions. After using Ruby (and thus dynamic typing) for several years, I prefer to let the type inferencer do all that for me. It will be interesting to see how this opinion evolves over time.

To help me get a handle on monads and other weirdness, I picked up The Haskell School of Expression by Paul Hudak. It has been excellent, clearing up holes in my understanding from Yet Another Haskell Tutorial. I'm not so interested in the multimedia programming that the book uses as a motivation for introducing new language features, but it is done well.

I'm still not entirely comfortable with the way functional languages are taught. In the past I have learned a language by building a mental model of how the compiler might turn code constructs into machine language. This has served me well with C, C++, Java and even Ruby. With Haskell I am still trying to do this, but it gets confusing quickly. It doesn't help that functional languages' computational models are generally based on the lambda calculus rather than Von Neumann imperative stuff. This leads to descriptions of the language's semantics based on value substitution. While this may be an effective reasoning and proving tool, I just don't find it as satisfying as what's ultimately happening at the hardware level. I may have to read some papers on compiling functional languages before I'm happy :/

Haskell has three features which are totally new and weird to me: lazy evaluation, functional purity and monads. Lazy execution means that an expression will not be executed until its value is needed. Laziness seems really cool. You can define an infinite list of values and only use as many elements as you need without causing an infinite loop. Like closures, it seems to be one of those features that sounds weird and not very useful until you've thought about them for a while. Functional purity means that Haskell cannot do I/O in the usual way since a print statement is both side-effectful and doesn't have a sensible return value. The solution is to use these things called Monads.

Monads apparently are the hardest thing to learn for those new to Haskell, and this has been the case for me. I basically understand the mechanics of Monads thanks to The Haskell School of Expression and the wonderful blog post A Monad Tutorial for Ocaml. I still don't feel like I've got my head wrapped around them though. What's really going on with I/O under the monadal covers is still very unclear. But the idea that you can wrap stateful, imperative things up in this monad mechanism and end up with a purely functional result is terribly compelling even though it feels like magic.

Haskell is being really fun and I've learned a lot with more to come. If I stay it will be for the expressive semantics and beautiful syntax. If I leave for other languages, it will be for a more pragmatic (expedient, if dirty) approach, maybe O'Caml (though it has plenty flaws of its own).

Wednesday, September 12, 2007

The Average CS Graduate?

Overheard: "That's what coding is... we're like monkeys...we're like monkeys on typewriters trying every possibility". Very depressing.

Friday, June 29, 2007

UTF8 For Erlang

Update: added tests and corrected a couple of bugs in the 3- and 4-byte encoding lengths.

A friend and I are learning Erlang more or less at the same time (as well as a bunch of the 'cool kids' programmers). We were talking about how nicely UTF-8 could be handled in Erlang, given its treatment of strings as lists of integers. I decided to try writing binary -> UTF-32 (to code point, really) and code point -> binary transformation functions as an exercise to get to know the language. I'm still new and have to look up just about every piece of syntax and function the first time I use it.

The UTF-8 encoding is based on the explanation in Wikipedia and has simplistic error handling.


    -module(utf8).
    -export([from_binary/1, to_binary/1]).
    
    %% Given a binary of UTF-8 encoded text, return a UTF-32 String
    %% (i.e. each element is a unicode code point).
    from_binary(Bin) ->
        decode_binary(Bin, []).
    
    decode_binary(<<>>, Str) ->
        {ok, lists:reverse(Str)};
    decode_binary(Bin, Str) ->
        {B1,B2} = split_binary(Bin, 1),
        case B1 of
            %% 0-7F  0zzzzzzz
            <<0:1,Z:7>> ->
                decode_binary(B2, [Z|Str]);
    
            %% 110yyyyy 10zzzzzz
            <<2#110:3,Y:5>> ->
                {<<2#10:2,Z:6>>,B3} = split_binary(B2, 1),
                U32 = (Y bsl 6) bor Z,
                decode_binary(B3, [U32|Str]);
    
            %% 1110xxxx 10yyyyyy 10zzzzzz
            <<2#1110:4,X:4>> ->
                {<<2#10:2,Y:6,2#10:2,Z:6>>,B3} = split_binary(B2, 2),
                U32 = (X bsl 12) bor (Y bsl 6) bor Z,
                decode_binary(B3, [U32|Str]);
    
            %% 11110www 10xxxxxx 10yyyyyy 10zzzzzz
            <<2#11110:5,W:3>> ->
                {<<2#10:2,X:6,2#10:2,Y:6,2#10:2,Z:6>>,B3} = split_binary(B2, 3),
                U32 = (W bsl 18) bor (X bsl 12) bor (Y bsl 6) bor Z,
                decode_binary(B3, [U32|Str]);
    
            %% an exception will be raised if the utf8 encoding is off
            %% and causes a match error
            true ->
                {bad_octet, B1}
        end.
    
    %% Given a list of unicode code points, return a binary of UTF-8
    %% encoded text.
    to_binary(Str) ->
        encode_utf32(Str, []).
    
    encode_utf32([], Utf8) ->
        {ok, list_to_binary(lists:reverse(Utf8))};
    encode_utf32([U32|Str], Utf8) ->
        if
            %% 0-7F  0zzzzzzz
            U32 < 16#80 ->
                encode_utf32(Str, [U32|Utf8]);
    
            %% 110yyyyy 10zzzzzz
            U32 < 16#800 ->
                Y = 2#11000000 bor ((U32 band 16#7c0) bsr 6),
                Z = 2#10000000 bor  (U32 band 16#3f),
                encode_utf32(Str, [Z|[Y|Utf8]]);
    
            %% 1110xxxx 10yyyyyy 10zzzzzz
            U32 < 16#10000  ->
                X = 2#11100000 bor ((U32 band 16#f000) bsr 12),
                Y = 2#10000000 bor ((U32 band 16#fc0) bsr 6),
                Z = 2#10000000 bor  (U32 band 16#3f),
                encode_utf32(Str, [Z|[Y|[X|Utf8]]]);
    
            %% 11110www 10xxxxxx 10yyyyyy 10zzzzzz
            U32 < 16#110000 ->
                W = 2#11110000 bor ((U32 band 16#1c0000) bsr 18),
                X = 2#10000000 bor ((U32 band 16#3f000) bsr 12),
                Y = 2#10000000 bor ((U32 band 16#fc0) bsr 6),
                Z = 2#10000000 bor  (U32 band 16#3f),
                encode_utf32(Str, [Z|[Y|[X|[W|Utf8]]]]);
    
            %% past allocated code points
            true ->
                {bad_code_point, U32}
        end.

The bit fiddling looks nastier to me than C syntax, but it's reasonable enough. Otherwise Erlang is reminding me of OCaml, which I studied briefly before returning to Ruby, only it is much nicer to work with due to its dynamic typing (which I'm far more used to having written Ruby for more than 4 years now).

Now let's test the code. I saved the above code to a file called utf8.erl, started the erlang shell and compiled the above code.


    bishop@arachnis:~$ erl
    Erlang (BEAM) emulator version 5.5.2 [source] [async-threads:0] [kernel-poll:false]

    Eshell V5.5.2  (abort with ^G)
    1> c(utf8).
    {ok,utf8}
    2> Bin = <<"K\303\274\303\237chen  \344\270\255\345\234\213  \360\220\214\260\360\220\214\261\360\220\214\262">>.
   <<75,195,188,195,159,99,104,101,110,32,32,228,184,173,229,156,139,32,32,240,144,140,176,240,144,140,177,240,144,...>>

I also initialized the test binary data with the UTF-8 encoded binary for "Küßchen 中國 𐌰𐌱𐌲" which is the German for for a little kiss, the Chinese for China, and the Gothic letters Ahsa, Bairkan, and Giba (cf. ABC). This provides all the UTF-8 encoding lengths: 1 and 2 bytes for the German, 3 bytes for the Chinese, and 4 bytes for the Gothic. Now to decode the UTF-8 into a list of code points, i.e. an Erlang string:


    3> {ok,Str} = utf8:from_binary(Bin).
    {ok,[75,252,223,99,104,101,110,32,32,20013,22283,32,32,66352,66353,66354]}

By looking at a Unicode character reference you can verify that the UTF-8 and code points agree and that the binary is being decoded correctly. Finally to encode the string back into a binary:


    4> {ok,Bin2} = utf8:to_binary(Str).
    {ok,<<75,195,188,195,159,99,104,101,110,32,32,228,184,173,229,156,139,32,32,240,144,140,176,240,144,140,177,...>>}
    5> Bin =:= Bin2.
    true
    6>

The string has been re-encoded to the same binary that we started with.

This code looks correct to me, discounting the lack of proper error handling. Feel free to use and improve it. If you are a Unicode expert and see any bugs, I'd love to hear it.

Have fun!

Wednesday, May 16, 2007

Enumerating Integers

My friend Dave B. came to me with a programming problem today. The end goal is to compute the minimum collection of pennies, nickels, dimes and quarters you need to make any amount under a dollar. This can be broken up into two major parts, the first of which is enumerating integer combinations. I will arrive at this algorithm in two steps. The first is something I've solved before and forgot about; the second step is a patching-up of the first which Dave helped me with.

When I visited my Grandmother growing up, I would sleep in a spare bedroom with many unfamiliar objects. The most captivating was this digital clock with a mechanical display instead of LEDs. As time passed, the numbers would flip over like a Rolodex to reveal the next one, finally looping back to 0. When the minutes position looped back to zero, the ten's (of minutes) position would flip over. If the ten's position flipped back to zero, then the hour position would flip over too.

Right vs. Left Directions in an Array.
When I write out an array, I do so like this:

array: [a,b,c,d]
index:  0 1 2 3

The first row shows the array contents itself and the second row shows the indexes used to access the array elements. The first element a is leftmost, and the last element d is rightmost, and I would say that b is to the left of c.

We can enumerate a 'permutation' of integers in this way. The idea is to have each combination of the integers 0..max in each element of the array. (This is not a true mathematical permutation because the collection of numbers we are selecting from has duplicates and is thus not a set.) Note that this will enumerate the indexes to access all elements of an N-dimensional array without needing N nested loops.

First, create an array ary with 3 elements. Then, starting at the rightmost index, increment the element at that index until it overflows (i.e. goes above the max value, 2). When it overflows, reset it to 0 (zero) and increment the element to the left of the current one until they stop overflowing. If you need to increment an element past the start of the array (less than index zero), then you are done enumerating. Ruby code implementing this algorithm follows:


    ary = [0, 0, 0]
    loop do
        k = ary.length - 1
        p ary
        while (ary[k] += 1) == 2
            ary[k] = 0
            k -= 1
            return if k < 0
        end
    end

We could easily generalize this to parameterize the number of integers and their min and max values:


    def permute_integers(count, min, max)
        ary = [min] * count
        loop do
            k = count - 1
            yield ary
            while (ary[k] += 1) >= max
                ary[k] = min
                return if (k -= 1) < 0
            end
        end
    end

Where count is the number of integers in each list and min and max denote the open interval min <= n < max and n is the set of values each integer in the list can take. A simple test reveals:


    irb(main):015:0> permute_integers(2, 1, 4) { |a| p a }
    [1, 1]
    [1, 2]
    [1, 3]
    [2, 1]
    [2, 2]
    [2, 3]
    [3, 1]
    [3, 2]
    [3, 3]
    => nil

Now that's all well and good if we're enumerating multi-dimensional array indexes. For coin counting, we really need a 'combination' because the order of the numbers doesn't matter. This requires a different updating strategy. On each iteration, find the rightmost array element that will not overflow if we increment it. Increment that element and copy its new value to the elements to its right, if any. Modifying permute_integers with this strategy gives:


    def combine_integers(count, min, max)
        ary = [min] * count
        loop do
            k = ary.length - 1
            yield ary
            while ary[k] >= max - 1
                return if (k -= 1) < 0
            end
            v = ary[k] + 1
            (k...count).each {|i| ary[i] = v }
        end
    end

To see that this works, let's take the above example and delete those integer collections that are mere reorderings of others:


    [1, 1]
    [1, 2]
    [1, 3]
    [2, 1]  (same as [1, 2])
    [2, 2]
    [2, 3]
    [3, 1]  (same as [1, 3])
    [3, 2]  (same as [2, 3])
    [3, 3]

Testing combine_integers with the same parameters as the permute_integers test, we get:


    irb(main):013:0> combine_integers(2, 1, 4) { |a| p a }
    [1, 1]
    [1, 2]
    [1, 3]
    [2, 2]
    [2, 3]
    [3, 3]
    => nil

Which is exactly the list we created manually. After trying this with three integers, I feel assured that the algorithm works. I am not mathematician enough for a real proof, sorry.

Monday, May 14, 2007

Lua: Good or Evil?

I was just checking out Lua as I look for a language for scripting 3d applications. Lua is a neat little language that combines Algol family syntax with the Lisp idea of basic but powerful language constructs. Most exciting are functions and associative arrays. The functions act as closures and can be treated as first-class values. The associative arrays are true to their name, but with a twist in the implementation. Internally, they store values either as an array or a hash table depending on how you index the elements. PHP (associative) 'arrays' done right. I've heard Lua described as a 'fixed Javascript.' And it's speedy.

As I was going through the "Programming in Lua" book I was severely disappointed to learn that the language and relevant library functions assume 1-based indexing for arrays, strings and variadic function parameters. After more than 10 years of C, C++, Ruby and other 0-based indexing languages, that's just too much to swallow. It's a wonderful little language, but if it's going to make me throw out all my hard-earned 0-based indexing tricks and correctness reasoning skills, I won't have it.

The Lua wiki shows some controversy on the subject: http://lua-users.org/wiki/CountingFromOne but no serious moves to change anything. There was a good article by Edsger Dijkstra linked there, "Why numbering should start at zero."