The Magic of Thunk - Stream_list

alt

list

The built-in list is a fundamental type in OCaml. We can have a list of numbers such as [1;2;3;4] or functions e.g. [fun x -> x + 1; fun y -> 2 * y]. Although list is most frequently used, don't forget its limitation: all elements inside must be generated before use of the list.

This gives us two troubles sometimes:

  • What if the computation of each element is intensive and maybe only partial of the list would be used in run time? This wastes CPU cycles.
(* generate all sqrt for natures within [p, q]
   this function is not tail-recursive, 
   but it should be in practice *)
let range_sqt p q =  
  let rec gen i =
    if i > q then []
    else sqrt (float_of_int i)::gen (i+1)
  in
  gen p;;

(* take only the first 5
   One million sqrt will be done anyway
   List.take is from Core lib *)
List.take (range_sqt 1 1000000) 5;;  
  • What if we need a list of all elements and the all means infinite? This becomes impossible and the process will not end ([1])
(* Generate a list of all nature numbers
   Since all elements need to be in place,
   the generation will never end *)
let natures_list =  
  let rec gen i = i::gen (i+1) in gen 1;;

stream_list

stream_list is born to solve the above two problems list has. It is like list, having a head and a tail. Before we dive into stream_list, let's first review the type of list, hd and tail functions.

(* This is list! Note that [] and :: are ocaml built-in support. Here to demonstrate, we use Nil and Cons *)

type 'a list = Nil | Cons of 'a * 'a list;;

let hd = function  
  | Nil -> failwith "empty list"
  | x::_ -> x;;

let tail = function  
  | Nil -> failwith "empty list"
  | Cons (_, tl) -> tl;;

l = [1;2;3];;  
let tl_0 = tail l;; (* tl_0 is [2;3] *)  
let hd_1 = hd tl_0;; (* hd_1 is 2 *)  

The important bit to notice is that when you need the tail, you simply retrieve and also the new head is already there. stream_list, however, is different: Only when needed, the real tail together with the new head is generated.

Quiz 1: How can we encapsulate the generation of tail of a stream_list?

Ans
Thunk, of course. Recall Magic of Thunk, we use thunks to store the way of generation, so we do not need to really generate them.

(* This is stream_list! It transforms from list by simply convert the tail to a thunk. *)

type 'a stream_list = Nil | Cons of 'a * (unit -> 'a stream_list);;

(* the hd function is even the same as list *)
let hd = function  
  | Nil -> failwith "empty list"
  | Cons(x, _) -> x;;

let tail = function  
  | Nil -> Nil
  | Cons (_, tl) -> tl();;

let rec take sl n =  
  match sl with
  | Nil -> []
  | _ when n = 0 -> []
  | Cons (x, tl) -> x::take (tl()) (n-1);;

To better understand it, let's follow a simple example: natures, which list cannot do.

let natures =  
  let rec gen i = Cons (i, fun() -> gen (i+1)) in
  gen 1;;

take natures 2;;  

natures initially is Cons(1, fun() -> gen 2). If we do tail natures, then fun() -> gen 2 gets applied and we will get Cons (2, fun() -> gen 3). The thunk is like a door and when we need the next thing, we have to open it. It naturally splits the endless process into steps by element, but still provides the possibility of accessing the elements endlessly. Let's practise more.

A trick

If stream_list doesn't feel natural yet, here is a little trick:

  1. Forget about stream_list first
  2. Write the list using Nil and Cons
  3. No tail-recursive
  4. Ignore or accept the fact that it would not stop
  5. Convert the tail to a thunk
  6. Then you get a stream_list

Fibonacci numbers

Following the above trick, it should not be too hard to write a stream_list of fibonacci numbers.

At first, purely list-based:

(* list *)
let fibs_list =  
  let rec fib a b = Cons (a, fib b (a+b)) in
  fib 0 1;;

Then convert the tail part:

(* stream_list *)
let fibs =  
  let rec fib a b = Cons (a, fun () -> fib b (a+b)) in 
  fib 0 1;;

Neat?

Primes

A good way to generate a list of primes is using Sieve of Eratosthenes algorithm. A functional version can:

  1. Maintain a list of the primes found so far (p_acc)
  2. For a new nature number, if the mods between it and all primes in pacc are not zero, then it is a new prime, add it to pacc and repeat; if not, then next nature number
(* List.for_all is from Core lib [2]*)
let is_prime x primes = List.for_all primes ~f:(fun p -> x mod p <> 0);;

(* This prime list will not end and eventually cause stackoverflow, but will help us to write a stream_list of all primes. Also, the reason of using :: in middle is because that's not relevant to stream_list *)
let prime_list =  
  let rec gen i p_acc =
    if is_prime i p_acc then Cons(i, gen (i+1) (i::p_acc))
    else gen (i+1) p_acc
  in
  gen 3 [2];;

Then easily, we can write the stream_list version:

let primes =  
  let rec gen i p_acc =
    if is_prime i p_acc then Cons(i, fun() -> gen (i+1) (i::p_acc))
    else gen (i+1) p_acc
  in
  Cons(2, gen 3 [2]);;

The above has a flaw though. primes always holds a list of primes found so far and this list will grows after each retrieval of prime. So theorically, this prime stream_list is not endless. Just be aware of this, despite that it could be fine in practical uses.

Combined operations

Just like list, stream_list also provides operatoins (functions) such as filter, map, folder, etc. Although the meanings and purposes of those operations of stream_list are the same as list, the behaviors of combined operations are a bit different. Let's first have a look at two combined filtering on list.

let l = [2;3;4;5;6];;

let double_filtering l =  
  List.filter ~f:(fun x -> Printf.printf "%d mod 3\n" x;x mod 3 = 0) (
    List.filter ~f:(fun x -> Printf.printf "%d mod 2\n" x; x mod 2 = 0) l)

double_filtering l;;  
1 mod 2  
2 mod 2  
3 mod 2  
4 mod 2  
5 mod 2  
6 mod 2

2 mod 3  
4 mod 3  
6 mod 3  
- : int list = [6]

After the first filtering, all x mod 2 = 0 are kept. Based on it, the second filtering keeps all x mod 3 = 0. Through the printed output, we can see that the second filtering started only after the first filtering finished on all elements of the original list. This means combined operations on list are executed strictly in order and each operation must finish on all elements in one go. Does stream_list have the same behavior?

(* stream_list of [i, j) *)
let rec range i j =  
  if i >= j then Nil
  else Cons(i, fun() -> range (i+1) j)

let rec filter ~f = function  
  | Nil -> Nil
  | Cons(x, tl) ->
    if f x then Cons(x, fun() -> filter ~f (tl()))
    else filter ~f (tl());;

let double_filter_sl sl =  
  filter ~f:(fun x -> Printf.printf "%d mod 3\n" x;x mod 3 = 0) (
    filter ~f:(fun x -> Printf.printf "%d mod 2\n" x; x mod 2 = 0) sl);;

(* take was define in beginging of the post *)
take (double_filter_sl (range 1 7)) 6;;  
1 mod 2  
2 mod 2

2 mod 3  
3 mod 2  
4 mod 2

4 mod 3

5 mod 2  
6 mod 2

6 mod 3  
- : int list = [6]

The order of filterings is quite different now. As shown from the output, two filterings are executed in turn on elements. If one element passed the first filtering then it will be given to the second filtering immediately.

If we write filters and lists as rows and columns:

filter1 [1;2;3;4;5;7]  
filter2 [1;2;3;4;5;7]  
filter3 [1;2;3;4;5;7]  
filter4 [1;2;3;4;5;7]  

The behavior of list is to finish row by row, while for stream_list it is column by column [4]. It is fascinating that the design of stream_list naturally let the overlayed operations focus on element one by one, which still obeys the rule of stream_list (always delays operations on rest of the elements)

Primes reloaded

We can now have a better prime stream_list.

Let's first remove p_acc (i.e., the aid of built-in list) in the list version of primes:

(* all functions are for list and endless *)

(* generate a endless nature nubmers starting from i *)
let rec from i = Cons(i, from (i+1));;

(* This function can easily be replaced by a filter *)
let rec prime_candidate prime = function  
  | Nil -> Nil
  | Cons(x, tl) -> 
    if x mod prime = 0 then prime_candidate prime tl
    else Cons(x, prime_candidate prime tl);;

let primes_list =  
  let rec gen = function
    | Nil -> Nil
    | Cons (x, tl) -> Cons(x, prime_candidate x tl)
  in
  gen (from 2);;

Instead of letting a new number mod all primes found so far, we take the other way around. Once we find a prime, we let all following numbers mod the prime to produce a prime candidate list. The first element of such a list will be the next prime (this next prime is the number that passed the check of mod all primes found so far). Then we repeat.

  1. prime candidates are [2;3;4;5;6;7;8;9;...]
  2. 2 is prime, so we prime_candate 2 [3;4;5;6;..] and we get [3;5;7;9;...]
  3. 3 is prime, so we prime_candate 3 [5;7;9;...] and we get [3;5;7;...]
  4. 5 is prime, so ...

Evantually a endless prime list will be generated. To convert the whole part to stream_list based, we get:

let rec from i = Cons(i, fun() -> from (i+1));;

let prime_candidate prime remaining_natures =  
  filter ~f:(fun x -> x mod prime <>0) remaining_natures;;

let primes =  
  let rec gen = function
    | Nil -> Nil
    | Cons (x, tl) -> Cons(x, fun() -> gen (prime_candidate x (tl())))
  in
  gen (from 2);;

Look closer and you will find out the key is combining together filterings each of which is based on whether a natural number can be divided by a prime.

Yet, the above primes is not flawless.

Also if you are interested, maybe you shall have a look at the paper The Genuine Sieve of Eratosthenes. It proves the ultimate version of prime stream_list is not perfect.


[1] We can use let rec alist = 0::alist to create a endless list of 0, but it uses cycled references and its values still need to be fixed; otherwise, the building process won't stop. We will describe let rec further in a later post.

[2] Core_list doc

[3] Note when excuting column by column, it is fail-fast, i.e., if an element does not pass one filter, it will not be fed to later filters.

Ps. All codes end up with ;;. This is to make it convenient for you to copy / paste in utop; however, it is not necessary if you store them to a .ml file and run the file. Also all the names of variables are quite long or verbose; this is to make it a little easier to understand.

Ps1. The fundamental study materials of this post comes from OCaml course in Cornell


To be continued