Generic servers and supervisors in Jobim Clojure’s actors library

The concept of a “supervisor” or “generic-server” are immediately familiar to anyone who has worked in an Erlang application. The main idea behind these concepts is to factorize common patterns in distributed Erlang applications in higher level abstractions so they can be easily reused. These abstractions are called behaviours and the Erlang OTP includes a really useful bunch of them.

A “generic server” is just an abstraction of a regular Erlang process with some scaffolding to maintain the state, handling synchronous and asynchronous messages as well as to handle the ordered termination of the process in the case of error.

A “supervisor” is a more interesting concept. It is a process with three main goals:

  • Starting other processes
  • Monitor their correct execution
  • Trying to recover from children failures

Designers can choose different ways of recovering from children errors. These recovery options are called recovery policies and can have different values: one for all (if a child process fails, the remaining processes are terminated and then all restarted), one for one (only the failing process is restarted), etc. A maximum number of restarts in a certain period of time can also be specified for the supervisor. If this limit is reached, the supervisor will also fail instead of getting trapped in a loop of failures-restarts.

A supervisor can have another supervisor as their children. This is a main idea in Erlang applications, that are usually designed as a tree with different levels of supervisors. If a children in a level fails, the supervisor at that level tries to recover from that failure. If it is impossible, the supervisor will fail and the superior level supervisor will try to recover from the error. If the top level supervisor fails, the application will crash in an ordered manner. This fail early and let some other component of the system recover from the error is a common design principle in Erlang systems.

Jobim is being modeled to a great extent after the Erlang OTP platform. It makes sense to add support for components like generic servers and supervisor to Jobim, since the same issues present in Erlang applications will also be important for a Jobim application.

Support for generic servers can be found in the jobim.behaviours.server namespace. Generic servers in Jobim must implement the jobim.behaviours.server.Server protocol, defined as:

;; Server protocol

(defprotocol Server
  (init [this initial-value] 
      "Invoked when the server starts execution")
  (handle-call [this request from state] 
       "Invoked when a new synchronous request 
        arrives at the server")
  (handle-cast [this request state] 
        "Invoked when a new asynchronous request 
         arrives at the server")
  (handle-info [this request state] 
        "Handles messages distinct to events")
  (terminate [this state] 
        "Clean up code invoked when the server is 
         going to be terminated externally"))

The main functions are handle-call and handle-cast which are invoked by the server process when the server receives a synchronous and asynchronous message respectively. The following snippet shows a sample implementation of a generic server used in the Jobim included example for a supervisor:

(ns jobim.examples.supervisor
  (:use [jobim])
  (:require [jobim.behaviours.server :as gen-server]
            [jobim.behaviours.supervisor :as supervisor]))

;; Test server

(gen-server/def-server ExceptionTestServer

  (init [this _] [])

  (handle-call [this request from state]
               (println (str "RECEIVED " request " PID " (self) " ALIVE"))
               (gen-server/reply :alive []))

  (handle-cast [this request state] (if (= request :exception)
                                      (throw (Exception. (str "I DIED! " (self))))
                                      (gen-server/noreply state)))

  (handle-info [this request state] (gen-server/noreply state))

  (terminate [this state] (println (str "CALLING TO TERMINATE " (self)))))

The test server just replies with :alive if it receives as synchronous message and throws an exception if it receives an asynchronous message with value :exception. The functions reply and noreply are used to send a potential reply for the message and pass the new state of the server. The implementation of the terminate function shows that the server just prints a message when it is going to be terminated.

Once a generic server is implemented, it is a good practice to wrap the interaction with the server with some wrapper functions defining a public interface for the process. In the same namespace some of these functions are defined:

;; Public interface

(defn make-test-client
  ([n] (gen-server/start (str "test-client-" n)  (ExceptionTestServer.) [])))

(defn ping-client
  ([n] (gen-server/send-call! (resolve-name (str "test-client-" n)) :ping)))

(defn trigger-exception
  ([n] (gen-server/send-cast! (resolve-name (str "test-client-" n)) :exception)))

The function make-test-client starts a new test server and register its name with a prefixed index. The functions ping-client and trigger-exception send messages to check that the server is alive and provoke a remote exception.

We can test the defined server from the REPL starting a new jobim node and calling these functions:

user> (use 'jobim)
nil

user> (use 'jobim.examples.supervisor)
nil

user> (bootstrap-node "test-node.clj")

 ** Jobim node started ** 


 - node-name: remote-test
 - id: 82534b8d74ce434d9164a59cf361c7df
 - messaging-type :rabbitmq
 - messaging-args {:host "localhost"}
 - zookeeper-args ["localhost:2181" {:timeout 3000}]


:ok

user> (spawn-in-repl)
"82534b8d74ce434d9164a59cf361c7df.1"

user> (make-test-client 0)
"82534b8d74ce434d9164a59cf361c7df.2"

user> (registered-names)
{"test-client-0" "82534b8d74ce434d9164a59cf361c7df.2"}

user> (ping-client 0)
:alive

user> (trigger-exception 0)
:ok

user> (registered-names)
{}

The previous transcript from the REPL starts a single Jobim node that uses RabbitMQ as the transport mechanism, spawns a Jobim process associated with the REPL thread and starts a test server with index 0. Then different messages are sent to the process with the send-call! and send-cast! functions. It can also be seen how the server ends execution after the exception asynchronous message is sent and no longer appears as a registered process.

A look at the transcript log shows some more information about the interaction with the server:

RECEIVED :hi PID 82534b8d74ce434d9164a59cf361c7df.2 ALIVE
ERROR jobim.core - *** process 82534b8d74ce434d9164a59cf361c7df.2 
died with message : I DIED! 82534b8d74ce434d9164a59cf361c7df.2 ]
...

If we would like to be notified when the test client launches an exception, we could link the actor associated to the REPL thread to the newly created server. For example:

user> (def *client-pid* (make-test-client 0))
#'user/*client-pid*
user> (link *client-pid*)
{"82534b8d74ce434d9164a59cf361c7df.1" ["82534b8d74ce434d9164a59cf361c7df.4"], 
 "82534b8d74ce434d9164a59cf361c7df.4" ["82534b8d74ce434d9164a59cf361c7df.1"]}

user> (trigger-exception 0)
:ok

user> (receive)
{:signal :link-broken, 
 :from "82534b8d74ce434d9164a59cf361c7df.4", 
 :cause "class java.lang.Exception:I DIED! 82534b8d74ce434d9164a59cf361c7df.4"}

This ability to create a bidirectional link between actors is used to define a supervisor actor. The supervisor actor can be created with the jobim.behaviours.supervisor/start function. This function receives a supervisor specification with the restart strategy, the maximum number of restarts, the period to check the maximum number of restarts and a list of child processes specifications. The Jobim’s supervisor example defines the following function to create a new supervisor:

(defn start-supervisor
  ([] (start-supervisor :one-for-one))
  ([restart-strategy]
     (supervisor/start
      (supervisor/supervisor-specification
       restart-strategy                 ; restart strategy
       1                                ; one restart max
       20000                            ; each 5 secs
       ; Children specifications
       [(supervisor/child-specification
         "test-client-1"
         "jobim.examples.supervisor/make-test-client"
         [1])
        (supervisor/child-specification
         "test-client-2"
         "jobim.examples.supervisor/make-test-client"
         [2])
        (supervisor/child-specification
         "test-client-3"
         "jobim.examples.supervisor/make-test-client"
         [3])]))))

The function creates three different children with indexes 0,1 and 3 and receives the restart strategy as a parameter. The possible restart strategies are :one-for-one, :one-for-all and :rest-for-one. We will review all of them.

one-for-one

The simplest restart strategy, if a child process dies, only that process is restarted.

one-for-one strategy from Erlang's official documentation

The following transcript shows how this strategy works:

user> (use 'jobim)
nil

user> (use 'jobim.examples.supervisor)
nil

user> (bootstrap-node "test-node.clj")

 ** Jobim node started ** 


 - node-name: remote-test
 - id: ca6377017189452a90814d95edcac79b
 - messaging-type :rabbitmq
 - messaging-args {:host "localhost"}
 - zookeeper-args ["localhost:2181" {:timeout 3000}]


:ok

user> (spawn-in-repl)
"ca6377017189452a90814d95edcac79b.1"

user> (start-supervisor :one-for-one)
"ca6377017189452a90814d95edcac79b.2"

user> (registered-names)
{"test-client-3" "ca6377017189452a90814d95edcac79b.5", 
 "test-client-1" "ca6377017189452a90814d95edcac79b.3", 
 "test-client-2" "ca6377017189452a90814d95edcac79b.4"}

user> (trigger-exception 2)
:ok

user> (registered-names)
{"test-client-3" "ca6377017189452a90814d95edcac79b.5", 
 "test-client-1" "ca6377017189452a90814d95edcac79b.3", 
 "test-client-2" "ca6377017189452a90814d95edcac79b.6"}

We can see how the PID for the actor identified with test-client-2 has changed once the supervisor restarted it. The other processes are not affected at all.

one-for-all

In this restart strategy all the remaining actors are terminated and then all the processes are restarted again.

one-for-all restart strategy from Erlang's official documentation

The following transcript shows an example of how this strategy works:

user> (use 'jobim)
nil

user> (use 'jobim.examples.supervisor)
nil

user> (bootstrap-node "test-node.clj")

 ** Jobim node started ** 


 - node-name: remote-test
 - id: a1c47e33320149938dbde4564b4a4199
 - messaging-type :rabbitmq
 - messaging-args {:host "localhost"}
 - zookeeper-args ["localhost:2181" {:timeout 3000}]


:ok

user> (spawn-in-repl)
"a1c47e33320149938dbde4564b4a4199.1"

user> (start-supervisor :one-for-all)
"a1c47e33320149938dbde4564b4a4199.2"

user> (registered-names)
{"test-client-3" "a1c47e33320149938dbde4564b4a4199.5", 
 "test-client-1" "a1c47e33320149938dbde4564b4a4199.3", 
 "test-client-2" "a1c47e33320149938dbde4564b4a4199.4"}

user> (trigger-exception 2)
:ok

user> (registered-names)
{"test-client-3" "a1c47e33320149938dbde4564b4a4199.8", 
 "test-client-1" "a1c47e33320149938dbde4564b4a4199.6", 
 "test-client-2" "a1c47e33320149938dbde4564b4a4199.7"}

We can see how all the registered actor’s PID have been restarted by the supervisor once the exception is triggered in one of them. If we check the transcript log, we can see how the terminate function for the restarted actors have been invoked:

ERROR jobim.core - *** process a1c47e33320149938dbde4564b4a4199.4 
died with message : I DIED! a1c47e33320149938dbde4564b4a4199.4 [...]
CALLING TO TERMINATE a1c47e33320149938dbde4564b4a4199.5
CALLING TO TERMINATE a1c47e33320149938dbde4564b4a4199.3

rest-for-one

The last strategy is :rest-for-one when only the actors defined after the failing actor in the supervisor specification are terminated and then, together with the failing one are restarted.

A supervisor can fail if too many children actors fail in the period of time specified in the supervisor definition. In the start-supervisor function the limits are set to one restarted process each 20 seconds. We can force the failure in the supervisor restarting more than one process in that time period. When the supervisor is going to terminate, it terminates all the children actors.

user> (use 'jobim)
nil

user> (use 'jobim.examples.supervisor)
nil

user> (bootstrap-node "test-node.clj")

 ** Jobim node started ** 


 - node-name: remote-test
 - id: 39df41a4d4224e80818e7f33061a5ca2
 - messaging-type :rabbitmq
 - messaging-args {:host "localhost"}
 - zookeeper-args ["localhost:2181" {:timeout 3000}]


:ok

user> (spawn-in-repl)
"39df41a4d4224e80818e7f33061a5ca2.1"

user> (start-supervisor :one-for-one)
"39df41a4d4224e80818e7f33061a5ca2.2"

user> (registered-names)
{"test-client-3" "39df41a4d4224e80818e7f33061a5ca2.5", 
 "test-client-1" "39df41a4d4224e80818e7f33061a5ca2.3", 
 "test-client-2" "39df41a4d4224e80818e7f33061a5ca2.4"}

user> (trigger-exception 2)
:ok

user> (registered-names)
{"test-client-3" "39df41a4d4224e80818e7f33061a5ca2.5", 
 "test-client-1" "39df41a4d4224e80818e7f33061a5ca2.3", 
 "test-client-2" "39df41a4d4224e80818e7f33061a5ca2.6"}

user> (trigger-exception 2)
:ok

user> (registered-names)
{}

If we check the transcript log, we can see how a exception is thrown by the supervisor after terminating all the children processes:

ERROR jobim.core - *** process 39df41a4d4224e80818e7f33061a5ca2.4 
died with message : I DIED! 39df41a4d4224e80818e7f33061a5ca2.4 [...]
ERROR jobim.core - *** process 39df41a4d4224e80818e7f33061a5ca2.6 
died with message : I DIED! 39df41a4d4224e80818e7f33061a5ca2.6 [...]
MAX RESTARTS REACHED
CALLING TO TERMINATE 39df41a4d4224e80818e7f33061a5ca2.7
CALLING TO TERMINATE 39df41a4d4224e80818e7f33061a5ca2.5
CALLING TO TERMINATE 39df41a4d4224e80818e7f33061a5ca2.3
ERROR jobim.core - *** process 39df41a4d4224e80818e7f33061a5ca2.2 
died with message : Max numer of restarts reached in supervisor: 
39df41a4d4224e80818e7f33061a5ca2.2 [...]

Conclusions

Supervisors, generic servers and other components implemented in Jobim (FSM, event managers, generic TCP servers) are important pieces for building distributed applications using actors.

The current implementation is still immature, but already shows how the interface of the basic building blocks for Erlang applications can be implemented in Clojure. Nevertheless, the underlying JVM where Clojure’s code is usually executed poses important difficulties to translate the behaviour of the Erlang VM. Actors must collaborate with the supervisor to terminate its execution since it is impossible to force the termination of a thread. This problem is present, for instance, with blocking operations.
Nevertheless the already implementing blocks are enough to start building test applications and makes possible define new abstractions on top of them at the level of the application being executed in a set of nodes.

Advertisement

One thought on “Generic servers and supervisors in Jobim Clojure’s actors library

  1. Awesome! I was working on a library very similar to this.

    This library uses RabbitMQ, right? Have you thought about using Erlang’s jinterface? It would enable jobim to talk to already existing Erlang applications.

    Oh, and have you forgot to add a link to your library? I found it on Github: http://github.com/antoniogarrote/jobim

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s