Message Passing Concurrency in Clojure using Kilim

Some slides from my talk about Clojure and Kilim.
You can get the code and the examples in github

Monte Carlo integration with Clojure and Mahout

Monte Carlo simulations can be used to approximate the value of different mathematical models where computing an analytical solution is an intractable problem or it is impossible to find because of the uncertainty in some variables.

Conceptually Monte Carlo simulations are simple.
They are based in the repeated computation of the model where the value for the random variables are extracted from some known probability distribution.
Finally, the results of all the simulations are averaged to obtain the approximated value.

The two main requirements to run a Monte Carlo simulation, provided you already have a mathematical model of the proble, are a good random number generator and some tool to model the different probability distributions. The Apache Mahout project has implemented these tools. It includes a powerful pseudo-random numbers generator that passes all the DieHard randomness tests. It also includes Java classes that make easy to work with different probability distributions.

This post is a small example of how to use these classes from Clojure as a convenient scripting language to work with Mahout. The example implemented is extracted from this excellent introductory book published by Springer.
It consists of computing the integral for the following function:

  (defn h
    ([x] (pow (+ (cos (* 50 x)) (sin (* 20 x))) 2)))

We can use Incanter to visualize this function:

  (use 'incanter.core)
  (use 'incanter.charts)

  (view  (function-plot h 0 1))

The integral can be computed anallytically obtaining the value 0.965.

The montcarlo approach is based in extracting iid variables from an uniform probability distribution U(0,1), and approximate the value with the function:

  (def *solution* (/ (reduce + *randoms*) *trials*))

Where *randoms* is the collection of randomly generated iid variables and *trials* is the number of variables generated.

In order to generate these values we can use the org.apache.mahout.math.jet.random.Uniform Mahout class:

  (import 'org.apache.mahout.math.jet.random.Uniform)
  
  (def *unif* (Uniform. (double 0.0)
                        (double 1.0)
                        (mod (.getTime (java.util.Date.)) 10000)))

  (def *trials* 100000)

  (def *randoms* (take *trials* (repeatedly #(h (.nextDouble *unif*)))))

To check that the values generated, we can plot the generated values using Incanter:

  (view (bar-chart (range 0 *trials*) *randoms* ))

The computed value is 0.9651838354718588, very close to the actual solution of 0.965.

We can also compute the running averages and plot them to see how the algorithm converges to the solution:

  (def *averages* (reduce (fn [ac i]
                            (conj ac (/ (+ (last ac) (nth *randoms* i)) (inc i))))
                          [0]
                          (range 0 *trials*)))

  (view  (function-plot (fn [x] (nth *averages* (Math/round (float x)))) 0 *trials*))

Monte Carlo simulations are a simple yet very powerful idea that can be applied to a lots of different of situations. If you want to learn more about Monte Carlo methods, these are some useful resources:

visualizing Mahout’s output with Clojure and Incanter

Some Clojure code to visualize clusters built using Apache Mahout implementation of the K-Means clustering algorithm.

The code retrieves the output of the algorithm (clustered-points and centroids) from HDFS, builds a Clojure friendly representation of the output (a map and a couple of lazy-seqs) and finally uses Incanter’s wrapper around JFreeChart to visualize the results.

A sample execution using the output generated by the example from Mahout’s documentation:

(use 'mahout-vis.core)

(bootstrap! "/Users/antonio/Development/tmp/hadoop-0.20.2/
conf/core-site.xml")

(def *results* 
  (k-means-output "output/clusters-9/part-r-00000" 
                  "output/clusteredPoints/part-m-00000"))

(visualize-plots 
  (compute-comps *results* [5 10] [15 50] 
                           {:display-centroids true}))

The output of the previous code are 4 frames displaying the clusters for the components with indices 5,10,15 and 50 of the input data.

Other visualizations can be generated interactively from Clojure’s REPL. Another example of how Clojure can provide an interactive and powerful interface to complex Java systems.

Extending MongoDB with custom commands

MongoDB is one of the fastest alternatives to store and retrieve data. Mongo stores information organized as schema-less documents. Queries and data objects are expressed as JSON objects encoded using a “binary” serialization instead of plain text for improved performance.

There are two main mechanisms to extract information from MongoDB. A find command wrapping a small ad-hoc query language including conditionals, sorting, etc. and a mapReduce command for batch processing.

When not standard retrieval of data is required, the mapReduce command is the only option MongoDB offers. In a recent project I’ve been working on, I had to select documents stored in mongo with a value for the Levenshtein distance from a query term inferior to a certain threshold, a similar functionality tha the one offered by the fuzzystrmatch module in PostgreSQL. The task can be accomplished using Mongo’s mapReduce command, but the performance of the queries was not optimal.

As an experiment, I started reading the code of MongoDB to see if there was an easy way to implement this functionality directly in the database. What I found is that Mongo’s code is really modular and easy to extend.
The outcome has been a new command that implements the functionality with a big improvement in performance, as the following table shows:

implementation mapReduce native
levenshtein 0.7 (0 matches) 1.941 s 0.077 s
levenshtein 0.4 (21 matches) 2.691 s 0.091 s
levenshtein 0.1 (22.478 matches) 22.857 s 7.962 s

The collection being queried in this test is a collection with 100.000 documents containgin random strings of text between 30 and 100 characters.

The code for the new command can be found at Github. The files in this commit contain all the code required to implement the command.

The following is a small summary of the steps required to extend MongoDB to support this kind of queries.

Retrieving MongoDB’s code and building process

MongoDB’s code is available at Github. Once the code has been retrieved, the next step is to build the data base and all the additional functionality, like the Javascript shell.
Mongo uses SCons as the build infrastructure. SCons is itself built using python, so this is a dependency that must be installed in your system before you can build Mongo.

To build the whole system, a single command is enough:

$scons .

The task can take quite a long time but after building the whole system, SCons does a great work just re-building only the modified sources.
Different parts of the system can also be built as independent targets:

# builds only the db
$scons mongod
# builds only the JS shell
$scons mongo

Creating a new DB command

Mongo’s core functinality can be found in the db directory of the source distribution. It includes the implementation of Mongo’s RESTful API, indexes/BTree support, standard Mongo’s queries and also the list of commands that can be issued to the database, e.g. the mapReduce.
Adding a new command to the list means implementing a new C++ class with the functionality of the command and registering a name for this command in a map of command-names to command classes.

If we take a look at db/commands.cpp we will find the function used by the server frontend to look for the function it has to execute:

    map * Command::_commands;
   ... 

    Command* Command::findCommand( const string& name ) {
        map::iterator i = _commands->find( name );
        if ( i == _commands->end() )
            return 0;
        return i->second;
    }

All commands implement the abstract mongo::Command class. The subclass must implement some functions in order for the command to be executed. The mos important function is mongo::Command::run defined in db/commands.h:

  // db/commands.h line 50
   virtual bool run(const string& db, BSONObj& cmdObj, 
                    string& errmsg, BSONObjBuilder& result, 
                    bool fromRepl) = 0;

The base Command class also provides a base constructor that will automatically register the command in the commands map when invoked in the subclass. For example, the implementation of the mapReduce command registers itself for execution invoking the base constructor:

/**
 * This class represents a map/reduce command executed on a single server
 */
class MapReduceCommand : public Command {
  public:
     MapReduceCommand() : Command("mapReduce", false, "mapreduce") {}

Retrieving the arguments for the command
The query retrieved from the client is encoded as a BSON object and passed as the second argument to the run function.
There is a whole suite of functions to manipulate BSON objects defined in MongoDB. They can be found in bson/bsonobj.h and bson/bsonelement.h.
In this fragment of code from the mapReduce command implementation the out parameter of the query is handled. The BSON object is stored in the variable cmdObj:

if ( cmdObj["out"].type() == String ) {
    finalShort = cmdObj["out"].String();
    outType = REPLACE;
}
else if ( cmdObj["out"].type() == Object ) {
    BSONObj o = cmdObj["out"].embeddedObject();

    BSONElement e = o.firstElement();
    string t = e.fieldName();

    if ( t == "normal" || t == "replace" ) {
        outType = REPLACE;
        finalShort = e.String();
    }
    else if ( t == "merge" ) {
        outType = MERGE;
        finalShort = e.String();
    }
    else if ( t == "reduce" ) {
        outType = REDUCE;
        finalShort = e.String();
    }
    else if ( t == "inline" ) {
        outType = INMEMORY;
    }
    else {
        uasserted( 13522 , str::stream() << "unknown out specifier [" << t << "]" );
    }

    if (o.hasElement("db")) {
        outDB = o["db"].String();
    }
}

Obtaining a cursor
To implement the desired functionality it usually necessary to traverse the collection of Mongo documents stored in the DB. Mongo implements this functionality using cursors.
Cursors can be obtained using a factory function called bestGuessCursor that receives as a parameter an unique namespace for the command and a description of a DB query.
The cursor is returned as a Boost smart pointer so we don’t have to deal with the deallocation of the resources consumed by the pointer. A possible template for a function using a collection pointer could be:

// run function
bool run(...) {
  
  // get the cursor
  shared_ptr temp = bestGuessCursor( ns, BSONQuery, BSONObj() );        
  auto_ptr cursor( new ClientCursor( timeoutOpts , temp , ns ) );

  // main loop
  while ( cursor->ok() ) {

    // get current document
    BSONObj o = cursor->current();

    ... logic ...

    cursor->advance(); 
  }
}

Building the output

The result of the command must be returned also as a BSON object. To build this object a reference to a BSONObjBuilder object is passed as an argument to the run function. The logic of the function can use functions like append to add values to the resulting BSON object. If the values of this object must also be BSON objects, additional BSONObjBuilder instances can be created. Once the object has been built, it can be retrieved from the builder calling to the obj funtion.

The run function must also signal if the execution of the command has been successful returning a boolean value.

Adding support for the command in the shell
In order to use the command we have implemented, we can add support for the command in the Mongo JS shell. A good location for the JS code invoking the command is shell/collection.js.
The function must build the JSON object tha will be later received as a parameter in the command implementation at the server. The only requirement for this JSON object is that the first property of the object must have the same name that the string used to register the command in the DB. The value for that property must be the short name of the collection. The rest of properties are optional. The command can be executed using the this._db.runCommand object from the present collection object.

As an example, this is the implementation of the custom levenshtein command:

DBCollection.prototype.levenshtein = function( sourceTerm , field, threshold, opts ){
    var c = { levenshtein : this._shortName , sourceTerm : sourceTerm , field : field, threshold : threshold };
    opts = opts || {"level":"word"};

    if(!opts["level"] || opts["level"] === "word") {
        c["word"] = true;
        c["sentence"] = false;
    } else {
        c["word"] = false;
        c["sentence"] = true;    
    }

    c["separators"] = (opts["separators"]||".,:; ");

    if(opts["limit"]) {
        c["limit"] = opts["limit"];
    }
    if(opts["outputField"]) {
        c["outputField"] = opts["outputField"];
    }

    var raw = this._db.runCommand( c );
    if ( ! raw.ok ){
        __mrerror__ = raw;
        throw "levenshtein matches failed:" + tojson(raw);
    }

    return tojson(raw);

}

Adding support in a driver
One problem of extending MongoDB this way is that we must add support for the new command in all the layers between our client code and the DB. We have already added support to the JS shell but many applications access MongoDB through some kind of driver interfacing with the DB server.

In my case, I was using Sominum’s congomongo Clojure library. This means adding support in two different layers, the low level Java driver and the Clojure wrapper library.
Fortunately, the design of the library and the Java driver make possible to add support for the command entirely in client code without further modification of the library sources. Congomongo’s coerce function also makes very easy to transform data structures to and from Clojure’s native data types and BSON objects. An example implementation can be found in this Github’s gist

Chasing Erlang: profiling a Clojure library

This is a short summary of my efforts profiling and benchmarking Jobim, the actors library for Clojure I’ve been working for the last few months. I have very little experience profiling Clojure applications, so I thought a brief summary of the process may be interesting for other people in the same situation.

How to measure performance?

The first problem I found wast to make up a good way of testing the actual performance of the library. Performance of complex systems is really hard to test. The number of moving parts make difficult to spot the bottlenecks and wrong conclusions can be drawn easily.
Fortunately, in my case I found an easy solution for the problem. I used the same test suite built to benchmark Termite Scheme. Termite is a variant of Scheme built for distributed computing and modelled after Erlang programming model on the Gambit Scheme system.
The Termite paper has a full section discussing different performance tests and showing the results for Termite and Erlang. Furthermore, the implementation of the tests can be found in Termite’s source code.

Because of the similarities between the projects, implementing these benchmark tests in Clojure/Jobim provide a good starting point to test the performance of the library.

Finding tools

To measure performance, System/nanoTime can be good enough. On the other hand, to profile the code when performance is bad, you really need a good profiler tool. Unfortunately, there is no a good Clojure profiler tool as fas as I know. Commercial Java profilers like Yourkit are a good option but they can also be expensive. A good free alternative to these profilers is VisualVM. The profiler is included in the Java distribution and it’s really ease to use. Using VisualVM with Clojure is just a matter of selecting the REPL process from the profiler user interface and start gathering data.

Adding type hints

One first step to increase the performance of a Clojure application is to add type hints to the code of your functions. Clojure is a dynamic language, but that’s not the case of Java, the host-language for Clojure. In order to invoke Java methods, Clojure needs to find the type of the java object passed as arguments to these methods or returned from them. Java reflection is the mechanism that makes possible to look up for this informatipn at run-time. The use of reflection has a double impact on performance, it adds processing time and it consumes additional memory creating reflection objects. The use of reflection is usually not a problem in regular applications, memory is freed by the garbage collector and the performance penalty is not significative. Nevertheless, in some other cases the use of the reflection mechanism can become a problem.

The following images show a profiler session for the execution of the Termite ring test with a ring of 40k actors. The non-deterministic execution of the garbage collector has been disabled with the -Xincgc JVM parameter.

The screenshots show how the execution of the test allocates a huge amount of memory as java.lang.reflect.Method objects. These objects are released by the GC once it is manually executed but in a regular execution, the execution of the GC thread might have impacted very negatively the outcome of the test.

In order to avoid these situations, Clojure can notify us when it will use reflection to resolve a method call if we set up the *warn-on-reflection* var to true. With this modification, the list of conflicting calls will be shown when the source code is being compiled. Using information we can add type hints with the correct type for the call to the metadata of the problematic symbols. Clojure will use this information instead of using reflection to discover the right type for the objects.

The following figures show another run of the same test after having added the required type hints. We can see how the use of memory has decreased and the java.lang.reflect.Method objects have dissapeared.

In the case of Jobim, the number of type hints required to resolve all the problematic calls was really small. Around 15 type hints were enough. Nevertheless, some calls involving protocols were harder to solve. The use of VisualVM was a great help to check that no reflection objects were being created.

Object serialization
One of the main performance problems in tests involving the exchange of messages among actors in different nodes is the serialization and deserialization of messages. Using standard Java serialization the amount of measured time for some tests spent in the serialization of messages could reach 20% of the total. Besides, serialization of some Clojure objects like lambda functions is not supported.

Lacking an unique solution for the problem. Different serialization services have been implemented as plugins for Jobim. The following table shows some performance results for a test consisting of encoding and decoding a java HasMap with long lists of integers as values:

Serialization mechanism Time encoding+decoding
JSON serialization 5.566 ms
Java serialization 4.419 ms
JBoss serialization 1.437 ms
Kryo serialization 0.573 ms

The result shows how the standard Java serialization is almost as slow as JSON serialization. Kryo offers a very good performance with the drawback of not supporting Clojure core data types. Finally JBoss serialization library offers a much better performance than the standard Java serialization with a compatible interface and the same degree of support for Java types, including Clojure core data types.

The lack of support for clojure data structures makes Kryo a valid option only for certain cases where actors restrict the kind of messages they exchange so they can be serialized by Kryo. JBoss serialization is a good default option and has been used in the rest of the tests. Jobim data types used in the exchange of messages between nodes by the library have restricted to standard Java types so they can be used with any serialization mechanism.

Results

The following table shows the results I have obtained profiling the current version of Jobim. The results for Erlang have been measured using the same hardware where the Clojure version was tested.

test Erlang Jobim threaded Jobim evented
fib(534) 585 ms 378 ms -
Spawn 6.92 us 49 us 27 us
Send 0.2 us 120 us 16 us
ring 40k/0 3 us - 83 us
ring 40k/4 5 us - 166 us
ping-pong internal (0) 0.7 us 148 us 25 us
ping-pong internal (1000) 16 us 152 us 29 us
ping-pong same node (0) 58 us 14333 us 5027 us
ping-pong same node (1000) 182 us 25921 us 6303 us
ping-pong remote (0) 3267 us 18481 us 7022 us
ping-pong remote (1000) 16556 us 38730 us 8400 us

Ping-Pong tests where executed using a Jobim configuration consisting of the ZooKeeper plugin for the coordination service, the TCP plugin for the messaging service and the JBoss plugin for the serialization service. The rest of the tests where run with the localnode plugin for coordination, service and serialization.

Some conclusions:

  • Clojure is faster than Erlang in the base case chosen, computing fibonacci of a small number with a naive implementation.
  • Actor models primitives: spawn, send, receive are orders of magnitude faster in Erlang that implemented as library functions in Jobim.
  • The number of processes/actors that can be created in a single node is far inferior in Jobim than in Erlang. In a machine where the Erlang version of the ring test was running without problems with 1M processes, the JVM was struggling with ~50k evented actors. Threaded actors are even less scalable.
  • Erlang seems to have some issues with the size of network messages, Jobim faces less variation in the response times when the size of the messages varies
  • Evented actors offer far superior performance than the threaded implementation in the tests.
  • In tests involving actual network communication, Jobim gets closer to Erlang and it even outperforms Erlang in one case.

With these results, it is probably not a good idea to write Erlang-in-Clojure applications using Jobim. Nevertheless, the library starts to be performant enough to be used as a distribution mechanism for Clojure applications using actors semantics and some of Erlang features like actors linking or behaviours.

Barber shop problem again

Today, ther have been some talking about concurrency in Clojure vs Scala. Most of the debate was based on an excellent article Scala vs Clojure – Round 2: Concurrency! by Lau B. Jensen.

The article discusses the classic “barber shop” problem and offers two implementations one in Clojure, using Clojure concurrency primitives, and another one in Scala using Scala implementation of actors.

Since I’ve been working in the latest months in Jobim, an actor’s library for Clojure, I thought it would be interesting to offer an additional implementation using actors in Clojure. The code can be found here.

The implementation of the actors is pretty straightforward. It is interesting to compare it with the Scala version from the previous article.

The shop actor:

(defn do-shop
  "Shop actor"
  ([max-seats max-clients]
     (let [seats (atom [])
           max-clients (atom max-clients)]
       (loop [[msg pid] (receive)]
         (condp = msg
             ;; allocate a seat for the client
             :get-seat (if (< (count @seats) max-seats)
                         (do (swap! seats conj pid)
                             (send! pid :seated))
                         (send! pid :no-space))
             ;; barber asks for a client
             :next-client (let [pid (first @seats)]
                            (swap! seats rest)
                            (send! (resolve-name "the-barber") pid))
             ;; A client is leaving the shop
             :exiting (swap! max-clients dec))
         (if (> @max-clients 0)
           (recur (receive))
           (notify-shop "Closing barber shop"))))))

The customer:

;; Customer
(defn customer
  "Customer actor"
  ([id]
     (spawn
      ;; enters the shop and ask for a seat
      #(loop []
         (println (str "Customer id " id "..."))
         (let [seated (do (send! (resolve-name "barber-shop") [:get-seat (self)])
                          (receive))]
           ;; we have a seat, wait to have a hair cut
           (if (= seated :seated)
             (let [_ (send! (resolve-name "the-barber") :hey)
                    ;; Awaking the barber
                   msg (receive)]
               ;; barber starting to do his thing
               (when (= msg :cutting)
                 (notify-shop (str "Customer " id " getting his hair cut")))
               ;; barber finished
               (let [msg (receive)]
                 (when (= msg :finished)
                   (notify-shop (str "Customer " id " got his hair cut")))
                 ;; leaving the shop
                 (send! (resolve-name "barber-shop") 
                           [:exiting (self)])
                 (notify-shop (str "Customer " id " leaving the shop"))))
             ;; No space available in the shop, exiting
             (do
               (notify-shop (str "Customer " id " could not get his hair cut"))
               (Thread/sleep (rand 5000))
               (recur))))))))

And finally the barber:

;; Barber
(defn barber
  "Barber actor"
  ([]
     (let [pid (spawn
                #(loop [msg (receive)] ;; wait for a customer
                   (when (= msg :hey) ;; awaken!
                     ;; ask for the customer to come in
                     (send! (resolve-name "barber-shop") [:next-client (self)])
                     ;; ignore the customers trying to awake us
                     (let [client-pid (receive (fn [msg] (not (= :hey msg))))] 
                       ;; starting to cut
                       (send! client-pid :cutting)
                       (Thread/sleep (rand 4000))
                       ;; finished cutting
                       (send! client-pid :finished)))
                   (recur (receive))))]
       (register-name "the-barber" pid))))

Additionally, the implementation includes the user of a behaviour. Behaviours is a concept taken from Erlang that refers to a piece of functionality in a distributed system that is encapsulated so it can be easily reused.

In this case, since the customers can be executed in any node, if we would like to collect the messages produced by all the customers, we could not use the traditional mechanism of printing to the standard output. The messages will appear all across the cluster. Using an event manager and a event handler we can collect messages from all the customers in the cluster in a central location.
Behaviours in Jobim are implemented using Clojure’s protocols. In this case:

;; Message logger

(def-event-handler BarberLogger
  (init [this initial-state] initial-state)
  (handle-event [this event state]
                (println (str "*** " event)))
  (terminate [this state] (println ("BarberLogger exiting"))))

(defn notify-shop [msg]
  (notify (resolve-name "barber-manager") msg))

More information about the event manager behaviour can be found in the Erlang manual.

One of the fundamental advantages of using a platform like Erlang is that you can run the same application in a single node or in a set of nodes with minimal changes. In this example, there are a couple of function to start the application in a single node or in multiple node. The only difference is the way the customer actors are initiated:

(defn run-barber-shop
  "Runs the barber shop problem in a single node"
  ([num-seats num-clients]
     (start-event-manager "barber-manager")
     (add-handler (resolve-name "barber-manager")
                  [name jobim.examples.barber.BarberLogger]
                  nil)
     (shop num-seats num-clients)
     (barber)
     (doseq [i (range 0 num-clients)]
       (customer i))))

(defn run-barber-shop-distributed
  "Runs the barber shop problem distributed in a set of nodes"
  ([num-seats num-clients nodes]
     (start-event-manager "barber-manager")
     (add-handler (resolve-name "barber-manager")
                  [name jobim.examples.barber.BarberLogger]
                  nil)
     (shop num-seats num-clients)
     (barber)
     (loop [ids (range 0 num-clients)
            nodes (cycle nodes)]
       (if (empty? ids)
         (println (str "*** All clients spawned"))
         (do (rpc-blocking-call (resolve-node-name (first nodes))
                                "jobim.examples.barber/customer"
                                [(first ids)])
             (recur (rest ids)
                    (rest nodes)))))))

The result of a test run can be seen in this video:

As you can see, mixing actors in Clojure is just another possibility and it makes a really good alternative when you want to build systems spanning more than a single node.

Embedding V8 in a Clojure SPARQL library

Using Javascript in Java applications is a common practice. The easiest path to achieve this integration is to use the excellent Rhino library from the Mozilla Foundation. This Library is an implementation of Javascript in Java, so interop. from both sides is really easy.
Nevertheless, when I was in the need of re-using a Javascript implementation of a SPARQL 1.1 parser I has been working in a different Clojure project, I decided to try a different approach and use Google’s V8 Javascript engine. These are the results of this small experiment.

The V8 project has some instructions about how to embed the engine into a C application. V8 exposes JavaScript objects to C code through ‘handlers’ referencing these objects. Objects, can be createad and released from C but instead of manually tracking each object individually, they can be grouped into ‘handle scopes’. This mechanism simplifies the management of JavaScript objects memory, since all the objects associated to a scope can be garbage-collected by V8 once the scope is deleted. Additionallly, in the same way that a C object can be created in the heap or in the stack, V8 object handlers can also be persistent or transient.

V8 objects are created and manipulated in a certain ‘execution context’. These contexts are persistent between function calls, and a scope to declare or manipulate additional JS objects can always be retrieved from the context object. C data structures and functions can also be used in JS code using ‘templates’ that wrap the C object in the JS execution context.

As an example, the following code shows the constructor of a C++ class that loads the JS parser code from a file, initializes a persistent execution context, and evaluates the JS script in that context. As a result the JS parsing function will be available into that V8 execution context.

SparqlParser::SparqlParser() {

  char script[500000];

  // load the parser js code
  std::ifstream fin;
  fin.open("sparql_parser.js", std::ios::in);
  fin.read(script,500000);
  fin.close();

  // The context where the parser will be executed
  context = Context::New();
  HandleScope handle_scope;
  Context::Scope context_scope(context);

  // compiles the parser function
  Handle parserScript = Script::Compile(String::New(script));
  parserScript->Run();

};

The method parse from the same clase, just builds the invocation to the parse JS function, retrieves a new scope in the persistent execution context and executes the actual JS function invocation:

std::string SparqlParser::parse(std::string query) {
  Context::Scope context_scope(context);
  HandleScope handle_scope;

  // executes the parser function call on the global object
  std::string query_string = "sparql_query('";
  query_string = query_string.append(query);
  query_string = query_string.append("');");

  Handle runnerScript = Script::Compile(String::New(query_string.c_str()));
  Handle result = runnerScript->Run();

  if (!result.IsEmpty() && !result->IsUndefined()) {
    String::Utf8Value str(result);
    std::string toReturn = std::string(ToCString(str));

    return toReturn;
  } else {
    std::cout << "Something went wrong";
    return NULL;
  }

The destructor of the class, just releases the persistent execution context.

To be able to use this code from a Clojure function, a mechnism to execute native code from Java code must be used. The main alternatives are the Java Native Interface (JNI) and the newer Java Native Access (JNA). JNA offers a simpler and cleaner integration with Java code, and is well supported in Clojure and Leiningen thanks to projects like clojure-jna and clj-native. Unfortunately JNA is oriented towards plain C APIs and using C++ code like V8 supposes writing a C interface layer on top of C++ classes. Taking this into account, I decided to write a JNI interface, consisting of a Java class, that will wrap the C++ code using the JNI automatically generated C interface code.

This is the Java wrapper class:

package sparqltest;

public class SparqlParser {
     static {
         System.loadLibrary("SparqlParserWrapper");
     }

    private native void init();
    private native String parse_query(String query);

    private static SparqlParser parser;

    private SparqlParser() {
        init();
    }

    public String _parse(String query){
        return parse_query(query);
    }

    public static String parse(String query) {
        if(parser == null) {
            parser = new SparqlParser();
        }
        return parser._parse(query);
    }
 }

And this is the generated C interface, with the usual conversion between wrapped Java types and C types:

#include <jni.h>
#include <stdio.h>
#include "sparql_parser.h"
#include "sparqltest_SparqlParser.h"

SparqlParser parser;

JNIEXPORT void JNICALL Java_sparqltest_SparqlParser_init (JNIEnv *env , jobject obj) {
  parser = SparqlParser();
}

JNIEXPORT jstring JNICALL Java_sparqltest_SparqlParser_parse_1query (JNIEnv *env, jobject obj, jstring javaQuery) {

  const char *query = env->GetStringUTFChars(javaQuery, 0);
  std::string result = parser.parse(std::string(query));

  jstring javaResult = env->NewStringUTF(result.c_str());

  return javaResult;
}

After compiling and packaging the Java wrapper in a Jar file, we are ready to test the code from Clojure. Before that, the C++ code must have been compiled into a library that will be loaded by the JNI framework. The library must be located somewhere in the load path of JNI, as well as the V8 library if we have decided to compile it as a shared library. This path can be configured in the Leiningen project specification using the :native-path keyword (equivalent to use the -Djava.library.path argument in the Java invocation.
Since the C code, reads the JS parser from disk, the file containing the parser must also be located in the same directory where we are running the application.

If all the paths are OK and JNI can found the native libraries, the JS parser can be invoked without problems from Clojure code:

(ns clj-sparql-test.core
  (:import sparqltest.SparqlParser)
  (:use [clojure.contrib.json :only [read-json]]))

(defn parse-sparql [query]
  (read-json (SparqlParser/parse query)))

A sample invocation:

  user> (use 'clj-sparql-test.core)
  nil
  user> (parse-sparql "SELECT * { ?s ?p ?o }")
  [{:token "prologue",
    :base "",
    :prefixes []}
   {:kind "select",
    :token "executableunit",
    :dataset [],
    :projection [{:token "variable", :kind "*"}],
    :pattern {:token "groupgraphpattern",
              :patterns [{:token "basicgraphpattern",
                          :triplesContext 
                                [{:subject {:token "var", :value "s"},
                                  :predicate {:token "var", :value "p"},
                                  :object {:token "var", :value "o"}}]}],
                          :filters []}}]

The code of this experiment along with more details about the building of the differents components, ca be found here.