GraphQL is for silos

GraphQL is booming.

After Github released its GraphQL API it’s clear that many other services and developers are going to try to adopt the technology. It might have been a turning point.

There are many things to like in GraphQL. For example, the typing system, the idea of having, at last, an effective schema mechanism connecting client and server, or the view of a unified data graph you can query. Many of those ideas are not new at all, but if GraphQL is able to finally make them popular and accepted by development teams, I would be very happy.

However, there’s a main design problem with GraphQL that needs to be addressed: GraphQL is for building silos.

This open issue in GraphQL Github’s repository shows the problem clearly.

GraphQL exposes a single data tree, through a single end-point. All your data is captured in that single data space and cannot reference or be referenced from other GraphQL  end-points in a standard way.

Open de-centralised systems don’t exist in the GraphQL world in its current form (which is not a surprise taking into consideration the original authors of the technology).

Of course, most organisations are just building their own public facing silo. Most of them have just a few clients they control directly: a JS web app, mobile apps, etc.
In this context GraphQL might be an attractive solution, specially because the integration between your client and the data layer is more sophisticated than what you can get with the mainstream interpretation of REST as some kind of HTTP+JSON combo.

But even if this is the case in your external facing API, probably in your back-end the landscape looks a lot more like a loosely coupled federation of services trying to work together. In this context, HTTP is still the best glue to tie them together in a unified data layer.

It should be possible to modify GraphQL to solve this issue and make the technology open:

  • Replace (or map in a standard way) IDs by URIs: If I’m going to reference some object in your data graph I need to be able to refer to it in an unique and unambiguous way. Also your identifiers and my identifiers need to coexist in the same identifier space. Relay global object IDs are half-way there.
  • Add namespaces for the types: If you are not alone in the data universe, you might not be the only one to define the ‘User’ type. Even better, we might want to re-use the same ‘User’ type. Extra points if the final identifier for the type is a URI and I can de-reference it to obtain the introspection query result for the type.
  • Add hyperlinks/pointers to the language: I want to hold references to objects in this or other graphs using their IDs/URIs.

With these three changes, and introducing a shared authentication scheme,  a single GraphQL end-point could be broken into many smaller federated micro-GraphQL end-points conforming a single (real) data graph. This graph could also span multiple data sources in an organisation or across organisations. In a sentence, it could be a real alternative for HTTP and REST.

The flip side of all this is that the pieces and technologies to provide the same level of experience GraphQL offers to developers have been available for HTTP as W3C standards for more than a decade. From the foundational components to the latest bits and ideas to bind them together.

It’s sad, but the surge in popularity of GraphQL makes it more clear our failure in the Linked Data and SemWeb communities to offer value and fix real problems for developers.

From RPC to Hypermedia API in 7 Steps

Hypermedia seems not to be a popular option between API designers these days. On the other hand REST (or RESTful) API design is supposed to be the orthodoxy in API design and implementation.
This is not a surprise since REST, by simple abuse of the term, has become just a concept vaguely related to the use of ‘resources’ and a conventional mapping CRUD operations to HTTP methods.

But REST should be much more than that. Fielding’s thesis was an attempt of systematically describe the network architecture of the web built on a few key components (HTTP, URI, HTML). The meaning of architecture in REST is just a process of imposing constraints over the many possible ways of building distributed network systems.
Each of these constraints limit the way we can build the system, but at the same time, they make the final result more reliable, extensible, decoupled and able to scale. On the other hand, these same constraints can have a negative impact in other aspects, like raw performance. They are conscious trade-offs.

When building RESTful APIs these days, most of the problems come from the lack of  reliability, fragility, coupling and the lack of scalability.These are the very same problems that REST architectural principles should help us to avoid.
There’s something wrong in our RESTful APIs.

A few months after the first web server was bootstrapped by Tim Berners Lee at CERN, the web was already composed of distributed servers linking information between multiple organisations. A couple of implementations of a generic client for the web were already being written. Both things are still impossible to achieve in the web of data APIs.

If we look again at the components of the web, HTTP is the only one we truly embrace. URI and HTML (or a version of HTML for data), and their combined use as hypermedia are lacking in most APIs.
This post tries to explore how hypermedia can be used in API design to fix some of the issues we have mentioned before: fragility, coupling and lack of scalability.
Instead of just discussing abstract notions and enumerating technologies and standards, the approach will be slightly different. We’ll start with a fully RPC API, with almost no other constraints than using the HTTP protocol for data transport, and we will start adding REST constraints through different revisions, Trying to understand the trade-offs involved in each small change.

The toy API we will be improving is just another variation of a ‘Todos’ service, where users can register and create and edit lists of Todo notes.
I will use RAML to describe the API, but I could use Swagger/OpenAPI, Blueprints or any other API documentation format.

 

API Version 0: RPC

The initial version of the API is just a list of RPC operations invoked over the HTTP protocol.

#%raml 1.0
title: Todos Service
baseUri: http://todosapp.com/api/version_0
mediaType: application/json
/usersServiceCreate:
description: Creates a users
post:
body:
properties:
email: string
name: string
response:
body:
properties:
id: string
/usersServiceList:
description: Lists all Users
post:
response:
body:
type: array:
items:
properties:
id: string
email: string
name: string
/usersServiceDelete:
description: Creates a users
post:
body:
properties:
id: string
/todosServiceCreate:
description: Creates Todos for a user
post:
body:
properties:
user_id: string
todos:
type: array
items:
properties:
title: string
description: string
response:
body:
properties:
id: string
/todosServiceList:
description: Lists all Todos
post:
response:
body:
type: array
items:
properties:
id: string
user_id: string
title: string
description: string
/todosServiceDelete:
description: Deletes todos
post:
body:
id: string[]
view raw version_0.raml hosted with ❤ by GitHub

There are only two REST restrictions in this design:

  • Client/server architecture: There’s a clear separation between the client side and server side of the service with a well defined boundary between them and clear separation of concerns
  • Sateless architecture: all requests to the API include all the information required for the service to process the request. There are no dependencies between requests that can be issued in any particular order. Session state is kept in the client.

The API uses the HTTP protocol just as a transport layer for the invocation of remote functions in the server. The semantics of HTTP are ignored, all requests are bound to POST HTTP requests. It doesn’t take advantage of any of the features of the protocol, like caching headers, HTTP is just a convenient mechanism to invoke remote functions.

Understanding this version of the API requires reading the textual description of the operation and examining the payloads of requests and responses. The API describes a private communication protocol.

Coupling between clients and servers is also high. Client code must be tailored to match the description of the API endpoints. Any change in the service interface will break the client and it’s impossible to rely on any kind of client library other than basic HTTP library support.

On the other hand the removal of some of the REST constraints also has benefits. For example, since we are not imposing the notion of resources, we can save some requests to the server just by introducing end-points that accept arrays of identifiers to operate over sets of entities in the server side. It might be also possible to change the transport layer easily without changing the client code, just replacing the basic HTTP functionality by some other transport protocol. One example of this kind of API is Dropbox HTTP API.

API Version 1: Resources

The main change we are going to make in this version of the API is to add one of the main REST architectural elements: the notion of resource and resource representations.

If we just look at the version 0 of the API, we can see how different operations in the API rely on identifiers being received or returned by the operations endpoints. These identifiers designate entities in the service that operations in the API manipulate: users and todos.

If we assigned an identifier, a URI, to each of these entities we can consider them resources from a REST point of view. Resources are not to be confused with the actual state they are in. We can have resources for values that don’t exist yet, we can have multiple resources pointing to the same state, we can have resources whose state don’t change in time, and resources whose value changes constantly in time. The only important thing is that the semantics of the resource should not change, for example, the resource identified by  http://todosapp.com/api/version_1/users/latest should always point to the latest user created in the system.

When we use the HTTP protocol to dereference a HTTP URI and retrieve the state associated to the resource the URI identifies, what we obtain is a representation of that state. This representation is fully transferred from the server to the client and it includes data and meta-data and can have multiple formats.
Clients need to select the most convenient representation for a resource among the representations available in the server using a mechanism known as content negotiation, involving the exchange of HTTP protocol headers containing media types.

Finally, in order to define how resources can be manipulated using the HTTP protocol, REST imposes another architectural constraint in our API: the uniform interface.
This interface defines the contract between client and servers and it is based in the semantics of the HTTP protocol. It includes a small set of methods that define how resources can be created, read, updated and deleted (CRUD).

In this new version of the API we will use these architectural elements and constraints to rewrite our API:

  • We will group our entities in 4 main resource types: collection of users, individual users, collection of todos and individual todos
  • Different representations for the resources: JSON and XML are introduced
  • We map the CRUD functionality offered by the functions exposed by the API version 0 and we map them to the right HTTP methods

The outcome is a API similar to most RESTful APIs built nowadays, for example Twitter’s REST API

#%RAML 1.0
title: Todos Service
baseUri: http://todosapp.com/api/version_1
mediaType: [ application/json, application/xml ]
types:
User:
properties:
id: string
email: string
name: string
Todo:
properties:
id: string
title: string
description: string
/users:
get:
description: Gets all the users in the service
responses:
200:
body: User[]
post:
description: Creates a new user
body:
properties:
email: string
name: string
password: string
responses:
201:
headers:
Location:
type: string
/users/{id}:
get:
description: Information about a single user of the service
responses:
200:
body: User
delete:
description: Deletes a user from the service
/users/{id}/todos:
get:
description: Lists all the Todos created by a user
responses:
200:
body: Todo[]
post:
description: Creates a new Todo for a user
body:
properties:
title: string
description: string
responses:
201:
headers:
Location:
type: string
/users/{id}/todos/{todo_id}:
get:
description: Information about a single Todo for a user
responses:
200:
body: Todo
delete:
description: Deletes a Todo for a User from the service
view raw version_1.raml hosted with ❤ by GitHub

One of the trade-offs we have introduced with the addition of resources and the HTTP uniform interface is that achieving certain functionality might involve a bigger number of HTTP requests.

A partial solution to this problem can be found in another additional architectural constraints introduced by REST: cacheability.
Since the API respects the semantics of the HTTP protocol, the caching features built in HTTP can be used to reduce the number of requests required to retrieve the information available in the service. Caching in HTTP is based in a number of headers: Cache-control, Expires, Etags, that can be used by clients but also by any other intermediate HTTP connectors, like proxys, to avoid new requests for resources whose state can be inferred to be valid.

Another problem with this version of the API is that clients still need to know in advance the identifiers for the resources exposed through the HTTP interface. Relationships between resources are only implicitly present in the structure of the URIs. Clients need to know these identifiers and relations in advance and any change in these identifiers or in the structure of the resources will break the client logic.
However, the adoption of HTTP uniform interface can improve the support for re-usable libraries in clients. For example, if certain conventions in the way URIs for the resources are constructed, generic ‘REST’ libraries like Rails ActiveResource can be used.

API Version 2: Hyperlinks

In this revision of the API we are going to explore a solution to the coupling between client and servers we detected in version 1 of the API. We’ll address the problem introducing a form of hypermedia: http hyperlinks.

In version 1 of the API clients need to know in advance the URIs of the resources in the API. Moreover, they need to compute URIs dynamically from hard-coded templates extracting resource string identifiers from the representation retrieved from other resources. We’ll remove these requirements adding links inside the resource representations.

Hypermedia can be defined as control information that is embedded in data, in our case in the representation of resources. Through hypermedia, the server bundles together with the data the mechanism that clients can use to request that the server perform a certain operation on a resource. Thanks to hypermedia, the interface to the information can be packaged with the information and stored away from the server.
One kind of hypermedia are hyperlinks. They encode a relation between resources identified by URIs. Following a link, a client can retrieve a representation of the resource through the HTTP uniform interface, without any need of documentation or previous knowledge of the resource URI.

We will introduce links into our API using properties marked with a special suffix _url. Clients will need to understand the combination of these specially marked properties and string value they have assigned as links to the related resources.
An example of this kind of APIs with hyperlinks introduced in an ad-hoc way into resource payloads is Github API.

#%RAML 1.0
title: Todos Service
baseUri: http://todosapp.com/api/version_2
mediaType: [ application/json, application/xml ]
types:
User:
properties:
id: string
email: string
name: string
self_url: string
todos_url: string
Todo:
properties:
id: string
title: string
description: string
self_url: string
user_url: string
# ...
view raw version_2.raml hosted with ❤ by GitHub

Just adding links to the payload improves the reliability of the API and decouples clients from servers. Structure of URIs can change as long as the semantics for the link properties remain the same. If that is the case, client code does not need to change, we just need to find the right property and follow the link to retrieve the resource.
However, one of the problems with this solution is the use of JSON to encode the representation of resources. JSON doesn’t have syntactical support for links, we can only encode them as JSON strings in the payload. Clients are still coupled with the server because it is impossible for them to discover links in the representation. Clients written to take advantage of links will only work for this particular API and if the property introducing the link changes, even those clients will break.

API Version 3: Data Graph

In the previous version of this API we have shown the benefits of adding hyperlinks to the representation of resources. However we found that we are limited in our capacity of working with these links when encoded as JSON because this format does not support hyperlinks. XML, the other representation we support, has a similar problem, but being an extensible data format, links can be introduced in a standard way using XLink.

In order to deal with hyperlinks in JSON we need to augmentate its syntax. Since JSON doesn’t provide extensibility features we will need a superset of the format identified by a completely new media type.

Many options have been proposed to introduce hypermedia on top of JSON, HAL or Siren are just two examples.

In this version of the API we will use for this purpose JSON-LD.

JSON-LD is a W3C recommendation, with a minimalistic syntax, good support for tools and increasing adoption.

Links in JSON-Ld are URI strings wrapped into a { "@id": URI } JSON object. To avoid ambiguity in the properties used in JSON objects, JSON-LD properties must also be URIs. To provide a simple way to transform properties strings into URIs, JSON-LD introduces a "@context" property where a "@vocab" prefix can be declared to be used by all properties in the object.

In the following revision of the API spec we will introduce a couple of additional types to describe JSON-LD syntax:

#%RAML 1.0
title: Todos Service
baseUri: http://todosapp.com/api/version_3
mediaType: [ application/ld+json, application/json, application/xml ]
types:
URI: string
Link:
properties:
@id: URI
Context:
properties:
@vocab: URI
User:
properties:
@context: Context
@id: URI
email: string
name: string
todos: Link
Todo:
properties:
@context: Context
@id: URI
title: string
description: string
user: Link
# ...
view raw version_3.raml hosted with ❤ by GitHub

In this version of the spec we have introduced the Link and Context types and the URI alias to support JSON-LD syntax. We have also replaced the old id identifiers by JSON-LD @id URIs based identifiers.

When requesting a particular resource from the API:

curl -X GET -H "Accept: application/ld+json" http://todosapp.com/api/version_3/users/1/todos/1
view raw req.bash hosted with ❤ by GitHub

The JSON-LD representation returned by the server looks like:

{
"@context": { "@vocab": "http://todosapp.com/api/vocab#" },
"@id": "http://todosapp.com/api/version_3/users/1/todos/1",
"title": "Test",
"description": "Test TODO",
"user": { "@id": "http://todosapp.com/api/version_3/users/1" }
}
view raw resource.json hosted with ❤ by GitHub

The same representation could be expressed in a different way omitting the @context property like:


The payload of the JSON-LD representation of this resource can be understood as a data graph where the properties are links between resources identified with URIs and literal properties:

fig1
The same graph can be serialised as sequence of assertions SUBJECT PREDICATE OBJECT:

This capacity of interpreting a JSON-LD document as a data graph is a powerful feature, specially when dealing with hyperlinks between resources.
In a REST API, state is transferred from the service to clients. Relations between resources are denoted by links in the representation of resources that can be traversed using the HTTP protocol. For example we could follow the http://todosapp.com/api/vocab#user property contained in the previous representation:

curl -X GET -H "Accept: application/ld+json" http://todosapp.com/api/version_3/users/1
view raw req.bash hosted with ❤ by GitHub

To retrieve the JSON-LD representation:

{
"@context": { "@vocab": "http://todosapp.com/api/vocab#" },
"@id": "http://todosapp.com/api/users/1",
"name": "John Doo",
"email": "john.doo@test.com",
"todos": { "@id": "http://todosapp.com/api/users/1/todos" }
}
view raw resource.json hosted with ❤ by GitHub

This JSON-LD document can also be interpreted as a data graph:

fig2

Both graphs can be merged in the client just appending the assertions:

To obtained the combined data graph:

fig3
We have been able to re-construct in the client a graph of data distributed across multiple HTTP resources just merging the underlying JSON-LD data model.
This distributed data graph model is specified in the RDF W3C recommendation and it opens multiple possibilities for clients.

For example, we could query the local data graph using the SPARQL and client library like RDFStore-JS.

PREFIX todos: <http://todosapp.com/api/vocab#&gt;
SELECT ?name (COUNT(?todo) AS ?todos)
WHERE {
?user todos:name ?name .
?todo todos:user ?user
}
GROUP BY ?name
view raw query.sparql hosted with ❤ by GitHub

To obtain the following results:

| ?name | ?todos |
|-----------|----------|
| John Doo | 1 |
|-----------|----------|
view raw result.bash hosted with ❤ by GitHub

This ability to merge and query distributed information exposed as HTTP resources in a standard data format, is the foundation to build generic API clients completely decoupled from the server. It can be the core of a hypothetical generic client for APIs, equivalent to a web browser for the web of documents.

API Version 3.1: URIs

In version 3 of the API we have seen how to embed hyperlinks in the representations of resources. In this revision we are going to address a problem related to the way we are building identifiers for these resources.

The notion of a URI has evolved since the original conception of the web.
At the beginning, in a web of documents, URIs were just identifiers for these documents (or more technically, information resources), specifying their location in the network and the network protocol that should be used to retrieve them.
But URIs cannot only be used to locate documents, they can be used to identify resources in the REST sense. REST resources are concepts (technically non information resources) that cannot be transmitted using the HTTP protocol. Resources can mean anything, from a person to Mount Everest, the only condition is that resources need to be identified by URIs.

So the question is: what should a server do when a client issues a HTTP GET request for a URI identifying a non information resource like Mount Everest?

This problem is known as ‘HTTP Range 14’ after the issue were it was first raised.

From a REST point of view the solution is clear, the resource cannot be transmitted, but a representation for the resource agreed between client and server through content negotiation can be transmitted. The question is if URIs for the resource and the representation should be the same.

Different solutions have been proposed, one is to use HTTP re-directions to send clients from the resource to the representation of the resource.

Our favored solution is to use two different kind of URIs:

  • Hash URIs like http://todosapp.com/api/users/1/todos/1#self used for naming resources and other ‘non information resources’
  • URLs a subset of URIs that can be used as locators, like http://todosapp.com/api/users/1/todos/1, for ‘information resources’

The interesting thing about this solution is that it offers a mechanism to automatically relate both URIs.
Hash URIs cannot be directly dereferenced by a HTTP client. If a client tries to dereference a hash URI the fragment is removed and the remaining URI, a potential URL, is sent in the request.
We can use this fact to distinguish the resource http://todosapp.com/api/users/1/todos/1#self from the JSON-LD document where the resource is described http://todosapp.com/api/users/1/todos/1. Moreover, the hash URI will appear inside the data graph encoded in the document.

fig4

We will change our API so every resource will be identified using a hash URI. An example representation for a resource will look like:

[
{
"@context": { "@vocab": "http://todosapp.com/api/vocab#&quot; },
"@id": "http://todosapp.com/api/users/1/todos/1#self&quot;,
"title": "Test",
"description": "Test TODO",
"user": { "@id": "http://todosapp.com/api/users/1#self&quot; }
}
]
view raw resource.json hosted with ❤ by GitHub

An example of an API addressing the HTTP Range 14 issue is dbpedia. If we try to request the Mount Everest resource (http://dbpedia.org/resource/Mount_Everest) in dbpedia with a JSON-LD representation (or with a web browser) we will be redirected to the right URL for the resource representation:

$ curl -iv -H "Accept: application/ld+json" http://dbpedia.org/resource/Mount_Everest
* Trying 194.109.129.58...
* Connected to dbpedia.org (194.109.129.58) port 80 (#0)
> GET /resource/Mount_Everest HTTP/1.1
> Host: dbpedia.org
> User-Agent: curl/7.43.0
> Accept: application/ld+json
>
< HTTP/1.1 303 See Other
HTTP/1.1 303 See Other
< Location: http://dbpedia.org/data/Mount_Everest.jsonld
Location: http://dbpedia.org/data/Mount_Everest.jsonld
# ...
<
* Connection #0 to host dbpedia.org left intact
$ curl -iv -H "Accept: text/html" http://dbpedia.org/resource/Mount_Everest
* Trying 194.109.129.58...
* Connected to dbpedia.org (194.109.129.58) port 80 (#0)
> GET /resource/Mount_Everest HTTP/1.1
> Host: dbpedia.org
> User-Agent: curl/7.43.0
> Accept: text/html
>
< HTTP/1.1 303 See Other
HTTP/1.1 303 See Other
< Location: http://dbpedia.org/page/Mount_Everest
Location: http://dbpedia.org/page/Mount_Everest
# ...
<
* Connection #0 to host dbpedia.org left intact
view raw req.bash hosted with ❤ by GitHub

API Version 3.2: Collections

In this revision of the API we are going to introduce the concept of collection resources.

Up to this we have paid attention to individual resources, Users and Todos. We have also introduced hash URIs to identify them property and distinguish them from their representations. But if we look at URIs like http://todosapp.com/api/version_3_1/users they are being treated as URLs pointing to a JSON-LD document containing a data graph that can be fully partitioned in as many sub-graphs as user resources have been created in the service. The problem is that these collections of resources are not resources themselves. They don’t have a hash URI identifying them, they will never appear as subject resource in the client data graph.

Furthermore, since they are not resources, we cannot describe them. We cannot, for instance,  assert that the collection of users is paginated or at what time the last user was created.

To solve this problem, we will introduce a Collection type of resource, with its own hash URI and a property members pointing to the URIs of the contained resources.

#%RAML 1.0
title: Todos Service
baseUri: http://todosapp.com/api/version_3_2
mediaType: [ application/ld+json, application/json, application/xml ]
types:
URI: string
Link:
properties:
@id: URI
Context:
properties:
@vocab: URI
User:
properties:
@context: Context
@id: URI
email: string
name: string
todos: Link
Todo:
properties:
@context: Context
@id: URI
title: string
description: string
user: Link
Collection:
properties:
@context: Context
@id: URI
members: Link[]
/users:
get:
description: Gets all the users in the service
responses:
200:
body: Collection
post:
description: Creates a new user
body:
properties:
email: string
name: string
password: string
responses:
201:
headers:
Location:
type: string
/users/{id}:
get:
description: Information about a single user of the service
responses:
200:
body: User
delete:
description: Deletes a user from the service
/users/{id}/todos:
get:
description: Lists all the Todos created by a user
responses:
200:
body: Collection
post:
description: Creates a new Todo for a user
body:
properties:
title: string
description: string
responses:
201:
headers:
Location:
type: string
/users/{id}/todos/{todo_id}:
get:
description: Information about a single Todo for a user
responses:
200:
body: Todo
delete:
description: Deletes a Todo for a User from the service

If we now follow a link leading to a Collection resource in our data graph we will obtain the following representation:

And we could add that information to our data graph to obtain a more connected graph:

fig5.png

Another design decision we have taken is to reduce the properties in the members of the collection resource to include just the link to the member resource. That’s the only requirement from the hyper-media point of view, but that also supposes that we need to issue a new HTTP request if we want to obtain any other information about the member resource.
The opposite solution would be to embed the full member resource data graph in the collection representation, minimising in this way the number of required requests to retrieve the information but increasing the size of the request payload. Intermediate solutions, where some properties are in-lined are also possible. The designer of the API must decide what is the right granularity for her use case.

API Version 3.3: Entry Point

In version 3.2 of the API we have connected our data graph of resources introducing Collection resources. However there’s still a resource we cannot reach the collection of all the users http://todosapp.com/api/version_3_2/users#self. We still need to document that ‘end-point’ in our RAML API spec for a client capable of following hyper-links to be able to explore our API.

In fact, that’s the only URI that we need to document, since all the resources in the API include hyper-links in their representation and they are documented in the description of the resource types.

One of the goals of REST design is that a single URI should be enough for a client to consume the API. We will make that idea explicit introducing a new type of resource an EntryPoint resource type with a single resource, the URI that will be the entry-point for our API. The entry-point itself will only include a link to the collection of all users.
We will also remove all the other end-points in our RAML specification to leave only the reference to the entry-point URI.

#%RAML 1.0
title: Todos Service
baseUri: http://todosapp.com/api/version_3_3
mediaType: [ application/ld+json, application/json, application/xml ]
types:
URI: string
Link:
properties:
@id: URI
Context:
properties:
@vocab: URI
EntryPoint:
@context: Context
@id: URI
users: Link
User:
properties:
@context: Context
@id: URI
email: string
name: string
todos: Link
Todo:
properties:
@context: Context
@id: URI
title: string
description: string
user: Link
Collection:
properties:
@context: Context
@id: URI
members: Link[]
/:
get:
description: Entry point for the application
responses:
200:
body: EntryPoint

An example of API with entry-point is again Github’s HTTP API where it is denominated ‘Root Endpoint’.

$ curl -X GET https://api.github.com/
{
"current_user_url": "https://api.github.com/user&quot;,
"current_user_authorizations_html_url": "https://github.com/settings/connections/applications{/client_id}",
"authorizations_url": "https://api.github.com/authorizations&quot;,
"code_search_url": "https://api.github.com/search/code?q={query}{&page,per_page,sort,order}",
"emails_url": "https://api.github.com/user/emails&quot;,
"emojis_url": "https://api.github.com/emojis&quot;,
"events_url": "https://api.github.com/events&quot;,
"feeds_url": "https://api.github.com/feeds&quot;,
"followers_url": "https://api.github.com/user/followers&quot;,
"following_url": "https://api.github.com/user/following{/target}",
"gists_url": "https://api.github.com/gists{/gist_id}",
"hub_url": "https://api.github.com/hub&quot;,
"issue_search_url": "https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}",
"issues_url": "https://api.github.com/issues&quot;,
"keys_url": "https://api.github.com/user/keys&quot;,
"notifications_url": "https://api.github.com/notifications&quot;,
"organization_repositories_url": "https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}",
"organization_url": "https://api.github.com/orgs/{org}",
"public_gists_url": "https://api.github.com/gists/public&quot;,
"rate_limit_url": "https://api.github.com/rate_limit&quot;,
"repository_url": "https://api.github.com/repos/{owner}/{repo}",
"repository_search_url": "https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}",
"current_user_repositories_url": "https://api.github.com/user/repos{?type,page,per_page,sort}",
"starred_url": "https://api.github.com/user/starred{/owner}{/repo}",
"starred_gists_url": "https://api.github.com/gists/starred&quot;,
"team_url": "https://api.github.com/teams&quot;,
"user_url": "https://api.github.com/users/{user}",
"user_organizations_url": "https://api.github.com/user/orgs&quot;,
"user_repositories_url": "https://api.github.com/users/{user}/repos{?type,page,per_page,sort}",
"user_search_url": "https://api.github.com/search/users?q={query}{&page,per_page,sort,order}"
}
view raw req.bash hosted with ❤ by GitHub

 

API Version 4: Read-Write API

In the different revisions 3.X of the API we have built a quite sophisticated REST API with support for hyperlinks.

The main problem with the resulting API is that it is read-only. It would be equivalent to a version of the web working with HTML supporting only ‘<a></a>’ tags and without support for ‘<form></form>’ elements.

To define a read-write version of the API, we need to support richer forms of hypermedia encoding control information for the whole HTTP uniform interface.

JSON-LD only provides syntax to encode hyperlinks in the JSON-LD document, but this is not a problem, since we can insert the required hypermedia controls in the data graph, encoding them in the same way we encode the data.

For example, let’s create a new type in our RAML specification called Operation and let’s allow Links to have an associated array of operations.
We will also introduce two ‘template’ types UserTemplate and TodoTemplate to document the shape of the payload required by the service to create a new User and Todo respectively:

#%RAML 1.0
title: Todos Service
baseUri: http://todosapp.com/api/version_4
mediaType: [ application/ld+json, application/json, application/xml ]
types:
URI: string
Type: string
Operation:
method:
enum: [ GET, POST, PUT, DELETE, PATCH, HEAD, OPTIONS ]
expects: Type
returns: Type
Link:
properties:
@id: URI
operations: Operation[]
Context:
properties:
@vocab: URI
EntryPoint:
@context: Context
@id: URI
users: Link
UserTemplate:
properties:
name?: string
email: string
User:
properties:
@context: Context
@id: URI
email: string
name: string
todos: Link
Todo:
properties:
@context: Context
@id: URI
title: string
description: string
user: Link
TodoTemplate:
properties:
title: string
description?: string
Collection:
properties:
@context: Context
@id: URI
members: Link[]
/:
get:
description: Entry point for the application
responses:
200:
body: EntryPoint
view raw version_4.raml hosted with ❤ by GitHub

Now, if we request any resource in the API we will retrieve a much richer set of hypermedia controls with enough control information to consume the whole HTTP uniform interface.

{
"@context": { "@vocab": "http://todosapp.com/api/vocab#&quot; },
"@id": "http://todosapp.com/api/version_4/users/1#self&quot;,
"name": "John Doo",
"email": "john.doo@test.com",
"todos": {
"@id": "http://todosapp.com/api/version_4/todos#self&quot;,
"operations": [
{
"method": "GET",
"returns": "Collection"
},
{
"method": "POST",
"expects": "TodoTemplate",
"returns": "Todo"
}
]
}
view raw resource.json hosted with ❤ by GitHub

Our data graph has also been enriched with additional nodes for the hyper-media controls. The control nodes are in this particular implementation anonymous, blank nodes in the graph, with URIs that cannot be dereferenced outside the local graph.

fig6.png

Consuming this version of the API, a generic hypermedia client could query the data graph encoded in the resource representation using SPARQL to discover the hypermedia controls embedded in the resource representation:

SELECT ?resource ?method
WHERE {
?resource <http://todosapp.com/api/vocab#oerations&gt; ?operation .
?operation <http://todosapp.com/api/vocab#method&gt; ?method .
}
ORDER BY(?resource)
view raw query.sparql hosted with ❤ by GitHub

To obtain the results:

| ?resource | ?method |
|-------------------------------------------------|------------|
| <http://todosapp.com/api/version_4/todos#self&gt; | GET |
| <http://todosapp.com/api/version_4/todos#self&gt; | POST |
|-------------------------------------------------|------------|
view raw results.bash hosted with ❤ by GitHub

API Version 5: Meta-data

Two of the major goals of REST are robustness and extensibility: the ability to gradually deploy changes in the architecture.
This is granted by the extensible elements of the HTTP protocol, like versioning or headers and by separating the parsing of the messages from their semantic. HTTP also constrains these messages to be self-descriptive by the inclusion of meta-data in the resource representation.
In the web of documents the notion of self-descriptive messages is usually equated to assigning media types to the messages. This is usually just a way of describing the format of the message for the client to be able to select a supported representation.

When we are working with APIs, meta-data we should also try to make our messages, the returned representation of a resource, to be self-descriptive, introducing as much meta-data as it is required for a client to automatically process the representation.

In a narrow sense this conception of meta-data matches the notion of schema: “the payload is a map with two properties (name and email) and types ‘string’ and ‘string'”. It can also mean the notion of a ‘type’ for the resource: “User”, “Todo”.  Ultimately should mean the semantics of the resource: “name and email of the persons using our service”.

In the version 4 of the API these meta-data are encoded in our RAML specification. The main problem with this approach is that this information is not encoded in the resource representations at all. So far, it lives in a document completely outside of the API and it is not available for API clients.

In this version of the API we are going to make our API representation self-descriptive using hypermedia. First, we are going to introduce a property resource_type designating the type of resource for the different resources. Secondly, we will add a hyper-link linking to the RAML description of the API:

#%RAML 1.0
title: Todos Service
baseUri: http://todosapp.com/api/version_5
mediaType: [ application/ld+json, application/json, application/xml ]
types:
URI: string
Type: string
Operation:
method:
enum: [ GET, POST, PUT, DELETE, PATCH, HEAD, OPTIONS ]
expects: Type
returns: Type
Link:
properties:
@id: URI
operations: Operation[]
Context:
properties:
@vocab: URI
EntryPoint:
@context: Context
@id: URI
described_by: Link
resource_type: Type
users: Link
UserTemplate:
properties:
name: string
email?: string
User:
properties:
@context: Context
@id: URI
described_by: Link
resource_type: Type
email: string
name: string
todos: Link
Todo:
properties:
@context: Context
@id: URI
described_by: Link
resource_type: Type
title: string
description: string
user: Link
TodoTemplate:
properties:
title: string
description?: string
Collection:
properties:
@context: Context
@id: URI
described_by: Link
resource_type: Type
members: Link[]
/:
get:
description: Entry point for the application
responses:
200:
body: EntryPoint
view raw version_5.raml hosted with ❤ by GitHub

Now if we request any resource of the API, the payload we will include the additional meta-data:

{
"@context": { "@vocab": "http://todosapp.com/api/vocab#&quot; },
"@id": "http://todosapp.com/api/version_5/users/1#self&quot;,
"resource_type": "User",
"name": "John Doo",
"email": "john.doo@test.com",
"described_by": { "@id": "http://todosapp.com/api/version_5/spec.raml&quot; },
"todos": {
"@id": "http://todosapp.com/api/version_4/todos#self&quot;,
"operations": [
{
"method": "GET",
"returns": "Collection"
},
{
"method": "POST",
"expects": "TodoTemplate",
"returns": "Todo"
}
]
}
view raw resource.json hosted with ❤ by GitHub

And the additional nodes will be added to the resource data graph:

fig7.png

This solution has a number of shortcomings:

  • Some meta-data is encoded as resources with URIs inside the data graph (like hyper-media controls) other is encoded as RAML in a document linked through a URL
  • The connection is made by a plain string, the type name in the RAML document, it is impossible to link it using a hyper-link
  • Clients need two different parsers, one for JSON-LD and another one for RAML
  • The protocol to discover meta-data is completely private, the client needs to understand the http://todosapp.com/api/vocab#described_by property
  • The size of the payloads keep on increasing with more meta-data

API Version 6: Self-descriptive API

A possible solution for the issues we have found in version 5 of the API, that is consistent with our design, is to move the meta-data from the representations to a different plane above the representations and link the meta-data to the resources using a hyper-link.
A possible implementation of this solution is to introduce a new resource type Class in the RAML specification. This type will encode the the information of the RAML spec itself. With this resource type defined, we can introduce another new resource of type ApiDocumentation exposing the RAML specification information as yet another resource in the API with its own URI.

#%RAML 1.0
title: Todos Service
baseUri: http://todosapp.com/api/version_6
mediaType: application/json+ld
types:
URI: string
Operation:
method:
enum: [ GET, POST, PUT, DELETE, PATCH, HEAD, OPTIONS ]
expects: Link
returns: Link
Link:
properties:
@id: URI
operations: Operation[]
Context:
properties:
@vocab: URI
Class:
@id: URI
operations: Operation[]
properties:
@id: URI
domain: URI
range: URI
links: Link[]
Resource:
@id: URI
@type: URI
ApiDocumentation:
@id: URI
@type: URI
classes: Class[]
entryPoint: URI
/spec:
get:
description: API specification
200:
body: ApiDocumentation
/:
get:
description: Entry point for the application
responses:
200:
body: Resource
view raw version_6.raml hosted with ❤ by GitHub

Now, if we try to retrieve an API resource, we will obtain a much small data graph in the resource representation:

The @type keyword in JSON-LD is a way of denoting the type for a typed node in the encoded data graph and it will be expanded to the URI http://www.w3.org/1999/02/22-rdf-syntax-ns#type. If the client now follows the provided @type link, all meta-data for the resource will be retrieved from the server.
It is important to note that types are no longer strings, they are URIs pointing to resources in the same way properties are, that’s the reason we can now connect them with hyperlinks

{
"@context": { "@vocab": "http://todosapp.com/api/vocab#&quot; },
"@id": "http://todosapp.com/api/version_6/spec#self&quot;,
"@type": "http://todosapp.com/api/version_6/spec#ApiDocumentation&quot;,
"entryPoint": { "@id":"http://todosapp.com/api/version_6/#self&quot; },
"classes": [
{
"@id": "http://todosapp.com/api/version_6/spec#EntryPoint&quot;,
"operations": [
{
"method": "GET",
"returns": { "@id": "http://todosapp.com/api/vocab#EntryPoint&quot; }
}
],
"links": [
{
"@id": "http://todosapp.com/api/vocab#users_url&quot;,
"operations": [
{
"method": "GET",
"returns": { "@id": "http://todosapp.com/api/vocab#Collection&quot; }
},
{
"method": "POST",
"expects": {
"@type": "http://todosapp.com/api/vocab#Class&quot;,
"properties": [
{
"@id": "http://todosapp.com/api/vocab#name&quot;,
"range": { "@id": "http://www.w3.org/2001/XMLSchema#String&quot; }
},
{
"@id": "http://todosapp.com/api/vocab#email&quot;,
"range": { "@id": "http://www.w3.org/2001/XMLSchema#String&quot; }
}
]
},
"returns": { "@id": "http://todosapp.com/api/vocab#User&quot; }
},
]
}
]
}
]
}
view raw resource.json hosted with ❤ by GitHub

Both graphs, the data and meta-data graph can be merged into a single unified data-graph in the client that can be queried in the same way as in previous versions.

fig8.png

This distinction between two related ‘planes’ of information, one for data and one for meta-data, is what technically is known as the distinction between the Tbox for terminological box and the Abox for assertional box. From the logical point of view, the first component contains the conceptualisation of the system, the semantics of the data model, the second component contains the data, regarded as assertions over the concepts defined in the Tbox.

API Version 7: Generic Hypermedia API

In the previous version we have transformed the meta-data in our RAML API description into resources and we have exposed them using REST principles to describe the semantics of our API.

Both kind of resources are linked and conform a single data graph that can be discovered and explored by any client that understand the semantics of the control hypermedia.

At this point we no longer need the RAML document, the only thing we require is the URI of the single ‘EntryPoint’ resource.

The main problem of version 6 of the API is that the vocabulary used to describe hypermedia control information and the semantics of the resource types is private to our API.

In order to build a generic hypermedia API client, a truly generic API browser, we would need a standard way describing that information, a common vocabulary for hypermedia in APIs.

Fortunately, an effort to standarised such a vocabulary is currently underway in the W3C. It is called Hydra.

Hydra provides a rich vocabulary for describing hypermedia in APIs. The vocabulary we have used up to version 6 is just a sub-set of Hydra.

We could rewrite our ApiDocumentation class to use Hydra to describe the API so any generic Hydra browser could consume it:

{
"@context": {
"hydra": "http://www.w3.org/ns/hydra/core#&quot;
},
"@id": "http://todosapp.com/api/vocab#self&quot;,
"@type": "hydra:ApiDocumentation",
"hydra:entryPoint": { "@id":"http://todosapp.com/api/version_7/#self&quot; },
"hydra:supportedClass": [
{
"@id": "http://todosapp.com/api/vocab#User&quot;,
"@type": "hydra:Class",
"hydra:supportedOperation": [
{
"hydra:method": "GET",
"hydra:returns": { "@id": "http://todosapp.com/api/vocab#User&quot; }
},
{
"hydra:method": "DELETE"
}
],
"hydra:supportedProperty": [
{
"hydra:property": {
"@id": "http://todosapp.com/api/vocab#name&quot;,
"rdfs:range": "xsd:string"
},
"readonly": false,
"required": false
},
{
"hydra:property": {
"@id": "http://todosapp.com/api/vocab#email&quot;,
"rdfs:range": "xsd:string"
},
"readonly": false,
"required": true
},
{
"hydra:property": {
"@id": "http://todosapp.com/api/vocab#todos&quot;,
"@type": "hydra:Link",
"hydra:supportedOperation": [
{
"hydra:method": "GET",
"hydra:returns": { "@id": "hydra:Collection" }
},
{
"hydra:method": "POST",
"hydra:expects": { "@id": "http://todosapp.com/api/vocab#Todo&quot; },
"hydra:returns": { "@id": "http://todosapp.com/api/vocab#Todo&quot; }
},
]
}
}
]
},
{
"@id": "http://todosapp.com/api/vocab#Todo&quot;,
"@type": "hydra:Class",
"hydra:supportedOperation": [
{
"hydra:method": "GET",
"hydra:returns": { "@id": "http://todosapp.com/api/vocab#Todo&quot; }
},
{
"hydra:method": "DELETE"
}
],
"hydra:supportedProperty": [
{
"hydra:property": {
"@id": "http://todosapp.com/api/vocab#title&quot;,
"rdfs:range": "xsd:string"
},
"readonly": false,
"required": true
},
{
"hydra:property": {
"@id": "http://todosapp.com/api/vocab#description&quot;,
"rdfs:range": "xsd:string"
},
"readonly": false,
"required": false
},
{
"hydra:property": {
"@id": "http://todosapp.com/api/vocab#user&quot;,
"@type": "hydra:Link",
"hydra:supportedOperation": [
{
"hydra:method": "GET",
"hydra:returns": { "@id": "http://todosapp.com/api/vocab#User&quot; }
}
]
}
}
]
},
{
"@id": "http://todosapp.com/api/vocab#EntryPoint&quot;,
"@type": "hydra:Class",
"hydra:supportedOperation": [
{
"hydra:method": "GET",
"hydra:returns": { "@id": "http://todosapp.com/api/vocab#EntryPoint&quot; }
}
],
"hydra:supportedProperty": [
{
"hydra:property": {
"@id": "http://todosapp.com/api/vocab#users_url&quot;,
"@type": "hydra:Link",
"hydra:supportedOperation": [
{
"hydra:method": "GET",
"hydra:returns": { "@id": "hydra:Collection" }
},
{
"hydra:method": "POST",
"hydra:expects": { "@id": "http://todosapp.com/api/vocab#User&quot; },
"hydra:returns": { "@id": "http://todosapp.com/api/vocab#User&quot; }
},
]
}
}
]
}
]
}
view raw version_7.json hosted with ❤ by GitHub

Additionally, we will introduce another discovery mechanism in our server, we will include a HTTP Link header as described in RFC5988 and prescribed in Hydra recommendation draft to make clients aware of the location of the meta-data graph:

Now we have a truly generic hyper-media enabled API.

This is just a starting point. Since we have support for the same powerful architecture that powers the web in our API, we can use it to change the way we address a number of problems in APIs: client identity, authentication, ACLs, validation, API integration and clients collaboration, all these problems could be solved in a truly de-centralised and scalable just relying in the support for hypermedia and meta-data this architecture for building APIs enable.

Building a Resque like job processing system for JVM apps with RabbitMQ

RabbitMQ is a complex beast.
It’s flexible, powerful but also hard to entirely grasp and master.
Many different use cases and usage patterns can be built on top of this powerful piece of software but missteps and design errors in the first attempts of coding a particular solution are also commonplace.
In this article I will discuss the design and implementation of Palermo, a system implementing one of the possible usage patterns that can be built using RabbitMQ as the underlying queuing mechanism: a batch job processing system.

By batch job processing system we refer to a mechanism for the automated execution of jobs by workers that extract jobs from different queues. Clients can enqueue new jobs into these queues that will eventually be delivered to the workers that will execute the job. If one job fails in its execution, the worker thread will place the failed job into a special queue where it can be retried or inspected for error debugging.
The main inspiration when building Palermo has been Resque , a job processing system built by Github using Ruby and Redis.
Resque makes it possible for users of the system to define queues with different names, enqueue jobs matching the name of a Ruby class in the system with certain input arguments and to start worker processes in different machines that will consume the jobs, instantiate the Ruby class and perform the job with the provided input arguments. If the job execution fails, the job will be routed to a special ‘failed’ job queue where it can be retried or removed. Resque workers are well integrated with the underlying operative system, being able to deal with incoming signals as a way for system operators to control the execution of jobs. It also provides a web interface where the state of the jobs and workers can be inspected and certain actions like the re-enqueue of a job or the clearing of the jobs in a queue can be performed.
From a developer’s point of view, the main advantage of Resque is the extremely ease of use of the whole system. Defining and enqueuing jobs require just a few lines of Ruby code and the system performs in consistent and robust way.

The aim of Palermo is to build a job processing system as easy to use and robust as Resque but for JVM languages and using RabbitMQ as the underlying queuing mechanism instead of Redis.

From a formal point of view a job processing system queuing mechanism can be defined as a tuple space , a kind of persistent associative memory where some processes enqueue jobs by writing tuples with the form:


write(queue_name, job_type, input_argument)

Worker processes remove tuples from memory, one at a time, by using a predicate matching the queue name:


read(queue_name, ?, ?)

The only additional constrain that must be added to this formulation of the job processing queuing mechanism as a tuple space is that the read function from the distributed memory must adhere to first in first out semantics. RabbitMQ queues can be viewed as this kind of memory with FIFO semantics. The name of the queue serves as the first argument passed to the `read` predicate to extract the next matching tuple stored in memory.

palermo_space

In order to get the desired tuple space semantics for reading we need to deal with some of the RabbitMQ internals and configure the following options:

– Persistence of queues
– Persistence of messages
– Quality of service
– Message acknowledgments

Since the memory we need to use must be persistent we need to add persistence to the queues and messages managed by RabbitMQ. In order to do that, queues must be declared in RabbitMQ as `durable` and messages as `persistent`. In this way even if the RabbitMQ broker goes down, information about the queues that have been created and the pending messages will survive a restart.

When more than one worker are connected to the same queue, RabbitMQ will round robin messages between all the available workers. The main problem is that RabbitMQ will deliver a message to the next worker as soon as the message arrives to the queue, no matter if the worker is already processing a different message. If one job takes a long time to process, incoming messages can pile up in the local buffer of the busy worker while the other workers starve. We can deal with situation using two features of RabbitMQ: message acknowledgments and quality of service/pre-fetch counts.

First of all, the worker can be required to notify RabbitMQ when has processed the message with an explicit acknowledge. Messages will not be removed from the queue until they are acknowledged. If the worker dies without acknowledging a message, it will be automatically re-enqueue by RabbitMQ.

At the same time quality of service (qos) configuration can be used to tell RabbitMQ the maximum number of unacknowledged messages that can be sent to a particular worker. If we set this value, known as pre-fetch count to 1, no other message will be sent from the queue until the last message has been acknowledged by the worker process.
In this way, a fair dispatch of the jobs between the workers can be achieved.

palermo_rabbit

We still need to find a way to implement the write operation into the memory. One important aspect of RabbitMQ that makes it different from other queuing systems is the strong separation of concerns between publishers and consumers. In the RabbitMQ architecture this is achieved by the introduction of a exchange between the queues and the publishers. Publishers don’t know about the final destination of their messages, they only need the name of an exchange and a routing key that can be seen as the address for the message is being sent.
RabbitMQ supports different semantics for different exchange types that will affect the way a message with an address matching an exchange will be deliver to the actual mail boxes (queues) where the consumers are receiving the messages.
In our case, the first argument of the write function is the queue name. The queue name is also the argument to the predicate used by the workers to extract the next job that needs to be processed from the memory.
The approach followed by Palermo is to used a direct exchange type. In this type of exchanges the routing key has to match the name of a queue where the message will be delivered.
If a message is sent to a RabbitMQ exchange and no queue is attached to the exchange, the message will be discarded or sent to an associated alternate exchange. To avoid this situation, every time a publisher enqueues a new job in Palermo, it declares the queue matching the routing key before sending the job data. In this way we can be sure that the message will not be discarded if no worker is awaiting jobs. Re-declaring an existing queue with the same options is a valid operation in RabbitMQ without any effect.

Using the previously described set up for RabbitMQ we can build the core of a job processing system. However, there are certain features, like the management of failed jobs or the serialization of messages that cannot be addressed resorting to RabbitMQ features. In Palermo this features have been implemented as an additional application software layer that can be distributed as a Java library.

The first issue is how to handle with failed jobs. We have seen how, since we are using explicit worker acknowledgments, RabbitMQ will handle fatal failures in the worker execution or timeout issues. This mechanism could also be used to handle any kind of exceptional condition in the job processing logic of the job but our desired functionality requires that the failed message must be sent to a special queue of failed jobs where they can be inspected, removed and if it is the case re-enqueued. This functionality has been accomplished in Palermo wrapping the execution of the job in a generic java try/catch block and piping the failed messaged to the failed queue adding some information about the job error, retry count and original queue in the headers of the message meta-data that are sent along the message to RabbitMQ.

The serialization problem has been addressed defining a job message with header information about the Java class for the job, the serialized arguments for the job and the serialization type. A small pre-visualization of the arguments in human readable string is also sent along the message. Palermo has support for plugin and unplugging different serializations into the system and includes a JSON based serializer, but the default serialization mechanism used is JBoss Serialization. Only the arguments for the job are serialized and sent to the worker, the class with the bytecode for the job must still be available in the worker class path for the job to be correctly executed.
Palermo workers are just generic means of execution of the actual job logic encapsulated inside the definition of the job classes.

The whole logic of Palermo has been implemented in the Clojure programming language, but thanks to Clojure Java inter-op features, it can be used from Java code or any other JVM based language. Since Palermo worker threads run on the JVM they are isolated from the underlying operative system. Integration with the OS in the way Resque workers do is hard to achieve but certain degree of integration has been attempted when possible, e.g. using process identifiers for the identity tags of the RabbitMQ consumers associated to Palermo workers. Also a command line interface has been implemented so new workers can be started easily by scripts and users.

A final component of the solution is a web interface for a running Palermo system. It’s just a simple web application that uses the introspection features coded in the Palermo library to provide an easy to understand view of how the Palermo system is performing. This interface is a copy of Resque web interface and can be used as a replacement of RabbitMQ generic interface for managing exchanges, queues and workers.
All the functionality provided by the web interface can also be used programmatically by any Java code through the Palermo library.

ss1

ss3

ss3

Taking all the previous into consideration we can think about Palermo as just a thin layer of functionality built on top of a particular setup of RabbitMQ and wrapped as a library that can be used as a reusable implementation of the job processing usage pattern. The same approach can be potentially applied to different usage patterns that can be built using RabbitMQ as the underlying engine saving time in the writing of not application specific code as well as dealing with the setup and the rest of the complexity introduced by RabbitMQ specific configuration.

Monte Carlo integration with Clojure and Mahout

Monte Carlo simulations can be used to approximate the value of different mathematical models where computing an analytical solution is an intractable problem or it is impossible to find because of the uncertainty in some variables.

Conceptually Monte Carlo simulations are simple.
They are based in the repeated computation of the model where the value for the random variables are extracted from some known probability distribution.
Finally, the results of all the simulations are averaged to obtain the approximated value.

The two main requirements to run a Monte Carlo simulation, provided you already have a mathematical model of the proble, are a good random number generator and some tool to model the different probability distributions. The Apache Mahout project has implemented these tools. It includes a powerful pseudo-random numbers generator that passes all the DieHard randomness tests. It also includes Java classes that make easy to work with different probability distributions.

This post is a small example of how to use these classes from Clojure as a convenient scripting language to work with Mahout. The example implemented is extracted from this excellent introductory book published by Springer.
It consists of computing the integral for the following function:

  (defn h
    ([x] (pow (+ (cos (* 50 x)) (sin (* 20 x))) 2)))

We can use Incanter to visualize this function:

  (use 'incanter.core)
  (use 'incanter.charts)

  (view  (function-plot h 0 1))

The integral can be computed anallytically obtaining the value 0.965.

The montcarlo approach is based in extracting iid variables from an uniform probability distribution U(0,1), and approximate the value with the function:

  (def *solution* (/ (reduce + *randoms*) *trials*))

Where *randoms* is the collection of randomly generated iid variables and *trials* is the number of variables generated.

In order to generate these values we can use the org.apache.mahout.math.jet.random.Uniform Mahout class:

  (import 'org.apache.mahout.math.jet.random.Uniform)
  
  (def *unif* (Uniform. (double 0.0)
                        (double 1.0)
                        (mod (.getTime (java.util.Date.)) 10000)))

  (def *trials* 100000)

  (def *randoms* (take *trials* (repeatedly #(h (.nextDouble *unif*)))))

To check that the values generated, we can plot the generated values using Incanter:

  (view (bar-chart (range 0 *trials*) *randoms* ))

The computed value is 0.9651838354718588, very close to the actual solution of 0.965.

We can also compute the running averages and plot them to see how the algorithm converges to the solution:

  (def *averages* (reduce (fn [ac i]
                            (conj ac (/ (+ (last ac) (nth *randoms* i)) (inc i))))
                          [0]
                          (range 0 *trials*)))

  (view  (function-plot (fn [x] (nth *averages* (Math/round (float x)))) 0 *trials*))

Monte Carlo simulations are a simple yet very powerful idea that can be applied to a lots of different of situations. If you want to learn more about Monte Carlo methods, these are some useful resources:

visualizing Mahout’s output with Clojure and Incanter

Some Clojure code to visualize clusters built using Apache Mahout implementation of the K-Means clustering algorithm.

The code retrieves the output of the algorithm (clustered-points and centroids) from HDFS, builds a Clojure friendly representation of the output (a map and a couple of lazy-seqs) and finally uses Incanter’s wrapper around JFreeChart to visualize the results.

A sample execution using the output generated by the example from Mahout’s documentation:

(use 'mahout-vis.core)

(bootstrap! "/Users/antonio/Development/tmp/hadoop-0.20.2/
conf/core-site.xml")

(def *results* 
  (k-means-output "output/clusters-9/part-r-00000" 
                  "output/clusteredPoints/part-m-00000"))

(visualize-plots 
  (compute-comps *results* [5 10] [15 50] 
                           {:display-centroids true}))

The output of the previous code are 4 frames displaying the clusters for the components with indices 5,10,15 and 50 of the input data.

Other visualizations can be generated interactively from Clojure’s REPL. Another example of how Clojure can provide an interactive and powerful interface to complex Java systems.

Extending MongoDB with custom commands

MongoDB is one of the fastest alternatives to store and retrieve data. Mongo stores information organized as schema-less documents. Queries and data objects are expressed as JSON objects encoded using a “binary” serialization instead of plain text for improved performance.

There are two main mechanisms to extract information from MongoDB. A find command wrapping a small ad-hoc query language including conditionals, sorting, etc. and a mapReduce command for batch processing.

When not standard retrieval of data is required, the mapReduce command is the only option MongoDB offers. In a recent project I’ve been working on, I had to select documents stored in mongo with a value for the Levenshtein distance from a query term inferior to a certain threshold, a similar functionality tha the one offered by the fuzzystrmatch module in PostgreSQL. The task can be accomplished using Mongo’s mapReduce command, but the performance of the queries was not optimal.

As an experiment, I started reading the code of MongoDB to see if there was an easy way to implement this functionality directly in the database. What I found is that Mongo’s code is really modular and easy to extend.
The outcome has been a new command that implements the functionality with a big improvement in performance, as the following table shows:

implementation mapReduce native
levenshtein 0.7 (0 matches) 1.941 s 0.077 s
levenshtein 0.4 (21 matches) 2.691 s 0.091 s
levenshtein 0.1 (22.478 matches) 22.857 s 7.962 s

The collection being queried in this test is a collection with 100.000 documents containgin random strings of text between 30 and 100 characters.

The code for the new command can be found at Github. The files in this commit contain all the code required to implement the command.

The following is a small summary of the steps required to extend MongoDB to support this kind of queries.

Retrieving MongoDB’s code and building process

MongoDB’s code is available at Github. Once the code has been retrieved, the next step is to build the data base and all the additional functionality, like the Javascript shell.
Mongo uses SCons as the build infrastructure. SCons is itself built using python, so this is a dependency that must be installed in your system before you can build Mongo.

To build the whole system, a single command is enough:

$scons .

The task can take quite a long time but after building the whole system, SCons does a great work just re-building only the modified sources.
Different parts of the system can also be built as independent targets:

# builds only the db
$scons mongod
# builds only the JS shell
$scons mongo

Creating a new DB command

Mongo’s core functinality can be found in the db directory of the source distribution. It includes the implementation of Mongo’s RESTful API, indexes/BTree support, standard Mongo’s queries and also the list of commands that can be issued to the database, e.g. the mapReduce.
Adding a new command to the list means implementing a new C++ class with the functionality of the command and registering a name for this command in a map of command-names to command classes.

If we take a look at db/commands.cpp we will find the function used by the server frontend to look for the function it has to execute:

    map * Command::_commands;
   ... 

    Command* Command::findCommand( const string& name ) {
        map::iterator i = _commands->find( name );
        if ( i == _commands->end() )
            return 0;
        return i->second;
    }

All commands implement the abstract mongo::Command class. The subclass must implement some functions in order for the command to be executed. The mos important function is mongo::Command::run defined in db/commands.h:

  // db/commands.h line 50
   virtual bool run(const string& db, BSONObj& cmdObj, 
                    string& errmsg, BSONObjBuilder& result, 
                    bool fromRepl) = 0;

The base Command class also provides a base constructor that will automatically register the command in the commands map when invoked in the subclass. For example, the implementation of the mapReduce command registers itself for execution invoking the base constructor:

/**
 * This class represents a map/reduce command executed on a single server
 */
class MapReduceCommand : public Command {
  public:
     MapReduceCommand() : Command("mapReduce", false, "mapreduce") {}

Retrieving the arguments for the command
The query retrieved from the client is encoded as a BSON object and passed as the second argument to the run function.
There is a whole suite of functions to manipulate BSON objects defined in MongoDB. They can be found in bson/bsonobj.h and bson/bsonelement.h.
In this fragment of code from the mapReduce command implementation the out parameter of the query is handled. The BSON object is stored in the variable cmdObj:

if ( cmdObj["out"].type() == String ) {
    finalShort = cmdObj["out"].String();
    outType = REPLACE;
}
else if ( cmdObj["out"].type() == Object ) {
    BSONObj o = cmdObj["out"].embeddedObject();

    BSONElement e = o.firstElement();
    string t = e.fieldName();

    if ( t == "normal" || t == "replace" ) {
        outType = REPLACE;
        finalShort = e.String();
    }
    else if ( t == "merge" ) {
        outType = MERGE;
        finalShort = e.String();
    }
    else if ( t == "reduce" ) {
        outType = REDUCE;
        finalShort = e.String();
    }
    else if ( t == "inline" ) {
        outType = INMEMORY;
    }
    else {
        uasserted( 13522 , str::stream() << "unknown out specifier [" << t << "]" );
    }

    if (o.hasElement("db")) {
        outDB = o["db"].String();
    }
}

Obtaining a cursor
To implement the desired functionality it usually necessary to traverse the collection of Mongo documents stored in the DB. Mongo implements this functionality using cursors.
Cursors can be obtained using a factory function called bestGuessCursor that receives as a parameter an unique namespace for the command and a description of a DB query.
The cursor is returned as a Boost smart pointer so we don’t have to deal with the deallocation of the resources consumed by the pointer. A possible template for a function using a collection pointer could be:

// run function
bool run(...) {
  
  // get the cursor
  shared_ptr temp = bestGuessCursor( ns, BSONQuery, BSONObj() );        
  auto_ptr cursor( new ClientCursor( timeoutOpts , temp , ns ) );

  // main loop
  while ( cursor->ok() ) {

    // get current document
    BSONObj o = cursor->current();

    ... logic ...

    cursor->advance(); 
  }
}

Building the output

The result of the command must be returned also as a BSON object. To build this object a reference to a BSONObjBuilder object is passed as an argument to the run function. The logic of the function can use functions like append to add values to the resulting BSON object. If the values of this object must also be BSON objects, additional BSONObjBuilder instances can be created. Once the object has been built, it can be retrieved from the builder calling to the obj funtion.

The run function must also signal if the execution of the command has been successful returning a boolean value.

Adding support for the command in the shell
In order to use the command we have implemented, we can add support for the command in the Mongo JS shell. A good location for the JS code invoking the command is shell/collection.js.
The function must build the JSON object tha will be later received as a parameter in the command implementation at the server. The only requirement for this JSON object is that the first property of the object must have the same name that the string used to register the command in the DB. The value for that property must be the short name of the collection. The rest of properties are optional. The command can be executed using the this._db.runCommand object from the present collection object.

As an example, this is the implementation of the custom levenshtein command:

DBCollection.prototype.levenshtein = function( sourceTerm , field, threshold, opts ){
    var c = { levenshtein : this._shortName , sourceTerm : sourceTerm , field : field, threshold : threshold };
    opts = opts || {"level":"word"};

    if(!opts["level"] || opts["level"] === "word") {
        c["word"] = true;
        c["sentence"] = false;
    } else {
        c["word"] = false;
        c["sentence"] = true;    
    }

    c["separators"] = (opts["separators"]||".,:; ");

    if(opts["limit"]) {
        c["limit"] = opts["limit"];
    }
    if(opts["outputField"]) {
        c["outputField"] = opts["outputField"];
    }

    var raw = this._db.runCommand( c );
    if ( ! raw.ok ){
        __mrerror__ = raw;
        throw "levenshtein matches failed:" + tojson(raw);
    }

    return tojson(raw);

}

Adding support in a driver
One problem of extending MongoDB this way is that we must add support for the new command in all the layers between our client code and the DB. We have already added support to the JS shell but many applications access MongoDB through some kind of driver interfacing with the DB server.

In my case, I was using Sominum’s congomongo Clojure library. This means adding support in two different layers, the low level Java driver and the Clojure wrapper library.
Fortunately, the design of the library and the Java driver make possible to add support for the command entirely in client code without further modification of the library sources. Congomongo’s coerce function also makes very easy to transform data structures to and from Clojure’s native data types and BSON objects. An example implementation can be found in this Github’s gist

Chasing Erlang: profiling a Clojure library

This is a short summary of my efforts profiling and benchmarking Jobim, the actors library for Clojure I’ve been working for the last few months. I have very little experience profiling Clojure applications, so I thought a brief summary of the process may be interesting for other people in the same situation.

How to measure performance?

The first problem I found wast to make up a good way of testing the actual performance of the library. Performance of complex systems is really hard to test. The number of moving parts make difficult to spot the bottlenecks and wrong conclusions can be drawn easily.
Fortunately, in my case I found an easy solution for the problem. I used the same test suite built to benchmark Termite Scheme. Termite is a variant of Scheme built for distributed computing and modelled after Erlang programming model on the Gambit Scheme system.
The Termite paper has a full section discussing different performance tests and showing the results for Termite and Erlang. Furthermore, the implementation of the tests can be found in Termite’s source code.

Because of the similarities between the projects, implementing these benchmark tests in Clojure/Jobim provide a good starting point to test the performance of the library.

Finding tools

To measure performance, System/nanoTime can be good enough. On the other hand, to profile the code when performance is bad, you really need a good profiler tool. Unfortunately, there is no a good Clojure profiler tool as fas as I know. Commercial Java profilers like Yourkit are a good option but they can also be expensive. A good free alternative to these profilers is VisualVM. The profiler is included in the Java distribution and it’s really ease to use. Using VisualVM with Clojure is just a matter of selecting the REPL process from the profiler user interface and start gathering data.

Adding type hints

One first step to increase the performance of a Clojure application is to add type hints to the code of your functions. Clojure is a dynamic language, but that’s not the case of Java, the host-language for Clojure. In order to invoke Java methods, Clojure needs to find the type of the java object passed as arguments to these methods or returned from them. Java reflection is the mechanism that makes possible to look up for this informatipn at run-time. The use of reflection has a double impact on performance, it adds processing time and it consumes additional memory creating reflection objects. The use of reflection is usually not a problem in regular applications, memory is freed by the garbage collector and the performance penalty is not significative. Nevertheless, in some other cases the use of the reflection mechanism can become a problem.

The following images show a profiler session for the execution of the Termite ring test with a ring of 40k actors. The non-deterministic execution of the garbage collector has been disabled with the -Xincgc JVM parameter.

The screenshots show how the execution of the test allocates a huge amount of memory as java.lang.reflect.Method objects. These objects are released by the GC once it is manually executed but in a regular execution, the execution of the GC thread might have impacted very negatively the outcome of the test.

In order to avoid these situations, Clojure can notify us when it will use reflection to resolve a method call if we set up the *warn-on-reflection* var to true. With this modification, the list of conflicting calls will be shown when the source code is being compiled. Using information we can add type hints with the correct type for the call to the metadata of the problematic symbols. Clojure will use this information instead of using reflection to discover the right type for the objects.

The following figures show another run of the same test after having added the required type hints. We can see how the use of memory has decreased and the java.lang.reflect.Method objects have dissapeared.

In the case of Jobim, the number of type hints required to resolve all the problematic calls was really small. Around 15 type hints were enough. Nevertheless, some calls involving protocols were harder to solve. The use of VisualVM was a great help to check that no reflection objects were being created.

Object serialization
One of the main performance problems in tests involving the exchange of messages among actors in different nodes is the serialization and deserialization of messages. Using standard Java serialization the amount of measured time for some tests spent in the serialization of messages could reach 20% of the total. Besides, serialization of some Clojure objects like lambda functions is not supported.

Lacking an unique solution for the problem. Different serialization services have been implemented as plugins for Jobim. The following table shows some performance results for a test consisting of encoding and decoding a java HasMap with long lists of integers as values:

Serialization mechanism Time encoding+decoding
JSON serialization 5.566 ms
Java serialization 4.419 ms
JBoss serialization 1.437 ms
Kryo serialization 0.573 ms

The result shows how the standard Java serialization is almost as slow as JSON serialization. Kryo offers a very good performance with the drawback of not supporting Clojure core data types. Finally JBoss serialization library offers a much better performance than the standard Java serialization with a compatible interface and the same degree of support for Java types, including Clojure core data types.

The lack of support for clojure data structures makes Kryo a valid option only for certain cases where actors restrict the kind of messages they exchange so they can be serialized by Kryo. JBoss serialization is a good default option and has been used in the rest of the tests. Jobim data types used in the exchange of messages between nodes by the library have restricted to standard Java types so they can be used with any serialization mechanism.

Results

The following table shows the results I have obtained profiling the current version of Jobim. The results for Erlang have been measured using the same hardware where the Clojure version was tested.

test Erlang Jobim threaded Jobim evented
fib(534) 585 ms 378 ms
Spawn 6.92 us 49 us 27 us
Send 0.2 us 120 us 16 us
ring 40k/0 3 us 83 us
ring 40k/4 5 us 166 us
ping-pong internal (0) 0.7 us 148 us 25 us
ping-pong internal (1000) 16 us 152 us 29 us
ping-pong same node (0) 58 us 14333 us 5027 us
ping-pong same node (1000) 182 us 25921 us 6303 us
ping-pong remote (0) 3267 us 18481 us 7022 us
ping-pong remote (1000) 16556 us 38730 us 8400 us

Ping-Pong tests where executed using a Jobim configuration consisting of the ZooKeeper plugin for the coordination service, the TCP plugin for the messaging service and the JBoss plugin for the serialization service. The rest of the tests where run with the localnode plugin for coordination, service and serialization.

Some conclusions:

  • Clojure is faster than Erlang in the base case chosen, computing fibonacci of a small number with a naive implementation.
  • Actor models primitives: spawn, send, receive are orders of magnitude faster in Erlang that implemented as library functions in Jobim.
  • The number of processes/actors that can be created in a single node is far inferior in Jobim than in Erlang. In a machine where the Erlang version of the ring test was running without problems with 1M processes, the JVM was struggling with ~50k evented actors. Threaded actors are even less scalable.
  • Erlang seems to have some issues with the size of network messages, Jobim faces less variation in the response times when the size of the messages varies
  • Evented actors offer far superior performance than the threaded implementation in the tests.
  • In tests involving actual network communication, Jobim gets closer to Erlang and it even outperforms Erlang in one case.

With these results, it is probably not a good idea to write Erlang-in-Clojure applications using Jobim. Nevertheless, the library starts to be performant enough to be used as a distribution mechanism for Clojure applications using actors semantics and some of Erlang features like actors linking or behaviours.

Barber shop problem again

Today, ther have been some talking about concurrency in Clojure vs Scala. Most of the debate was based on an excellent article Scala vs Clojure – Round 2: Concurrency! by Lau B. Jensen.

The article discusses the classic “barber shop” problem and offers two implementations one in Clojure, using Clojure concurrency primitives, and another one in Scala using Scala implementation of actors.

Since I’ve been working in the latest months in Jobim, an actor’s library for Clojure, I thought it would be interesting to offer an additional implementation using actors in Clojure. The code can be found here.

The implementation of the actors is pretty straightforward. It is interesting to compare it with the Scala version from the previous article.

The shop actor:

(defn do-shop
  "Shop actor"
  ([max-seats max-clients]
     (let [seats (atom [])
           max-clients (atom max-clients)]
       (loop [[msg pid] (receive)]
         (condp = msg
             ;; allocate a seat for the client
             :get-seat (if (< (count @seats) max-seats)
                         (do (swap! seats conj pid)
                             (send! pid :seated))
                         (send! pid :no-space))
             ;; barber asks for a client
             :next-client (let [pid (first @seats)]
                            (swap! seats rest)
                            (send! (resolve-name "the-barber") pid))
             ;; A client is leaving the shop
             :exiting (swap! max-clients dec))
         (if (> @max-clients 0)
           (recur (receive))
           (notify-shop "Closing barber shop"))))))

The customer:

;; Customer
(defn customer
  "Customer actor"
  ([id]
     (spawn
      ;; enters the shop and ask for a seat
      #(loop []
         (println (str "Customer id " id "..."))
         (let [seated (do (send! (resolve-name "barber-shop") [:get-seat (self)])
                          (receive))]
           ;; we have a seat, wait to have a hair cut
           (if (= seated :seated)
             (let [_ (send! (resolve-name "the-barber") :hey)
                    ;; Awaking the barber
                   msg (receive)]
               ;; barber starting to do his thing
               (when (= msg :cutting)
                 (notify-shop (str "Customer " id " getting his hair cut")))
               ;; barber finished
               (let [msg (receive)]
                 (when (= msg :finished)
                   (notify-shop (str "Customer " id " got his hair cut")))
                 ;; leaving the shop
                 (send! (resolve-name "barber-shop") 
                           [:exiting (self)])
                 (notify-shop (str "Customer " id " leaving the shop"))))
             ;; No space available in the shop, exiting
             (do
               (notify-shop (str "Customer " id " could not get his hair cut"))
               (Thread/sleep (rand 5000))
               (recur))))))))

And finally the barber:

;; Barber
(defn barber
  "Barber actor"
  ([]
     (let [pid (spawn
                #(loop [msg (receive)] ;; wait for a customer
                   (when (= msg :hey) ;; awaken!
                     ;; ask for the customer to come in
                     (send! (resolve-name "barber-shop") [:next-client (self)])
                     ;; ignore the customers trying to awake us
                     (let [client-pid (receive (fn [msg] (not (= :hey msg))))] 
                       ;; starting to cut
                       (send! client-pid :cutting)
                       (Thread/sleep (rand 4000))
                       ;; finished cutting
                       (send! client-pid :finished)))
                   (recur (receive))))]
       (register-name "the-barber" pid))))

Additionally, the implementation includes the user of a behaviour. Behaviours is a concept taken from Erlang that refers to a piece of functionality in a distributed system that is encapsulated so it can be easily reused.

In this case, since the customers can be executed in any node, if we would like to collect the messages produced by all the customers, we could not use the traditional mechanism of printing to the standard output. The messages will appear all across the cluster. Using an event manager and a event handler we can collect messages from all the customers in the cluster in a central location.
Behaviours in Jobim are implemented using Clojure’s protocols. In this case:

;; Message logger

(def-event-handler BarberLogger
  (init [this initial-state] initial-state)
  (handle-event [this event state]
                (println (str "*** " event)))
  (terminate [this state] (println ("BarberLogger exiting"))))

(defn notify-shop [msg]
  (notify (resolve-name "barber-manager") msg))

More information about the event manager behaviour can be found in the Erlang manual.

One of the fundamental advantages of using a platform like Erlang is that you can run the same application in a single node or in a set of nodes with minimal changes. In this example, there are a couple of function to start the application in a single node or in multiple node. The only difference is the way the customer actors are initiated:

(defn run-barber-shop
  "Runs the barber shop problem in a single node"
  ([num-seats num-clients]
     (start-event-manager "barber-manager")
     (add-handler (resolve-name "barber-manager")
                  [name jobim.examples.barber.BarberLogger]
                  nil)
     (shop num-seats num-clients)
     (barber)
     (doseq [i (range 0 num-clients)]
       (customer i))))

(defn run-barber-shop-distributed
  "Runs the barber shop problem distributed in a set of nodes"
  ([num-seats num-clients nodes]
     (start-event-manager "barber-manager")
     (add-handler (resolve-name "barber-manager")
                  [name jobim.examples.barber.BarberLogger]
                  nil)
     (shop num-seats num-clients)
     (barber)
     (loop [ids (range 0 num-clients)
            nodes (cycle nodes)]
       (if (empty? ids)
         (println (str "*** All clients spawned"))
         (do (rpc-blocking-call (resolve-node-name (first nodes))
                                "jobim.examples.barber/customer"
                                [(first ids)])
             (recur (rest ids)
                    (rest nodes)))))))

The result of a test run can be seen in this video:

As you can see, mixing actors in Clojure is just another possibility and it makes a really good alternative when you want to build systems spanning more than a single node.

Embedding V8 in a Clojure SPARQL library

Using Javascript in Java applications is a common practice. The easiest path to achieve this integration is to use the excellent Rhino library from the Mozilla Foundation. This Library is an implementation of Javascript in Java, so interop. from both sides is really easy.
Nevertheless, when I was in the need of re-using a Javascript implementation of a SPARQL 1.1 parser I has been working in a different Clojure project, I decided to try a different approach and use Google’s V8 Javascript engine. These are the results of this small experiment.

The V8 project has some instructions about how to embed the engine into a C application. V8 exposes JavaScript objects to C code through ‘handlers’ referencing these objects. Objects, can be createad and released from C but instead of manually tracking each object individually, they can be grouped into ‘handle scopes’. This mechanism simplifies the management of JavaScript objects memory, since all the objects associated to a scope can be garbage-collected by V8 once the scope is deleted. Additionallly, in the same way that a C object can be created in the heap or in the stack, V8 object handlers can also be persistent or transient.

V8 objects are created and manipulated in a certain ‘execution context’. These contexts are persistent between function calls, and a scope to declare or manipulate additional JS objects can always be retrieved from the context object. C data structures and functions can also be used in JS code using ‘templates’ that wrap the C object in the JS execution context.

As an example, the following code shows the constructor of a C++ class that loads the JS parser code from a file, initializes a persistent execution context, and evaluates the JS script in that context. As a result the JS parsing function will be available into that V8 execution context.

SparqlParser::SparqlParser() {

  char script[500000];

  // load the parser js code
  std::ifstream fin;
  fin.open("sparql_parser.js", std::ios::in);
  fin.read(script,500000);
  fin.close();

  // The context where the parser will be executed
  context = Context::New();
  HandleScope handle_scope;
  Context::Scope context_scope(context);

  // compiles the parser function
  Handle parserScript = Script::Compile(String::New(script));
  parserScript->Run();

};

The method parse from the same clase, just builds the invocation to the parse JS function, retrieves a new scope in the persistent execution context and executes the actual JS function invocation:

std::string SparqlParser::parse(std::string query) {
  Context::Scope context_scope(context);
  HandleScope handle_scope;

  // executes the parser function call on the global object
  std::string query_string = "sparql_query('";
  query_string = query_string.append(query);
  query_string = query_string.append("');");

  Handle runnerScript = Script::Compile(String::New(query_string.c_str()));
  Handle result = runnerScript->Run();

  if (!result.IsEmpty() && !result->IsUndefined()) {
    String::Utf8Value str(result);
    std::string toReturn = std::string(ToCString(str));

    return toReturn;
  } else {
    std::cout << "Something went wrong";
    return NULL;
  }

The destructor of the class, just releases the persistent execution context.

To be able to use this code from a Clojure function, a mechnism to execute native code from Java code must be used. The main alternatives are the Java Native Interface (JNI) and the newer Java Native Access (JNA). JNA offers a simpler and cleaner integration with Java code, and is well supported in Clojure and Leiningen thanks to projects like clojure-jna and clj-native. Unfortunately JNA is oriented towards plain C APIs and using C++ code like V8 supposes writing a C interface layer on top of C++ classes. Taking this into account, I decided to write a JNI interface, consisting of a Java class, that will wrap the C++ code using the JNI automatically generated C interface code.

This is the Java wrapper class:

package sparqltest;

public class SparqlParser {
     static {
         System.loadLibrary("SparqlParserWrapper");
     }

    private native void init();
    private native String parse_query(String query);

    private static SparqlParser parser;

    private SparqlParser() {
        init();
    }

    public String _parse(String query){
        return parse_query(query);
    }

    public static String parse(String query) {
        if(parser == null) {
            parser = new SparqlParser();
        }
        return parser._parse(query);
    }
 }

And this is the generated C interface, with the usual conversion between wrapped Java types and C types:

#include <jni.h>
#include <stdio.h>
#include "sparql_parser.h"
#include "sparqltest_SparqlParser.h"

SparqlParser parser;

JNIEXPORT void JNICALL Java_sparqltest_SparqlParser_init (JNIEnv *env , jobject obj) {
  parser = SparqlParser();
}

JNIEXPORT jstring JNICALL Java_sparqltest_SparqlParser_parse_1query (JNIEnv *env, jobject obj, jstring javaQuery) {

  const char *query = env->GetStringUTFChars(javaQuery, 0);
  std::string result = parser.parse(std::string(query));

  jstring javaResult = env->NewStringUTF(result.c_str());

  return javaResult;
}

After compiling and packaging the Java wrapper in a Jar file, we are ready to test the code from Clojure. Before that, the C++ code must have been compiled into a library that will be loaded by the JNI framework. The library must be located somewhere in the load path of JNI, as well as the V8 library if we have decided to compile it as a shared library. This path can be configured in the Leiningen project specification using the :native-path keyword (equivalent to use the -Djava.library.path argument in the Java invocation.
Since the C code, reads the JS parser from disk, the file containing the parser must also be located in the same directory where we are running the application.

If all the paths are OK and JNI can found the native libraries, the JS parser can be invoked without problems from Clojure code:

(ns clj-sparql-test.core
  (:import sparqltest.SparqlParser)
  (:use [clojure.contrib.json :only [read-json]]))

(defn parse-sparql [query]
  (read-json (SparqlParser/parse query)))

A sample invocation:

  user> (use 'clj-sparql-test.core)
  nil
  user> (parse-sparql "SELECT * { ?s ?p ?o }")
  [{:token "prologue",
    :base "",
    :prefixes []}
   {:kind "select",
    :token "executableunit",
    :dataset [],
    :projection [{:token "variable", :kind "*"}],
    :pattern {:token "groupgraphpattern",
              :patterns [{:token "basicgraphpattern",
                          :triplesContext 
                                [{:subject {:token "var", :value "s"},
                                  :predicate {:token "var", :value "p"},
                                  :object {:token "var", :value "o"}}]}],
                          :filters []}}]

The code of this experiment along with more details about the building of the differents components, ca be found here.