Extending MongoDB with custom commands

MongoDB is one of the fastest alternatives to store and retrieve data. Mongo stores information organized as schema-less documents. Queries and data objects are expressed as JSON objects encoded using a “binary” serialization instead of plain text for improved performance.

There are two main mechanisms to extract information from MongoDB. A find command wrapping a small ad-hoc query language including conditionals, sorting, etc. and a mapReduce command for batch processing.

When not standard retrieval of data is required, the mapReduce command is the only option MongoDB offers. In a recent project I’ve been working on, I had to select documents stored in mongo with a value for the Levenshtein distance from a query term inferior to a certain threshold, a similar functionality tha the one offered by the fuzzystrmatch module in PostgreSQL. The task can be accomplished using Mongo’s mapReduce command, but the performance of the queries was not optimal.

As an experiment, I started reading the code of MongoDB to see if there was an easy way to implement this functionality directly in the database. What I found is that Mongo’s code is really modular and easy to extend.
The outcome has been a new command that implements the functionality with a big improvement in performance, as the following table shows:

implementation mapReduce native
levenshtein 0.7 (0 matches) 1.941 s 0.077 s
levenshtein 0.4 (21 matches) 2.691 s 0.091 s
levenshtein 0.1 (22.478 matches) 22.857 s 7.962 s

The collection being queried in this test is a collection with 100.000 documents containgin random strings of text between 30 and 100 characters.

The code for the new command can be found at Github. The files in this commit contain all the code required to implement the command.

The following is a small summary of the steps required to extend MongoDB to support this kind of queries.

Retrieving MongoDB’s code and building process

MongoDB’s code is available at Github. Once the code has been retrieved, the next step is to build the data base and all the additional functionality, like the Javascript shell.
Mongo uses SCons as the build infrastructure. SCons is itself built using python, so this is a dependency that must be installed in your system before you can build Mongo.

To build the whole system, a single command is enough:

$scons .

The task can take quite a long time but after building the whole system, SCons does a great work just re-building only the modified sources.
Different parts of the system can also be built as independent targets:

# builds only the db
$scons mongod
# builds only the JS shell
$scons mongo

Creating a new DB command

Mongo’s core functinality can be found in the db directory of the source distribution. It includes the implementation of Mongo’s RESTful API, indexes/BTree support, standard Mongo’s queries and also the list of commands that can be issued to the database, e.g. the mapReduce.
Adding a new command to the list means implementing a new C++ class with the functionality of the command and registering a name for this command in a map of command-names to command classes.

If we take a look at db/commands.cpp we will find the function used by the server frontend to look for the function it has to execute:

    map * Command::_commands;
   ... 

    Command* Command::findCommand( const string& name ) {
        map::iterator i = _commands->find( name );
        if ( i == _commands->end() )
            return 0;
        return i->second;
    }

All commands implement the abstract mongo::Command class. The subclass must implement some functions in order for the command to be executed. The mos important function is mongo::Command::run defined in db/commands.h:

  // db/commands.h line 50
   virtual bool run(const string& db, BSONObj& cmdObj, 
                    string& errmsg, BSONObjBuilder& result, 
                    bool fromRepl) = 0;

The base Command class also provides a base constructor that will automatically register the command in the commands map when invoked in the subclass. For example, the implementation of the mapReduce command registers itself for execution invoking the base constructor:

/**
 * This class represents a map/reduce command executed on a single server
 */
class MapReduceCommand : public Command {
  public:
     MapReduceCommand() : Command("mapReduce", false, "mapreduce") {}

Retrieving the arguments for the command
The query retrieved from the client is encoded as a BSON object and passed as the second argument to the run function.
There is a whole suite of functions to manipulate BSON objects defined in MongoDB. They can be found in bson/bsonobj.h and bson/bsonelement.h.
In this fragment of code from the mapReduce command implementation the out parameter of the query is handled. The BSON object is stored in the variable cmdObj:

if ( cmdObj["out"].type() == String ) {
    finalShort = cmdObj["out"].String();
    outType = REPLACE;
}
else if ( cmdObj["out"].type() == Object ) {
    BSONObj o = cmdObj["out"].embeddedObject();

    BSONElement e = o.firstElement();
    string t = e.fieldName();

    if ( t == "normal" || t == "replace" ) {
        outType = REPLACE;
        finalShort = e.String();
    }
    else if ( t == "merge" ) {
        outType = MERGE;
        finalShort = e.String();
    }
    else if ( t == "reduce" ) {
        outType = REDUCE;
        finalShort = e.String();
    }
    else if ( t == "inline" ) {
        outType = INMEMORY;
    }
    else {
        uasserted( 13522 , str::stream() << "unknown out specifier [" << t << "]" );
    }

    if (o.hasElement("db")) {
        outDB = o["db"].String();
    }
}

Obtaining a cursor
To implement the desired functionality it usually necessary to traverse the collection of Mongo documents stored in the DB. Mongo implements this functionality using cursors.
Cursors can be obtained using a factory function called bestGuessCursor that receives as a parameter an unique namespace for the command and a description of a DB query.
The cursor is returned as a Boost smart pointer so we don’t have to deal with the deallocation of the resources consumed by the pointer. A possible template for a function using a collection pointer could be:

// run function
bool run(...) {
  
  // get the cursor
  shared_ptr temp = bestGuessCursor( ns, BSONQuery, BSONObj() );        
  auto_ptr cursor( new ClientCursor( timeoutOpts , temp , ns ) );

  // main loop
  while ( cursor->ok() ) {

    // get current document
    BSONObj o = cursor->current();

    ... logic ...

    cursor->advance(); 
  }
}

Building the output

The result of the command must be returned also as a BSON object. To build this object a reference to a BSONObjBuilder object is passed as an argument to the run function. The logic of the function can use functions like append to add values to the resulting BSON object. If the values of this object must also be BSON objects, additional BSONObjBuilder instances can be created. Once the object has been built, it can be retrieved from the builder calling to the obj funtion.

The run function must also signal if the execution of the command has been successful returning a boolean value.

Adding support for the command in the shell
In order to use the command we have implemented, we can add support for the command in the Mongo JS shell. A good location for the JS code invoking the command is shell/collection.js.
The function must build the JSON object tha will be later received as a parameter in the command implementation at the server. The only requirement for this JSON object is that the first property of the object must have the same name that the string used to register the command in the DB. The value for that property must be the short name of the collection. The rest of properties are optional. The command can be executed using the this._db.runCommand object from the present collection object.

As an example, this is the implementation of the custom levenshtein command:

DBCollection.prototype.levenshtein = function( sourceTerm , field, threshold, opts ){
    var c = { levenshtein : this._shortName , sourceTerm : sourceTerm , field : field, threshold : threshold };
    opts = opts || {"level":"word"};

    if(!opts["level"] || opts["level"] === "word") {
        c["word"] = true;
        c["sentence"] = false;
    } else {
        c["word"] = false;
        c["sentence"] = true;    
    }

    c["separators"] = (opts["separators"]||".,:; ");

    if(opts["limit"]) {
        c["limit"] = opts["limit"];
    }
    if(opts["outputField"]) {
        c["outputField"] = opts["outputField"];
    }

    var raw = this._db.runCommand( c );
    if ( ! raw.ok ){
        __mrerror__ = raw;
        throw "levenshtein matches failed:" + tojson(raw);
    }

    return tojson(raw);

}

Adding support in a driver
One problem of extending MongoDB this way is that we must add support for the new command in all the layers between our client code and the DB. We have already added support to the JS shell but many applications access MongoDB through some kind of driver interfacing with the DB server.

In my case, I was using Sominum’s congomongo Clojure library. This means adding support in two different layers, the low level Java driver and the Clojure wrapper library.
Fortunately, the design of the library and the Java driver make possible to add support for the command entirely in client code without further modification of the library sources. Congomongo’s coerce function also makes very easy to transform data structures to and from Clojure’s native data types and BSON objects. An example implementation can be found in this Github’s gist

Leave a comment