A better Task Engine with Mongo DB

What is a Task Engine

    In a process there are two main types of activities or steps. Steps that are automatically executed by the system and steps that require human intervention. The later are also known as Tasks or Human Activities.    The life cycle for such a step is under the control of a dedicated component called Task Engine. A Task Engine is responsible for creating, updating, searching, completing and deleting tasks.        More advanced Task Engines include besides the life-cycle management functionality also other important responsibilities:

  • task events - based on task life-cycle changes or business data changes perform some actions
  • task audit - keep track of who did changes on a task, the time the change was made and the exact change (field old value -> field new value)
  • task permission - control who can perform certain actions on a task    The Task Engine is the component where most of the user requests will end up (task search, read, update), thus the Task Engine is the main responsible for the end-user perception of a BPM system.----

How tasks are persisted

    The data behind a Task Step can vary depending on the business requirements. The root document is called TaskData, everything beneath it will vary from project to project.

    The Task Engine persists data by serializing the TaskData document and saving it as a BLOB in a DB table.    Each time a task instance is updated the Task Engine de-serializes the BLOB into a Java object, modifies the data, serializes the Java object again, and updates the BLOB value in the DB.    When a search is performed, the values inside the BLOB have to be compared against the search criteria, thus the Task Engine performs the de-serialization, compares the Java object with the actual search criteria and adds the matching entries to a list. When large production systems are involved the number of required de-serializations can get in the hundreds of thousands or millions range, seriously affecting the response time and implicitly the user experience.


Performance improvements

The webMethods Task Engine which is integrated into myWebMethods Server (MWS) has two main ways to improve the performance.

  • Caching

Each task that is fetched from the DB is kept in memory and synchronised across the MWS cluster. Although the performance improvement is significant, it can also add a lot of configuration complexity in clustered environments. The number of tasks that can be cached is limited and for un-cached tasks the response time is still high.

  • Indexing

Starting with webMethods version 8 there is the option to define at development time on which fields the searches will be made.For example if the business data is InvoiceData, and searches will only be performed for sapNumber, amount and department, then only those fields can be indexed.At deployment time anew table will be created with the indexed data.

MWS Indexing

The advantage is that the search time is the normal SQL time, no overhead for Java serialisation and de-serialisation, but the down sides are also important:

  • the caching capability is lost
  • the search fields must be carefully chosen since the Index Table increases the size on the DB
  • it is not possible to search over arrays, only plain types
  • it is not possible to use aggregation
  • it is not possible to search at the same time for indexed fields and non-indexed fields

An alternative to Relational DB based Task Engine

    Since the persistence layer of the Task Engine has been first designed there have been developed in parallel different solutions specialized in Document Based data storing. The most popular is Mongo DB, an open-source Document Database.    Many of the persistent challenges that the webMethods Task Engine faced are already provided out of the box by Mongo DB.     By switching to a Document Oriented DB from a Relational DB the Task Engine can push one layer down the responsibilities of document management thus taking a serious load off My webMethods Server (the Task Engine container).    The most important Mongo DB features that the Task Engine would benefit from are:

  • Document-Oriented Storage
    The Task Engine is expected to store and manipulate a variety of document structures. Each task type defines its own document structure (under TaskData) which can undergo modifications as the product matures.
    "Data in MongoDB has a flexible schema. Collections do not enforce document structure. This flexibility gives you data-modeling choices to match your application and its performance requirements" (http://docs.mongodb.org/manual/data-modeling/).
  • Full Index Support
    Search is one of the operations that a user will most often perform, a short response time is very important. Without indexes every document in a collection has to be scanned. Mongo DB offers a variety of index types (Single Field Indexes, Multikey Indexes, Text Indexes, Hashed Index) which when properly defined lead to much fast searches then the existing Task Engine indexing."Indexes provide high performance read operations for frequently used queries" (http://docs.mongodb.org/manual/indexes/)
  • Sharding
    On an intensively used BPM system, the amount of data stored for human steps will significantly increase over time. An approach to keeping up with the ever-larger amount of data is to horizontally scale the Mongo DB instances to "shard" the "Task Data" across these instances. The advantage is that the MWS cluster must not be scaled based on the amount of data, but only the data-responsible layer has to be adjusted. "Sharding is the process of storing data records across multiple machines and is MongoDB’s approach to meeting the demands of data growth" (http://docs.mongodb.org/manual/sharding/)
  • Caching
    "MongoDB keeps all of the most recently used data in RAM. If you have created indexes for your queries and your working data set fits in RAM, MongoDB serves all queries from memory" (http://docs.mongodb.org/manual/faq/fundamentals/)The caching mechanism on the Task Engine level can be replaced by the one built-in into Mongo DB.
  • Aggregation
    Almost any type of query can be performed using the aggregation framework.For example, if one wishes to know the sum of all invoices received in a certain time interval there is a simple way to do this on Mongo DB level. On the other hand, the Task Engine has to make one or more queries first and then compute the sum on the business logic layer."Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets" (http://docs.mongodb.org/manual/core/aggregation-introduction/)