• The power of ActiveMq and virtual topics

    24-04-2013Author: basdenooijer

    ActiveMq is an opensource message broker, with lots of enterprise features. And while it’s written in Java, it can be used in almost any environment using one of the interfaces it provides. We used it in a PHP project I’ve worked on recently, with the PHP Stomp extension as the client.

    If you’re not familiar with ActiveMq, here’s a short introduction: ActiveMq is a message broker. It basically receives messages from producers, these messages are placed in a queue to be handled by consumers. One message might be consumed by multiple consumers, and the producer doesn’t even need to know what consumers handle its messages. This allows for a very flexible/dynamical workflow and simplifies the producers and consumers. (read on for an example…)
    On top of the basic functionality many additional features are available, for instance: master/slave setups, clustering, message queue persistence, message groups and virtual destinations.

    In this post I want to highlight ActiveMq in general, and one feature in particular that proved to be crucial for our project: virtual topics.

    Scenario

    To demonstrate the concept of virtual topics I’m first going to introduce you to an imaginary project: a data import service for a CRM system. This service needs to be able to process all kinds of input data:

    • XML files with contact details, in several formats (depending on the source)
    • Emails
    • Scanned business cards
    • PDF documents with contact details

    Also, any new data formats that might be used in the future should be easy to add without impacting existing functionality.

    While the data formats are very different, there still is quite some overlap in functionality if we look at the steps required to extract the data we need for each type of input data.

    XML file

    1. arrival
    2. detect data type
    3. detect XML format using XSDs to match
    4. extract data based on detected XML format
    5. save data in CRM

    Email

    1. arrival
    2. detect data type
    3. extract relevant text content from the email source
    4. process text using address detection/extraction algorithm
    5. save data in CRM

    Scanned business card

    1. arrival
    2. detect data type
    3. image-to-text (OCR)
    4. process text using address detection/extraction algorithm
    5. save data in CRM

    PDF document

    1. arrival
    2. detect data type
    3. process in Tika (pdf-to-text)
    4. if Tika could not extract data (for instance a scanned document):
      1. PDF to image
      2. image-to-text (OCR)
    5. process text using address detection/extraction algorithm
    6. save data in CRM

    Why would we need ActiveMq?

    Only the underlined steps are unique for a single data type, all the others are re-usable. Now, you could go ahead and create this functionality in plain PHP (or any other language) using OOP or other mechanisms for the re-use of functionality. Depending on the project at hand this might be ok, but you miss out on some very important advantages of implementing this using ActiveMq:

    • Components are not tightly coupled, so you can easily mix the best technologies fitting for the jobs. In many cases the best solution varies per task at hand. One component might be a PHP script, another one a Ruby script and yet another one might execute a tool like Tika.
    • On top of using several technologies, you can even easily using multiple servers/platforms. This can be for scaling, or because a specific task requires a special server platform.
    • As one queue can have multiple consumers you can easily add more consumers to a busy queue. You could run multiple consumers for a queue on a single system, to better utilize cores without more complex threaded programming. Using a platform like Amazon EC2 you could even scale out on demand to efficiently handle big request peaks.
    • Each component is simple and has only one task. It receives X and should produce Y or X. This greatly simplifies development, testing, documenting and the debugging of issues.
    • If multiple consumers need to process the same message this can easily be done in parallel, even on separate systems to improve performance.

    Implementing the scenario in ActiveMq

    So, how would  this be implemented using ActiveMq? We could link all steps together using producers and consumers. The arrival would send a message to the ‘detect data type’ component and so one. However, this is not very flexible. What if we also need to copy the input data to an archive? We have three options:

    1. Add this to the ‘arrival’ component
    2. Add this to the ‘detect data type’ component
    3. Introduce a new step to the flow: ‘arrival’ –> ‘archive’ –> ‘detect-data-type’ –> …..

    None of the above are ideal. Option 1 and 2 introduce new responsibility into components that should only have a single task, so can quickly be ruled out. Option 3 is slightly better, however still causes issues. To add a new step after ‘arrival’ we need to update the ‘arrival’ component as well. It now need to sends it’s messages to a topic for the ‘archive’ component , instead of the topic for the ‘detect-data-type’ component. So our changes are no longer isolated…
    Also, while we are required to log the original input data, this doesn’t necessarily have to be done before any other steps. So this could be done in parallel, instead of added steps to the existing flow, slowing it down.

    What we ideally want is this:

    'arrival' --> 'detect-data-type' --> .....
              --> 'archive'

    The original flow still exists, we’ve just added another task on a ‘side track’. That’s where ActiveMQ’s virtual topics come into play. Virtual topics work like this:

    • The ‘arrival’ component sends messages to VirtualTopic.arrival
    • The ‘detect-data-type’ component consumes the topic Consumer.detect-data-type.VirtualTopic.arrival
    • The ‘archive’ component consumes the topic Consumer.archive.VirtualTopic.arrival

    ActiveMQ takes care of forwarding the messages sent to VirtualTopic.arrival to both virtual topics. Within each virtual topic a message is processed only once, so you can safely have multiple consumers for a virtual topic.

    Also, you can use wildcard for matching topics. The ‘detect-data-type’ component can send messages including the detected data type, for instance “VirtualTopic.data.email” and “VirtualTopic.data.pdf”. If consumer X only wants PDF files, it should consume “Consumer.x.VirtualTopic.data.pdf”. But if  consumer Y wants to handle any file type, it can consume “Consumer.y.VirtualTopic.data.*”. And ofcourse, this will not interfere with consumer X. For a PDF file both will consume the message, for XML only Y will consume it.

    With these two concepts we can create this matrix of components, covering all scenarios:

    COMPONENT               CONSUMES                                                PRODUCES         
    
    arrival                 --- (entrypoint)                                    	VirtualTopic.arrival
    
    detect-data-type        Consumer.detect-data-type.VirtualTopic.arrival	        VirtualTopic.data.xml
    										VirtualTopic.data.email
    										VirtualTopic.data.pdf
    										VirtualTopic.data.card
    
    detect-xml-format       Consumer.detect-xml-format.VirtualTopic.data.xml	VirtualTopic.xml.format1
    									        VirtualTopic.xml.format2
    extract-xml-data        Consumer.extract-xml-data.VirtualTopic.xml.*(#1)	VirtualTopic.contactdetails
    
    email-to-text           Consumer.email-to-text.VirtualTopic.data.email    	VirtualTopic.data.text
    
    image-to-text		Consumer.image-to-text.VirtualTopic.data.image    	VirtualTopic.data.text
    
    pdf-to-image		Consumer.pdf-to-image.VirtualTopic.data.pdf(#2)	        VirtualTopic.data.image
    
    pdf-to-text		Consumer.pdf-to-text.VirtualTopic.data.pdf(#2)  	VirtualTopic.data.text
    
    text-to-address-data    Consumer.text-to-address-data.VirtualTopic.data.text    VirtualTopic.contactdetails	
    
    save-data-to-crm        Consumer.save-data-to-crm.VirtualTopic.contactdetails	---

    With this matrix you should be able to create all flows of the scenario by create the sequence of consumers and producers, a nice exercise that I’m leaving up to you… In case of questions you can always post a comment!

    I’ve also added two notes for special cases in the matrix:
    (#1): The two XML formats are handled by a single component in this example, to demonstrate the wildcard
    (#2): Two components consume the same PDF topic. A PDF is either text or image. Both components try to process the PDF simultaneously, and one will always fail. This introduces some overhead, but should be faster (given enough resources) as a second try is never needed.

    Conclusion

    ActiveMQ offers flexibility and scaling that can otherwise be very hard to achieve. There is a bit of a learning curve, however the time spent should quickly be earned back. If you have a project that requires complex multistep tasks, take a look at ActiveMq. For me it easily beats other solutions I’ve seen: custom deamons, cronjobs polling every minute or gearman. ActiveMq can also be used for simple single tasks like send email, however in that case something like Gearman will also be fine.

    This post is just an introduction to the possibilities of ActiveMq. The actual implementation has been omitted, as this will vary a lot per project.
    Also, the example flow above has been simplified, omitting important details like errorhandling. Maybe I will post a follow-up with more details.

    References

    ,
  • Comments are closed.