MiddlewareGuru

Is ESB a single point of failure?

If you are using ESB (Enterprise Service Bus) or a gateway in your SOA infrastructure extensively, then this question might resonate well with you. Whether an ESB is essential for implementing SOA is up for debate. If your enterprise has chosen to use ESB as an integral part of SOA, then there are few pitfalls to be beware of and to plan for, ahead.


First, let me introduce the term ESB briefly: an Enterprise Service Bus helps connects disparate systems in an enterprise with least effort. ESBs add value by implementing reusable functions such as security, logging and transformation.  ESB products range from the light-weight and nimble gateways at the front (IBM Datapower and Layer 7, for example) to ESBs meant for heavy-lifting, such as IBM WebSphere message broker or Microsoft BizTalk.Typically, ESBs connect services offered by providers to clients that make avail of the services. In a SOA envrionment, each service could be used by multiple clients.

 

ESB-HappyScenarioFigure 1 – A happy ESB scenario

In figure 1, a manufacturing firm has clients from its Marketing, Sales and Supply chain business functions (‘domains’) utilize services offering information about customers, orders and products, respectively. The sparsely scattered colored boxes represent normal flow of messages.  For simplicity, reuse of multiple services by one client is not shown. Services are meeting their SLAs (Service Level Agreements) , ESB is in great health and the clients augment the respective business domains. 

 

What could go wrong here? It depends. If you have large capacity, high availability and redundancy compared to the small traffic, then problems might not manifest into catastrophic failures. If your enterprise does not extensively use ESBs for online transactions, or use asynchronous transport such as MQ, then there could be time buffer to address problems. If neither apply to you, below are some common scenarios that could shine spotlight on your team (for the wrong reasons) and could bring the business to a stand-still.  All of my clients with prolific ESB use faced these problems at some level or the other. Often there is little forethought or planning to tackle the resulting crippling outages.

 

  • A service slowdown – One of the services might slowdown , causing the ESB to be clogged up with request messages meant for that service
  • A rogue client – One of the client applications might send an unexpected surge of requests- unintentional or malicious- causing ESB to become clogged.
  • Resource constraints on ESB – ESB might be running out of capacity because of insufficient memory, CPU or another resource constraint such as limited network bandwidth.


A service slow-down clogs ESB


ESB-A service slows down

Figure 2

Figure 2 depicts a situation where Orders services slows down, for example, because of an issue with an underlying database. Customers and Products services are healthy and available. The color boxes, once again, represent number of messages, in this case, showing a build-up at some points. As a result this slow-down:

  • ESB is flooded with the slow moving Orders messages that consume resources, resources that are taken away from requests for Customers and Products services
  • Customers and Products services are healthy, but there are few incoming requests for them to process.
  • Sales function is directly affected by the slowdown. Marketing and Supply chain functions are also affected, because of their dependency on the same ESB.

There are some possible ways to address this problem.

  • Request timeouts – To mitigate risk at ESB,  reduce the time that an ESB waits for a response from the service, Orders service in this case. In HTTP terms, this is the request timeout value, default for which could be as much as 120 seconds, depending on implementation. Reducing the timeout to 3 seconds, for example, ensures that requests hold resources for no longer than 3 seconds, giving way for other requests.  This option, when applied to the example,  limits the impact of the slowdown only to the clients of Orders service.
  • Federated ESB – This is a way to partition ESB by business functions. If Marketing ESB does not process messages for Orders service, then clients using that ESB are not affected by a slowdown in Orders service. This option coupled with the request timeout enforcement could dramatically improve quality of service.  Figure 3 shows an example for federated ESB architecture.

ESB-Federated


Figure 3 – Federated ESB

A rogue client floods ESB with requests

One of clients connecting through ESB might experience a surge in requests for whatever reasons. In most cases, it might just be a peak volume or or an unplanned batch job than something malicious, such as a denial of service attack.  See the figure below.

ESB-A rogue client

Figure 4 – A client floods with spurt of request messages

In the above case, all 3 services are in good health. Sales clients are sending abnormally large number of requests for Orders service. As a result, Marketing and Supply-Chain clients might not be able to achieve the same level of throughput for their messages as they do in normal circumstances. This might translate into longer wait time for a customer filling a marketing survey or longer it takes to process an order at a point of sale terminal. Let’s look at some possible remedies.

 

  • Client-specific runtime policies to shape or  throttle traffic Most ESB products, especially ones designed for gateways, support runtime policies to shape or throttle traffic. Shaping is the technique where priority of requests from a certain client exceeds a limit specified in the policy. Once the limit is hit, further requests from the client are reduced in priority and are allotted less resources. Throttle means that any requests above the limit specified in a policy are rejected, until the client’s throughput complies to the limit specified by policy. Such runtime policies should be applied carefully. One might rather shape or throttle traffic from batch processes or from Marketing activities, than to limit traffic from core business domains such as sales or supply-chain. Calculation of throughput for application of such policies vary by implementation and thus should be examined carefully.
  • Federated ESB architecture, as described in the previous section. Any spurt of traffic in Sales ESB affects clients only in Sales domain , for example.

 

Resource constraints on ESB


A resource-constrained system is the most familiar scenario. An ESB might run out of CPU cycles, memory or experience reduced network bandwidth, which causes a build of pending messages within ESB. This results in poor user experience or an outage for the clients. The services are in good health, but are barely used because of the hiccups on ESB.

 

ESB-ResourceIssues

Figure 5 – Resource constraints on ESB


The easiest way to address resource constraints is adding more resources or  scaling horizontally, by adding more servers or devices to the ESB cluster. Availability of private clouds make scaling much easier, especially when the ESB servers could be hosted on operating systems such as Linux or Windows.

 

There might also be underlying issues that cause the resource constraints in first place, such as memory leaks or performing functions such as XML schema validation  unintentionally.

 

Something that’s overlooked often is the potential size of response messages from services. A 1MB report data being parsed, validated or transformed on flight could easily consume quite a bit of memory and CPU at the expense of messages from online transactions. A possible way to address such problems is to partition or group ESB by functions performed by ESB or by nature of content processed.  See figure 6 for an sample partitioned EIB architecture.

ESB-Partitioning

 

Figure 6 – ESB partitions

 

Another benefit of partitioning is the ability to set runtime service level policies by services (as opposed to clients) , while creating isolation between groups of clients. For instance, all services are offered at a lower SLA over Batch-ESB than the SLA offered over OLTP-ESB. During unexpected surges, it might be possible to further adjust SLAs for the partitions, in order to maintain acceptable user experience for all critical business functions.


Comments, feedback and idea are welcome!

Category: Messaging, Patterns, SOA



Your email address will not be published. Required fields are marked *

*


+ three = 4