7 Oct 2010

The curious case of Load Balancers

For one of our enterprise applications we had a very interesting learning that goes on to explain why architecture is more of experience (60%) than just theory (40%).

The system had a couple of functionally decoupled application areas. From the outset the design/development  team had planned the component deployment keeping scalability in mind. It was with this perspective that they had also introduced Load Balancers in the architecture. Now came the classical problem - the team that advocated usage of Load Balancers and their fitment in the architecture had never played around with em in the first place. However after interactions with the hardware and networking teams this was the architecture finalised.


LB001 was supposed to be the entry point into the application and the cluster of web servers would then map out to different functional areas as per URL mapping. Apart from this the web servers also had a secondary task - to add on a layer of single sign on authentication tokens. For each application functional area, a second layer of load balancer's (LB002-LB004) were envisaged to provide application load balancing. So far so good. LB001 was configured for round-robin scheduling ( sessions could go through any of the web servers) and the others were sticky-session based ( i.e once a session took a path, that path would be taken for the remainder of that session).

The system was deployed and interestingly there were no problems until the user load started increasing 6 months later. That is when application servers started reporting memory and resource constraint based failures. All preliminary investigations pointed towards load being distributed unevenly across the servers. The development team felt this should not happen as the load balancer's were in place and should be functioning as expected. The networking team blamed the development team and they in turn blamed the former till it was decided to sit together, simulate and crack the issue.

The issue was really interesting. When LB001 forwards load-balanced incoming network packets its rewrites them so that components further down in the architecture see the packets coming from LB001. i.e the actual external IP address is lost.  What happens then at LB002, LB003 & LB004 is that they assume all the packets are coming from LB001. And since they are configured for sticky session (on IP), all requests are passed on to only one application server. That means effective the load balancers were just passing on all the buck to only a single server and weren't doing any balancing at all ! Such a simple point we overlooked in the design !.

Well there are various ways to fix this one. In our case we changed the sticky session config to work on cookies instead of IP addresses. It worked for us. Another option might be to rewrite header source IP at the web server level ( Usually the actual IP can be found in one of the HTTP headers and this can be used for the rewrite).

No comments :

Post a comment