Five Fallacies of Application Ruggedization
9:01 pm in SOA Solutions by admin
Summa architects often find that we are smoke-jumping late into failing projects to fire-fight failing business critical web and enterprise applications. Often the failures are a direct result of a road laid by best intentions and penny-wise but pound-foolish decisions. Here is my top five list of enterprise application architecture fallacies that result in significant failures. Each fallacy could stand a lot more discussion – but let’s start with some thought-provoking ideas:
1. Just cluster the servers for high availability
Elimination of SPOFs (single points of failure) is indeed a key characteristic of highly-available systems – but not the only thing. Counter this vendor-propagated fallacy by knowing and understanding the whole picture:
- Custom and commercial applications or components may not be designed to correctly exploit redundant hardware.
- Services may not be configured to failover correctly.
- Dependent resources (known and implicit) such as databases and/or underlying disk storage may not be redundant.
- Application data stored in memory, on disk, in queues, in files etc. may not be properly replicated to fail over servers in the cluster.
- You may also need to think about failover between locations and network connectivity in between…
And… you must take steps to test a variety of failure modes: hardware, software, network, storage and dependent services to verify that your configuration is indeed fault resilient.
2. The firewall (or app-server or web server or database or…) will take care of security
I hear this far less often these days than in the past. Security is a holistic concern that includes physical, network, server OS, app server, middleware, database, external services, custom code at each tier and commercial components. Security threats are both external and internal. Authorization schemes are often neglected or conversely over-designed against usability for the abilities of staff who will administer application security.
3. I need more cache
Developers and application architects love to put caches into application and network designs – fun stuff! But often too soon, they are too numerous and far less effective than anticipated. The extra cache components cause, at best, some excess complexity – but frequently incorrect, inconsistent and difficult to test behavior. Take time when designing the system to do capacity planning and modeling to understand performance goals, understand the application access profiles and focus measurement and improvements on real hotspots and bottlenecks where caching will actually help (and also not be defeated by some other architecture layer.)
4. We can tune in performance later (or just add hardware)
Designing for performance means making the right architectural trade-off decisions to meet capacity and performance goals. This means that you have to start with real / attainable capacity goals – transaction volumes and response times – considering both nominal and peak. Performance goals are often hand-waved (”it has to be fast”) or misstated (”we need a 1 second response time”) because there is a fundamental lack of understanding of system performance engineering. Sometimes the typically valuable sage advice, “premature optimization is the root of all evil”, surfaces as an excuse to evade proper performance concerns during design time.
5. What is serviceability?
OK – so this is not really stated as a fallacy – but an important and often neglected topic that deserves attention. Leveraging facilities of the environment and building in logging, tracing, performance measurement points and other debugging capabilities that work in a production setting and help to isolate failure causes is easy to do and inexpensive if done early in the architecture, design and development efforts. It is usually neglected because it is not front/center in the architect’s, analyst’s or developer’s face during the early phases of the project. Those who have been in support and maintenance positions with well-designed facilities for capturing production troubleshooting information will recognize the benefits and defensive coding techniques that will save the day (but often under-appreciated by the business).
Do you have more common reliability-killing fallacies that you have seen in your experience?