The fine folks at Twitter Engineering recently posted about the performance issues they have had over the holiday weekend. Since Saturday, the site has been slow for users and API calls. While AppliedTrust hasn't (yet) made the leap to Twitter, we recognize how important it is for delivering World Cup news. I give Twitter Engineering tons of credit for being so transparent about the details of the problem - they say:
|
In brief, we made three mistakes:
* We put two critical, fast-growing, high-bandwith components on the same segment of our internal network.
* Our internal network wasn't appropriately being monitored.
* Our internal network was temporarily misconfigured.
|
Twitter is well known for great application-layer monitoring and instrumentation, so this gap in monitoring is a surprise. It exposes a common misconception among social software companies - that their server and network infrastructure is "covered" by their hosting provider. As web applications scale to even 1/1000 the size of Twitter, software becomes critically interdependent on the underlying network. Infrastructure should be instrumented and monitored at least as closely as the software that depends on it.
For more The Barking Seal articles on monitoring and troubleshooting, see:
