Posted by
Petter Måhlén on Thursday, January 28th 2010
It’s not enough for a complex software system to just do what it should do, it needs to be able to prove that it does it as well. So there are functional requirements (such as “show relevant results when a user enters a search term”) and operational requirements (such as “explain how you got the results” and “show what the resource usage was”). The former kind of requirements make the system perform a useful function, and the latter kind make it possible to verify the primary functions in QA and to monitor and troubleshoot the system in production.
To meet the kind of operational requirements that this blog post is about, you need to track metadata: not just the data that makes up the response, but also information about how the system derived the response. With a service-oriented architecture like the one we use at Shopzilla, this metadata is crucial; without it, the complexity of troubleshooting the system would be too much. The most obvious and, I would say, useful place we display that data is in a debug header and footer that can be turned on only if you’re on Shopzilla’s internal network:

Debug footer example
Every service invocation that led to the page that is being displayed is shown as a link in the footer. If there is something wrong with the page, you can click on the links to perform identical service invocations and see what the responses were. All responses are XML (we call the services RESTful, but they’re not, really – and that’s fine) and with a very few exceptions, the service invocations are idempotent. So checking service invocation results is something we do easily and all the time. Bug reports typically include site URLs that show a broken page and one or two service invocation URLs that highlight a problem. We copy/paste URLs from the debug footer and edit the hostname bit to compare results from different environments. Product owners use service invocation results to understand which information is already available from which services in order to get a feel for how quickly we can develop new features, etc., etc.
While this feature is invaluable to us, I’ve always been in two minds about how we’ve implemented it. This is a pretty typical code snippet:
try {
sasInvoker.invokeSas(urlPathPart, productCallback, headers);
} catch (SasException e) {
throw new ServiceInvocationException("Error invoking SAS for products for keyword: " + keyword, e);
} finally {
ServiceDebugInfo.get().setSasProductUrl(productCallback.getUrl());
}
The class that knows how to construct a product search query does that and calls an invoker which knows the host to contact and can handle an HTTP request/response. There is a callback which knows how to parse the response. The invoker has the responsibility to store the final URL that was invoked in the callback, and the code we’re looking at stashes that URL in a ThreadLocal where it can be retrieved later by code that is interested in the metadata.
The thing that bothers me is the use of the ThreadLocal. The intention is of course to keep the metadata out of the main business logic – it’s not needed for the primary functional requirement after all, so mixing the metadata code with the business logic will just obscure what the business logic actually does and confuse the reader. This is a good point, but I think there are some pretty significant drawbacks with the solution as well:
- The use of a ThreadLocal creates problems with threading – for performance reasons, we execute most service invocations in parallel, using different threads. In order to collect all the debug information from invocations made by all these different threads we need to jump through some hoops, and it is possible for people not knowing about these hoops to start up threads in a way that loses this information.
- Breaking encapsulation – the callee needs to reach outside itself and poke objects that really belong to the caller. So we introduce dependencies that shouldn’t really be there. Examples above include the fact that the invoker needs to set the caller URL, but the contract for doing this is really implicit. Also, the code we’re looking at reaches out and sets the ThreadLocal to something that is hopefully what the client expects.
- A very similar point to the above is that objects hide parts of their API (as also described here). There is in fact a requirement on the product search implementation that it sets the ThreadLocal, but there is nothing in the API definition that indicates that. Similarly, there is nothing indicating a need to unit-test the ThreadLocal.
- The debug info is essentially a (request-)global variable with all the problems that entails. It is for instance possible for two different objects to set the SasProductUrl field. Unit tests would indicate that everything is fine, but when the classes are wired up, one or the other of those objects will lose and the value it wrote be dropped.
To my mind (I’m really only in one mind about the implementation, I lied earlier), these issues are big enough and cause enough problems that I’m prepared to say that metadata should actually be made an explicit part of the API of at least some parts of our system. Specifically those that do service invocations. It’s not a huge issue, so I doubt that we will try to fix it urgently if at all, but it’s the sort of thing I will probably try to solve differently the next time I am involved in the design of a new system.
To summarise:
- For some parts of more complex systems, metadata is as important as the primary data (well, close enough that the difference doesn’t really matter).
- When classes generate metadata in addition to primary data, the metadata should be made a full-fledged part of the API, not hidden away.
Metadata matters – if it is important enough to generate in production code, it is important enough to make that code be explicit, unit-testable, and as solid as can be.