The Fragile Manifesto

A lot of what I’ve been reading and thinking about of late is about the agile backlash. More speed, lower velocity reflects on IT teams pursuing “deliver more/newer IT” at the cost of “help the company achieve its mission”. Grooming the Backfog is about one dysfunction that arises as a result: (mis)managing a never-ending road of small changes rather than looking at the big picture and finding a path toward the destination. Our products are not our products attempts to address this problem by recasting teams not as makers of product, but as solvers of problems.

Here’s the latest: UK wasting £37 billion a year on failed agile IT projects. Some people will say that this is a result of not Agiling enough: if you were all Lean and MVP and whatever you’d not get to waste all of that money. I don’t necessarily agree with that: I think there’s actually things to learn by, y’know, reading the article.

The truth is that, despite the hype, Agile development doesn’t always work in practice.

True enough, but not a helpful statement, because “Agile” now means a lot of different things to different people. If we take it to mean the values, principles and practices written by the people who came up with the term, then I can readily believe that it wouldn’t work in practice for people whose context is different from those who came up with the ideas in 2001. Which may well be everyone.

I’m also very confident that it doesn’t mean that. I met a team recently who said they did “Agile”, and discussed their standups and two-week iterations. They also described how they were considering whether to go from an annual to biannual release.

Almost three quarters (73%) of CIOs think Agile IT has now become an industry in its own right while half (50%) say they now think of Agile as “an IT fad”.

The Agile-Industrial Complex is well-documented. You know what isn’t well-documented? Your software.

The report revealed 44% of Agile IT projects that fail, do so because of a failure to produce enough (or any) documentation.

The survey found that 34% of failed Agile projects failed because of a lack of upfront and ongoing planning. Planning is a casualty of today’s interpretation of the Agile Manifesto[…]

68% of CIOs agree that agile teams require more Architects. From defining strategy, to championing technical requirements (such as performance and security) to ensuring development teams stick to the rules of the game, the role of the Architect is sorely missed in the agile space. It must be reintroduced.

A bit near the top of the front page of the manifesto for agile software development is a sentence fragment that says:

Working software over comprehensive documentation

Before we discuss that fragment, I’d just like to quote the end of the sentence. It’s a long way further down the page, so it’s possible that some readers have missed it.

That is, while there is value in the items on the right, we value the items on the left more.

Refactor -> Inline Reference:

That is, while there is value in comprehensive documentation, we value working software more.

Refactor -> Extract Statement:

There is value in comprehensive documentation.

Now I want to apply the same set of transforms to another of the sentence fragments:

There is value in following a plan.

Nobody ever said don’t have a plan. You should have a plan. You should be willing to amend the plan. I was recently asked what I’d do if I found that my understanding of the “requirements” of a system differ from the customer’s understanding. It depends a lot on context but if there truly is a “the customer” and they want something that I’m not expecting to offer them, it’s time for me to either throw away my version or find a different customer.

Similarly, nobody said don’t have comprehensive documentation. I have been on a very “by-the-book” Agile team, where a developer team lead gave feedback that they couldn’t work out where a change would go to enable a particular feature. That’s architecture! What they wanted was an architectural plan of the system. Except that they couldn’t explicitly want that, because software architecture is so, ugh, 1990s and Rational Rose. Wanting an architecture diagram is like wanting to use CORBA, urrr.

Once you get past that bizarre emotional response, give me a call.

Input-Output Maps are Strongly Biased Towards Simple Outputs

About this paper

Input-Output Maps are Strongly Biased Towards Simple Outputs, Kamaludin Dingle, Chico Q. Camargo and Ard A. Louis, Nature Communications 9, 761 (2018).


On Saturday I went to my alma mater’s Morning of Theoretical Physics, which was actually on “the Physics of Life” (or Active Matter as theoretical physicists seem to call it). Professor Louis presented this work in relation to RNA folding, but it’s the relevance to neural networks that interested me.

The assertion made in this paper is that if you have a map of a lot of inputs to a (significantly smaller, but still large) collection of outputs, the outputs are not equally likely to occur. Instead, the simpler outputs are preferentially selected.

A quick demonstration of the intuition behind this argument: imagine randomly assembling a fixed number of lego bricks into a shape. Some particular unique shape with weird branches can only be formed by an individual configuration of the bricks. On the other hand, a simpler shape with large degree of symmetry can be formed from different configurations. Therefore the process of randomly selecting a shape will preferentially pick the symmetric shape.

The complexity metric that’s useful here is called Kolmogorov complexity, and roughly speaking it’s a measure of the length of a Universal Turing Machine program needed to describe the shape (or other object). Consider strings. A random string of 40 characters, say a56579418dc7908ce5f0b24b05c78e085cb863dc, may not be representable in any more efficient way than its own characters. But the string aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, which is 40 characters, can be written with the Python program:


which is seven characters long including the newline. Assuming eight bits per character, the random string needs 40*8=320 bits to be represented. The forty as can be found by actually finding the Python program, which is 56 bits. The assertion is that a “find a program that generates character sequences of length 40” algorithm (with some particular assumptions in place) will find the a56579… string with probablity 2^-320, but will find the aaa… string with probability 2^-56, which is much, much more likely.

In fact, this paper shows that the upper and lower bounds on the probability of a map yielding a particular output for random input are both dependent on the Kolmogorov complexity of the output. It happens that due to the halting problem, you can’t calculate Kolmogorov complexity for arbitrary outputs. But you can approximate it, for example using Lempel-Ziv complexity (i.e. the length of the input to a lossless compression algorithm needed to recover the same output).

Where does this meet neural networks? In a preprint of a paper selected for the ICLR 2019, with two of the same authors as this paper. Here, we find that a neural network can be thought of as a map between the inputs and weights to a function that does the inference.

Typically neural network architectures have lots more parameters than there are points in the training set, so how is it that they manage to generalise so well? And why is it that different training techniques, including stochastic gradient descent and genetic algorithms, result in networks with comparable performance?

The authors argue that a generalising function is much less complex than an overfitting function, using the same idea of complexity shown above. And that as the training process for the network is sampling the map of inputs->functions, it is more likely to hit on the simple functions than the complex ones. Therefore the fact that neural networks generalise well is intrinsic to the way they select functions from a wealth of possibilities.

My hope is that this is a step toward a predictive theory of neural network architectures. That by knowing the something of the function we want to approximate, we can set a lower bound on the complexity of a network needed to discover sufficiently generalisable functions. This would be huge for both reducing the training effort needed for networks, and for reducing the evaluation runtime. That, in turn, would make it easier to use pretrained networks on mobile and IoT devices.

HPC at FOSDEM 2019

This year’s FOSDEM featured an HPC, Big Data and Data Science devroom on the Sunday. This post is the first part of my notes on the topics presented there. If you are interested, book some time and let’s talk about what it means for your and your high-performance computing team.

OpenHPC Update

Adrian Reber from the OpenHPC project gave a refresher on what OpenHPC is, and a status update. OpenHPC has not been represented at FOSDEM since 2016, when the project was very new.

It’s a community-driven project with representation from many vendors and HPC sites. On first blush their output might appear to be “RPM packages” and “documentation” but their mission is actually to discover and share best practices in HPC management. Those packages are all well-tested with each other, and the documentation is tested every release, too. The idea is that if you build the core of your cluster with OpenHPC packages on CentOS-like Linux distributions, on either x86-64 or AArch64, you get to rely on tried and tested work from the whole community.

Reber, who works at Red Hat on their OpenHPC efforts, invited everyone to join the weekly project steering calls in a demonstration of the openness of the project. He discussed future directions, including an upcoming release v1.3.7 that will include packages rebuilt with the ARM HPC compiler for AArch64, and the challenges of understanding when is right to release v1.4 which will drop SLES12 for SLES15 and RHEL7 for RHEL8.


On the subject of HPC libraries, a common frustration is testing codes with various combinations of compilers, MPI libraries, hardware capabilities and so on. Developers both want to know that their code is correct (i.e. the science outcomes are still valid after a change) and that the performance has not been significantly impacted.

Victor Holanda discussed ReFrame, a tool for HPC regression and performance testing developed at CSCS and used regularly on Piz Daint and their other clusters. Written in Python, it gives test authors a way to express what their tests require (e.g. that they must run on machines with CUDA, compile a particular code with one of three different compilers, load environment modules with one of two different MPIs), run the tests, and inspect the output for certain outcomes.

Testers get to run a single command, or point their Jenkins or Travis CIs at a single command, to discover and execute the tests. The ReFrame runtime will compare the environments that the test can use with the ones that are available, and will report on the outcomes in each of those environments.

Inside CSCS, ReFrame is used for a 90 minute nightly production test run, and 10 minute maintenance runs to check for system regressions after configuration changes. They also have a set of diagnostic tests to help understand what’s happened if a node goes bad. Their approach to correctness is very robust; the team do not declare that they support something until it has enough users to know how well it works. They also say that in three years of development they “have never seen a python stacktrace” from ReFrame, as they test ReFrame with ReFrame while they are developing it.

Singularity Containers

Singularity from sylabs is a container runtime tool that specifically addresses problems containerising HPC workloads. Eduardo Arango gave a “what’s new in Singularity” update, as FOSDEM 2017 had already featured an introduction-level talk.

What’s new is that they’ve rewritten in Go. This means they get better integration with libraries used in Docker, Kube etc., and could adopt the de facto standard Containers Networking Interface for software-defined networking when running containers. It also reduces the dependencies needed to get Singularity up and running.

The new version uses a new format for containers, SIF (Singularity Image Format), a read-only SquashFS filesystem along with metadata, all of which can be cryptographically signed using PGP for integrity protection. An upcoming extension will allow a writable overlay to be added to a SIF.

Supporting this, Sylabs have a new container library similar to DockerHub for hosting SIF images for public or private cloud use. They have a key store for those PGP signing keys, and a cloud-based remote image builder for developers who need to build images but can’t do it locally.


This has been part one of my FOSDEM HPC round-up. I’ve focussed on the tools that are out there for automating and simplifying HPC workflows, because it’s an interesting problem and one that presents challenges to many HPC teams. Don’t forget that the Labrary can help!

How UX Practitioners Produce Findings in Usability Testing

The Paper

How UX Practitioners Produce Findings in Usability Testing by Stuart Reeves, in ACM Transactions on Computer-Human Interaction, January 2019.


Various features of this paper make it a shoe-in for Research Watch.

  • It is about the intersection between academia and commercial practice. That is where the word “Labrary” comes from.
  • It extends the usual “human-computer interaction” focus of UX to include the team performing the UX, which an aspect of PETRI.
  • I get to use the word “praxeology”.

Reeves compares the state of UX in the academic literature with the state of UX in commercial fields. He finds a philosophical gap that is similar to something I observed when studying “Requirements Engineering” on a Software Engineering M.Sc. course. Generally, the academic treatment of UX describes usability problems as things that exist, and that the task of UX activities is to find them.

The same can be seen in much early literature on requirements engineering. We assume that there is a Platonic model of how a software product should work, and that the job of the requirements engineer is to “gather” requirements from the stakeholders. Picture a worker with a butterfly net, trying to collect in these elusive and flighty requirements so they can pin them down in a display case made by the Jira Cabinet Company.

There’s an idea here that, even before it’s formed, the software is real and has an identity independent of the makers, users, and funders. Your role in the software production process is one of learning and discovery, trying to attain or at least approximate this ideal view of the system that’s out there to be had.

Contrasted with this is the “postmodern” view, which is a more emergent view. Systems and processes result from the way that we come together and interact. A software system both mediates particular interactions and blocks or deters others. The software system itself is the interaction between people, and developments in it arise as a result of their exchanges.

In this worldview, there are not “UX problems” to be found by adequate application of UX problem-discovery tools. There are people using software, people observing people using software, and people changing software, and sometimes their activities come together to result in a change to the software.

This philosophy is the lens through which Reeves engages in the praxeology (study of methods) of UX practitioners. His method is informed by ethnomethodological conversation analysis, which is an academic way of saying “I watched people in their context, paying particular attention to what they said to each other”.

The UX activity he describes is performed by actors in two different rooms. In the test room, the participant uses a computer to achieve a goal, with some context and encouragement provided by a moderator. The rest of the team are in the observation room, where they can see and hear the test room and the participant’s screen but talk amongst themselves.

Four representative fragments expose different features of the interactions, and to my mind show that UX is performative, arising from those interactions rather than being an intrinsic property of the software.

  • In fragment A, the participant reports a problem, the observers react and decide to report it.
  • In fragment B, the participant reports a problem, the observers react and suppress reporting it.
  • In fragment C, the participant does not seem to be having a problem, but the observers comment that they did not do something they would have expected, and discuss whether this is an issue.
  • In fragment D, the participant is working on the task but does not choose the expected approach, observers see that, and define a problem and a solution that encompasses that.

One observation here is that even where a participant is able to complete the task, a problem was raised. The case in fragment D is that the participant was asked how they would report a problematic advert. They described sending an email to the client. That would work. However, the product team see that as a problem, because they are working on the “submit a complaint” feature on the website. So, even though the task goal can be satisfied, it was not satisfied the way they want, which means there’s a UX problem.

There are all sorts of things to learn from this. One is that you can’t separate the world neatly into “ways humans do things” and “measurements of the ways humans do things”, because the measurements themselves are done by humans who have ways of doing things. Another is that what you get out of UX investigations depends as much on the observers as it does on the participants’ abilities. What they choose to collectively see as problems and to report as problems depends on their views and their interactions to an extent comparable to their observations of the participants working through the tasks.

Ultimately it’s more evidence for the three systems model. Your team, your software, and your customers are all interacting in subtle ways. Behaviour in any one of these parts can cause significant changes in the others.

Grooming the Backfog

This is “Pub Walks in Warwickshire”. NEW EDITION, it tells me! This particular EDITION was actually NEW back in 2008. It’s no longer in print.

Pub Walks in Warwickshire

Each chapter is a separate short walk, starting and finishing at a pub with a map and instructions to find your way around the walk. Some of the instructions are broken: a farmer has put a barbed wire fence across a field, or a gate has been replaced or removed. You find when you get there that it’s impossible to follow the instructions, and you have to invent a new route to get back on track. You did bring a different map, didn’t you? If not, you’ll be relying on good old-fashioned trial and error.

Other problems are more catastrophic. The Crown at Napton-on-the-hill seems to have closed in about 2013, so an attempt to do a circular walk ending with a pint there is going to run into significant difficulties, and come to an unsatisfactory conclusion. The world has moved on, and those directions are no longer relevant. You might want to start/end at the Folly, but you’ll have to make up a route that joins to the bits described here.

This morning, a friend told me of a team that he’d heard of who were pulling 25 people in to a three-hour backlog grooming session. That sounds like they’re going to write the NEW EDITION of “Pub Walks in Warwickshire” for their software, and that by the time they come around to walking the route they’ll find some of the paths are fenced over and the pubs closed.

Decomposing the Analogy

A lengthy, detailed backlog is not any different from having a complete project plan in advance of starting work, and comes with the same problems. Just like the pub walks book, you may find that some details need to change when you get to tackling them, therefore there was no value in spending the time constructing all of those details in the first place. These sorts of changes happen when assumptions about the organisation or architecture of the system are invalidated. Yes, you want this feature, but you can no longer put it in the Accounts module because you found that customers think about that when they’re sorting their bills, not their accounts. Or you need to put more effort into handling input from an external data source, because the way it really works isn’t quite the same as the documentation.

Or you find that a part of the landscape is no longer present and there’s no value in being over there. This happens when the introduction of your system, or a competitors’, means that people no longer worry about the problem they had back at the start. Or when changes in what people are trying to do mean they no longer want or need to solve that problem at all.

A book of maps and directions is a snapshot in time of ways to navigate the landscape. If it takes long enough to follow all of the directions, you will find that the details on the ground no longer match the approximation provided by the book.

A backlog of product features and stories is a snapshot in time of ways to develop the product. If it takes long enough to implement all of the features, you will find that the details in the environment no longer match the approximation provided by the backlog.

A Feeling of Confidence

We need to accept that people are probably producing this hefty backlog because they feel good about doing it, and replace it with something else to feel good about. Otherwise, we’re just making people feel bad about what they’re doing, or making them feel bad by no longer doing it.

What people seem to get from detailed plans is confidence. If what they’re confident in is “the process as documented says I need a backlog, and I feel confident that I have done that” then there’s not much we can do other than try to change the process documentation. But reality probably isn’t that facile. The confidence comes from knowing where they’re trying to go, and having a plan to get there.

We can substitute that confidence with frequent feedback: confidence that the direction they’re going in now is the best one given current knowledge, and that it’s really easy to get updates and course corrections. Replace the confidence of a detailed map with the confidence of live navigation.

On the Backfog

A software team should still have an idea of where it’s going. It helps to situate today’s development in the context of where we think (but do not know) we will be soon, to organise the system into a logical architecture, to see which bits of flexibility Ya [Probably] Ain’t Gonna Need and which bits Ya [Probably] Are. It also helps to have the discussion with people who might buy our stuff, because we can say “we think we’re going to do these things in the coming months” and they can say “I will give you a wheelbarrow full of money if you do this one first” or “actually I don’t need that thing so I hope it doesn’t get in my way”.

But we don’t need to know the detailed steps and directions to get there, because building those details now will be wasted effort if things change by the time we are ready to tackle all of the pieces. Those discussions we’re having with the people who might buy our stuff? They might, and indeed probably should, change that high-level direction.

Think of it like trying to navigate an unknown landscape in fog. You know that where you’re trying to get to is over there somewhere, but you can’t clearly see the whole path from here. You probably wouldn’t just take a compass bearing and head toward the destination. You’d look at what you can see around, and what paths there are. You’d check a map, sure, but you’d probably compare it with what you can see. You’d phone ahead to the destination, and check that they expect to be open when you expect to get there. You’d find out if there are any fruitful places to stop along the way.

So yes, share the high-level direction, it’s helpful. But share the uncertainty too. The thing we’re doing next should definitely be known, the thing we’re doing later should definitely be guesswork. Get confidence not from colouring in the plan all the way up to the edges, but by knowing how ready and able you are to update the plan.

Structured Pruning of Deep Convolutional Neural Networks

Structured Pruning of Deep Convolutional Neural Networks, Sajid Anwar et al. In the ACM Journal on Emerging Technologies in Computing special issue on hardware and algorithms for learning-on-a-chip, May 2017.


Quick, a software engineer mentions a “performance” problem to you. What do they mean?

This is, of course, an unfair question. There are too many different ideas that all get branded “performance” for us to know what we are trying to solve. This paper is simultaneously about two different flavours of performance.

On the one hand, the “performance” of a neural network is related to its ability to perform its task: the correctness of its inferences. There isn’t really a good way to know what neural network configuration will perform efficiently for the (unknown) function you want it to approximate. Therefore, the rule of thumb is find a network that’s too complex, and stop training it when it begins to overfit (when its performance starts to degrade, because it’s being too specific about whether an input looks like an example from the training set rather than whether it shares important features).

Now we meet the other kind of performance: the amount of resources consumed to do the work. A large neural network needs to do a lot of computations with a lot of numbers to classify an input, and that means using a lot of processor cycles and a lot of memory. Because our approach to designing the network was to overspecify it, we are using more computer than we need. But if that computer is relatively low-specification and battery operated—a mobile phone for example—this may render our solution unusable.

So, how can we turn a complex neural network into a simpler neural network? While this isn’t totally satisfying, the answer is: “guess”. Turn off bits of the network (i.e. set weights to zero), and see whether it still classifies (performs) well. This act of turning bits of the network off is called pruning.

Ironically some previous work in this space has actually not been great for performance (the resource kind). You can “unroll” convolutional layers (commonly found in image-classifying networks) into matrix multiplications, and you can turn that into a sparse matrix by approximating all small weights with zero and only storing the non-zero values and their locations. But now, even though you have fewer calculations, you may have more memory accesses in trying to solve where the weights should be used. And that could be slower than not reducing the network.

The work in this paper takes a structured approach to pruning the network. Whole feature maps (scores indicating whether particular characteristics of an image were found, and where, in the input image) can be removed, the network retrained, and the performance (ability to classify) measured afterwards. At smaller scales, the kernels can be pruned in particular deterministic ways, replacing a full weights matrix with a start index, a “stride” (gap between each non-zero value) and the list of non-zero weights. The different possibilities are explored using a combination of random generation and evolutionary iteration; networks that have a misclassification rate within given tolerance the original are kept into subsequent generations.

The results seem promising. With pruning at both levels of abstraction, the resulting network is just as deep (it contains as many layers) but it has fewer nodes at each layer and fewer connections between nodes. The systematic pruning approach means that the resulting networks are smaller in memory and faster in use: CPU time measurements are down approximately two thirds when compared with the initial, unpruned network.

However, be careful when interpreting the graph: the authors are showing the reduced execution time of the unrolled matrix multiplication for one layer of one network configuration. It is not clear what this means for overall behaviour of the network, what the misclassification rate of this network was (they show a tolerance cutoff at 4%, which may be too high for a given use case), or in general how the CPU time savings vary with network topology. In other words, we have a single graph, and don’t know how to generalise it.

I hope that at some point a sound theoretical basis for choosing the architecture for a neural network to solve a given problem will be developed. In fact, I sort of hope that it exists now, and that I haven’t found it. I don’t think so: for the moment, whack-a-mole is the state of the art, and this paper shows you can whack quite a lot of moles and still get reasonable results.

On the continuous history of approximation

The Difference Engine – the Charles Babbage machine, not the steampunk novel – is a device for finding successive solutions to polynomial equations by adding up the differences introduced by each term between the successive input values.

This sounds like a fairly niche market, but in fact it’s quite useful because there are a whole lot of other functions that can be approximated by polynomial equations. The approach, which is based in calculus, generates a Taylor series (or a MacLaurin series, if the approximation is for input values near zero).

Now, it happens that this collection of other functions includes logarithms:

\(ln(1+x) \approx x – x^2/2 + x^3/3 – x^4/4 + \ldots\)

and exponents:

\(e^x \approx 1 + x + x^2/2! + x^3/3! + x^4/4! + \ldots\)

and so, given a difference engine, you can make tables of logarithms and exponents.

In fact, your computer is probably using exactly this approach to calculate those functions. Here’s how glibc calculates ln(x) for x roughly equal to 1:

  r = x - 1.0;
  r2 = r * r;
  r3 = r * r2;
  y = r3 * (B[1] + r * B[2] + r2 * B[3]
    + r3 * (B[4] + r * B[5] + r2 * B[6]
        + r3 * (B[7] + r * B[8] + r2 * B[9] + r3 * B[10])));
  // some more twiddling that add terms in r and r*r, then return y

In other words, it works out r so that it is calculating ln(1+r), instead of ln(x). Then it adds together r + a*r^2 + b*r^3 + c*r^4 + d*r^5 + ... + k*r^12…it does the Taylor series for ln(1+r)!

Now given these approximations, we can combine numbers into probabilities (using the sigmoid function, which is in terms of e^x) and find the errors on those probabilities (using the cross entropy, which is in terms of ln(x). We can build a learning neural network!

And, more than a century after it was designed, our technique could still do it using the Difference Engine.

HPC’s Shift to the Cloud

Timothy Prickett Morgan writes on The Next Platform about the slow but inevitable shift to cloudy infrastructure. It seems that a tipping point has been reached, where the amount of IT money spent on “cloudy” infrastructure overtook the amount spent on “traditional” datacentre gear. This happened in 2018Q3, according to the IDC report cited in the article.

Prickett Morgan suggests that the transformation from bare metal to the cloud has been faster in HPC than in enterprise IT. In some senses, this makes sense, because HPC has long had the sorts of abstractions between the application and its environment that make it possible to change infrastructure. The days where an atomic energy or climate situation would be capable of running only on dedicated hardware with integrated bench seating are long gone, and all of the top supercomputers are now (highly tuned, admittedly) GNU/Linux clusters running on normal-ish CPUs: mostly Intel, some IBM POWER, and ARM are moving from evaluation to deployment too. All of these technologies, as well as the Nvidia GPUs used in CUDA codes and deep learning applications, and even Google’s TPUs, are to be found in public cloud environments from the big providers.

On the other hand, there still are big honkin’ boxes of bare metal, with the number one spot changing almost every year. So not all HPC applications are cloud-suitable, and some of those codes that people are trying to port to the cloud may prove challenging. Here’s a summary of the components of a “traditional” HPC deployment, and how it might help or hinder adaptation to the cloud.



Plenty of HPC sites already virtualise their filesystems to some extent, with the Modules package. Modules let administrators separate the installation and management of packages from their availability to users, by defining modulefiles that configure the environment to make each package accessible.

Where a team is already using modules to set up its environment for building or running codes, adopting containers and similar abstractions should be straightforward. Docker images, for example, can contain the module packages and the environment changes necessary to use the modules, and the HPC application image can be composed on top to include the relevant environment.

Job submission

HPC systems tend to already be built with the kind of self-service in mind that devops teams in commercial software development strive to provide. This heritage has evolved from the necessarily multi-user nature of a large supercomputer deployment. Mainframe batch submission systems, grid middleware (such as Sun -> Oracle -> Univa Grid Engine) and SLURM are based around the idea that a user can request a certain amount of resources to run their codes, the request being queued until the resources are available.

The open source SLURM project already supports cloud-native demand scheduling. Others are using Kubernetes as an elastic demand scheduler.

However, a lot of teams find job-specific submission scripts with hard coded assumptions about the environment they will run on, and codes that are tightly coupled to the submission script. Loosening that coupling will require some effort, but will make the codes “portable” to a cloud environment and enable new workflows for testing and development.

File systems

HPC sites frequently use high-performance parallel filesystems like Lustre or IBM’s GPFS. While these filesystems can be deployed to a cloud environment, the performance characteristics will differ and it will be harder to tune to the specific topology offered by a physical deployment. Notice that HPC filesystems do not perform well in some scenarios so some applications like AI training may benefit from re-evaluating the data access strategy anyway. Portable codes could be tested against new hardware without significant capital outlay; for example Google Cloud uses Intel Optane non-volatile memory.

Job-specific nodes

A traditional cluster will often have login nodes for accessing the cluster from the scientific workstations, batch nodes for running and using the batch submission systems, compiler nodes for building codes, metadata nodes if it uses a parallel filesystem, and finally compute nodes on which the simulations and deep learning jobs are actually executed. The compute nodes may be divided into groups to service different queues, or to differentiate between testing/debugging and production jobs.

While operations teams may be interested in getting close to 100% utilisation out of the compute nodes, the fact is that the other classes of machine exist because they need different configurations, not because they need to always be available with dedicated hardware. They are ideal candidates to lead the transition to on-demand scaling, perhaps treating a physical cluster as a “private cloud” that commits as much hardware to compute as possible, scaling its other functions as needed.

Meanwhile, compilation and computation can be modelled as serverless workloads, consuming resource when they are executing but scaling to zero when not in use.

Application Support


MPI libraries like Open MPI already support demand-based scaling at job launch, using the -np option to control how many processes are started and the --hostfile to indicate where those hosts are. In principle it might seem like the hosts in the host file could be discovered using the Kubernetes service registry or similar services from other cloud orchestration layers. In practice the MPI library will need to support launching the process on the nodes so a middleware (see above) will still need to be deployed on top, or the MPI software extended with native support for the cloud’s orchestration API.

Software Licences

This turns out to be one of the biggest hurdles for demand-scaling for many teams. HPC software such as proprietary compilers, numerical algorithms libraries and developer tools are licensed with a particular maximum number of parallel uses. Lab-developed codes may have evolved with assumptions about where the licence file is located, and not built defensively against the possibility that a licence can’t be checked out. The ISV may have built assumptions into their licensing scheme, for example the host having a fixed IP or MAC address. A researcher or developer could have copied a particular licence file into their home directory, using that beyond other agreements being arranged with the vendor.

Where the licensing scheme is flexible enough to allow portability of the software, a good technique is to centralise management of the licenses using a secrets store, for example Vault, and to inject them into the HPC applications’ containers when they are launched.

Alternatively, particularly if the licensing scheme is too rigid, it’s worth evaluating the effort required and performance impact sustained to port the code to a different technology, for example an open source compiler. The trade off of this approach is that on the one hand, increased deployment flexibility is strategically beneficial, but the short term costs, staffing requirements and impact on the scientific mission can make it hard to justify or unworkable.


While there are significant benefits to be had in porting high-performance codes to cloud environments, the task is not without its challenges. Labrary consultancy with Graham Lee, bringing his experience in cloud-first devops teams, scalable systems at Facebook, and High-Performance Computing on ARM, can help your team identify and overcome these challenges.

Graham will be at the HPC, Big Data and Data Science devroom at FOSDEM in Brussels, February 2-3. Say hello, grab some time and let’s move your codes forward!

The ABC of Software Engineering Research

About this paper

The ABC of Software Engineering Research by Klaas-Jan Stol and Brian Fitzgerald, published October 2018. See link for full citation.


There are too many ways in which terms describing research methods in software engineering get used, and these authors have a solution. The reason, at least according to the introductory discussion in this paper, is in part a case of discipline envy. This is the idea that we don’t quite know what software engineering is, but we know what those people over there do, and we like that, so we’ll co-opt it.

You could argue that the entire idea of software engineering is discipline envy. A collection of computing experts from academia and industry didn’t quite know how to formalise the problems faced by software makers in the 1960s, but they did know what engineers do, and they liked that. In fact, it’s not clear that they (or at least we, their intellectual descendents) truly understood what engineers do, but nonetheless we gave it a jolly good go. In 1968, in the town of Garmisch in Germany, a discipline was born.

Now, it’s interesting that discipline envy has turned up so early in this discussion, because the contribution in this paper is a framework borrowed from social science researchers. To understand the applicability of cross-discipline seeding, we have to ask how strong the analogy “software engineering is just like X” seems to be (as well as identify how well the proposed idea worked in field X). So, is software engineering like a social science?

Here, the authors carve the field in two. They distinguish “solution-seeking” research, in which we identify what we ought to do about a problem, from “knowledge-seeking” research, in which we identify what people do do about the problem. The bad news about ditching the solution-seeking half of the discipline is that we just lost the engineering from software engineering, the bit where we use scientific results to propose novel solutions to problems.

The good news is that knowledge-seeking software engineering research does look quite a bit like a social science. People, in some context, do things, and we can try to understand that. Indeed that is the origin of the ABC initialism: Actors (the people), Behaviour (the things) and Context.

Well, we can understand bits of it at a time. Like good consultants, the authors introduce a quadrant diagram. On one axis, the “generalisability” of a research method, from highly universal contexts to deeply specific contexts. On the other, the “obtrusiveness” of the method, reflecting whether the researchers are passive observers or active interferers.

As the Labrary stands at the intersection, we approve of the idea that two different things lie on a continuum, rather than being an either/or choice. This framework makes the point that while a particular research technique or strategy is situated somewhere in the general/obtrusive map, others are available elsewhere. The reaction to a highly-controlled lab experiment should not be to declare that the result is not generalisable, but to understand what else could be done to explore generalisations of its results.

The discussions of where particular research strategies fit in the map are interesting, though some of the analogies drawn are fairly tenuous. The authors show where on the map the maximum applicability of a method for each of the key properties lies: universally contextual, unobtrusive research generalises over Actors best, while highly-obtrusive methods allow more precise measures of particular Behaviours and more focus gives a more realistic Context. It would be really beneficial to see a similar framework for “solution-seeking” literature, so we can evaluate the applicability of techniques developed in software engineering research to “practical” problems.

Java By Contract: a Worked Example

Java by Contract is an implementation of Design by Contract, as promoted by Bertrand Meyer and the Eiffel Software company, for the Java programming language. The contract is specified using standard Java methods and annotations, making it a more reliable tool than earlier work which used javadoc comments and rewrote the Java source code to include the relevant tests.

Which is all well and good, but how do you use it? Here’s an example.

The problem

There is a whole class of algorithms to approximately find roots to a function, using an iterative technique. Given, in Java syntax, the abstract type MathFunction that implements a function over the double type:

interface MathFunction {
    double f(double x);

Define the abstract interface that exposes such an iterative solution, including the details of its contract.

The solution

The interface is designed using the Command-Query Separation Principle. Given access to the function f(), the interface has a command findRoot(seed1, seed2) which locates the root between those two values, and a query root() which returns that root. Additionally, a boolean query exhaustedIterations() reports whether the solution converged.

Both of the queries have the precondition that the command must previously have successfully run; i.e. you cannot ask what the answer was without requesting that the answer be discovered.

The contract on the command is more interesting. The precondition is that for the two seed values seed1 and seed2, one of them must correspond to a point f(x) > 0 and the other to a point f(x) < 0 (it does not matter which). This guarantees an odd, and therefore non-zero, number of roots[*] to f(x) between the two, and the method will iterate toward one of them. If the precondition does not hold, then an even number of roots (including possibly zero) lies between the seed values, so it cannot be guaranteed that a solution exists to find.

In return for satisfying the precondition, the command guarantees that it either finds a root or exhausts its iteration allowance looking. Another way of putting that: if the method exits early, it is because it has already found a convergent solution.

Many, but not all, of these contract details can be provided as default method implementations in the interface. The remainder must be supplied by the implementing class.

 * Given a mathematical function f over the doubles, and two bounds for a root to that function,
 * find the root using an (unspecified) iterative approach.
 * A root is an input value x such that f(x)=0.
public interface RootFinder {
     * @return The function that this object is finding a root for.
    MathFunction f();
     * A root to the function f() is thought to lie between seed1 and seed2. Find it.
     * @param seed1 One boundary for the root to f().
     * @param seed2 Another boundary for the root to f().
    @Precondition(name = "seedGuessesStraddleRoot")
    @Postcondition(name = "earlyExitImpliesConvergence")
    void findRoot(double seed1, double seed2);
     * @return The root to the function f() that was discovered.
    @Precondition(name = "guessWasCalculated")
    Double root();
     * @return Whether the iterative solution used the maximum number of iterations.
    @Precondition(name = "guessWasCalculated")
    boolean exhaustedIterations();

    default Boolean guessWasCalculated() {
        return this.root() != null;
    default Boolean seedGuessesStraddleRoot(Double seed1, Double seed2) {
        double r1 = f().f(seed1);
        double r2 = f().f(seed2);
        return ((r1 > 0 && r2 < 0) || (r1 < 0 && r2 > 0));
    Boolean earlyExitImpliesConvergence(Double seed1, Double seed2, Void result);

Example usage

There are swaths of algorithms to implement this interface. See, for example, the book Numerical Recipes. Given a particular implementation, we can look for roots of a simple function, for example f(x) = x^2 - 2:

    RootFinder squareRootOfTwo = SecantRootFinder.finderForFunction((double x) -> x*x - 2);
    squareRootOfTwo.findRoot(1.0, 2.0);
    System.out.println(String.format("Root: %f", squareRootOfTwo.root()));
    System.out.println(String.format("The solution did%s converge before hitting the iteration limit",

This suggests that a root exists at x~=1.414214, and that it converged on the solution before running out of goes. Let’s see if there’s another root between 2 and 3:

Exception in thread "main" online.labrary.javaByContract.ContractViolationException:
  online.labrary.javaByContract.Precondition seedGuessesStraddleRoot had unexpected value false on object
    at javaByContract/online.labrary.javaByContract.ContractEnforcer.invoke(
    at jdk.proxy1/com.sun.proxy.jdk.proxy1.$Proxy4.findRoot(Unknown Source)
    at javaByContract/online.labrary.rootFinder.RootFinder.main(

Whoops! I’m holding it wrong: the function doesn’t change sign between x=2 and x=3. I shouldn’t expect the tool to work, and indeed it’s been designed to communicate that expectation by failing a precondition.

[*] Nitpick: roots _or singularities_.

