Projecting mainframe demand

June 17th, 2010  |  Published in Uncategorized  |  3 Comments

When mutual funds and such issue a prospectus, they typically have a phrase like “past results should not be taken as a guarantee of future returns,” and something similar should probably be said about projecting future demand for mainframe capacity. But as long as we’re trying to predict the future, the past provides most of what little actual data we have to go on.

When we were preparing a response to the migration assessment, I asserted that historically we have seen a 10%–20% annual increase in mainframe demand. Dennis countered with a 2%/year growth rate over the life of our current z890. Who is right? In this post I’ll explain why I said what I did. First, the picture:

Graph showing CPU growth on current z890

In this graph, the green line is the average CPU busy during prime shift, the orange is maximum CPU busy during prime shift, and the blue shows how much of the capacity was available, since most of the time we’ve run with one of the four engines turned off. (Prime shift is 8:00–5:00 weekdays excluding holidays.) If you want to see the raw numbers, you can get them from my RMF reporting site. (This site is behind a firewall and is only available to UT campus IP addresses.) The light green line is the best fit line for the average CPU busy and its slope does indeed come out to a little over 2%/year, like Dennis said.

The first point I’d like to make is that this result is rather dependent on just exactly where you start and end. For example, if I do a best fit from June 2006 through January 2009 the slope is just under 5%/year, while if you start in April 2006 and end in March 2008 you’ll get negative 1%/year growth. I’ll have more to say later about choosing appropriate starting and ending points.

However, a bigger point is that when Dennis said “2%/year” and I said “historically, 10%–20% annual growth” we were talking about two completely different things. For one thing, the first is assuming a linear growth pattern while the second is exponential. (For this interval, the exponential curve is a slightly better fit, but the difference is very small.) So the “2%” means “2% of the total capacity of the z890” while the “10%” means “10% of what it was the year before.”(1) Here’s two tables at a higher level of granularity to show what I mean:

CPU busy in February
Year Actual 5% growth 10% growth
2005 43.12 43.12 43.12
2006 63.95 45.28 47.43
2007 52.74 47.54 52.18
2008 53.60 49.92 57.39
2009 66.39 52.41 63.13
2010 57.03 55.03 69.45
CPU busy in August
Year Actual 5% growth 10% growth
2005 52.23 52.23 52.23
2006 57.71 54.84 57.45
2007 60.82 57.58 63.20
2008 63.31 60.46 69.52
2009 58.92 63.49 76.47

The February table fits the 10% growth rate pretty well through 2009, while the August one comes somewhere between 5% and 10% through 2008. I’ll argue in a bit that we shouldn’t include anything since November 2008 when trying to forecast future demand.

So far we’ve only looked at the big picture; now let’s examine some smaller portions of the data to see if it will help us understand what’s going on better.

The first thing to note is that for the first year and a half the growth rate was extremely high, and then in July 2006 there was a sharp drop. (I’ve used the letter “A” to mark this on the graph.) What happened here was there was a certain high-use application (names removed to protect the guilty; most of the programmers have left the University for greener pastures) that was performing a lot of inefficient processing. Brick and Jon and Greg (and maybe some others I don’t recall) worked with them to redesign the application, which led to a significant drop in CPU demand.

The next event to note occurred in November 2008 (marked “D”.) This is when Jon turned off the Trim monitor for Adabas. During peak periods Trim was consuming up to 10% of our CPU capacity. This is significant enough that if you’re trying to project future demand, you really can’t include periods when Trim was running and periods when it wasn’t and expect your projections to mean anything. And I have to point out that since Trim isn’t running there are problems we can’t diagnose because we don’t have the necessary data. So far we’ve been lucky and haven’t had one of those kinds of problems at a critical time, but there’s no guarantee our luck will continue.

Back near the beginning I promised to say more about choosing appropriate starting and ending points when doing curve fitting; here is where I fulfill that promise. The starting and ending point should be either both before Trim was removed or both after. Since we only have about a year and a half after, I’m going to go with before. (There are other reasons for excluding the last year I’ll get to eventually.) I also think we should exclude the time before point A, when the aforementioned inefficient processing was artificially inflating demand. After that, we need to choose starting and ending points that are not too close to peaks or troughs, and that include full years to average out the annual cycles. I think the best period, therefore, is from the middle of October 2006 to the same time in 2008. The best linear fit for this interval is 4%/year growth, and the exponential fit comes in between 8% and 10% annual growth.

Let me note a couple of other events. In June 2007 (“B” on the graph) after a multi-year project UTCAT (the library card catalog) was replaced by a non-mainframe application. If this project had been completed on schedule UTCAT usage wouldn’t appear in our statistics. (I’m not sure why anyone would expect moving everything off the mainframe would take less time than moving the library catalog.) When I first started working here UTCAT accounted for about half of our mainframe usage, but its growth was much smaller than other applications and by the time it moved it was using less than 5%. The next event (”C”) in November 2007 was when we purchased and installed the Natural Optimizing Compiler. This product reduces the amount of CPU consumed by Natural. It had been around for years, but when we’d looked at it before the cost appeared greater than the CPU savings it would provide.

So far we’ve been looking at the average CPU usage; now I want to talk about peak usage. This more or less grows along with the average usage, but at a slightly higher rate. (I was thinking of peak usage when I said “10% to 20%.”) The other thing about peak usage is that you can never use more than the available capacity. Any time you see near 100% busy that means that some processes didn’t get the service they needed. (The technical term for this is “latent demand.”) If you include these times when making projections you’ll underestimate the actual demand. This is the other reason I wouldn’t include the last year when making projections.

I could go on and talk about the queuing that Jim does in the Netscaler to control demand or the demand generated by more responsive applications like HRMS, but this has probably gone on long enough. I’ll just finish up by saying that, while any projection of what will happen in the future is, in the end, a guess, my projection is that with a new mainframe to support it we will see between 10% and 15% annual growth in demand.


(1) Also, when I say “historically” I’m thinking back to before we got the z890. I took over keeping track of CPU busy statistics from Bill Wagner in 1989, and he still had data from several previous years that he passed on to me.

Responses

  1. Adam Connor says:

    June 18th, 2010 at 8:43 am (#)

    I guess there could be different (valid) reasons for tracking demand, but when considering sizing I would think we’d be most interested in how often we are hitting peak capacity or are very close to it. (From that point of view, the times when we are not running with all processors don’t seem as relevant, since we are clearly not requiring full capacity then.) One thing that stands out in the graph is that our peaks seem much wider (longer) now than they once were — and the latest figures understate this because of trim.

    Then again, this considers all demand to be equally valid — not sure that is true, but also not sure there is a better alternative, absent clear and implementable guidance by campus leaders.

  2. ross hartshorn says:

    June 18th, 2010 at 9:25 am (#)

    If it were worth it, the best thing to do with this data is:
    1) acquire the services of a statistician who knows how to do multivariate regression (with software like SAS, for example)
    2) leave in the entire time period, but add the variables “TrimOrNot”, “AppAWasOptimizedOrNot”, “UTCATOrNot”, and “NOCOrNot”, along with the variable for month and year
    3) this would give you the underlying (year-based) rate of change, without any of the above one-time events distorting it. Incidentally, it would also give you the month-to-month variation, and the size of each of the one-off events mentioned above, without being distorted by what time of year they changed

    Not saying this would necessarily be worth the effort, but it wouldn’t be too hard for a statistician with SAS or R, and would utilize all the data.

    The answer that comes out of it, of course, will still be “we’ve almost run out of mainframe capacity”. 🙂

  3. Adam Connor says:

    June 18th, 2010 at 12:47 pm (#)

    Been a long time since I did any of this stuff, but I wonder if time series might be more appropriate. Anyhow, a trained statistician would certainly be able to build a better model, but the answer is obvious so I doubt it’s worth the work.

    Best course I ever took in statistics (and I took a bunch of them in grad school): non-parametric statistics.

Leave a Response

Social Widgets powered by AB-WebLog.com.