28 | May | 2010 | Thimbles & Care

RAS is an acronym IBM likes to use a lot; it stands for “Reliability, Availability, Serviceability.” In general, it’s about how much you can count on a system to be up and running when you need it.

While we’d always prefer systems that were perfectly reliable and always available, getting there costs a lot of money. Part of designing a system involves trading off RAS characteristics against cost: if it’s OK for a service to be down for hours at a time, why spend the extra money for highly reliable hardware and software?

IBM’s z Series hardware and the z/OS operating system are designed for very high RAS, which is one of the reasons for their high prices. I think that we have many services on our mainframe that need this level of reliability, but there are lots of other services that are there because they have integration points with the mission-critical services. It would be really good if the business leaders of the University would try to categorize the various services provided by the mainframe according to how critical they are. Then, if we could solve some of the integration problems I mentioned in my last post, we could start running the less critical applications on less costly platforms.

This would also help us more immediately during registration and other times of peak capacity. Since we don’t have a big enough mainframe to meet the demand at these times, our only way of getting through is to stop some of the services. (By the way, why does last summer’s “mainframe efficiency initiative” keep getting touted as a success? We only made it through August registration because Jon turned off the Trim monitor on Adabas, which means we now have no way to diagnose database performance problems, and in January we did have to turn off services.) As systems administrators, we can’t really evaluate the relative priorities of different applications, so during crunch times we don’t know what to stop and what to try to keep running. In January we picked services that were timing out anyway (it seems safe to turn something off if it’s not working) but if that isn’t enough we really shouldn’t be the ones trying to decide.

(As long as I’m talking about RAS, I should mention that one of the problems we had with the migration assessment plan is that the hardware recommended is from a lower reliability class than the current mainframe. Multiple hardware vendors had provided specifications for candidate systems, but the one that made it into the report was the one that didn’t meet what we felt were minimum reliability criteria. Also, the other high-reliability systems didn’t cost that much less than a z Series machine.)

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Thimbles & Care

Curtis Pew’s thoughts related to his job

Daily Archives: May 28, 2010

RAS