RAS :: Thimbles & Care

RAS

May 28th, 2010 | Published in Uncategorized | 3 Comments

RAS is an acronym IBM likes to use a lot; it stands for “Reliability, Availability, Serviceability.” In general, it’s about how much you can count on a system to be up and running when you need it.

While we’d always prefer systems that were perfectly reliable and always available, getting there costs a lot of money. Part of designing a system involves trading off RAS characteristics against cost: if it’s OK for a service to be down for hours at a time, why spend the extra money for highly reliable hardware and software?

IBM’s z Series hardware and the z/OS operating system are designed for very high RAS, which is one of the reasons for their high prices. I think that we have many services on our mainframe that need this level of reliability, but there are lots of other services that are there because they have integration points with the mission-critical services. It would be really good if the business leaders of the University would try to categorize the various services provided by the mainframe according to how critical they are. Then, if we could solve some of the integration problems I mentioned in my last post, we could start running the less critical applications on less costly platforms.

This would also help us more immediately during registration and other times of peak capacity. Since we don’t have a big enough mainframe to meet the demand at these times, our only way of getting through is to stop some of the services. (By the way, why does last summer’s “mainframe efficiency initiative” keep getting touted as a success? We only made it through August registration because Jon turned off the Trim monitor on Adabas, which means we now have no way to diagnose database performance problems, and in January we did have to turn off services.) As systems administrators, we can’t really evaluate the relative priorities of different applications, so during crunch times we don’t know what to stop and what to try to keep running. In January we picked services that were timing out anyway (it seems safe to turn something off if it’s not working) but if that isn’t enough we really shouldn’t be the ones trying to decide.

(As long as I’m talking about RAS, I should mention that one of the problems we had with the migration assessment plan is that the hardware recommended is from a lower reliability class than the current mainframe. Multiple hardware vendors had provided specifications for candidate systems, but the one that made it into the report was the one that didn’t meet what we felt were minimum reliability criteria. Also, the other high-reliability systems didn’t cost that much less than a z Series machine.)

Responses

Feed Trackback Address

Adam Connor says:

June 1st, 2010 at 8:54 am (#)

I don’t know much about RAS, but our current architecture’s “SOA” solution is basically an in-process model. This results in very good performance, but requires a monolithic solution: I can’t callnat your secured module unless we are both running in the same place.

If we moved from in-process to something else — true SOA, perhaps — you would have the flexibility to move applications around. However, the performance would degrade a lot. Given our high level of integration, I would guess that would cause serious problems in other places — notably, batch programs. Such jobs could perhaps be rewritten to avoid this — e.g., instead of making a CALLNAT on every FOO entity seen, you could write them to a dataset and make an SOA call that processes the dataset (maybe) — but that’s a lot of work.

It’s also one of the questions I have about moving to something like Python. It’s too dynamic a language for an in-process model to be secure — but I’m not sure how we will scale any alternative model.
ross hartshorn says:

June 2nd, 2010 at 11:40 am (#)

If we had the ability to run just the database, and batch programs, on the central (high RAS) hardware, we could handle this a bit better. Really, most of what we write in Natural doesn’t need to be on the same machine as the database, but some of it does.

As I understand it, the biggest obstacle to accessing Adabas without Natural is our home-grown security system. If we could allow unNatural software to access Adabas directly, but only for the files it has business accessing, we could migrate code off the mainframe a piece at a time. Who knows, we might never have to move the last few pieces (probably batch programs) off, but even if we did we could do them last, while extending the life of the current mainframe by moving non-critical code off.

I take it switching to a different security system, that could accept SQL from off-mainframe without granting access to everything, is a task so big you might as well rewrite everything?
Adam Connor says:

June 2nd, 2010 at 3:43 pm (#)

But then we would end up with a lot of, say, Python code written to access ADABAS. I don’t think that’s necessarily the way to go — nothing against ADABAS per se, but the tools all support SQL, so I think it would make more sense to migrate that direction.

It does strike me that the conversation always seems to brush up against doing what’s best for us and being compatible with what’s popular. Most places choose the popular path because it is easier. To do something else requires commitment and vision.

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Thimbles & Care

RAS

Responses

Leave a Response

Blogroll

Coworkers

Tech news

Tech people

Meta