Optimizing Technology Infrastructure with Master Craftsmen

One of my former colleagues, Ralph Bertrum, has provided the primary material for today’s post on how to optimize a technology infrastructure with master craftsmen. Ralph is one of these master craftsmen in the mainframe infrastructure space. If you are a CIO or senior IT leader looking to improve your shop’s cost or performance, I recommend optimizing your infrastructure and systems through high payback platform efficiency reviews.

In today’s shops, often with development and coding partially or fully outsourced, and not enough experienced and capable resources on staff, many applications are built for functionality without much regard for efficiency.  And nearly every shop has legacy applications where few engineers, if any, actually understand how they work. These applications have often been patched and extended that just to have them run is viewed as the goal, rather than run effectively and efficiently. The result is that for most shops, there is a  build up of 10, 20, or even 30% of their compute and storage capacity that is wasted on inefficient systems. This happens partly because it is easiest to just throw hardware at a problem and partly because they do not have the superior engineering resources — or master craftsmen — required to locate, identify, and resolve such inefficiencies. Yet, it is a tremendous waste and a major recurring cost for the IT shop. It is a significant opportunity for IT leaders to attack these inefficiencies.  In my experience, every one of these master craftsmen, if given the framework and support, can easily return 4 to 10 times their annual compensation in savings each quarter!

So, how do you go about building and then leveraging such an efficiency engineering capability? First, to build the capability, you must be willing to invest in select engineers that are heavily dedicated to this work. I recommend focusing on mainframe efficiency and server efficiency (Unix and Windows) as the primary areas of opportunity. Given the different skill sets, you should establish two separate efficiency teams for these two areas. Storage usage should be reviewed as a part of each team’s mission. A small team of two to four individuals is a good starting point. You can either acquire experienced talent or build up by leveraging promising engineers on staff and augmenting with experienced contractors until your staff have attained full capability. Ensure you invest in the more sophisticated tools needed to instrument the systems and diagnose the issues.  And importantly, ensure their recommend application and systems changes are treated with priority and implemented so the savings can be achieved. A monthly report on the recommendations and results completes the building the team and framework.

Now for the approach, which Ralph Bertrum, an experienced (perhaps even an old school) efficiency engineer has provided for the mainframe systems:

Having spent 50 years in Information Technology working on Mainframe Computers, I have seen a great many misunderstandings.  The greatest single misunderstanding is the value and impact of system engineering training and experience and it’s use in performing maintenance on a very costly investment. Many CIOs prefer to purchase a computer engine upgrade and continue to run a wasteful collection of jobs on a new faster machine.  It is the easiest way out but definitely not the most cost effective.  It is the equivalent to trading in your car every time the air filter, spark plugs or hoses need changing or the tires need air and then moving the old air filter, old spark plugs, old hoses and old tires to the new car.

Would you drive around with a thousand pound bag of sand in the trunk of your car?  Would you pull a thousand pound anchor down the street behind your car?  That is exactly what you are doing when you don’t regularly review  and improve the Job Control Language (JCL), Programs, and Files that run on your Mainframe Computer.  And would you transfer that thousand pound bag of sand and that anchor to your new car every time you purchased a new one?  Most IT shops are doing that with every new mainframe upgrade they make.  Every time they upgrade their computer they simply port all the inefficiencies to the upgrade.

Platform efficiency reviews will reduce waste impacting all kinds of resources: CPU, storage, memory, I/O, networks. And the results will make the data center greener and reduce electricity bills and footprints of equipment, speed online and batch processing, eliminate or delay the need for  upgrades, reduce processing wall times and cycle time, and ultimately improve employee efficiency, customer satisfaction and company profitability.

You can apply platform efficiency reviews to any server but let’s use the mainframe as a primary example. And, we will extend the analogy to a car because there are many relevant similarities.

Both automobiles and computers have a common need for maintenance.  An automobile needs to have the oil, the air filter, and spark plugs changed, tires rotated and the tire air pressure checked.  All of these are performed regularly and save a large amount of gas over the useful life of the automobile and extend the life of the car.  Reasonable maintenance on a car can improve mileage by three to four miles per gallon or about a 20% improvement. When maintenance is not performed the gas mileage begins to degrade and the automobile becomes sluggish, loses its reliability and soon will be traded in for a newer model.  The sand is growing in weight every day and the anchor is getting heavier.

For a mainframe, the maintenance is not just upgrading the systems software to the most recent version. Additional maintenance work must be done to the application software and its databases. The transactions, files, programs, and JCL must be reviewed, adjusted and tuned in order to identify hot spots, inefficient code and poor configurations that have been introduced with application changes or additional volume or different mixes. Over the last twenty five years I have analyzed and tuned millions of Mainframe Computer Jobs, Files, JCL, and Programs for more than one thousand  major data centers and all of them were improved.  I have never seen a Mainframe computer that couldn’t have its costs reduced by at least 10% to 15% and more likely 20% through a platform efficiency review.

Often, there are concerns that doing such tuning can introduce issues. Adjusting JCL or file definitions is just as safe as changing a spark plug or putting air in a tire.  It is simple and easy and does not change program logic.   The typical effect is that everything runs better and costs less.  The best thing about maintenance in a data center is that almost all the maintenance lasts much longer than it does in an automobile and stays in effect with continued savings in upgrade after upgrade, year after year.

Think of maintenance of a Mainframe Computer as a special kind of motor vehicle with thousands of under-inflated tires.  By making simple adjustments you can get improvements in performance from every one of these under-inflated tires. And even though each improvement is small in total, because there so many, you multiply the improvement to get significant effect.   You get this cost reduction every time the file is used or transaction executed and when all the savings from all the little improvements are added together you will get a 15% to 20% reduction in processing costs. The maintenance is a onetime cost that will pay for itself over and over upgrade after upgrade.

Here are some areas to focus for performance improvement with examples:

Avoid excessive data retention:  Many IT shops leave data in a file long after its useful life is over with or process data that is not meaningful.  An example would be Payroll records for an employee no longer with the company, General Ledger transactions from previous years, or inventory parts that are no longer sold.  By removing these records from the active file, and saving them in separate archive storage, you are saving CPU every time the file is used and work may complete much faster.  For example, an IT shop had an Accounts Receivables file that had 14 million records.  Every day they would run the file through a billing program that produced and mailed invoices.  At that time the cost of a stamp was $0.32 cents for a first class postage stamp.  A recommendation was made to the CFO that they purge all billing amounts of $0.32 cents or less from the billing file.  It was silly to pay $ 0.32 to collect $0.32 or less.  Two million records were removed from the file, the daily job ran four hours faster and they saved $35,000.00 a month on CPU and DASD space to say nothing about employee time and postage costs.  After a trial test period the minimum billing amount was raised to $1.00 and another set of very large savings was accomplished.

Optimize databases for their use.  An IT shop was looking to reduce the run time of a mailing list label system.  After looking at the data it was found that 90% of the labels were located in California and that a table looked up the city and state from a zip code table.  Each time the program needed a California city name, the program had to do ninety thousand zip code table compares before finding the correct city and state for the address.  The table was rearranged to optimize searching for California zip codes and the job went from running twenty hours to running only one hour.  CPU dropped by over 90%.  This has also worked with online transaction tables and file placement in Local Shared Resource (LSR) buffer pools. Optimizing databases is a key improvement technique.

Optimize the infrastructure configuration for your system.   One shop had jobs that would run very quickly one day and very slowly the next day.  After analyzing the jobs and the file locations it was determined that the public storage pools contained two different types of disk drives.  The Temporary work files would be placed on different disk drives every day.  What was the very best setting for one type of disk was the very worst for the other and this was causing the erratic behavior.  The storage pool was changed to contain only one type of disk device. The problem went away and the jobs ran fast every day.

Tune your systems to match your applications:   A mainframe comes with a great many abilities and features, but if your team are not adept with them your systems will not be optimized to run well.  I have analyzed over 1,000 data centers and applications and never once failed to discover significant tuning that could be accomplished within the existing system configuration. This occurs because of a lack of training or experience or focus. Ensure your team places that as a priority and if needed, bring in experts to adopt best practices. As an example of system tuning, I worked with a shop that had an online file accessing a data base and was having a major response time problem. They were afraid they were going to need a very costly upgrade. Every time they entered a transaction the system would go into a 110% utilization mode with paging.  An efficiency analysis was conducted and a system file was discovered that was doing sixteen million Input Output (I/O) instructions a day. After working with IBM to optimize the configuration, we achieved a 50% drop in I/O to eight million per day and response time improved to less than one second.  Apparently the shop had installed the system file as it was delivered and never modified it for their environment.

Tune your configurations to match your hardware.  When you make a hardware change be sure to make all the necessary other software changes as well.  Last year I worked with a very large bank that upgraded their disk drives but forgot to change a System Managed Storage (SMS) storage pool definition and continued to run forty five thousand monthly jobs using the worst blocksize possible in two thousand five hundred files.  When found and corrected, the forty five thousand jobs ran 68% faster with significant CPU savings as well as a 20% disk space savings.

Ensuring you are optimizing the most costly resource.  Remember that the efficient use of disk space is important, but not nearly as important as CPU consumption.  Analysis at another company discovered that in order to save disk space many files were using a Compression option.  The storage group had implemented this to save DASD space. In doing, the increased CPU usage unwittingly caused a multi-million dollar CPU upgrade. The Compression was removed on some of the files and CPU dropped by 20% across the board for both batch and online processing and delayed another upgrade for two more years. Optimizing disk usage at the expense of CPU resources may not be a good strategy.

Tune vendor software for your configuration.    Remember that the vendors sell their product to thousands of customers and not just you. Each vendor’s product must run in all IBM compatible environments and many of those environments will be older or smaller than your environment.  When you install the vendor software it should always be adjusted to fit properly in your environment.  Last year I did an analysis for a company that was beginning to have a run time problem.  They had an online viewing product from a vendor. They had set it up to create an online file for each customer.  They had created over three million online files and were adding one hundred thousand new ones every day.  They had run into serious performance issues because they did not understand the vendor software and the setup had been done incorrectly. So don’t add more sand to your computer by just not understanding how to best use a vendor product and configuring it correctly for your system.

Understand your systems and avoid duplication whenever possible.  Duplication of data and work is a common issue. We reviewed one IT shop that had the three million online files backed up to one hundred volumes of DASD every night taking five to six hours to run each night (and missing their SLA every morning).  Analysis showed that the files were viewing files and were never updated or changed in any way.  Except for a small number of new files, they were exactly the same unchanged report files backed up over and over.  It would have been much better to just backup the newly created files each night.  After all how many copies of a report file do you need?

If it’s not used remove it.  Remember that every file that you are not using is typically being backed up and stored every night.  If a file is not used it should be backed up, saved on archive and removed.  This space will be released and can be used for other purposes.

So the next time you think you need a computer upgrade don’t move the thousand pound bag of sand or connect the anchor to your new computer. Remember that maintenance is easy, simple, safe, and green.  Maintenance has a much greater return on investment than an upgrade.  Conduct a thorough platform efficiency review, it will save you a great deal more than you think over and over, year after year, upgrade after upgrade from now on.

Best, Ralph Bertrum and Jim Ditmore

About Ralph: Since 1986, Ralph is the co-founder and principle of Critical Path Software. He is an inventor, designer and software developer of The TURBO suite of mainframe analysis tools and expert performance tuning database. He has provided performance tuning services and analyses for over 1,100 major Fortune 5,000 corporations worldwide.  He is a former MVS, VSE, VM, CICS, and EDOS/VSE systems programmer, and an IDMS and IMS DBA.