With Great Power Management Comes Great Responsibility

Yes, that title is a tweak to an original quote from Voltaire or Winston Churchill or maybe Superman or Spiderman, depending upon your age and who you ask. But power management in HPC is now not only possible, but highly desirable. We, as a community, must become more aware of our power consumption, usage models, and more importantly, we must begin to create policies and procedures around power management. HPC Matters and we have the responsibility to use power effectively doing that important work.

In the past we would hear things like ““We don’t pay the power bill, so it’s not that important to us.” or “Power is a facilities thing; we are the Data Center.” Sometimes we would even hear that “Power around here is cheap.” and “I need to use my whole power budget to justify additional resources.”

But today we are hearing things like “The Data Center has a power limit.” and “Winter (and/or Summer) power demands exceed resources.” In many cases, the power costs are now included in the total HPC costs and in some places around the world, the power costs change hourly, daily, and seasonally.

Instead of addressing these issues head-on, let me show by example using typical and atypical scenarios why now is the time to begin thinking about and creating policies and procedures around power management. By doing so now, we may be able to pro-actively address some unseen political, facility and end users issues that will almost certainly arise if we do nothing.

The first typical example results from the push for higher efficiency facility cooling capabilities. As the efficiencies are pushed and our tolerances lowered, there will be more hours per day or maybe days per year when the facility’s cooling system is less efficient. This may result in higher power consumption by the cooling system and that power may not be available and a request for lowered power consumption may be made on the HPC resources. Similarly, during facilities maintenance events, that same request for lowered power consumption may begin to appear.

My first atypical example is one we all hope never occurs – a complete cooling failure. Power is still readily available and jobs are continuing, but where should we spend our limited time? I personally would put saving the parallel file system at the top of the list.   Another example is degraded cooling with a long lead time replacement part. Again, power is readily available. But our HPC resources can generate more heat than the degraded cooling system can handle. Should we stop some queues or jobs and shut down part of the system? Maybe. But which ones? Suppose we knew which jobs were required for an upcoming paper or conference? Or to complete a grant application? Or for a report to the Director, the Governor, the General, the Admiral, the Senator, etc.? And could lower the power consumption of all the others.

With SGI® System Management Suite (SMS), we currently have the ability to power cap the ePower Managementntire system, a complete rack or a specific node. Coming this summer (2015), with the help of our partner Altair, SGI’s SMS will work in concert with Altair PBS Professional.  Altair PBS Professional is using SGI Management Suite’s power management capabilities that measure and limit power to manage power resources on a per job basis. This capability will be available across all of SGI’s HPC product lines.

We will have the power management capabilities to more effectively use electricity in normal operations, and to more effectively serve our community when things go wrong. But the policies and procedures to exploit these capabilities remain in the hands of individual sites. Now is the time to create those policies and procedures around power management. HPC Matters and we have the responsibility to use power effectively and serve our community optimally. #HPCMatters

Leave a Reply