Selected Highlights of the Labs21 2008 Annual Conference
Energy Efficient and Sustainable HPC at Pacific Northwest National Laboratory
Phil Tuma, P.E., 3M Corporation, Andrés Márquez, Ph.D., Pacific Northwest National Laboratory, Landon Sego, Ph.D., Pacific Northwest National Laboratory, Roger Schmidt, Ph.D., P.E., IBM Corporation, and Tahir Cader, Ph.D., HP (formerly SprayCool)
Power consumption in US data centers has been escalating at an alarming rate. In response to Public Law 109-431, EPA reported that electricity usage by US data centers accounted for 1.5% of the total electricity used in the US in 2006. They projected that if current data center operating practices continue, electricity usage will almost double to 2.9% of the total electricity used in the US in 2011 (EPA, 2007). Consequently EPA issued a call to action to both government and industry to collaborate and set aggressive goals to reduce power consumption in data centers.
In the spirit of responding to the findings and recommendations of the EPA report, the Pacific Northwest National Laboratory (PNNL) has teamed with several key organizations including: The Green Grid (TGG), ASHRAE TC9.9, IBM, 3M, and SprayCool. As part of this effort, a highly instrumented liquid-cooled cluster has been installed at PNNL. The cluster is housed in an 800 ft2 data center which resides in a mixed-use data center with a significant amount of instrumentation as well. The eventual objective of the effort is to be able to report the real-time power consumption, energy efficiency, and productivity of the liquid-cooled data center.
Preliminary results from the effort at PNNL are reported in this paper. Thermal results are reported for the hottest server components, including the microprocessors and memory DIMMs. Under all conditions tested, the components have not exceeded manufacturers' specifications. More importantly, the data show that the liquid-cooled servers can be maintained within specifications while rejecting to non-chilled facility water at 78°F (25.6°C). Furthermore, a reasonable extrapolation suggests that the specifications can still be maintained at 86°F (30°C). In an effort to address global warming, work has started on the qualification of a new 3M Fluoroketone fluid that has a Global Warming Potential (GWP) of 1. This GWP is the lowest published value of all commercially available coolants. Details are provided in the body of this paper.
PNNL and SprayCool have been collaborating on the development of an energy efficient data center cooling solution since 2004, and started with the conversion and testing of a rack of spray cooled HP rx2600 2U servers. The rack was installed and run for over a year in the Molecular Sciences Computing Facility's data center. This rack achieved an overall uptime of 96.9%. This effort was followed by the installation of a rack of spray cooled HP rx1620 1U servers and a similar evaluation. Given the performance of the cooling solution, PNNL issued a request for proposals to develop a turnkey spray cooled cluster. IBM was the winning bidder with SprayCool the chosen cooling solution. The decision by IBM to deliver to PNNL, in collaboration with SprayCool, a spray cooled cluster (named NW-ICE) represented a major milestone for the program.
Since the initiation of the effort, the ultimate objective has been to demonstrate the ability to raise a data center's energy efficiency through the deployment of liquid-cooled IT equipment. With the introduction of the Lieberman-Warner Climate Security Act of 2007 (Lieberman-Warner, 2007), the need to migrate to greener coolants has become a top priority. In this paper, the authors discuss both energy efficiency and sustainability issues in high performance computing environments.
Spray Cooled NW-ICE and the ESDC-TBF
SprayCool technology has been implemented in an IBM x3550 cluster, with the cluster installed in a small instrumented 800 ft2 data center housed within a larger mixed-use facility. The cluster includes seven compute racks, five of which are spray cooled. The cluster has been named NW-ICE and the data center is referred to as the Energy Smart Data Center–Test Bed Facility (ESDC–TBF, or simply ESDC). Additional information on SprayCool technology, NW-ICE, and the ESDC is provided in the remainder of this section.
Description of SprayCool Technology
Spray cooling broadly refers to the delivery of a coolant via spray, to one or more heated objects. In the context of this paper, the cooling is of microprocessors as deployed in servers.
The technology can be implemented in a number of ways. The two most common approaches are referred to as global spray cooling (Cader et al., 2005) and hybrid (or indirect) spray cooling (Cader et al., 2006). In global spray cooling, electronics such as single board computers are placed in enclosures or chassis and are directly sprayed with an electrically non-conducting fluid such as 3M's PF5060. Upon delivery to the heated electronics, the coolant vaporizes and expands. The vapor and unevaporated coolant are delivered to a heat exchanger for cooling and a reservoir for collection. The fluid is then returned to the enclosure for delivery to the electronics.
The hybrid spray cooling approach refers to the fact that some of the heat from the electronics, typically the microprocessors, is removed indirectly by spray, whereas the balance of the electronics are cooled with air. The indirect heat removal refers to the fact that the microprocessors are cooled with spray cooled cold plates that are placed directly on the microprocessors. Similarly to the global cooling approach, the indirect spray cooling implementation is done in a closed loop fashion, also relying on a heat exchanger for heat rejection, a reservoir for fluid collection, and a pump for fluid delivery.
Description of NW-ICE
NW-ICE consists of 195 x 1U dual socket, quad-core Intel Xeon (Clovertown), IBM x3550 servers. The servers are housed in seven IBM 19” 42U equipment racks with a maximum of 28 servers per rack. The computational network is routed through a DDR2 Infiniband switch which is housed in a separate rack, while the management IP network operates across seven rack-mounted HP switches. In addition to the management network, each node of the NW-ICE computer can be addressed via five terminal servers.
NW-ICE, assembled via a joint-effort between IBM and SprayCool, is a turn-key solution delivered to PNNL. System benchmarking done with High Performance Linpack (HPL) clocks a minimum sustained performance of 9.3 TFlops. Five of the seven compute racks are spray cooled, with two of the racks left unmodified to compare air-cooling to liquid-cooling.
Description of the ESDC
The ESDC (Figure 1) is a state-of-the-art 800ft2 data center housed in PNNL's Molecular Sciences Computing Facility (MSCF), which in turn is located in the Environmental Molecular Sciences Lab (EMSL). The ESDC is located adjacent to MSCF's prominent 163TF HP supercomputer, sharing power and cooling infrastructure. The shared components include chillers, chilled water pumps, condenser water pumps, cooling towers, and other utilities. The uniqueness of the ESDC is that it is a true research-dedicated data center housed in a mixed-use facility. Standard air-cooling in the ESDC is provided by two nominal 30 ton air handler systems, located at opposite ends of the room.
Figure 1: Photograph of NW-ICE installed in the ESDC (HX = heat exchanger).
The ESDC is completely instrumented to measure power delivered to each server, power delivered to each rack, water flow rate delivered to each rack, water temperature, water temperature rise, power consumption by each air handler, and power delivered before and after each power distribution unit. Close to 500 sensors are deployed within NW-ICE. In essence, most of the data needed to quantify the key energy efficiency and energy productivity metrics for a computer running production jobs can be determined from the instrumentation immediately within the ESDC. In addition, there is a high level of instrumentation in the remainder of EMSL. The data capture and storage requirements are orchestrated by a software tool developed at PNNL, called FRED (Fundamental Research for Efficient Datacenters). FRED is linked to the building management and instrumentation system as well as to customized sensor arrays controlled by data acquisition systems distributed over the ESDC.
Tests Conducted in the ESDC
A number of preliminary experiments have been conducted in the ESDC with the ultimate goals of: (1) characterizing the chip case temperatures for various combinations of the cooling water temperature and CRAH temperatures, (2) comparing the energy consumption of the air versus spray cooled technology, and (3) quantifying the energy usage relative to the amount of useful work performed by the data center. The experiments which address (2) and (3) are still underway with the results pending. The results in relation to (1) are discussed below in the section, "Results from the ESDC."
Results from the ESDC
Figures 2 and 3 present data for a series of experiments conducted on March 19, 2008. For these experiments, spray cooled racks B1 and B2 were fully exercised while running CPU Burn to stress the functional units of the processors. During the experiments, the CRAH and rack cooling water1setpoint temperatures were varied according the numbering used in the figures: 1 = CRAH set point temperature of 80°F and water temperature set point of 68°F; 2 = water temperature set point raised to 86°F; 3 = CRAH set point temperature set point changed to 65°F; 4 = water temperature set point dropped to 70°F. A primary objective of the experiments was to show an ability to cool the CPUs while rejecting the waste heat to 86°F water. A water temperature level of 86°F was selected since condenser water (i.e., cooling tower water) is not likely to exceed this value at any time of the year at a majority of locations in the US. This then proves the ability to cool the CPUs without the use of a power-hungry chilled water plant.
Figure 2 (left-hand image) presents the case temperatures for a number of instrumented CPUs in Racks B1 and B2 plotted against time. In addition, the right-hand vertical axis shows the temperature of the water delivered to the spray cooled racks. Due to an unresolved test set-up issue, the target water temperature of 86°F was not achieved (the water reached approximately 78°F). Figure 2 shows that all the CPUs, with the exception of one, reached a maximum temperature of just under 130°F, while the remaining CPU reached a maximum temperature of 140°F. The manufacturer's maximum allowable case temperature for these CPUs is 150.8°F (66°C). The data for a 'stable' period between 15:25 and 17:02 were used to extrapolate the CPU case temperatures to a level corresponding to a water temperature of 86°F (see right-hand image of Figure 2). This worst-case extrapolation suggests that the case temperatures will not exceed 148°F even at a water temperature of 86°F, supporting the claim that the CPUs can be cooled by rejecting their waste heat to non-chilled water. It is important to note that the validity of the extrapolation depends upon the reasonable assumption that the physical properties of the cooling fluids do not change over the range of the extrapolation.
Figure 2: CPU case temperatures for Racks B1 and B2 (specific nodes indicated in legend). The red interval in the extrapolation plot indicates the 95% prediction interval.
Figure 3 shows the data for DIMM2 and DIMM4, for all nodes in Racks B1 and B2, as a function of experiment timer2. The data was acquired at the same time, and under the same conditions as for the CPU data presented in Figure 2. DIMM2 was the coolest DIMM for all the nodes in Racks B1 and B2, while DIMM4 was the hottest. As shown, DIMM4 temperatures reach a maximum of 149°F (65°C), which is well below the typical maximum allowable junction temperature of 167°F (75°C) for similar fully buffered DIMMs. Since the DIMMs are air-cooled, the temperature results are invariant to the temperature of the water delivered to the racks. The DIMMs do, however, benefit from the fact that the CPU heat is not rejected to the server ambient.
Figure 3: DIMM surface temperatures for Racks B1 and B2 (specific nodes indicated in legend)
Sustainable Fluid Chemistries
The issue of ozone depleting chemicals was addressed with the 1987 Montreal Protocol. In response to this protocol, Hydrofluorocarbons (HFCs) like HFC-23, 32, 134a, 245fa and blends thereof were introduced. Since the Montreal Protocol, attention has turned to global warming. The 1997 Kyoto Protocol, and the 2006 European Union F-Gas regulation, are the result of the concern about global warming. Under the Kyoto Protocol, various nations pledged to reduce their emissions of greenhouse gases such as Perfluorocarbons (PFCs) and HFCs, and companies within those nations are compelled to achieve reductions as well. Although the US did not ratify the Kyoto Protocol, individual states are addressing the issue of global warming emissions with their own regulations. As an example, the California Air Resources Board (CARB) has suggested emission standards for commercial refrigeration systems and a variety of other HFC applications (CARB, 2007). Refrigeration and food industry consortia like the US EPA's GreenChill partnership are also striving to reduce HCFC and HFC refrigerant emissions (US EPA, 2007).
Most recently, the Lieberman-Warner Climate Security Act of 2007 (Lieberman-Warner, 2007) commenced a dialog on the need to more stringently regulate greenhouse gases such as Perfluorocarbons and HFCs. This legislation is expected to be aggressively promoted by President-elect Barack Obama (2008 election). It is noted that NW-ICE is currently using a Perfluorocarbon (PF5060), while typical pumped refrigerant data center cooling systems commonly use HFC-134a.
In light of the mounting evidence of the impact of greenhouse gases on global warming, 3M and SprayCool have teamed to investigate the deployment of environmentally friendly and sustainable fluid chemistries for use in SprayCool systems. Figure 4 highlights the Global Warming Potential (GWP) of several commercially available fluids. The two most attractive fluids, from the standpoint of acceptable GWPs, are Fluoroketones (C6K) and Hydrofluoro-olefins (HFO-1234yf). The Hydrofluoro-olefins, while attractive from the standpoint of their thermophysical properties, have an unacceptably high vapor pressure in the preferred operating temperature range for SprayCool systems. The Fluoroketone C6K, along with another related compound with a higher boiling point, is attractive from the standpoint of the thermophysical properties, as well as the vapor pressure in the preferred operating temperature range. Fluid qualification in SprayCool systems is currently underway, and results will be reported in an upcoming publication.
Figure 4: Comparative Global Warming Potentials (GWPs) of new and commercially significant compounds
In addition to the concern of global warming, power consumption in data centers is an issue of critical importance in the US and other countries the world over. In this paper, the authors document an effort to address both areas of concern.
The liquid-cooled cluster, NW-ICE, installed at PNNL has been operational for almost two years, with a number of production jobs running during the last year. Two key objectives of the undertaking have been to demonstrate 1) the viability, and 2) the energy efficiency of a liquid-cooled cluster. The viability of the cooling solution has been demonstrated by the cluster operating in both developmental and production modes with minimal downtime. Proving the energy efficiency of the solution remains a work-in-progress. The experimental results to-date suggest the ability to reject the processor waste heat to 30°C cooling tower water. Consequently, spray cooled processors would not require chilled water, thereby reducing the load on the chiller plant. Finally, work is underway to demonstrate the ability to migrate from the current perfluorocarbon coolant to a green Fluoroketone coolant. The selected coolant has a global warming potential of 1, which is the lowest published value of all commercially available coolants.
- Cader, T., Tolman, B., Kabrell, C., and Krishnan, S., 2005, “SprayCool Thermal Management for Dense Stacked Memory,” proceedings of IMECE 2005, paper # IMECE2005-81692, Orlando (FL).
- Cader, T., Westra, L., Marquez, A., McAllister, H., and Regimbal, K., 2006, “Performance of a Rack of Liquid-Cooled Servers,” ASHRAE Journal paper #DA-07-12.
- CARB, 2007, “Expanded List of Early Action Measures to Reduce Greenhouse Gas Emissions in California Recommended for Board Consideration,” California Environmental Protection Agency Air Resources Board, September 2007.
- US EPA, 2007, “GreenChill Partnership.” See also: www.epa.gov/ozone/partnerships/greenchill/.
- US EPA, 2007, “Report to Congress on Server and Data Center Energy Efficiency” in response to Public Law 109-431, August 2, 2007. See also: www.energystar.gov/index.cfm?c=prod_development.server_efficiency_study
- Lieberman-Warner, 2007, “Lieberman-Warner Climate Security Act (S. 2191).”
Footnote 1: The "water temperature" refers to the temperature of the water delivered to the spray cooled racks. The water is used to remove the CPU waste heat via a liquid-to-liquid heat exchanger located in the bottom of each rack.
Phillip Tuma, P.E., is an advanced application development specialist in the Electronics Markets Materials Division of 3M Company. He has worked for 13 years developing applications for fluorinated heat transfer fluids in various industries, including military and aerospace electronics, supercomputers, lasers, pharmaceuticals, and semiconductor manufacturing. Mr. Tuma received a B.A. from the University of St. Thomas, a B.S.M.E. from the University of Minnesota and a M.S.M.E. from Arizona State University.
Dr. Andrés Márquez is the Principal Investigator for the Energy Smart Data Center project at the Pacific Northwest National Laboratory (PNNL). He is the lead hardware architect for the Data Intensive Computing Initiative at the laboratory. He also acts as a scientist at the Center of Adaptive Software Systems and the Exascale Computing Initiative. Dr. Marquez is a high performance computer and compiler architect who has worked on the development of the German Supercomputers SUPRENUM and MANNA, on the GENESIS European Supercomputer design studies, on US Supercomputer design studies for the Hybrid Technology and Multithreaded Technology (HTMT) computer (funded by NASA, DARPA, NSA, JPL) and on academic high performance computing projects such as the Efficient Architecture for Running Threads (EARTH) and the Compiler Aided Reorder Engine (CARE). He has published over 30 peer reviewed papers in journals and conferences in the fields of hardware-, software- and systems- architecture as well as IT infrastructure.
Dr. Landon Sego is a collaborative scientist for the Energy Smart Data Center (ESDC) project at the Pacific Northwest National Laboratory (PNNL). His doctoral research focused on statistical methods for health care surveillance and the monitoring of rare events. His areas of statistical expertise also include experimental design, quality control, statistical programming, and methodologies which account for less-than-detect data. While at PNNL, Dr. Sego has provided statistical consultation in the design and analysis of experiments involving staff training, graphical user interfaces, and the ESDC. As a graduate student, he provided statistical consultation in the design and analysis of experiments for graduate students and faculty in the Agricultural, Biological, Veterinary, and Engineering sciences. Dr. Sego also consulted with Bank of America developing statistical methodology and customized software to improve the quality of the information used to guide investment strategies. He has authored (or co-authored) a number of journal articles and presentations in both statistical journals and other subject areas.
Dr. Roger R. Schmidt, Distinguished Engineer, National Academy of Engineering Member, IBM Academy of Technology Member and American Society of Mechanical Engineers (ASME) Fellow, has over 30 years of experience in engineering and engineering management in the thermal design of IBM's large-scale computers. He has led development teams in cooling mainframes, client/servers, parallel processors and test equipment utilizing such cooling mediums as air, water, and refrigerants. He has published more than 100 technical papers and holds 96 patents/patents pending in the area of electronics cooling. He is a member of ASME's Heat Transfer Division and an active member of the K-16 Electronics Cooling Committee. He has been an Associate Editor of the Journal of Electronics Packaging and is now associate editor of the ASME Journal of Heat Transfer. He has taught extensively over the past 25 years Mechanical Engineering courses for prospective Professional Engineers and has given seminars on electronics cooling at a number of universities. He is Chair of the ASHRAE TC9.9 committee on Mission Critical Facilities, Technology Spaces, and Electronic Equipment.
Dr. Tahir Cader is an alternate director of The Green Grid, is a member of The Green Grid's Technical and Liaison Committees, and is a member of ASHRAE Technical Committee 9.9. At SprayCool, Dr. Cader serves as Technical Director. His areas of emphasis include high performance computing and commercial data centers, as well as system architecture development for telecom, semiconductor test, and other emerging market opportunities for SprayCool's liquid-cooling technology. With over 14 years in the high tech industry, he has served on several industry electronics packaging and thermal management panels, has been a member of the organizing committee for several technical conferences, and has served as session chair/co-chair for several technical conferences. Dr. Cader is both a sole inventor as well as a co-inventor on 15 SprayCool patents issued, a sole/co-inventor on 12 filed SprayCool patents, and a co-author for more than 40 peer-reviewed journal, conference, and trade journal technical articles. He is also a significant contributor to several published ASHRAE Datacom Series books.