且慢,數(shù)據(jù)中心團(tuán)隊(duì)還要逐步適應(yīng)液體冷卻應(yīng)用

中國(guó)IDC圈
佚名
眾所周知,液體冷卻比空氣冷卻更有效。但由于冷卻設(shè)施的安裝和管理將會(huì)發(fā)生顛覆性變化,以及認(rèn)為沒(méi)有必要等多種原因,數(shù)據(jù)中心運(yùn)營(yíng)商采用液體冷卻的速度很慢。大多數(shù)情況下,采取這種技術(shù)的往往是具有高功率密度的數(shù)...

眾所周知,液體冷卻比空氣冷卻更有效。但由于冷卻設(shè)施的安裝和管理將會(huì)發(fā)生顛覆性變化,以及認(rèn)為沒(méi)有必要等多種原因,數(shù)據(jù)中心運(yùn)營(yíng)商采用液體冷卻的速度很慢。大多數(shù)情況下,采取這種技術(shù)的往往是具有高功率密度的數(shù)據(jù)中心。因此,如果企業(yè)的數(shù)據(jù)中心已達(dá)到要求采用液體冷卻的功率密度級(jí)別,那么其日常運(yùn)營(yíng)將有哪些變化?

而根據(jù)工作人員數(shù)據(jù)中心工作的職業(yè)生涯,液體冷卻可能看起來(lái)是一種全新的技術(shù)或相當(dāng)傳統(tǒng)的技術(shù)。“早在上世紀(jì)80年代和90年代,大型機(jī)以及超級(jí)計(jì)算機(jī)采用液體冷卻技術(shù)很常見(jiàn)。”Uptime Institute公司首席技術(shù)官Chris Brown表示,“如果數(shù)據(jù)中心管理人員年齡較大的話,他們可能會(huì)覺(jué)得采用液體冷卻很熟悉,但年輕一代對(duì)其應(yīng)用感到擔(dān)憂和緊張。”

GRC公司在休斯敦?cái)?shù)據(jù)中心的冷卻罐

人們往往不愿意將液體冷卻與昂貴的IT資產(chǎn)聯(lián)系在一起。但是,一旦更好地理解和應(yīng)用這項(xiàng)技術(shù),就會(huì)消除其擔(dān)憂和顧慮,因?yàn)樵诤芏嗲闆r下,冷卻硬件設(shè)備的液體實(shí)際上并不是水,也不會(huì)造成任何損害。

例如,現(xiàn)代浸沒(méi)式和直接為芯片提供冷卻的液體冷卻系統(tǒng)采用的是介電(非導(dǎo)電)的非易燃流體,冷卻分配單元將冷凍液輸送到熱交換器,從而通過(guò)浸沒(méi)去除熱量。Brown解釋說(shuō):“這使IT設(shè)備能夠獲得液體冷卻的好處,如果存在泄漏,也會(huì)破壞價(jià)值數(shù)百萬(wàn)美元的硬件。”

對(duì)設(shè)施的影響

事實(shí)上,已經(jīng)使用冷凍水設(shè)施的數(shù)據(jù)中心切換到液體冷卻并不會(huì)變得更復(fù)雜。“他們已經(jīng)習(xí)慣于處理液壓和冷卻器等問(wèn)題,并需要對(duì)管道的水進(jìn)行處理以防止藻類生長(zhǎng),因?yàn)槿绻|(zhì)不好的話,就會(huì)堵塞熱交換器中的管路。”冷卻浸沒(méi)水箱的水回路可以在現(xiàn)有的活動(dòng)地板下運(yùn)行,需要額外的結(jié)構(gòu)支撐。

Brown警告說(shuō),如果企業(yè)使用機(jī)械制冷裝置而不熟悉液體冷卻運(yùn)作的話,液體冷卻需要更陡峭的學(xué)習(xí)曲線,此外,還需要對(duì)數(shù)據(jù)中心運(yùn)營(yíng)進(jìn)行更多改變。任何冷凍水系統(tǒng)都是如此。

對(duì)IT的影響

采用顛覆性液體冷卻技術(shù)取決于企業(yè)IT部門(mén)選擇的冷卻技術(shù)類型。美國(guó)能源部下屬的勞倫斯伯克利國(guó)家實(shí)驗(yàn)室工程師Dale Sartor表示,“后門(mén)熱交換器需要的變化很少,后門(mén)交換器具有管道連接,它們很靈活,因此可以像以前一樣打開(kāi)和關(guān)閉后門(mén),而采用液體冷卻只是需要一個(gè)更厚、更重的門(mén),但其他技術(shù)和服務(wù)方面幾乎是一樣的。”

同樣,對(duì)于直接到芯片的冷卻技術(shù),機(jī)架后部有一個(gè)歧管,將歧管連接到服務(wù)器并冷卻組件。Sartor解釋說(shuō),這些管子安裝了無(wú)滴漏的連接器。“技術(shù)人員將這個(gè)連接器從服務(wù)器上取出,其采用無(wú)滴水設(shè)計(jì),因此他們可以像以前一樣將服務(wù)器拉出來(lái)。”

需要注意的一個(gè)問(wèn)題是正確地恢復(fù)連接。“工作人員可能會(huì)弄錯(cuò)管子的方面,因此可能會(huì)錯(cuò)誤地連接,反之亦然。”他警告說(shuō)。一些連接器采用顏色編碼。而包括微軟、Facebook、谷歌、英特爾在內(nèi)的行業(yè)組織正致力于為液冷服務(wù)器機(jī)架開(kāi)發(fā)一種開(kāi)放式規(guī)范,該機(jī)架將引入不可逆插頭以避免此問(wèn)題。“可以使冷熱水管相互區(qū)分和隔離,以消除人為錯(cuò)誤,”Sartor說(shuō)。

采用浸入式冷卻

浸入式冷卻確實(shí)顯著改變了IT設(shè)備的維護(hù)過(guò)程和設(shè)備需要。地球科學(xué)機(jī)構(gòu)CGG公司先進(jìn)系統(tǒng)部門(mén)的經(jīng)理Ted Barragy表示,該公司已經(jīng)使用GRC公司液體浸沒(méi)系統(tǒng)已有五年多的時(shí)間。

如果企業(yè)的服務(wù)器供應(yīng)商在發(fā)貨之前沒(méi)有進(jìn)行所有更改,則可能需要卸下風(fēng)扇或反向?qū)к?,以便將主板懸掛在浸入液中。?duì)于具有監(jiān)控冷卻風(fēng)扇速度的BIOS的舊系統(tǒng),GRC公司等冷卻供應(yīng)商提供風(fēng)扇仿真器電路,但較新的BIOS則不需要。

Barragy說(shuō),“網(wǎng)絡(luò)設(shè)備并不總是適合沉浸式冷卻,因?yàn)橛行┊a(chǎn)品是基于塑料的,容易溶化或腐蝕。”實(shí)際上,CGG公司發(fā)現(xiàn)網(wǎng)絡(luò)設(shè)備并不需要采用液體冷卻,因此可以將它們部署在冷卻設(shè)施之外,從而騰出空間來(lái)實(shí)施更多計(jì)算。

雖然CGG公司在液體冷卻方面還有一些問(wèn)題需要解決,但一旦企業(yè)了解了如何調(diào)整數(shù)據(jù)中心架構(gòu)和運(yùn)營(yíng)以利用它,人們就會(huì)認(rèn)為這種技術(shù)是可靠的。Barragy說(shuō),“如今,人們采用液體冷卻最大的障礙是心理障礙。”

液體冷卻的IT設(shè)備維護(hù)

工作人員如果更換浸沒(méi)在冷卻液中的硬盤(pán)或內(nèi)存等組件,則必須將整個(gè)主板從液體中取出,但這種措施代價(jià)高昂,因?yàn)榭赡芘獊y冷卻布局,或?qū)е吕鋮s液泄漏或流失。

Barragy建議工作人員在拆裝組件時(shí)需要穿戴橡膠手套和圍裙,以免液體濺到身上。此外與通常維護(hù)更大的區(qū)別是,工作人員需要在專業(yè)區(qū)域維修IT設(shè)備,而不是直接在機(jī)架中工作。此外,更換組件可能需要更換整個(gè)機(jī)箱。

Barragy說(shuō),“如果想分批拆解組件的話,其團(tuán)隊(duì)將等到他們有四到五個(gè)系統(tǒng)需要維修時(shí)工作,這經(jīng)常會(huì)讓故障的服務(wù)器離線數(shù)天的時(shí)間。”為了縮短維護(hù)時(shí)間,Barragy建議提前做好配件準(zhǔn)備。

權(quán)衡利弊

如今,可供選擇的液冷系統(tǒng)供應(yīng)商相對(duì)較少,而即使液冷式機(jī)架的開(kāi)放式規(guī)格系統(tǒng)上市,企業(yè)的IT設(shè)備也需要冷卻設(shè)備供應(yīng)商的產(chǎn)品進(jìn)行匹配。Barragy警告說(shuō),“如今的行業(yè)中,沉浸式冷卻供應(yīng)商很少,而可以提供直接芯片冷卻系統(tǒng)的廠商更少,他們都傾向于與硬件供應(yīng)商合作。這意味著一旦企業(yè)的產(chǎn)品被鎖定在制冷供應(yīng)商中,其選擇所需硬件的能力就非常有限。”

另一方面,如果企業(yè)要增加功率密度,則無(wú)需重新進(jìn)行復(fù)雜的氣流動(dòng)力學(xué)計(jì)算或計(jì)算如何在更多機(jī)架上分布負(fù)載。只需將20kW的冷卻油箱切換到40kW冷卻油箱,并保持相同的冷卻液和冷卻液分配單元即可。

其設(shè)備組件維護(hù)變得更復(fù)雜,最好分批完成。“如果有一些IT組件需要維修,需要讓它們干燥一段時(shí)間。”Barragy解釋說(shuō)。而設(shè)計(jì)用于浸沒(méi)式系統(tǒng)的主板供應(yīng)商可以輕松處理這些組件。CGG公司可以通過(guò)正常的回收渠道處理使用壽命到期的IT系統(tǒng)。

人員的舒適性

聯(lián)想數(shù)據(jù)中心集團(tuán)高性能計(jì)算和人工智能執(zhí)行總監(jiān)Scott Tease表示,采用液體冷卻可能意味著額外的工作,但它也可以帶來(lái)更舒適的工作環(huán)境。許多數(shù)據(jù)中心由于采用速度更快的處理器和更多的組件,數(shù)據(jù)中心中的溫度正在成為一個(gè)比電源更大的問(wèn)題。

這意味著企業(yè)需要越來(lái)越多的冷空氣來(lái)冷卻服務(wù)器。“對(duì)更多空氣流動(dòng)的需求將推動(dòng)服務(wù)器內(nèi)的能耗,并且加大了機(jī)房空調(diào)的耗電量。此外,空調(diào)噪音也很嘈雜。”

CGG公司用戶的IT員工現(xiàn)在更喜歡在沉浸式冷卻數(shù)據(jù)中心工作。“一旦掌握了這種技術(shù),數(shù)據(jù)中心將會(huì)運(yùn)營(yíng)更好,也很安靜。”Barragy說(shuō),“而配備大量機(jī)房空調(diào)的數(shù)據(jù)中心環(huán)境的噪音在80dB范圍內(nèi)。“

液冷數(shù)據(jù)中心也為內(nèi)部工作人員提供更舒適的空氣溫度。Brown 說(shuō),“數(shù)據(jù)中心的冷卻工作都是從機(jī)柜后部進(jìn)行的,熱通道的溫度讓工作人員感覺(jué)很熱,而冷通道的溫度也很低,也會(huì)讓人感覺(jué)不舒適。”

So, You Want to Go Liquid – Here’s What Awaits Your Data Center Team

Yes, in some cases you’ll need gloves and aprons, but once you get some practice with liquid, it may provide a more comfortable working environment.

Mary Branscombe | Aug 09, 2018

Liquid cooling can be more efficient than air cooling, but data center operators have been slow to adopt it for a number of reasons, ranging from it being disruptive in terms of installation and management to it being simply unnecessary. In most cases where it does become necessary, the driver is high power density. So, if your data center (or parts of it) has reached the level of density that calls for liquid cooling, how will your day-to-day routine change?

Depending on how long you’re been working in data centers, liquid cooling may seem brand new (and potentially disturbing) or pretty old-school. “Back in the 80s and 90s, liquid cooling was still common for mainframes as well as in the supercomputer world,” Chris Brown, CTO of the Uptime Institute, says. “Just being comfortable with water in date centers can be a big step. If data center managers are older, they may find it familiar, but the younger generation are nervous of any liquid.”

GRC's cooling tanks at a CGG data center in Houston

There’s often an instinctive reluctance to mix water and expensive IT assets. But that concern goes away once you understand the technology better, because in many cases, the liquid that’s close to the hardware isn’t actually water and can’t do any damage.

Modern immersion and some direct-to-chip liquid cooling systems, for example, use dielectric (non-conductive) non-flammable fluids, with standard cooling distribution units piping chilled water to a heat exchanger that removes heat from the immersion fluid. “That allows them to have the benefits of liquid cooling without having water right at the IT asset … so that if there is a leak, they’re not destroying millions of dollars’ worth of hardware,” Brown explains.

Impact on Facilities

In fact, he says, data centers that already use chilled water won’t get much more complex to manage from switching to liquid cooling. “They’re already accustomed to dealing with hydraulics and chillers, and worrying about maintaining the water treatment in the piping to keep the algae growth down – because if the water quality is low, it’s going to plug the tubes in the heat exchangers.” The water loop that cools immersion tanks can run under an existing raised floor with needing extra structural support.

If you don’t have that familiarity with running a water plant because you’re using direct expansion air conditioning units, Brown warns that liquid cooling will require a steeper learning curve and more changes to your data center operations. But that’s true of any chilled-water system.

Impact on IT

How disruptive liquid cooling will be for day-to-day IT work depends on the type of cooling technology you choose. Rear-door heat exchangers will require the fewest changes, says Dale Sartor, an engineer at the US Department of Energy’s Lawrence Berkeley National Laboratory who oversees the federal Center of Expertise for Data Centers. “There are plumbing connections on the rear door, but they’re flexible, so you can open and close the rear door pretty much the same way as you did before; you just have a thicker, heavier door, but otherwise servicing is pretty much the same.”

Similarly, for direct-to-chip cooling there’s a manifold in the back of the rack, with narrow tubes running into the server from the manifold and on to the components. Those tubes have dripless connectors, Sartor explains. “The technician pops the connector off the server, and it’s designed not to drip, so they can pull the server out as they would before.”

One problem to watch out for here is putting the connections back correctly. “You could easily reverse the tubes, so the supply water could be incorrectly connected to the return, or vice versa,” he warns. Some connectors are color-coded, but an industry group that includes Microsoft, Facebook, Google, and Intel is working on an open specification for liquid-cooled server racks that would introduce non-reversible plugs to avoid the issue. “The cold should only be able to connect up to the cold and the hot to the hot to eliminate that human error,” Sartor says.

Adjusting to Immersion

Immersion cooling does significantly change maintenance processes and the equipment needed Ted Barragy, manager of the advanced systems group at geosciences company CGG, which has been using GRC’s liquid immersion systems for more than five years.

If your server supplier hasn’t made all the changes before shipping, you may have to remove fans or reverse rails, so that motherboards hang down into the immersion fluid. For older systems with a BIOS that monitors the speed of cooling fans, cooling vendors like GRC offer fan emulator circuits, but newer BIOSes don’t require that.

Networking equipment isn’t always suitable for immersion, Barragy says. “Commodity fiber is plastic-based and clouds in the oil.” In practice, CGG has found that network devices don’t actually need liquid cooling, and its data center team now attaches them outside the tanks, freeing up space for more compute.

While CGG had some initial teething troubles with liquid cooling, Barragy is confident that the technology is reliable once you understand how to adjust your data center architecture and operations to take advantage of it. “The biggest barrier is psychological,” he says.

Gloves and Aprons

To replace components like memory in a server dipped in a tub of coolant, you have to remove the whole motherboard from the fluid – which is expensive enough that you don’t want to waste it and messy enough that you don’t want to spill it – and allow it to drain before you service it.

Barragy recommends wearing disposable nitrile gloves and remembers spilling oil down his legs the first time he worked with immersed components. Some technicians wear rubber aprons; others, who have more experience, do it in business-casual and don’t get a drop on them. “Once you’ve done it a few times, you learn the do’s and don’ts, like pulling the system out of the oil very slowly,” he says. “Pretty much anyone that does break-fix will master this.”

A bigger difference is that you’re going to be servicing IT equipment in a specialized area off the data center floor rather than working directly in the racks. You might have to replace an entire chassis and bring the replacement online before taking the original chassis away to replace or upgrade the components, Brown suggests.

“You want to do break-fix in batches,” Barragy agrees. His team will wait until they have four or five systems to work on before starting repairs, often leaving faulty servers offline for days, with failed jobs automatically requeued on other systems. To speed the process up, he recommends having a spare-parts kiosk.

The Trade-Offs

There are relatively few suppliers of liquid-cooled systems to choose from, and until systems based on the upcoming open specification for liquid-cooled racks are on the market, you can’t mix and match vendors. “There is no interoperability,” Lawrence warns. “There are ten or 15 suppliers of immersive cooling and fewer of direct-to-chip [systems], and they tend to partner up with a hardware supplier. That means the ability to just choose the hardware you need is very limited once you're locked into a cooling provider.”

On the other hand, you don’t have to redo complex airflow dynamics calculations or figure out how to spread load across more racks if you want to increase power density. You can just switch from a 20kW to a 40kW tank and keep the same coolant and coolant distribution units.

Returns get somewhat more complicated and best done in batches. “If you’ve got some broken components, you're going to let those drip dry for a couple of days,” Barragy explains. “They'll have an oil film on them, but you’re not going to wind up with a puddle of mineral oil on your hands.” Vendors who design motherboards for use in immersion systems will be comfortable dealing with components coming back in this condition, and CGG is able to process systems that reach end of life through their normal recycling channels.

Creature Comfort

Liquid cooling may mean extra work, but it also makes for a more pleasant working environment Scott Tease, executive director of high-performance computing and Artificial Intelligence at Lenovo’s Data Center Group, says. Heat is becoming a bigger problem than power in many data centers, with faster processors and accelerators coming in ever-smaller packages.

That means you need more and more air to move inside each server. “The need for more air movement will drive up power consumption inside the server, and will also make air handlers and air conditioning work harder,” he says. Not only will it be hard to deliver enough cubic feet per minute of air, it will also be noisy.

The break-fix and first-level-fix IT staff at CGG now prefer to work in the immersion-cooled data center rather than the company’s other, air-cooled facility, which is state-of-the-art. “Once you learn the techniques so you don’t get the oil all over you, it’s a nicer data center, because it’s quiet and you can talk to people,” Barragy said. “The other data center with the 40mm high-speed fans is awful. It’s in the 80dB range.”

Liquid-cooled data centers also have more comfortable air temperature for the staff working inside. “A lot of work in data centers is done from the rear of the cabinet, where the hot air is exhausted, and those hot aisles can get to significant temperatures that are not very comfortable for people to work in,” Brown says. “The cold aisles get down to pretty cold temperatures, and that's not comfortable either.”

THEEND