[Novalug] C code bug disrupts power to 50 Million People

Mark Smith mark@winksmith.com
Mon Aug 31 05:32:32 EDT 2020


interesting differences. the great NE blackout of 1965 is the topic of 
S1E1 Connections with James Burke 
<https://www.youtube.com/watch?v=XetplHcM7aQ> 
(https://www.youtube.com/watch?v=XetplHcM7aQ). I was just watching it a 
few minutes ago. A totally different take as there were no (or not much) 
software involved.


On 30/08/20 9:57 am, Roger W. Broseus via Novalug wrote:
> From the archives: C code bug disrupts power to 50 Million People, 
> August, 14, 2003
>
> A long read but interesting history.
> -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> Tracking the Blackout bug
> By Kevin Poulsen
> Published Thursday 8th April 2004 13:57 GMT
>
> A number of factors and failings came together to make the August 14th 
> northeastern blackout the worst outage in North American history. One 
> of them was buried in a massive piece of software compiled from four 
> million lines of C code and running on an energy management computer 
> in Ohio.
>
> To nobody's surprise, the final report on the blackout released by a 
> US-Canadian task force Monday puts most of blame for the outage on 
> Ohio-based FirstEnergy Corp., faulting poor communications, inadequate 
> training, and the company's failure to trim back trees encroaching on 
> high-voltage power lines. But over a dozen of task force's 46 
> recommendations for preventing future outages across North America are 
> focused squarely on cyberspace.
>
> That may have something to do with the timing of the blackout, which 
> came three days after the relentless Blaster worm began wreaking havoc 
> around the Internet - a coincidence that prompted speculation at the 
> time that the worm, or the traffic it was generating in its efforts to 
> spread, might have triggered or exacerbated the event. When US and 
> Canadian authorities assembled their investigative teams, they 
> included a computer security contingent tasked with looking 
> specifically at any cybersecurity angle on the outage.
>
> In the end, it turned out that a computer snafu actually played a 
> significant role in the cascading blackout - though it had nothing to 
> do with viruses or cyber terrorists. A silent failure of the alarm 
> function in FirstEnergy's computerized Energy Management System (EMS) 
> is listed in the final report as one of the direct causes of a 
> blackout that eventually cut off electricity to 50 million people in 
> eight states and Canada.
>
> The alarm system failed at the worst possible time: in the early 
> afternoon of August 14th, at the critical moment of the blackout's 
> earliest events. The glitch kept FirstEnergy's control room operators 
> in the dark while three of the company's high voltage lines sagged 
> into unkempt trees and "tripped" off. Because the computerized alarm 
> failed silently, control room operators didn't know they were relying 
> on outdated information; trusting their systems, they even discounted 
> phone calls warning them about worsening conditions on their grid, 
> according to the blackout report.
>
> "Without a functioning alarm system, the [FirstEnergy] control area 
> operators failed to detect the tripping of electrical facilities 
> essential to maintain the security of their control area," reads the 
> report. "Unaware of the loss of alarms and a limited EMS, they made no 
> alternate arrangements to monitor the system."
>
> With the FirstEnergy control room blind to events, operators failed to 
> take actions that could have prevented the blackout from cascading out 
> of control.
>
> In the aftermath, investigators quickly zeroed in on the Ohio 
> line-tripping as a root cause. But the reason for the alarm failure 
> remained a mystery. Solving that mystery fell squarely on the 
> corporate shoulders of GE Energy, makers of the XA/21 EMS in use at 
> FirstEnergy's control center. According to interviews, a half-a-dozen 
> workers at GE Energy began working feverishly with the utility and 
> with energy consultants from KEMA Inc. to figure out what went wrong.
>
> The XA/21 isn't based on Windows, so it couldn't have been infected by 
> Blaster, but the company didn't immediately rule out the possibility 
> that the worm somehow played a role in the alarm failure. "In the 
> initial stages, nobody really knew what the root cause was," says Mike 
> Unum, manager of commercial solutions at GE Energy. "We spent a 
> considerable amount of time analyzing that, trying to understand if it 
> was a software problem, or if - like some had speculated - something 
> different had happened."
>
> Sometimes working late into the night and the early hours of the 
> morning, the team pored over the approximately one-million lines of 
> code that comprise the XA/21's Alarm and Event Processing Routine, 
> written in the C and C++ programming languages. Eventually they were 
> able to reproduce the Ohio alarm crash in GE Energy's Florida 
> laboratory, says Unum. "It took us a considerable amount of time to go 
> in and reconstruct the events." In the end, they had to slow down the 
> system, injecting deliberate delays in the code while feeding alarm 
> inputs to the program. About eight weeks after the blackout, the bug 
> was unmasked as a particularly subtle incarnation of a common 
> programming error called a "race condition," triggered on August 14th 
> by a perfect storm of events and alarm conditions on the equipment 
> being monitoring. The bug had a window of opportunity measured in 
> milliseconds.
>
> "There was a couple of processes that were in contention for a common 
> data structure, and through a software coding error in one of the 
> application processes, they were both able to get write access to a 
> data structure at the same time," says Unum. "And that corruption lead 
> to the alarm event application getting into an infinite loop and 
> spinning."
>
> Testing for Flaws
>
> "This fault was so deeply embedded, it took them weeks of poring 
> through millions of lines of code and data to find it," FirstEnergy 
> spokesman Ralph DiNicola said in February.
>
> After the alarm function crashed in FirstEnergy's controls center, 
> unprocessed events began to cue up, and within half-an-hour the EMS 
> server hosting the alarm process folded under the burden, according to 
> the blackout report. A backup server kicked-in, but it also failed. By 
> the time FirstEnergy operators figured out what was going on and 
> restarted the necessary systems, hours had passed, and it was too late.
>
> This week's blackout report recommends that the U.S. and Canadian 
> governments require all utilities using the XA/21 to check in with GE 
> Energy to ensure "that appropriate actions have been taken to avert 
> any recurrence of the malfunction." GE Energy says that's a moot 
> point: though the flaw has not manifested itself elsewhere, last fall 
> the company gave its customers a patch against the bug, along with 
> installation instructions and a utility to repair any alarm log data 
> corrupted by the glitch. According to Unum, the company sent the 
> package to every XA/21 customer - more than 100 utilities around the 
> world - and offered to help install it, "irrespective of their current 
> support status," he says.
>
> The company did everything it could, says Unum. "We text exhaustively, 
> we test with third parties, and we had in excess of three million 
> online operational hours in which nothing had ever exercised that 
> bug," says Unum. "I'm not sure that more testing would have revealed 
> that. Unfortunately, that's kind of the nature of software... you may 
> never find the problem. I don't think that's unique to control systems 
> or any particular vendor software."
>
> Tom Kropp, manager of the enterprise information security program at 
> the Electric Power Research Institute, an industry think tank, agrees. 
> He says faulty software may always be a part of the electric grid's 
> DNA. "Code is so complex, that there are always going to be some 
> things that, no matter how hard you test, you're not going to catch," 
> he says. "If we see a system that's behaving abnormally well, we 
> should probably be suspicious, rather than assuming that it's behaving 
> abnormally well."
>
> But Peter Neumann, principal scientist at SRI International and 
> moderator of the Risks Digest, says that the root problem is that 
> makers of critical systems aren't availing themselves of a large body 
> of academic research into how to make software bulletproof.
>
> "We keep having these things happen again and again, and we're not 
> learning from our mistakes," says Neumann. "There are many possible 
> problems that can cause massive failures, but they require a certain 
> discipline in the development of software, and in its operation and 
> administration, that we don't seem to find. ... If you go way back to 
> the AT&T collapse of 1990, that was a little software flaw that 
> propagated across the AT&T network. If you go ten years before that 
> you have the ARPAnet collapse.
>
> "Whether it's a race condition, or a bug in a recovery process as in 
> the AT&T case, there's this idea that you can build things that need 
> to be totally robust without really thinking through the design and 
> implementation and all of the things that might go wrong," Neumann says.
>
> Despite the absence of cyber terrorism in the blackout's genesis, the 
> final report includes 13 recommendations focused squarely on 
> protecting critical power-grid systems from intruders. The computer 
> security prescriptions came after task force investigators discovered 
> that the practices of some of the utility companies involved in the 
> blackout created "potential opportunities for cyber system compromise" 
> of EMS computers.
>
> "Indications of procedural and technical IT management vulnerabilities 
> were observed in some facilities, such as unnecessary software 
> services not denied by default, loosely controlled system access and 
> perimeter control, poor patch and configuration management, and poor 
> system security documentation," reads the report.
>
> Among the recommendations, the task force says cyber security 
> standards established by the North America Electric Reliability 
> Council, the industry group responsible for keeping electricity 
> flowing, should be vigorously enforced. Joe Weiss, a control system 
> cyber security consultant at KEMA, and one of the authors of the NERC 
> standards, says that's a good start. ""The NERC cyber security 
> standards are very basic standards," says Weiss. "They provide a 
> minimum basis for due diligence."
>
> But so far, it seems software failure has had more of an effect on the 
> power grid than computer intrusion. Nevertheless, both Weiss and 
> EPRI's Kropp believe that the final report is right to place more 
> emphasis on cybersecurity than software reliability. "You don't try to 
> look for something that's going to occur very, very, very 
> infrequently," says Weiss. "Essentially, a blackout like this was 
> something like that. There are other issues that are higher 
> probability that need to be addressed."
>
> Source:
> http://www.theregister.co.uk/2004/04/08/blackout_bug_report/
> ------------------------------------------------------------------------------ 
>
-- 
Mark Smith
E: mark@winksmith.com



More information about the Novalug mailing list