[Novalug] C code bug disrupts power to 50 Million People
Mark Smith
mark@winksmith.com
Mon Aug 31 05:32:32 EDT 2020
interesting differences. the great NE blackout of 1965 is the topic of
S1E1 Connections with James Burke
<https://www.youtube.com/watch?v=XetplHcM7aQ>
(https://www.youtube.com/watch?v=XetplHcM7aQ). I was just watching it a
few minutes ago. A totally different take as there were no (or not much)
software involved.
On 30/08/20 9:57 am, Roger W. Broseus via Novalug wrote:
> From the archives: C code bug disrupts power to 50 Million People,
> August, 14, 2003
>
> A long read but interesting history.
> -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> Tracking the Blackout bug
> By Kevin Poulsen
> Published Thursday 8th April 2004 13:57 GMT
>
> A number of factors and failings came together to make the August 14th
> northeastern blackout the worst outage in North American history. One
> of them was buried in a massive piece of software compiled from four
> million lines of C code and running on an energy management computer
> in Ohio.
>
> To nobody's surprise, the final report on the blackout released by a
> US-Canadian task force Monday puts most of blame for the outage on
> Ohio-based FirstEnergy Corp., faulting poor communications, inadequate
> training, and the company's failure to trim back trees encroaching on
> high-voltage power lines. But over a dozen of task force's 46
> recommendations for preventing future outages across North America are
> focused squarely on cyberspace.
>
> That may have something to do with the timing of the blackout, which
> came three days after the relentless Blaster worm began wreaking havoc
> around the Internet - a coincidence that prompted speculation at the
> time that the worm, or the traffic it was generating in its efforts to
> spread, might have triggered or exacerbated the event. When US and
> Canadian authorities assembled their investigative teams, they
> included a computer security contingent tasked with looking
> specifically at any cybersecurity angle on the outage.
>
> In the end, it turned out that a computer snafu actually played a
> significant role in the cascading blackout - though it had nothing to
> do with viruses or cyber terrorists. A silent failure of the alarm
> function in FirstEnergy's computerized Energy Management System (EMS)
> is listed in the final report as one of the direct causes of a
> blackout that eventually cut off electricity to 50 million people in
> eight states and Canada.
>
> The alarm system failed at the worst possible time: in the early
> afternoon of August 14th, at the critical moment of the blackout's
> earliest events. The glitch kept FirstEnergy's control room operators
> in the dark while three of the company's high voltage lines sagged
> into unkempt trees and "tripped" off. Because the computerized alarm
> failed silently, control room operators didn't know they were relying
> on outdated information; trusting their systems, they even discounted
> phone calls warning them about worsening conditions on their grid,
> according to the blackout report.
>
> "Without a functioning alarm system, the [FirstEnergy] control area
> operators failed to detect the tripping of electrical facilities
> essential to maintain the security of their control area," reads the
> report. "Unaware of the loss of alarms and a limited EMS, they made no
> alternate arrangements to monitor the system."
>
> With the FirstEnergy control room blind to events, operators failed to
> take actions that could have prevented the blackout from cascading out
> of control.
>
> In the aftermath, investigators quickly zeroed in on the Ohio
> line-tripping as a root cause. But the reason for the alarm failure
> remained a mystery. Solving that mystery fell squarely on the
> corporate shoulders of GE Energy, makers of the XA/21 EMS in use at
> FirstEnergy's control center. According to interviews, a half-a-dozen
> workers at GE Energy began working feverishly with the utility and
> with energy consultants from KEMA Inc. to figure out what went wrong.
>
> The XA/21 isn't based on Windows, so it couldn't have been infected by
> Blaster, but the company didn't immediately rule out the possibility
> that the worm somehow played a role in the alarm failure. "In the
> initial stages, nobody really knew what the root cause was," says Mike
> Unum, manager of commercial solutions at GE Energy. "We spent a
> considerable amount of time analyzing that, trying to understand if it
> was a software problem, or if - like some had speculated - something
> different had happened."
>
> Sometimes working late into the night and the early hours of the
> morning, the team pored over the approximately one-million lines of
> code that comprise the XA/21's Alarm and Event Processing Routine,
> written in the C and C++ programming languages. Eventually they were
> able to reproduce the Ohio alarm crash in GE Energy's Florida
> laboratory, says Unum. "It took us a considerable amount of time to go
> in and reconstruct the events." In the end, they had to slow down the
> system, injecting deliberate delays in the code while feeding alarm
> inputs to the program. About eight weeks after the blackout, the bug
> was unmasked as a particularly subtle incarnation of a common
> programming error called a "race condition," triggered on August 14th
> by a perfect storm of events and alarm conditions on the equipment
> being monitoring. The bug had a window of opportunity measured in
> milliseconds.
>
> "There was a couple of processes that were in contention for a common
> data structure, and through a software coding error in one of the
> application processes, they were both able to get write access to a
> data structure at the same time," says Unum. "And that corruption lead
> to the alarm event application getting into an infinite loop and
> spinning."
>
> Testing for Flaws
>
> "This fault was so deeply embedded, it took them weeks of poring
> through millions of lines of code and data to find it," FirstEnergy
> spokesman Ralph DiNicola said in February.
>
> After the alarm function crashed in FirstEnergy's controls center,
> unprocessed events began to cue up, and within half-an-hour the EMS
> server hosting the alarm process folded under the burden, according to
> the blackout report. A backup server kicked-in, but it also failed. By
> the time FirstEnergy operators figured out what was going on and
> restarted the necessary systems, hours had passed, and it was too late.
>
> This week's blackout report recommends that the U.S. and Canadian
> governments require all utilities using the XA/21 to check in with GE
> Energy to ensure "that appropriate actions have been taken to avert
> any recurrence of the malfunction." GE Energy says that's a moot
> point: though the flaw has not manifested itself elsewhere, last fall
> the company gave its customers a patch against the bug, along with
> installation instructions and a utility to repair any alarm log data
> corrupted by the glitch. According to Unum, the company sent the
> package to every XA/21 customer - more than 100 utilities around the
> world - and offered to help install it, "irrespective of their current
> support status," he says.
>
> The company did everything it could, says Unum. "We text exhaustively,
> we test with third parties, and we had in excess of three million
> online operational hours in which nothing had ever exercised that
> bug," says Unum. "I'm not sure that more testing would have revealed
> that. Unfortunately, that's kind of the nature of software... you may
> never find the problem. I don't think that's unique to control systems
> or any particular vendor software."
>
> Tom Kropp, manager of the enterprise information security program at
> the Electric Power Research Institute, an industry think tank, agrees.
> He says faulty software may always be a part of the electric grid's
> DNA. "Code is so complex, that there are always going to be some
> things that, no matter how hard you test, you're not going to catch,"
> he says. "If we see a system that's behaving abnormally well, we
> should probably be suspicious, rather than assuming that it's behaving
> abnormally well."
>
> But Peter Neumann, principal scientist at SRI International and
> moderator of the Risks Digest, says that the root problem is that
> makers of critical systems aren't availing themselves of a large body
> of academic research into how to make software bulletproof.
>
> "We keep having these things happen again and again, and we're not
> learning from our mistakes," says Neumann. "There are many possible
> problems that can cause massive failures, but they require a certain
> discipline in the development of software, and in its operation and
> administration, that we don't seem to find. ... If you go way back to
> the AT&T collapse of 1990, that was a little software flaw that
> propagated across the AT&T network. If you go ten years before that
> you have the ARPAnet collapse.
>
> "Whether it's a race condition, or a bug in a recovery process as in
> the AT&T case, there's this idea that you can build things that need
> to be totally robust without really thinking through the design and
> implementation and all of the things that might go wrong," Neumann says.
>
> Despite the absence of cyber terrorism in the blackout's genesis, the
> final report includes 13 recommendations focused squarely on
> protecting critical power-grid systems from intruders. The computer
> security prescriptions came after task force investigators discovered
> that the practices of some of the utility companies involved in the
> blackout created "potential opportunities for cyber system compromise"
> of EMS computers.
>
> "Indications of procedural and technical IT management vulnerabilities
> were observed in some facilities, such as unnecessary software
> services not denied by default, loosely controlled system access and
> perimeter control, poor patch and configuration management, and poor
> system security documentation," reads the report.
>
> Among the recommendations, the task force says cyber security
> standards established by the North America Electric Reliability
> Council, the industry group responsible for keeping electricity
> flowing, should be vigorously enforced. Joe Weiss, a control system
> cyber security consultant at KEMA, and one of the authors of the NERC
> standards, says that's a good start. ""The NERC cyber security
> standards are very basic standards," says Weiss. "They provide a
> minimum basis for due diligence."
>
> But so far, it seems software failure has had more of an effect on the
> power grid than computer intrusion. Nevertheless, both Weiss and
> EPRI's Kropp believe that the final report is right to place more
> emphasis on cybersecurity than software reliability. "You don't try to
> look for something that's going to occur very, very, very
> infrequently," says Weiss. "Essentially, a blackout like this was
> something like that. There are other issues that are higher
> probability that need to be addressed."
>
> Source:
> http://www.theregister.co.uk/2004/04/08/blackout_bug_report/
> ------------------------------------------------------------------------------
>
--
Mark Smith
E: mark@winksmith.com
More information about the Novalug
mailing list