Process Alarm and Event Data in Non-Traditional Cloud-Based Applications

Note from Jim Cahill: Our team at Emerson Impact Partner, Spartan Controls, recently worked with a major oil & gas client to implement this use case of bring DeltaV Alarms & Events data to the Amazon Web Services (AWS) cloud using Emerson’s Plantweb Optics platform—a holistic data management platform unifying Operational Technology (OT) data into actionable information.

Background

The data available in OT including smart sensors, Internet of Things (IoT) devices, mobile equipment, electrical and automation systems, and hosts of other plant systems has grown exponentially over the past decade. As the onboard computing, memory and new capabilities in the OT has grown and become available in operating plants, the ability to egress large data sets from the operations into modern IT environments remains a bottleneck. Plant network load balancing, cybersecurity, single-port firewall traversals, and other networking, data transport and storage challenges have constrained organizations in the ability to access these data sets at the scale desired. Growing volumes of data from operating plants is finding its way to enterprise IT networks including cloud platforms and is being leveraged in large numbers of use cases.

Time series process data like pressure, temperature, level and flow are the most in demand and are normally archived in what OT people would refer to as a site historian on local plant networks (L3). In some organizations this data may flow up to an ‘enterprise historian’ residing on corporate networks (L4) and more recently many companies have either moved their enterprise historian to the cloud or connected it to their enterprise data lake that gathers routine operating data from multiple facilities.

Other forms of time-series data generated by automation systems finding their way to enterprise IT systems are process alarms and event data. In the plant operations, process alarms inform the operations when process parameters have moved outside of normal operating limits and that action may be required to return the process or equipment to a normal operating state. “Low Boiler Feedwater Tank Level” or “High Furnace Temperature” are simple examples of routine alarm data and would be time-stamped, displayed, and archived by the automation system. In an industrial plant, alarms are typically viewed by plant operations staff on specially designed operations graphics like the ones below which present the data in context to the operating units.

DeltaV operating graphic with process alarm banner

Plant Operating Graphic with Process Alarm Banner

DeltaV alarm summary screen

Typical Alarm Summary Screen as Viewed by Operator

Events generated by the control system can be distinguished from process alarms in that they are activities recorded by the automation system that is related to operator interactions with the process or the automation system itself logging information generated by devices on the network. Launching a startup or shutdown sequence, manually starting or stopping plant equipment, changing the mode of a control loop, or bypassing a safety interlock are examples of operating events which automation systems track and record.

Information about the event, including who made the change, and when the change occurred is recorded in the alarm / event (A&E) chronicle. Records of control module installations, changes in a system device status or network communication errors are examples of system level events which are also typically logged by automation systems. This information is typically used by automation staff to troubleshoot hardware, software or automation network problems for example.

For storage of these data, an independent repository typically referred to as the A&E chronicle is used. Note that depending on the automation system and in large plants, there may be independent A&E chronicles for different plant operating zones or areas. Normally process operators managing a specific plant area do not wish to see alarms/events generated in other areas of the plant that are not under their direct control for example. So, getting a complete picture of the A&E for a large site can require combining and synchronizing the data from multiple A&E chronicles.

DeltaV Alarm & Event Chronicle

Example Data Captured by Alarm and Event Chronicle

In industrial plants, A&E data is used to help isolate the root cause of process upsets, log operator actions on the process, provide diagnostic information during system maintenance and troubleshooting. In many cases, 3rd party alarm management applications are used to benchmark facilities against industry alarm standards such as ISA 18.2. These applications help ensure the alarm system is engineered appropriately, meeting industry standards, and is not overwhelming operations staff with too much information. Alarm floods are a common challenge for operations in industrial operations during major process transitions or plant trips for example and providing actionable information to operations under these conditions is a critical role of the alarm system in industrial plants.

Use Cases for Industrial Alarm and Event Data Outside of the Operation

Traditionally, alarm and event data gathered from the automation systems doesn’t leave the boundaries of the industrial operations. More recently however, within the digital transformation and industry 4.0 programs of industrial operators, organizations have begun to access and make use of these data sets in new and innovative use cases outside of the operation itself. The demand for alarm & event data along with the process data and available meta data from industrial automation systems is growing rapidly.

Operational Risk Management

For any company, operational risk is defined as the risk of loss resulting from failed internal processes, people, systems, or external events that disrupt the business operation. In the case of industrial companies in the process or manufacturing industries, significant operational risks exist with in the production facilities themselves. Hazardous chemicals, process conditions, and dangerous plant equipment are factors which increase the operational risk related to human health & safety, and surrounding natural environment (land, air, water). Process and equipment reliability issues, extreme weather events, or accidents or errors by onsite personnel can disrupt plant production.

Industrial organizations put tremendous thought and effort into the risk their production facilities pose to the overall business. Most have mature health & safety programs and standards, which are designed to mitigate the operational risks described above. Technical staff use Hazard and Operability Study’s (HAZOPS) and Layers of Protection Analysis (LOPA) to systematically identify possible hazards and accidents, their potential frequency, and their consequences in operating plants for example. These techniques are used to investigate the adequacy of existing protection or the addition of new layers of protection from known hazards.

In general, it is from a LOPA (or alternate methods), industrial plants determine the Safety Integrity Levels (SIL) required in their operating units to help specify or target the desired level of risk reduction required. The SIL is a measure of performance of a Safety Instrumented Function (SIF) – the set of equipment used to reduce the risk associated with a specific hazard. International functional safety standards (IEC 61508 & IEC 61511) govern both the equipment standards as well as the work process and methods required to meet a required SIL.

Today many organizations in these industries are wanting to monitor operational risk generated by their production facilities on a corporate scale in near-real-time. “What are the current risk levels and how do we mitigate them at each of our production operations”, are questions corporate safety officers and risk management leaders are asking in industrial companies.

Keep in mind that for process facilities to run efficiently and safely, people and the plant automation systems must work in unison. Under normal operating conditions, Operations staff must ensure plant processes are kept within safe operating limits using plant automation systems which are managed by engineering and technical teams. During critical abnormal situations like an Emergency Shut Down (ESD), the automation systems (basic process control & safety systems) are designed move the operation to a safe operating state autonomously.

Answers to many of the questions about risk levels within the industrial plant resides in the data generated in these operations by these automation systems. Data that resides in the Alarm and Event chronicles and the plant data historians can be key to assessing some of these risk factors. As described, the A&E chronicle keeps a time-stamped record of every process and safety alarm as well as each action taken by operations staff at the operator console, while the process historian keeps a historical record of real-time operating conditions in the plant–pressure, temperature, level, flow, etc.

Novel Use Cases for Industrial A&E data Outside the Operation

Automation Bypass and Override Reporting

Instrument bypass and override logic is commonly configured in an automation or safety system to allow maintenance activities like instrument calibration, proof testing or repair to take place without risk of tripping the plant. Instrument bypass and control logic overrides (permissive, interlock) effectively ‘jumper’ or bypass control logic that would be active during normal operations. A low tank level alarm may be interlocked to a downstream control valve such that the system automatically closes the valve to protect downstream equipment like a pump for example. During plant commissioning or as part of routine maintenance however, there may be times when the tank has been completely emptied of liquid. In this state, technicians may wish open the valve verify wiring or control logic or conduct valve maintenance for example requiring the interlock which shuts the valve to be bypassed for the work to proceed. In the plant, rigorous work processes are normally followed to conduct the work and ensure bypasses are removed when complete. This is critical to avoid these layers of protection being defeated inadvertently and creating unsafe conditions under normal operating conditions.

Due to the potential impact of these bypasses in the industrial operation, some corporate safety and risk management programs have prioritized tracking and reporting on instrument and logic bypasses in their automation systems. Officers of these companies wish to know:

  • What devices and logic are bypassed currently in each operation?
  • How many bypasses are active across their fleet of operating assets?
  • How long have they been active? When will the bypass be removed?

Based on these numbers and criticality or each bypass, risk/safety KPIs are generated and can be used to drive work processes which ensure they are removed in a timely fashion reducing the overall risk to the organization. A couple example reports are shown below, documenting, and organizing data collected from the A&E chronicles from automation systems.

Alarm Bypass List

Current Bypass List – By Area/Module, Type, User (click to enlarge)

Alarm bypass count report

Bypass Count Reports – by Area/Module, Type, User (click to enlarge)

Process Alarm Studies, Facility Benchmarking

At the plant level, operations, and automation engineering teams labor to manage the alarm data generated by automation systems. Industry standards (ISA 18.2) and best practices are mature and provide guidance on acceptable numbers of alarms in industrial operations such that they are functional by operations teams. Alarm systems that generate excessive numbers of alarms in routine operating scenarios are not useful and will eventually be ignored by operations teams if not addressed. Technology like ‘state-based’ alarming is becoming more prevalent in industrial operations and is used to mask or shelve alarms that are not relevant to specific operating states to prevent these conditions. Starting-up or shutting-down are perfectly valid operating states that are notorious for generating volumes of alarms that can be avoided by good design for example.

Outside of the operation, we are seeing process alarm data being used to document alarm volumes and benchmark facilities against industry standards as well as compare one plant to the next across a fleet. In well designed and implemented alarm systems, high alarm volumes over extended periods indicate processes are routinely running outside of designed operating envelopes. This is undesirable and left unchecked, can result in any number of negative effects including lower plant throughput, off-spec product, decreased energy efficiency.

Alarm data, effectively fused and visualized with process data, can be used to inform remote support teams on overall plant health. It can also be used to investigate where additional automation could be applied to further optimize the operation.

In addition, there is a safety and risk aspect to poorly managed industrial alarm systems. Alarm flooding conditions caused by spurious events in the plant are cause for concern in the operation as they effectively mask the root problem and make troubleshooting operating problems challenging. Missing critical alarms during an anomalous plant event can result in an incorrect response by operations resulting in plant trip for example. Human factors including operator stress and fatigue are well-known and well-studied, concerns for operations teams trying to manage processes and equipment for long durations away from normal steady state conditions.

Below are examples of process alarm dashboards that are being used by engineering and automation teams away from the operation to investigate the efficacy of their plant alarm systems and benchmark each operation against industry standards.

This information can be summarized and used in KPIs dashboards for each operation as well as help remote teams to identify opportunities to optimize the operation. Evaluations of current operating procedures and automation strategies are other examples of how the data is used by remote workers.

Alarm distribution report

Alarm Distribution Report – by Plant Area, Time, Priority, Category (click to enlarge)

Alarm/Event Bad Actor Report

Alarm / Event Bad Actor Report – Bad Actors by Plant Area and Control Module (click to enlarge)

Alarm key performance indicator (KPI) report (click to enlarge)

ISA 18.2 Alarm KPI Report

Operator Training & Automation Studies

Operating a complex industrial operation is fraught with challenges and organizations work hard to continuously improve their production operations. Operating leaders want to know if they have the appropriate number of skilled plant and field operators in the facilities. Training groups want to know if plant operators have the appropriate training and skills, they need to manage the operation confidently and safely. Automation people are often asked to investigate why individual operating teams respond differently during similar plant upsets or disturbances to determine if control strategy changes or additional automation is required.

How can one group of operator’s startups the plant in a few hours when it may take another a full day? These are questions asked by stakeholders across industrial organizations, and again alarm and event data chronicled in the operation can provide important clues answers. How many interactions (touches) a board operator has with the process is a good measure of how much trouble he/she is having with it. The number of manual equipment starts & stops, along with mode and other changes board operators make to plant control loops is a great indication that there are problems with either the plant automation or the operations team itself. Control strategies could be poorly designed for current operating conditions, loops could be incorrectly tuned, there may be issues with the associated sensor or end device (valve), or the operations team may require additional training. A well automated and well-tuned plant has very few requirements for manual interventions by operators normally.

Again, the alarm/event chronicle contains some of the data required to troubleshoot these issues. Each operator action on the control console is timestamped and logged as an event by this repository and can be used to investigate operator interactions with the process via the automation systems. Combining these logged events with process alarm conditions and time series process data can help investigate questions and operating challenges like the ones posed above.

Summary

This blog post has examined some of the use cases for process alarm and event data routinely chronicled by the automation systems that are emerging outside operating facilities themselves. As described, this type of data along with many others is now finding its way to enterprise data systems and being exploited in non-traditional use cases outside the plant gates. Enterprise operational risk management studies, facility industry benchmarking, operating team and automation investigations and studies are good examples of how industrial organizations are looking to leverage data that historically never left the production operation itself.

The post Process Alarm and Event Data in Non-Traditional Cloud-Based Applications appeared first on the Emerson Automation Experts blog.