Event Chronicle Filling Up with "I/O Input Failure"

With 300 MB event chronicles turning over ever 10-14 days, had a look and I'm seeing several hundred (200-500 or more perhaps) instances of "IO Input Failure" followed by "Error Cleared". They are all "Events" and all "4-INFO" and are not spawning any "Bad Quality" alarms in DeltaV Operate. Various blocks are indicated as experiencing the failure - AI blocks, ALM blocks, ISEL, occasionally a PID. Some are associated with bus IO (Fieldbus and Modbus) but not all . . . 

Anyone else seeing this? I have not yet deployed the latest 14.3.1 controller hotfix but I saw nothing in the issues addressed that looked like mine. Most of these modules have been around for a long time - some of them since at least v 4 (ca. 2000). This is the first I have seen this volume of I/O input failures on an ongoing basis.

12 Replies

  • Yes. We are running version 13.3.1. In the last 8 hours, we have had 918 such messages, both I/O Input Failure and I/O Output Failure.
  • Would someone changed the execution rate of the module by any chance?

  • John,
    I've been seeing a lot of chattering from PCSD and PCSD derived type modules lately, but as you point out, your configuration hasn't changed in quite a long time. How about your hardware?

    I think the best thing to do is quantify a list of the worst bad actors, (some simple SQL scripts can do the trick), but most importantly try to find the starting point for the issue in your long term history. It's likely that the worst offenders are orders of magnitude higher than the baseline noise of I/O errors (which may have existed but was less noticeable), and that correlation of those worst bad actors will lead to an insight on the source issue.

    Did the problem started immediately after upgrade to 14.3.1?
  • In reply to Youssef.El-Bahtimy:

    Note that most times system owners are only focused on alarm quantification as part of their situational awareness plan, but as you point out, chattering I/O and other non-annunciation events should still be evaluated to uncover underlying performance issues.

    Typically, my clients slough the A&E chronicles to the Batch Historian to aggregate the short-term chronicle SQL to the long(er) term batch history SQL, or even PI and Aspen IP21 historians for longest term querying.

    I hit those systems with a counting query on a regular basis to get faster identification of these sorts of noise floods that aren't annunciating (until they fill up the storage).

  • In reply to Lun.Raznik:

    Just to add, I/O Input failure indicates that the module was running twice (I can't rememeber how many times already) and IO database have not seen an update.

    Are you using CHARMs by any chance?
  • In reply to Lun.Raznik:

    Our I/O is almost all CHARMs.
  • In reply to Mark Bendele:

    Well this is a huge clue.

    I am assuming that at one point your system was running DeltaV 11.3.x? The default IO scan rate for CHARMs was fixed (if I remember this correctly) in 11.3.x to 50ms. This was changed to 250ms in v13.3.1 (or maybe even before that release). So, if you happen to have modules that are running at say 100ms then you will be getting IO Failure in newer versions.

    Can you please check how CIOCs are configured?

  • In reply to Lun.Raznik:

    All but two CIOCs have a scan rate of 100 ms. Two are at 50 ms, because they have I/O for control modules running at 200 ms. The great majority of our control modules run at 1 sec. This is not an upgraded system. We started off with 13.3.1 a year ago.
  • In reply to Mark Bendele:

    Bummer :).

    Maybe we can try this, let us focus on a module (one module), you have to pick one:

    1. Using PHV, identify the module that reports IO Input Failure. Open this module.
    2. Using PHV again, on that same row, identify which function block is reporting the failure.
    3. Once the function block is identified, identify the actual IO assigned to the block. 
    4. Once you have identified the IO, let us know the following:
      1. What is the module scan rate?
      2. What is the function block type?
      3. What is the type of IO?
      4. If CHARMs, what is the scan rate. Note that CIOC have a default scan rate. Once assigned you can also change the scan rate of the IO from the controller "Assigned IO". We need to ensure that we check both. 
  • In reply to Lun.Raznik:

    The change in CIOC update rates occurred in 11.3.1.  The original 50 ms scan rate caused issues with the CPU loading on SDPlus controllers and offering a slower update rate on the CIOC allowed the SDPlus to handle a reasonable number of CIOC's to accommodate 750 IO.  The CIOC update rate has been configurable since 11.3.1 with a default of 250 ms.

    Luz is correct in suggesting that you focus on one module and investigate it more completely.   Best to fully understand one source of the errors before we try to suggest a fix.

    900 IO events over two weeks does not seem to me to be accounting for the consumption of a Data Set of 300 MB.  On my system I have a 100 MB Ejournal active database, currently showing 81 MB used, 19 MB free.  A look in PHV shows slightly more than 29000 total records, which indicate each record takes about 2.5 KB  A 300 MB data set should therefore accommodate  well over 100,000 events.   Seems there might be more to the story than just I/O Input errors.

    At this point, I would ask that you post a few examples of the Ejournal events, just to clarify that we are dealing with an IO Input or an IO Output error, or if it is a Input Transfer Error or Output Transfer Error.  On my system I have an IO Input error and it  indicates IOF and General IO Error in the description fields.  I tried posting a screen capture but it is not legible in this post.  

    IO Input error on my system is caused by an open loop condition on the AI CHARM.  so it is latched and not chattering.

    Luz mentioned that Module Execution scan rate can cause IO errors and that the module must not execute faster than the IO updates.  Specifically, the Module must not write to an Output channel faster than the IO subsystem can process the output to the Output channel.  Module execution rates will affect Output channels only in this way.  A module can read an input channel faster than the Input is updating and this does not cause an Input error. 

    It is certainly good practice to have modules write at a slower rate than the IO subsystem can process the signals.  Note that if the output value has not changed, an output error will not be registered even if the module executes faster than the IO update rate.  This can account for sporadic Output errors.  The issue is typically seen with analog values more so than discrete values that change infrequently.

    You can also use Diagnostic explorer to see the current state of MERROR and MSTATUS for all modules in a controller.  You can sort on the IOINerr and IOOUTerr to find all modules with input or output errors. Select the Assigned Modules container and in the right hand pane you can see all the individual bits of the MERRO and MSTATUS words.  This is a good place to view the overall health of your control modules and any online issues or abnormal conditions.  Each row represents one module, which is identified in the first column.

    Note, on our database, the Disparity Detect had been enabled on one transmitter and this was toggling an alarm that chattered every 4 to 6 seconds.  It was generating 29000 events in one day, filling my Event Journal in about 30hours. I disabled this feature until I can determine what the root cause is.  It is important to keep the Event Journal from being inundated with chattering events so that you can both maintain a longer time period of current data and not suffer performance issues when retrieving an hour or a days worth of data.  Having 100 times more events means each query must process that much more data.  A clean Ejournal helps PHV Echarts and Event views perform better.

    If you only have 900 such events in your 300MB archive, the archive must not be very full.  Use Event Chronicle Administrator to view the start time of the Archive.  view Properties of the archive to view the Used and Free Space.  Is the Archive filling in a couple of weeks, or is it simply a couple of weeks since this data set became active.  View all events in a PHV event view and determine how many events are in your dataset. Then compare this to the used space.  This will help determine how much time this data set will collect before switching to a new active Data set.  

    We still need to determine the cause of IO IN /OUT errors and address them.  But at least you'll know what impact these events have on your dataset rollover period.

    Andre Dicaire

  • I have used the DMAIC on similar problem.
    Usually I use the 5W and 1 H to define the problem to start.
    Ask as many questions that start with W ie What is the problem?, when did it start?, what happened then?, has it ever stopped? what happened then, what has changed? what is the trend, who is involved. has this problem been solved before? Does anyone else have the same problem ? Does anyone with the same equipment not have a problem. It is a very simple approach but it allows us to focus on the problem rather than trying to find a solution.

    One possible issue is a hardware/sensor fault that is feeding into the system. Parasitic osculation is present in some sensors and system. Parasitic osculation is defined as and osculation or feedback that causes the system to no longer behave in the manner it was designed for. If these devices develop a fault they can feed into the system and create all sorts of usual action.
  • In reply to Andre Dicaire:

    Andre - sorry, to clarify, I am seeing 200-500 I/O failures per 24 hours. We are looking at specific bad actors and trying to see if we can clean some of it up - with some success. No Charms in our system at the moment as we only recently (in the last two years) upgraded to MQ controllers and couldn't support CIOC's without adding an "S" series controller. I'll post some examples when I'm back in the plant . . .