Cold Restart - Event Flooding Log

Hi,

I have an issue with a event flooding Process history every 10 seconds...

03/02/2018, 08:56:45, EVENT, AREA_A, PPLUS01, COLD_RESTART, DOWNLOAD, Cold restart on CTR01 failed; download data incomplete
02/03/2018, 08:56:45, EVENT, AREA_A, PPLUS01, COLD_RESTART, DOWNLOAD, Cold restart on CTR01 started

The event started following a partial download to a AI module - simply a change to the HI_LIM parameter.

At the point of downloading, I receive the following warning dialog...

WARNING:
CTR01 has one or more system files missing which could impact the integrity of controller redundancy. Perhaps, another download is in progress.

To check the powerup directory in more detail, run the Diagnostics application and use the Powerup Directory Test tool.

The Powerup directory test of CTR01 shows that the majority of items in the Powerup directory differ from the controller. Exceptions appear to be any itmes recently downloaded.

The controller (SX) cold restart source is "Controller"; User Restart State is "Yes"; Cold Restart State is "No"

A total download to the controler stops the cold restart events appearing in the log, but I'm trying to understand why the controller is getting into this situation in the first place?

Is the Powerup directory mismatch related to the coldresart event appearing in the event log?

Thanks for any pointers

1 Reply

  • Colin, To answer your last question, yes. The Power up directory stores all the download scripts sent to controllers and workstations, as well as a few general housekeeping files that define the commissioned hardware. The Power Up directory is used to execute Cold Restart downloads, and if the Cold Restart source location is the Workstation, the Power Up directory files are actually downloaded during the cold restart. A Cold Restart download happens under two conditions: the controller loses power and power is subsequently restored; or the Standby controller of a redundant pair is being updated following a partial or total download to the active controller. (Note that a user initiated Cold Restart download transfers the files to controller's cold restart location and does not affect the running modules)

    You mention this is an SX controller, so you are at v11, v12 or v13. In v12, the S-series hardware received a new OS, and I believe there were a couple of changes in the Cold Restart behavior. But, I'm thinking that on a partial download to the controller, the Active controller will pass the download to the stand by so that both the Active and the Standby will have the new module running. At this point, Cold Restart source is set to the Workstation (Pro Plus). A cold restart download is done to update the controller's cold restart configuration and restore the source location to Controller.

    If we end up with a failed download, the content of the Power Up directory might not match the actual controller modules running. At this point, a total download will resolve this. On a download, the controller receives the module script, instantiates a new version of the module and transfers runtime data from the old version before shifting to the new version. Even if all you changed was the Hi_LIMIT parameter value, a partial download means a new instance of the module that replaces the old. During this process, if anything goes wrong, (File is corrupted/interrupted/incomplete) the new instance is not created and the old version of the module continues as though nothing happened. At this point, the Power Up directory no longer contains the same configuration as the running module. A total download aligns the runtime with the power up directory and resolves the issue.

    The events in the Ejournal as strictly due to the fact that Cold Restart attempts are happening and so are recorded. We could debate why it continues to do this indefinitely, but we do know that if not resolved, we have lost cold restart functionality. There should also be a Hardware alert on the controller that along with the failed download message, tells us to address the issue. However, since the condition can persist for a while until a resolution is identified, it might be beneficial for the frequency of the retries drops to maybe once a minute after a few initial retries. Hmm.

    As to why the download failure happened, I'm not sure. If this is a v11 controller, there are issues in the OS Memory management that results in memory fragmentation. However, with SX controllers, we almost never see this as the available 96MB of user configuration all but eliminated this. If this is v12 or v13, the new OS changed the memory management and eliminated this completely. If the memory was so fragmented that the largest block of contiguous memory as too small to accept the module size, the download would fail. Since the controller does do some memory optimizations over time, the SX controller's memory allowed for memory fragments to coalesce and become larger blocks.

    A network fault could corrupt a download packet, but if total downloads work and there are no error counters being recorded on the network switches, that would seem unlikely. It could have been a transient condition that is now gone.

    I this point, I'd strongly urge you to log a call with the GSC and record the occurrence of this issue. They may have encountered similar issues and there may be a known resolution. A call would also inform Emerson of the issue. If others are experiencing the same behavior, it will help provide more information to the investigations and subsequent resolutions. Even if you've fixed the issue with a total download, that has only masked the underlying issue that caused the failed download. If you know why the download failed, then please post that here too.

    Andre Dicaire