Watchdog problem at customer site

Dear All,

We are facing a Watchdog failure problem at one of our customer site, as soon as this watchdog timer fails the entire batch goes to hold thus stopping the entire production. The entire system uses dynamic referencing to issue different commands. Controller Free time and Free memory seems good when issue happens. If anyone came across such problem or have found a way out or alternative please kindly share. Thank you very much.

  • As the general community does not have access to the GSC call tracking system, I would suggest you post the details of your issue here.

     

    If the problem pertains to a customer that you are the service provider for (based on the title of the posting), please make sure not to post any sensitive information.

     

    Youssef El-Bahtimy | Systems Integration Technologist
    PROCONEX | 103 Enterprise Drive | Royersford, PA 19468 USA
    Proconex Office: 610 495 2970 | Cell: 267 275 7513
  • In reply to Youssef.El-Bahtimy:

    I agree with Youssef.

    But on general terms - watchdog failure at the node level can be caused by seveal things.

    On top of my head:

    - Loading on the node (for the controller is it within the specified limits)? Like

    Controller free time minimum: 20%

    Controller free memory minimum

    MD - 1.4 MB

    SD Plus/MD Plus - 4.8 MB

    SX/MX - 9.6 MB

    - For workstations - check that you have enough CPU idle, free memory, Disk IO load

    - At network level - is it clean traffic-wise? Are the network equipments in good shape

     

    On the control level, it is possible that improvements/fixes may have exposed issues in you configuration. Or there might be regression in functionality.

  • Dear Youssef/Raznik,

    The data is critical and important to client and thats the reason I didn't sumbitted it here. I thought may be some of you guys are having access to SMS to check the details for the problem.

    Thank you for reminding. I will try to sum up the problem faced at site in general words without sharing any critical information.

    The main query is not watchdog failure, but trip of entire batch because of watchdog failure.

    I will get back with details as soon as I can.

    Thanks a lot ever
  • In reply to Ashish P:

    Dear All,

    Sorry for delay just want to update some information on this problem :

    The issue we are facing is different one….. THIS IS NOT SERIAL WATCHDOG… this is batch watchdog

    Your phase is loaded to controller…. While running if it loses connection with batch executive, after certain time your batch fails with a reason as ‘Phase Logic Failure: PLM Watchdog Failed 1’ or "Device connection Error " etc.

    Has anyone observed this kind of situation earlier ? Any details or information will be really helpful.

    Thank you for your time gentlemen !!

  • In reply to Ashish P:

    You mentioned that you use a lot of dynamic referencing.  I have had experience with module's with dynamic references embedded in phases causing problems because the module execution would start before references were bound.  

    Perhaps take a look at networking during phase loading to see if utilization due to excessive dynamic reference binding spikes enough, interefering with batch exec watchdog function.

    I would look into whether you can modify the watchdog time out.  For soft phases I believe you can, but I don't  think that applies to controller phases.

  • In reply to Youssef.El-Bahtimy:

    Dear Youssef,

    As you mentioned that module's with dynamic references embedded in phases causing problems because the module execution would start before references were bound,  but this problem is observed only in a particular phase. Not all the phases cause the PLM watchdog Failure.

    Whether Controller Memory fragmentation has any impact on watchdog failure ?

  • In reply to Ashish P:

    Client considered the option of hardware upgrade to solve this problem. Now client will replace all the existing controllers with MX.