Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IOC lockRequest messages and inconsistent record scanning #71

Open
LeeHudsonDLS opened this issue Mar 26, 2021 · 12 comments
Open

IOC lockRequest messages and inconsistent record scanning #71

LeeHudsonDLS opened this issue Mar 26, 2021 · 12 comments
Labels

Comments

@LeeHudsonDLS
Copy link

Hi,
I'm in the early stages of characterising an issue I've been seeing but in case it's something obvious I thought I would raise it now. Our version of streamDevice is 2.8.18.

We have rolled this version of streamDevice out on a few IOCs and haven't had any issues except for on one IOC. The IOC in question monitors how long it takes to get a response from a device, it reports that some readback PVs aren't getting updated as fast as they were on an older version (2.5). The record in question and an "ai" record and has a scan rate of 0.2 seconds. By setting some asyn trace masks I was able to see exactly when the record was querying the hardware, below is the time in seconds between these queries:

2.8.18:
avg 0.458166666666666
max 1.569
stdev 0.385062690171769

2.5:
avg 0.342230769230769
max 0.777
stdev 0.220675811463744

As you can see there doesn't appear to be a big difference in the average but the max is double. It's also less consistent as shown in the stdev.

This particular IOC does also give the occasional error, on boot and during operation:

2021/03/24 14:26:57.778463 robot_serial BL04I-MO-ROBOT-01:CNTL_ERR_CODE lockRequest: status returned to normal 2021/03/24 14:26:56.867206 robot_serial BL04I-MO-ROBOT-01:DRYTMR lockRequest: status returned to normal 2021/03/24 14:26:56.867218 robot_serial BL04I-MO-ROBOT-01:DRY_SPEED lockRequest: status returned to normal 2021/03/24 14:26:56.867229 robot_serial BL04I-MO-ROBOT-01:DRY_TIME lockRequest: status returned to normal

I don't know if this is related.

It should be pointed out that i've seen this behaviour by only changing the streamDevice dependency.

I appreciate this isn't much to go on but I will investigate further, I just thought i'd raise the issue now in case you can see anything obvious.

Thanks

@dirk-zimoch
Copy link
Member

dirk-zimoch commented Mar 26, 2021

I never had real-time in mind when writing this driver and did not pay much atttention on the performance values.
To understand yout values better, please explain:

  • Which underlying transport layer do you use? TCP? Serial?
  • How many records compete for access to the port?
  • The numbers are the time between two processings? They are all much longer than your 0.2 second scan rate. Thus it "never worked". Either your device is slow or processing the data takes longer.

I wonder a bit that you see only "return to normal" messages but no actual error message. Need to check that.

Please provide full database, protocol files and sample data. Best send per mail directly to me.

@LeeHudsonDLS
Copy link
Author

LeeHudsonDLS commented Mar 26, 2021

This IOC is using Serial

We have 146 records that can access this port but only 62 that have a regular scan rate.

That's a good point, the numbers aren't the time between record processes but the time between the communications triggered by the record processing. I got these values by setting asynSetTraceIOMask to report each command sent. As this is the time between messages sent from the IOC I don't think the device is the slow part, I see the subsequent response from the device in around 1ms (again using asynSetTraceIOMask).

I will send the IOC to you via email.

Sorry I missed some previous error messages:

`[Wed Mar 24 12:14:36 2021]�[31;1m2021/03/24 12:14:36.753586 main BL04I-MO-ROBOT-01:BCODE lockRequest: port robot_serial not connected

[Wed Mar 24 12:14:36 2021]�[0m�[31;1m2021/03/24 12:14:36.753613 main BL04I-MO-ROBOT-01:BCODE: Can't start @init handler

[Wed Mar 24 12:14:36 2021]�[0m�[31;1m2021/03/24 12:14:36.753622 main BL04I-MO-ROBOT-01:BCODE: Record initialization failed

[Wed Mar 24 12:14:36 2021]�[0mBad init_rec return value PV: BL04I-MO-ROBOT-01:BCODE ao: init_record
`

iocLog.txt

@dirk-zimoch
Copy link
Member

The message "lockRequest: port robot_serial not connected" shows that pasynManager->queueRequest() fails when StreamDevice is starting the protocol. The error string "not connected" comes directly from the asyn driver. I see you are using drvAsynIPPortConfigure, so it is a TCP device, not serial. The message means that the TCP socket is not (yet) connected when communication is supposed to start. For that reason, all @init handlers have failed during iocInit as well as the first few processings. Only 3 seconds later, the asyn port connects and StreamDevice reports "status returned to normal". From that time on, everything works normally, from my point of view.
I will have a look how to force (and/or wait for) connect in iocInit. But this will introduce timeouts for any device that is not online when the ioc starts. For the time being, you may try to add connect <timeout_ms>; to the beginning of your @init handlers.

@LeeHudsonDLS
Copy link
Author

Thanks Dirk, thanks for looking into this and explaining the issue. We are using a terminal server over TCP to talk to a serial device. What is odd is this method of communication is extremely common at diamond and using the older version of stream device (2.5) we never see this error.

@dirk-zimoch
Copy link
Member

I changed something in the connection handling, using pasynManager->exceptionCallbackAdd() now to get asynchronous connect/disconnect events (and call @init again when the device re-connects). I need to check if that degrades the behavior in iocInit when connecting is slow...

Anyway, this has nothing to do with the processing time changes.

@dirk-zimoch
Copy link
Member

Comparing with my older code, I see that before relying on auto-connect, I used asynQueuePriorityConnect in the pasynManager->queueRequest() call when the device was not connected. This forced an connection attempt with a potential timeout in every communication attempt (I think). The idea behind auto-connect was to simply skip the device while it is disconnected and leave it to asyn to re-connect. But I see that this leads to problems at initialization. On the other hand, I want to avoid to time out as many times as there are @init handlers for the same device. Unfortunately the records do not really know which other records belong to the same device/asyn port. So I cannot say "wait only once per device".

@LeeHudsonDLS
Copy link
Author

Hi Dirk, I don't know this module intimately but are you saying that using the asynQueuePriorityConnect would mask any slow to respond hardware by waiting for the connection rather than continuing without a connection? The error messages I get look worse than they are considering the device does connect shortly after.

@dirk-zimoch
Copy link
Member

dirk-zimoch commented Jun 18, 2021

Using asynQueuePriorityConnect makes the I/O call wait for connect. That slows down the IOC startup until the device is connected instead of starting without a connection. But what if the device is offline and not just slow? That will slow down IOC startup a lot: TIMEOUT*records_with_@init.

And it slows down the port thread later when the device drops out, but you may not see that as the rest of the IOC will continue to run, only all records connected to the same port will get LockTimeout. However you may see that as the asyn thread hanging when the IOC is terminated.

As the @init should run once the device connects, everything should be fine. But I will see if I can do someting about it anyway. That may take some time though....

@LeeHudsonDLS
Copy link
Author

OK thanks Dirk, now that I understand the errors I think I can run with them. The other issue is the performance but I guess that's potentially a rabbit hole...

@dirk-zimoch
Copy link
Member

dirk-zimoch commented Jun 18, 2021

For that I would probably need access to such a device. Simply you telling me "something is slower" does not really allow me to track things down. What you can do: Build intermediate versions of StreamDevice and hunt down which commit caused the performance loss. Then I may understand what is going on. Maybe start with the latest 2.7 version and compare it to the first 2.8 version.

@LeeHudsonDLS
Copy link
Author

OK thanks, i'll do that. Because i'm comparing to a very old version it may be the case that the performance has gotten worse over many commits but I will verify that. Thanks

@LeeHudsonDLS
Copy link
Author

Apologies for the slow progress on this. I have narrowed the issue down to commit d777084

The issue if present after this commit but not before it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants