Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After running for a long time, an IOC using StreamDevice and ASYN exhibits significant deviations between the message transmission intervals and the set scan period. #102

Open
1458861693 opened this issue Jan 20, 2025 · 28 comments

Comments

@1458861693
Copy link

I have developed an EPICS IOC to control an X-ray source. The IOC needs to continuously send messages to the X-ray source controller at a 100ms interval to maintain the connection with the X-ray source. These messages serve as heartbeat packets. If the time interval between two consecutive heartbeat packets exceeds 3 seconds, the high voltage of the X-ray source will be automatically set to 0kV.

Additionally, upon power-up, the X-ray source sends unsolicited operational status data to the IOC at a 100ms interval. This unsolicited data provides continuous updates on the working status of the X-ray source without requiring any explicit requests from the IOC.

I have developed an IOC using StreamDevice and ASYN, which successfully receives and parses the status data from the X-ray source. Now, I can monitor the status of the X-ray source in real-time. However, there are some issues with sending heartbeat packets to maintain the connection with the X-ray source.

In the db file, I set "SCAN=0.1", which means processing the record every 100ms. As follows:

record(ao, "$(P)source:sendMessages") {
    field(DESC, "send control messages")
    field(DTYP, "stream")
    field(SCAN, ".1 second")
    field(OUT, "@XraySource.proto SendMsg($(P)kV.VAL,$(P)mA.VAL,$(P)source:ctrlData.VAL,$(P)exposure:time:amp.VAL) $(PORT)")
    field(FLNK, "$(P)source:cleanMessages")     # clean data after send messages
  }

Initially, the IOC runs well, and the message transmission intervals are around 100ms. As shown in the figure below, data starting with "5A" is sent by the IOC, and data starting with "A5" is sent by the X-ray source. Initially, everything works fine.

Image

However, after running for a while, approximately 2 to 3 hours, I noticed that the message transmission intervals become unstable, sometimes being 100ms, sometimes 400ms, and sometimes even 2000ms. The following figure shows the IOC running after 5 hours.

Image

I am not sure if the message transmission intervals will continue to increase as the IOC keeps running, but I cannot allow this interval to exceed 3 seconds because the X-ray source controller will automatically shut down the high voltage.

I am not sure what is causing this issue. Currently, I suspect that there might be a problem with the ASYN configuration, but after reviewing the documentation, I still haven't found the root cause. I need help to resolve this issue.

The versions of ASYN and StreamDevice I am using are asyn-R4-44 and StreamDevice-master, and the EPICS version is base-7.0.8.
The db, protocol files and st.cmd file are attached.
Thanks very much.

@dirk-zimoch
Copy link
Member

Do you get any error messages from your IOC about scan processing overruns?

@dirk-zimoch
Copy link
Member

dirk-zimoch commented Jan 20, 2025

The db, protocol files and st.cmd file are attached.

I cannot find them.

@1458861693
Copy link
Author

1458861693 commented Jan 20, 2025

Sorry, I forgot to upload the above files. It has now been uploaded to the attachment

db and proto and cmd ZipFiles.zip

The IOC looks normal and there are no errors reported. As follows.

#!../../bin/linux-x86_64/XraySource
< envPaths
epicsEnvSet("IOC","iocXraySource")
epicsEnvSet("TOP","/home/ls/epics/XraySource")
epicsEnvSet("SUPPORT","/home/ls/epics/softIoc/synApps_6_1/support")
epicsEnvSet("ASYN","/home/ls/epics/softIoc/synApps_6_1/support/asyn-R4-44/")
epicsEnvSet("AUTOSAVE","/home/ls/epics/softIoc/synApps_6_1/support/autosave-R5-10/")
epicsEnvSet("CALC","/home/ls/epics/softIoc/synApps_6_1/support/calc-R3-7-3/")
epicsEnvSet("STREAM","/home/ls/epics/softIoc/synApps_6_1/support/StreamDevice-master/")
epicsEnvSet("SNCSEQ","/home/ls/epics/softIoc/synApps_6_1/support/seq-2-2-9/")
epicsEnvSet("EPICS_BASE","/home/ls/epics/softIoc/base-7.0.8")
cd "/home/ls/epics/XraySource"
## Register all support components
dbLoadDatabase "dbd/XraySource.dbd"
XraySource_registerRecordDeviceDriver pdbbase
# create communicate port and set the communicate parameters of X ray Source
### test ip and port ###
drvAsynIPPortConfigure ("L0", "192.168.1.7:5000",0,0,0)
##################
# drvAsynIPPortConfigure ("L0", "192.168.1.199:5000",0,0,0)
## Load record instances
dbLoadRecords("/home/ls/epics/XraySource/db/XraySource.db", "P=xanes2:, PORT=L0")
# IMAX, OMAX setting
# dbLoadRecords("db/asynRecord.db","P=norum:,R=asyn,PORT=L0,ADDR=24,IMAX=100,OMAX=100")
### Asyn
dbLoadRecords("/home/ls/epics/softIoc/synApps_6_1/support/asyn-R4-44//db/asynRecord.db","P=xanes2:,R=asyn,PORT=L0,ADDR=0,OMAX=2000,IMAX=2000")
epicsEnvSet ("EPICS_CAS_SERVER_PORT", "6020")
epicsEnvSet ("STREAM_PROTOCOL_PATH", "/home/ls/epics/XraySource/db")
asynSetTraceMask("L0", -1, 0x9)
asynSetTraceIOMask("L0", -1, 0x4)
asynSetTraceInfoMask("L0",-1, 0x1)
################# AUTOSAVE ##################
# cd ${AUTOSAVE}
# Optional database for autosave sta
# dbLoadRecords "db/save_restoreStatus.db", "P=$(IOCNAME):"
#save_restoreSet_status_prefix("$(IOCNAME):")
# set_requestfile_path("$(SAVE_DIR)")
# set_savefile_path("$(SAVE_DIR)")
# save_restoreSet_NumSeqFiles(3)
# save_restoreSet_SeqPeriodInSeconds(30)
# Arrange for restoring saved values into records
# set_pass1_restoreFile("$(IOCNAME).sav")
################# AUTOSAVE ##################
cd "/home/ls/epics/XraySource/iocBoot/iocXraySource"
iocInit
Starting iocInit
############################################################################
## EPICS R7.0.8
## Rev. 2024-01-30T23:45+0800
## Rev. Date build date/time: 
############################################################################
iocRun: All initialization complete
## Start any sequence programs
#seq sncxxx,"user=ls"
# Set trace file
dbpf("xanes2:asyn.TFIL", "/home/ls/Desktop/XRaySourceLog.txt")
DBF_STRING:         "/home/ls/Desktop/XRaySourceLog.txt"    
epics> 

@dirk-zimoch
Copy link
Member

I notice all your inputs use "I/O Intr". Does this device send its data automatically, independent of your sendMsg?
If not (if they are replies of the message sent in sendMsg) try to handle them in the send Msg protocol (using redirections) instead of using "I/O Intr".

@1458861693
Copy link
Author

1458861693 commented Jan 20, 2025

yes, the device sends device status message automatically to the IOC at 100ms interval after the device start. No need to send a request.

@1458861693
Copy link
Author

You can download these files and test in your computer. Looking forward to your reply.

@MarkRivers
Copy link
Contributor

How did you generate the communications trace? I see you set the asynTrace settings, but that does not look like the output of asynTrace, which should have "write" and "read" messages.

@1458861693
Copy link
Author

I used a network debugging assistant to simulate communication between an X-ray source and the IOC, so the messages look unchanged. Yes, I have set up asynTrace, and the output is saved in a txt file, which I will upload. The file is somewhat large.
After the IOC has been running for an extended period, the intervals between the messages sent by the IOC are not as stable as they were initially.

XRaySourceLog.txt

@dirk-zimoch
Copy link
Member

Some buffer accumulating input or output data over time would explain the degradation if more and more old data is searched (maybe for terminators or 0 bytes) or copied (when buffer is growing).
But I do not see anything like that.
I need to add some debuging to read buffer sizes and set up a test system and let it run for a while...

@MarkRivers
Copy link
Contributor

Can you run "top" to look at the resources use by the IOC process and see if CPU or memory use are growing when the time between writes gets long?

@1458861693
Copy link
Author

1458861693 commented Jan 22, 2025

I tested this IOC on a physical machine today. The IOC runs on a CentOS 8 system, while the network debugging assistant runs on a Windows 11 system. However, because the Windows system went into sleep mode, the test lasted for less than 2 hours.

Now I will run this IOC on a Raspberry Pi 4 to communicate with the network debugging assistant on the Windows 10 system.
I will send the output data of the top command here every so often. Please help me resolve this issue; it is very important to me. Thank you.

output of the top command at 2025-1-22 20:01
This is the resource usage of the IOC right after it has started.
Image

@1458861693
Copy link
Author

1458861693 commented Jan 22, 2025

output of the top command at 2025-1-22 21:53
The IOC has been running for 2 hours, and the resource usage appears to be normal.
The send and receive rates of the IOC remain consistent with the configured settings.

Image

@dirk-zimoch
Copy link
Member

Keep observing your IOC process. You may also look at /proc/[PID of your IOC]/statm regularly (every hour or so) and see if any of the numbers keep growing.

@1458861693
Copy link
Author

Keep observing your IOC process. You may also look at /proc/[PID of your IOC]/statm regularly (every hour or so) and see if any of the numbers keep growing.

OK,Is the PID of st.cmd of my IOC?

@dirk-zimoch
Copy link
Member

dirk-zimoch commented Jan 22, 2025

If you have started the IOC as 'st.cmd', then it is that one (PID 2072).
If you have started the IOC as '[programname] st.cmd', then look for the programname. In your case probably "XraySource", according to what I see in st.cmd.

@1458861693
Copy link
Author

If you have started the IOC as 'st.cmd', then it is that one (PID 2072). If you have started the IOC as '[programname] st.cmd', then look for the programname. In your case probably "XraySource", according to what I see in st.cmd.

I started this IOC by type './st.cmd' on terminator, and I don't find a process named 'XraySource' or 'Xray'..., so this PID of my IOC should be 2072.

This is the output that I type 'ps aux | grep 'Xray'*'.

pi@raspberrypi:~ $ ps aux | grep 'Xray*'
pi 2072 2.0 0.0 40512 5376 pts/4 Sl+ 19:52 4:22 ../../bin/linux-arm/XraySource ./st.cmd
pi 6934 0.0 0.0 7348 524 pts/1 S+ 23:21 0:00 grep --color=auto Xray*

@1458861693
Copy link
Author

This is the output of "cat /proc/2072/statm".
Could it be that some settings in the asyn, such as the buffer size, need to be increased? But I didn't find this settings. Please help me.
Thanks!

pi@raspberrypi:/opt/egcs $ date
Wed 22 Jan 2025 10:13:22 PM CST
pi@raspberrypi:/opt/egcs $ cat /proc/2072/statm
10128 1344 1066 7 0 6905 0
pi@raspberrypi:/opt/egcs $ cat /proc/2072/statm
10128 1344 1066 7 0 6905 0
pi@raspberrypi:/opt/egcs $ date
Wed 22 Jan 2025 10:21:14 PM CST
pi@raspberrypi:/opt/egcs $ cat /proc/2072/statm
10128 1344 1066 7 0 6905 0
pi@raspberrypi:/opt/egcs $ date
Wed 22 Jan 2025 10:31:38 PM CST
pi@raspberrypi:/opt/egcs $ cat /proc/2072/statm
10128 1344 1066 7 0 6905 0
pi@raspberrypi:/opt/egcs $ date
Wed 22 Jan 2025 11:20:57 PM CST
pi@raspberrypi:/opt/egcs $ cat /proc/2072/statm
10128 1344 1066 7 0 6905 0
pi@raspberrypi:/opt/egcs $ date
Thu 23 Jan 2025 12:02:11 AM CST
pi@raspberrypi:/opt/egcs $ cat /proc/2072/statm
10128 1344 1066 7 0 6905 0
pi@raspberrypi:/opt/egcs $

@MarkRivers
Copy link
Contributor

@1458861693 I don't think it is a problem with asyn because others would have reported this problem. The thing that you are doing that is less typical is having I/O Intr scanned input records and periodic output records. I suspect that could be causing an issue in StreamDevice.

I don't think you really need to run the network monitoring. The asyn trace output should be sufficient. You can send that output to a file.

@dirk-zimoch
Copy link
Member

None of the memory numbers is going up. Thus my hypothesis about a buffer accumulating more and more data to process seems to be wrong.

@1458861693
Copy link
Author

@MarkRivers , I will stop running the network monitoring, use only asyn trace, and send the output to a file. Please wait for my feedback.

@1458861693
Copy link
Author

@dirk-zimoch @MarkRivers @ The Input record have I/O Intr scanned should be normal according to the StreamDevice document, but about output record, if I want to send message to a port periodically and the interval must be very short, how should I handle this?

@MarkRivers
Copy link
Contributor

What you are doing should work fine. It seems like there may be a problem we were not aware of.

@1458861693
Copy link
Author

@MarkRivers @dirk-zimoch
There is a new discovery. Another IOC is running on a CentOS8 system, using the same db and proto files as those in the previous attachment. After running for 20 hours, it suddenly stopped sending data, and the system became very sluggish; I could barely operate the CS-Studio software. At this point, the output of the top command was as follow picture.

Image

And the IOC Shell also displayed the following warning messages.

epics>
dbScan warning from '.1 second' scan thread:
Scan processing averages 1.550 seconds (0.505 .. 2.324).
Over-runs have now happened 10 times in a row.
To fix this, move some records to a slower scan rate.

dbScan warning from '.5 second' scan thread:
Scan processing averages 1.533 seconds (0.797 .. 2.432).
Over-runs have now happened 10 times in a row.
To fix this, move some records to a slower scan rate.

dbScan warning from '.1 second' scan thread:
Scan processing averages 1.789 seconds (0.505 .. 6.287).
Over-runs have now happened 20 times in a row.
To fix this, move some records to a slower scan rate.

dbScan warning from '.5 second' scan thread:
Scan processing averages 2.157 seconds (0.797 .. 8.662).
Over-runs have now happened 16 times in a row.
To fix this, move some records to a slower scan rate.

dbScan warning from '.5 second' scan thread:
Scan processing averages 1.828 seconds (1.073 .. 2.528).
Over-runs have now happened 10 times in a row.
To fix this, move some records to a slower scan rate.

dbScan warning from '1 second' scan thread:
Scan processing averages 1.624 seconds (1.026 .. 2.127).
Over-runs have now happened 10 times in a row.
To fix this, move some records to a slower scan rate.

dbScan warning from '.1 second' scan thread:
Scan processing averages 1.900 seconds (0.505 .. 7.760).
Over-runs have now happened 40 times in a row.
To fix this, move some records to a slower scan rate.

dbScan warning from '.5 second' scan thread:
Scan processing averages 2.245 seconds (1.073 .. 9.658).
Over-runs have now happened 17 times in a row.
To fix this, move some records to a slower scan rate.

dbScan warning from '1 second' scan thread:
Scan processing averages 2.771 seconds (1.026 .. 17.366).
Over-runs have now happened 13 times in a row.
To fix this, move some records to a slower scan rate.

dbScan warning from '1 second' scan thread:
Scan processing averages 8.178 seconds (1.026 .. 78.466).
Over-runs have now happened 14 times in a row.
To fix this, move some records to a slower scan rate.
errlog: 3 messages were discarded

dbScan warning from '2 second' scan thread:
Scan processing averages 7.558 seconds (2.111 .. 25.687).
Over-runs have now happened 12 times in a row.
To fix this, move some records to a slower scan rate.
epics>

Reviewing the trace files during these 20 hours, the data sent by the IOC was mostly normal; although there were slight fluctuations in the intervals between data transmissions, there were basically no instances of several-second-long gaps.

@1458861693
Copy link
Author

@MarkRivers , I will stop running the network monitoring, use only asyn trace, and send the output to a file. Please wait for my feedback.

Meanwhile, I am also running another IOC as mentioned above, without using the network monitor, only utilizing the IOC to send data with the scan field set to 0.1 seconds. After 5 hours, based on the timestamps of the data sent from the IOC, the intervals have remained largely consistent with the scan period. I have also uploaded the asyn trace file to the attachment. I will continue to run this IOC indefinitely and will provide feedback here if there are any issues.

XRaySourceLog.txt

@dirk-zimoch
Copy link
Member

I find it a bit strange that the problems start so suddenly and then affect all the scan threads, even up to "2 second". It is as if the whole IOC is suddenly frozen for at least 24 seconds (12 overruns in a row of the 2 second scan). The sending record is scanned with ".1 second" and the actual I/O is done in the asyn thread "L0". So I would have expected that only those threads would be affected. Problems with CSS on the same host points to a global problem.

Also I noticed that you have 518% CPU load (5 of 8 cores?) for the "node" (Node.js probably) program (plus some percentage for python3 and java). Can it be that whatever your JS program is doing kills the host performance?

Is the your CentOS8 system a real machine or virtual? In case it is virtual, maybe you need to tune your VM settings?

Is your machine accessing NFS mounts? In case there is something wrong with the NFS server, several processes may be affected. In particular if the machine is virtual with an NFS root file system.

If the IOC running on raspberrypi is more stable than running on your CentOS8, then the problem is most probably systematic to your CentOS8 machine and not related to the IOC.

@MarkRivers
Copy link
Contributor

This command will show the CPU time of each thread. It can be useful for finer scale information on what is using CPU.

top -h

@1458861693
Copy link
Author

Sorry for the late reply as it was the Chinese Spring Festival recently.

My CentOS 8 is running on a real machine without mounting NFS. My Control System Studio makes extensive use of jython and Python scripts. Could this affect the operation of the IOC?

The IOC running on the RaspberryPi might be more stable than the one on CentOS8 may because only one IOC runs on the RaspberryPi, whereas multiple IOCs and the Control System Studio are running on CentOS8.

Is it recommended to use another Linux distribution such as Rocky Linux 8 instead of CentOS 8? This is currently my plan.

@MarkRivers
Copy link
Contributor

Is it recommended to use another Linux distribution such as Rocky Linux 8 instead of CentOS 8? This is currently my plan.

No, CentOS 8 should work fine, even with several IOCs.

You should run "top" to see what is the load on the system. How much free CPU and memory capacity is there?

Mark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants