-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading driver cause system crash on version of Beta-V1 #95
Comments
I can't reproduce your reported issue with the latest open-source version. Do you restart the PC after programming the SPI flash (cold start/complete power down)? |
I could reproduce this with CentOS, kernel 6.4, and Vivado 2023.1 (Vivado v2023.1 (64-bit) SW Build: 3865809 on Sun May 7 15:04:56 MDT 2023 IP Build: 3864474 on Sun May 7 20:36:21 MDT 2023 SharedData Build: 3865790 on Sun May 07 13:33:03 MDT 2023). SOM image works, image built with Open-Source implementation does not. Similar with Ubuntu (Linux ocp 6.5.3 #1 SMP PREEMPT_DYNAMIC Mon Sep 18 10:41:37 IST 2023 x86_64 x86_64 x86_64 GNU/Linux) SOM image works, but Open-Source image built with Vivado 2023.1 does not. Kernel does not crash though, and I can see the following with lspcie:
Note that I have changed Device ID to 0x0401 to be sure my image is loaded. With SOM image the lspcie shows:
Clearly MSI is somehow disabled for Open-Source version. Can someone please explain what is the main difference between SOM version and the Open-Source version? Mainly any difference in the PCIe MSI settings? |
Yeah, well the with the Device ID 0x0401 the ptp_ocp driver is not loaded and the interrupts are not enabled. Why do you have 0x0401 as device id? Did you changed it manually? In the TCL script which creates the PCIe config it's not like that: Time-Appliance-Project/Time-Card/FPGA/Open-Source/Implementation/Xilinx/TimeCard/Bd/TimeCardBd.tcl Lines 613 to 625 in b22096f
I guess the crash has maybe something to do with the driver and already pending interrupts or similar. There are some small differences with the available modules and possible interrupt sources in the SOM and Open-Source version. PCIe config is the same. |
Thanks @thschaub for such a quick reply. Yeah, I've changed the Device ID to 0x0401 just to see that I have successfully loaded the new bit file. Thanks for pointing out that driver needs to have the ID changed accordingly. Now the card is visible with lspcie. Another issue, same to #73 is that the driver only gets allocated 1 MSI IRQ. @thschaub would you be able to tell what might be the issue and/or where to look? I have this behaviour with SOM and also clean Open-Source compiled bit stream. lspcie shows that 32 MSIs are enabled (see below), but driver only gets one IRQ after calling
|
I'm not sure if I should start a different issue. I have a v8 card and if I boot with a usb flash with ubuntu-23.04-live-server-amd64.iso with the closed TimeCard.bin firmware, it will boot ok, but if I load the TimeCardOS.bin firmware, the kernel will panic. In both cases, I did a cold boot after changing the firmware. I have not tried other linux versions. Here is a screenshot of the panic. I tried a few times and it paniced every time. I only have a single GNSS installed, if that makes a difference. I'm actually trying to write a FreeBSD driver for the card, but am a bit stuck with the MSI interrupts causing panics there, so I thought to check how linux was doing. I started with the OS version, which paniced, and I thought I had a smoking gun, but then linux did not panic with the SOM version. My FreeBSD code panic with both, but it is probably something I do wrong. :-) Any suggestions on debugging MSI interrupts on the TimeCard would be useful. @chaloz while searching I saw that for more than one MSI interrupt on linux, you need a kernel option CONFIG_IRQ_REMAP. |
@JohnHay Thanks for the tip, I think I have this option enabled: @JohnHay I am no expert on kernel drivers, but what I would suggest is to start with only one interrupt being enabled by the driver, and go on from there. If it works with a single interrupt, enable two and so on. The line where you allocate interrupts is here: Time-Appliance-Project/Time-Card/DRV/ptp_ocp.c Line 4868 in b22096f
So force err = 1; after the line above.
|
@thschaub are the MsiIrq Ips the same for the SOM and Open-Source version? I had a look at the Open-Source MsiIrq.vhd and noticed that MsiVectorWidth_DatIn is not being used, so I was wondering how the interaction with the PCIe would be in the case where the OS sets only 1 interrupt. Like we see in @chaloz example and I have also seen when booting with ubuntu and the SOM firmware. In the AXI PCIe documentation I found (DS820 April 24, 2012), in the MSI Interrupt section there is a sentence: I'm not sure if that refers to the capability part of that register or the part where the OS sets the number of vectors it allows/allocated. Would that mean if MsiIrq still sends all the vectors, just their lowest bit will be used? So will all the even vectors cause the same interrupt? What will then happen to the odd numbered ones? Is there a way for the driver to determine if there are pending interrupts before enabling it, other than going through each function on the card? |
I made some progress, but still have some questions. I followed @chaloz's suggestion and made a stripped down driver that only allocated a single msi interrupt with a handler that does nothing. That did not panic, but I found that it was getting 1.5M interrupts per second. That was with a cold booted system and the driver did nothing on the card. So I wrote a program that read and displayed the interrupt enable registers of all the functions on the card. Nothing was enabled. I tried it with both the SOM and Open-Source firmware and they behaved the same. Here is the output of the program with the Open-Source firmware. It used the the CoreList to get a list of all the functions and only looked at the ones with an irq that is not 0xffffffff. The number after the / is the interrupt sensitivity.
So I changed the driver to allocate 2 msi interrupts and a handler for each. That did not crash either, but I saw that both interrupts now received around 900k interrupts per second. So I increased it in powers of 2 until it allocated 32 msi interrupts and it still did not panic. I think it is because I installed a handler for each interrupt. In my original driver I only installed a handler for the functions was busy with. From where the driver allocated 4 msi interrupts, I could see that the interrupts came from irq 1 and 2, "timestamper gnss pps" and "timestamper 1".
@thschaub do you have an idea why it would happen? I started to look through the firmware on git, but have not found anything yet. |
@JohnHay, great to see you're making progress. Without seeing your code anything I recommend is just a shot in the dark, but here we go: are you clearing the interrupts in the interrupt handler? Like so: Time-Appliance-Project/Time-Card/DRV/ptp_ocp.c Line 2045 in b22096f
|
@chaloz, I made a lot of progress. I found the source of the interrupts. There are 2 unused interrupts (17 and 18) on the board. In TimeCardBd.tcl they are connected to xlconstant_5_dout, but xlconstant_5 is defined without giving it a value, so its default is 1. If they were edge interrupts, that would only cause 2 interrupts, but also in TimeCardBd.tcl, they are defined as level interrupts with this bitmask: @thschaub can xlconstant_5 not just be removed and TC_MsiIrq_0/IrqIn17_DatIn and TC_MsiIrq_0/IrqIn18_DatIn be left open? Looking at the MsiIrq code, it will use a 0 for IrqIn pins that are not connected. Also should the level mask not be changed to 0x000805B8, ie. clearing bits 17 and 18? That seems safer, except if you already know what irq 17 and 18 will be used for in the future, or is there another reason to keep them as level interrupts? The last part of the puzzle (for now) was the panics in FreeBSD. I found that there is a mistake in the function native_apic_alloc_vectors() that did not align it on a 32 boundary, but only on a 16 boundary. So msi allocations up to 16 were fine. If you allocate 32 msi vectors and they are not aligned, the axi pcie blob will just ignore the lower 5 bits and write over it. In my case the vector received from the OS was 0x50, so the axi pcie would ignore bit 4 and create vectors from 0x40 to 0x5f. The interrupts outside of the range caused the panics. The axi pcie implementation is correct though, because the pci spec says that the vector has to be aligned to the number of bits that are requested. So now I can return to the driver I'm busy with. :-) |
Good catch, the xlconstant_5 should be definitly '0'. The reason why we have not left them open and we have a defined mask is because on a special HW revision (production version) where this IRQs are used: Left open is also somehow undefined, but driving it with a constant allows to optimize the logic. So left something open is usually not the proper way in VHDL. I will inform you once this bug is fixed. |
I am using the PCB version is Beta -V1.
When I reflash Factory_TimeCardOS.bin file from \Time-Appliance-Project-master\Time-Card\FPGA\Open-Source\Implementation\Xilinx\TimeCard\Binaries and executing modprobe, then system gets crashed, only reboot.
(BTW, I don't make any changes on Vivado project)
Then I reflash Factory_TimeCard.bin file from \SOM\FPGA\Binaries and it works well.
It looks very wired, that is why?
If I only reflash from SOM folder, I can not make any changes on Vivado, it will be lost funny.
The text was updated successfully, but these errors were encountered: