Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does antlr have a length limit for parsing c/cpp code? #3599

Open
yaosheng-zhang opened this issue Jul 11, 2023 · 7 comments
Open

Does antlr have a length limit for parsing c/cpp code? #3599

yaosheng-zhang opened this issue Jul 11, 2023 · 7 comments

Comments

@yaosheng-zhang
Copy link

Does antlr have a length limit for parsing c/cpp code? I'm using antlr to parse a 2000 line c code file, but the parser can only parse up to 500 lines, when I delete the first 500 lines it parses a few hundred lines. How to solve the length limitation?

@kaby76
Copy link
Contributor

kaby76 commented Jul 11, 2023

Please provide a few more details.

  • C and C++ really are two different languages. Is it the c grammar or the cpp grammar?
  • The targets are all behave differently. What is the target (Cpp, CSharp, Dart, Go, Java, JavaScript, PHP, Python2, Python3, or TypeScript)?

@yaosheng-zhang
Copy link
Author

The following two files a java code is parsed c language code, another cfile is to have 378 lines of code, I use antlr to parse the cfile but the parse result is only 34 lines, please help!
cfile.txt
code.txt

@kaby76
Copy link
Contributor

kaby76 commented Jul 11, 2023

... I use antlr ...

What version of Antlr are you using?

OK, you are using the "Java" target.

For the c grammar, using Antlr 4.13.0, the CSharp (dotnet 7.0.305) target, on Ubuntu 20.04.6, on an AMD Ryzen 7 2700 Eight-Core Processor, 16GB DDR4, code.txt and cfile.txt both take each about 0.13 s. cfile.txt is 377 lines long, code.txt 50 lines (wc cfile.txt code.txt). Neither of these is over 500 lines long.

I tried it on a 1k line file from the GCC testsuite (Wmisleading-indentation.c). Took about the same amount of time.

NB: pre-processor directives should be ignored, but it looks like the c grammar parses only two types of directives. That's wrong. #3601

@kaby76
Copy link
Contributor

kaby76 commented Jul 11, 2023

Updated the grammar for parsing preprocessor directives. #3602

For the Java target, using "grouped parsing" (aka "warm up parsing"), these are the runtimes for each of the test files.

07/11-12:05:44 ~/issues/g4-3601/c/Generated-Java
$ bash run.sh ../examples/*.c
Java 0 ../examples/add.c success 0.038
Java 1 ../examples/BinaryDigit.c success 0.001
Java 2 ../examples/bt.c success 0.04
Java 3 ../examples/dialog.c success 0.002
Java 4 ../examples/FuncCallAsFuncArgument.c success 0.01
Java 5 ../examples/FuncCallwithVarArgs.c success 0.009
Java 6 ../examples/FuncForwardDeclaration.c success 0.002
Java 7 ../examples/FunctionCall.c success 0.003
Java 8 ../examples/FunctionPointer.c success 0.009
Java 9 ../examples/FunctionReturningPointer.c success 0.004
Java 10 ../examples/helloworld.c success 0.0
Java 11 ../examples/integrate.c success 0.013
Java 12 ../examples/ll.c success 0.002
Java 13 ../examples/ParameterOfPointerType.c success 0.001
Java 14 ../examples/pr403.c success 0.0
Java 15 ../examples/TypeCast.c success 0.007
Java 16 ../examples/Wmisleading-indentation.pp.c success 0.073
Total Time: 0.405
07/11-12:06:00 ~/issues/g4-3601/c/Generated-Java

@kaby76
Copy link
Contributor

kaby76 commented Jul 11, 2023

OK. "code.txt" is your driver code for the Java target.

"cfile.txt" is NOT a C-language file. It's a C++ source file. For example, it contains a class declaration "class ImageServer". Classes do not exist in the C language. So, you are using the wrong grammar.

This cannot be parsed by c grammar. It's cpp grammar. Starting over........

@kaby76
Copy link
Contributor

kaby76 commented Jul 11, 2023

$ tail -n +15 /c/Users/Kenne/Downloads/cfile.txt | head
STRICT_MODE_OFF
#include "json.hpp"
STRICT_MODE_ON
#include <iostream>
using namespace mavlink_utils;
using namespace mavlinkcom;
extern std::string replaceAll(std::string s, char toFind, char toReplace);
void UnitTests::RunAll(std::string comPort, int boardRate)
{
    com_port_ = comPort;
07/11-12:27:04 ~/issues/g4-3601/cpp/Generated-Java

This input is C++ source code, and code that is before preprocessing. It cannot be parsed cleanly with the cpp grammar because the macro call STRICT_MODE_OFF is not a C++ statement. The input should be the source code after preprocessing.

However, with the cpp grammar, the input is parsed with error, rather slowly.

$ bash run.sh /c/Users/Kenne/Downloads/cfile.txt
line 19:0 no viable alternative at input 'STRICT_MODE_OFF#include "json.hpp"\rSTRICT_MODE_ON#include <iostream>\rusing'
Java 0 C:/Users/Kenne/Downloads/cfile.txt fail 1.498
Total Time: 1.662
07/11-12:37:18 ~/issues/g4-3601/cpp/Generated-Java

@yaosheng-zhang
Copy link
Author

... I use antlr ...

What version of Antlr are you using?

OK, you are using the "Java" target.

For the c grammar, using Antlr 4.13.0, the CSharp (dotnet 7.0.305) target, on Ubuntu 20.04.6, on an AMD Ryzen 7 2700 Eight-Core Processor, 16GB DDR4, code.txt and cfile.txt both take each about 0.13 s. cfile.txt is 377 lines long, code.txt 50 lines (wc cfile.txt code.txt). Neither of these is over 500 lines long.

I tried it on a 1k line file from the GCC testsuite (Wmisleading-indentation.c). Took about the same amount of time.

NB: pre-processor directives should be ignored, but it looks like the c grammar parses only two types of directives. That's wrong. #3601

I'm using antlr 4.9 in maven, which means that if my .c file exceeds 500 lines it can't be parsed? I downloaded the c.g4 from the official antlr repository or do I need to preprocess the data myself? Is there a .g4 file that can parse both cpp and c?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants