Parsing C source file

Parsing C source file - java

If i have a C source file and i want to locate a specific local variable within a function and make it global - so another tool is able to process the C file (a tool i didn't write) what would be the easiest way to do this? I was thinking of using regex, but even that posses it's own problems. It's kind of like writing a mini C parser in Java.. a lot of work :S
Are there any libraries that can help make this easier?
For example, say i want to make the variable "i" into a global variable. The user will specify the function name and the variable name (but not the type the variable is - ie. "int").
I can use regex to find the function - sure. But from there i really don't know what the best approach would be?... Will CDT plugin help?
Example:
/*
* add.c
* a simple C program
*
*/
#include <stdio.h>
#define LAST 10
int main()
{
int i = 0;
int sum = 0;
for ( i = 1; i <= LAST; i++ ) {
sum += i;
} /*-for-*/
printf("sum = %d\n", sum);
return 0;
}
converted to:
/*
* add.c
* a simple C program
*
*/
#include <stdio.h>
#define LAST 10
int i = 0;
int main()
{
int sum = 0;
for ( i = 1; i <= LAST; i++ ) {
sum += i;
} /*-for-*/
printf("sum = %d\n", sum);
return 0;
}

If you do only trivial examples, you can hack this with Perl or some java regex. It won't work reliably on complex programs, because you need a real parser.
Our DMS Software Reengineering Toolkit and its C Front End could be used to to this pretty reliably.
DMS provides general purpose program analysis and transformation capability, parameterized by a programming langauge description. DMS's C Front explains to DMS what the precise syntax is for C (for a variety of dialects of C, including GCC and MS); it in effect provides a complete parser, producing Abstract Syntax trees (and the inverse: a C code generator from the ASTs) This allows DMS to read C source files accurately, including preprocessing.
With the parsed code in AST form, you can build DMS functions and/or write patterns to find function definitions and in particular your targeted variable. DMS code or alteratevely source-to-source transforms can then be used to either lift the variable out of the function, and/or insert code to track state changes of that variable so it can be seen.
So, with DMS and some custom code, you can achieve your desired effect. The example you provided is probably pretty simple to do with DMS, but the learning curve will stil be a lot; DMS is complex because the langauges it handles are complex, and you have to learn how to use it. So, this isn't an afternoon's exercise for a newbie.
Note: you will want to do this to preprocessed programs (otherwise you won't be generally able to parse them reliably). So, this should be something you do just before compilation, and shouldn't become part of the finalized code.
If you want to make permanent code changes, you'll need to parse the unpreprocessed code; that's a heckuva lot harder. DMS's C front end can do this to the extent the preprocessor directives are "structured"; about 95% of them are. So now you have a new problem: either fix the unstructured ones (a one time manual change), or reject files that can't be parsed with "tough luck".
You might use GCC instead of DMS; after all it has a very well tested C parser. It won't help you generate modified C code, though. Another alternative is Clang, which is coming up fast as a pretty good alternative. I think it will parse C++; not so sure about C or in particular the dialect of C your end user may be using (you didn't say). It has ASTs like DMS, and a kind of scheme for generating "patches" to code that might work.

The first thing I would demand is a complete specification of exactly when this is required and why, and how to identify when it is safe to do so without adversely affecting the program semantics. This is a really bad idea. Clearly those who gave you the assignment have no idea of either the implementation complexity, which is immense, or the adverse semantic effects. I am guessing that they will therefore be unable to come up with an adequate specification either, which will ultimately let you out.
I would also draw their attention to this discussion, especially Ira Baxter's comments. I used to build compilers for a living. It is not a task to learn, or ask about, on a forum.

Even if you are able to come up with a way to make such transformations, I think it's not a good idea. The program will not stay the same since you move around construction and destruction. Also, not all types are default constructable or copyable so in general the transformation is not possible.
Are you interested only in a few simple types? Then make that a part of the solution. Is the original code generated? Else, how can you trust to identify local objects by name only? The same name May also be used for different type of objects.

Related

Is there any way to write parsing logic using json?

I have a map in java Map<String,Object> dataMap whose content looks like this -
{country=Australia, animal=Elephant, age=18}
Now while parsing the map the use of various conditional statements may be made like-
if(dataMap.get("country").contains("stra")
OR
if(dataMap.get("animal") || 100 ==0)
OR
Some other operation inside if
I want to create a config file that contains all the rules on how the data inside the Map should look like. In simple words, I want to define the conditions that value corresponding to keys country, animal, and age should follow, what operations should be performed on them, all in the config file, so that the if elses and extra code can be removed. The config file will be used for parsing the map.
Can someone tell me how such a config file can be written, and how can it be used inside Java?
Sample examples and code references will be of help.
I am thinking of creating a json file for this purpose
Example -
Boolean b = true;
List<String> conditions = new ArrayList<>();
if(dataMap.get("animal").toString().contains("pha")){
conditions.add("condition1 satisifed");
if(((Integer.parseInt(dataMap.get("age").toString()) || 100) ==0)){
conditions.add("condition2 satisifed");
if(dataMap.get("country").equals("Australia")){
conditions.add("condition3 satisifed");
}
else{
b=false;
}
}
else{
b=false;
}
}
else{
b=false;
}
Now suppose I want to define the conditions in a config file for each map value like the operation ( equals, OR, contains) and the test values, instead of using if else's. Then the config file can be used for parsing the java map

Just to manage expectations: Doing this in JSON is a horrible, horrible idea.
To give you some idea of what you're trying to make:
Grammars like this are best visualized as a tree structure. The 'nodes' in this tree are:
'atomics' (100 is an atom, so is "animal", so is dataMap).
'operations' (+ is an operation, so is or / ||).
potentially, 'actions', though you can encode those as operations.
Java works like this, so do almost all programming languages, and so does a relatively simple 'mathematical expression engine', such as something that can evaluate e.g. the string "(1 + 2) * 3 + 5 * 10" into 59.
In java, dataMap.get("animal") || 100 ==0 is parsed into this tree:
OR operation
/ \
INVOKE get[1] equality
/ \ / \
dataMap "animal" INT(100) INT(0)
where [1] is stored as INVOKEVIRTUAL java.util.Map :: get(Object) with as 'receiver' an IDENT node, which is an atomic, with value dataMap, and an args list node which contains 1 element, the string literal atomic "animal", to be very precise.
Once you see this tree you see how the notion of precedence works - your engine will need to be capable of representing both (1 + 2) * 3 as well as 1 + (2 * 3), so doing this without trees is not really possible unless you delve into bizarre syntaxis, where the lexical ordering matching processing ordering (if you want that, look at how reverse polish notation calculators work, or something like fortran - stack based language design. I don't think you'll like what you find there).
You're already making language design decisions here. Apparently, you think the language should adopt a 'truthy'/'falsy' concept, where dataMap.get("animal") which presumably returns an animal object, is to be considered as 'true' (as you're using it in a boolean operation) if, presumably, it isn't null or whatnot.
So, you're designing an entire programming language here. Why handicap yourself by enforcing that it is written in, of all things, JSON, which is epically unsuitable for the job? Go whole hog and write an entire language. It'll take 2 to 3 years, of course. Doing it in json isn't going to knock off more than a week off of that total, and make something that is so incredibly annoying to write, nobody would ever do it, buying you nothing.
The language will also naturally trend towards turing completeness. Once a language is turing complete, it becomes mathematically impossible to answer such questions as: "Is this code ever going to actually finish running or will it loop forever?" (see 'halting problem'), you have no idea how much memory or CPU power it takes, and other issues that then result in security needs. These are solvable problems (sandboxing, for example), but it's all very complicated.
The JVM is, what, 2000 personyears worth of experience and effort?
If you got 2000 years to write all this, by all means. The point is: There is no 'simple' way here. It's a woefully incomplete thing that never feels like you can actually do what you'd want to do (which is express arbitrary ideas in a manner that feels natural enough, can be parsed by your system, and when you read back still makes sense), or it's as complex as any language would be.
Why not just ... use a language? Let folks write not JSON but write full blown java, or js, or python, or ruby, or lua, or anything else that already exists, is open source, seems well designed?

Count how many list entries have a string property that ends with a particular char

I have an array list with some names inside it (first and last names). What I have to do is go through each "first name" and see how many times a character (which the user specifies) shows up at the end of every first name in the array list, and then print out the number of times that character showed up.
public int countFirstName(char c) {
int i = 0;
for (Name n : list) {
if (n.getFirstName().length() - 1 == c) {
i++;
}
}
return i;
}
That is the code I have. The problem is that the counter (i) doesn't add 1 even if there is a character that matches the end of the first name.

You're comparing the index of last character in the string to the required character, instead of the last character itself, which you can access with charAt:
String firstName = n.getFirstName()
if (firstName.charAt(firstName.length() - 1) == c) {
i++;
}

When you're setting out learning to code, there is a great value in using pencil and paper, or describing your algorithm ahead of time, in the language you think in. Most people that learn a foreign language start out by assembling a sentence in their native language, translating it to foreign, then speaking the foreign. Few, if any, learners of a foreign language are able to think in it natively
Coding is no different; all your life you've been speaking English and thinking in it. Now you're aiming to learn a different pattern of thinking, syntax, key words. This task will go a lot easier if you:
work out in high level natural language what you want to do first
write down the steps in clear and simple language, like a recipe
don't try to do too much at once
Had I been a tutor marking your program, id have been looking for something like this:
//method to count the number of list entries ending with a particular character
public int countFirstNamesEndingWith(char lookFor) {
//declare a variable to hold the count
int cnt = 0;
//iterate the list
for (Name n : list) {
//get the first name
String fn = n.getFirstName();
//get the last char of it
char lc = fn.charAt(fn.length() - 1);
//compare
if (lc == lookFor) {
cnt++;
}
}
return cnt;
}
Taking the bullet points in turn:
The comments serve as a high level description of what must be done. We write them aLL first, before even writing a single line of code. My course penalised uncommented code, and writing them first was a handy way of getting the requirement out of the way (they're a chore, right? Not always, but..) but also it is really easy to write a logic algorithm in high level language, then translate the steps into the language learning. I definitely think if you'd taken this approach you wouldn't have made the error you did, as it would have been clear that the code you wrote didn't implement the algorithm you'd have described earlier
Don't try to do too much in one line. Yes, I'm sure plenty of coders think it looks cool, or trick, or shows off what impressive coding smarts they have to pack a good 10 line algorithm into a single line of code that uses some obscure language features but one day it's highly likely that someone else is going to have to come along to maintain that code, improve it or change part of what it does - at that moment it's no longer cool, and it was never really a smart thing to do
Aominee, in their comment, actually gives us something like an example of this:
return (int)list.stream().filter(e -> e.charAt.length()-1)==c).count();
It's a one line implementation of a solution to your problem. Cool huh? Well, it has a bug* (for a start) but it's not the main thrust of my argument. At a more basic level: have you got any idea what it's doing? can you look at it and in 2 seconds tell me how it works?
It's quite an advanced language feature, it's trick for sure, but it might be a very poor solution because it's hard to understand, hard to maintain as a result, and does a lot while looking like a little- it only really makes sense if you're well versed in the language. This one line bundles up a facility that loops over your list, a feature that effectively has a tiny sub method that is called for every item in the list, and whose job is to calculate if the name ends with the sought char
It p's a brilliant feature, a cute example and it surely has its place in production java, but it's place is probably not here, in your learning exercise
Similarly, I'd go as far to say that this line of yours:
if (n.getFirstName().length() - 1 == c) {
Is approaching "doing too much" - I say this because it's where your logic broke down; you didn't write enough code to effectively implement the algorithm. You'd actually have to write even more code to implement this way:
if (n.getFirstName().charAt(n.getFirstName().length() - 1) == c) {
This is a right eyeful to load into your brain and understand. The accepted answer broke it down a bit by first getting the name into a temporary variable. That's a sensible optimisation. I broke it out another step by getting the last char into a temp variable. In a production system I probably wouldn't go that far, but this is your learning phase - try to minimise the number of operations each of your lines does. It will aid your understanding of your own code a great deal
If you do ever get a penchant for writing as much code as possible in as few chars, look at some code golf games here on the stack exchange network; the game is to abuse as many language features as possible to make really short, trick code.. pretty much every winner stands as a testament to condense that should never, ever be put into a production system maintained by normal coders who value their sanity
*the bug is it doesn't get the first name out of the Name object

How are keywords represented in binary form?

How are keywords represented in binary form?
For ex:: In java, how is the sin() represented in binary? How is sqrt() and other functions represented.
If not only in java, in any language how is it represented?? because ultimately everything is translated into binary and then into on and off signals.
Thanks in advance.

Firstly, sin is not a keyword in Java. It is an identifier. Keywords are things like if, class, and so on.
It depends on when you are asking about.
In the source code, the sin identifier is represented as characters, and those characters are represented as bits (i.e. binary) .... if you want to look at it that way.
In the classfile that is output by the javac compiler, the word sin is represented as string in the Constant Pool. (The JVM spec specifies the format of classfiles in great detail.)
When the classfile is first loaded by a JVM, the word sin becomes a Java String object.
When the code is linked by the JVM, the reference to the String is resolved to some kind of reference to a method. (The details are implementation specific. You'd need to read the JVM source code to find out more.)
When the code is JIT compiler, the reference to the method (typically) turns into the address in memory of the first native instruction of the JIT compiled method. (Strictly speaking, this is not "assembly language". But the native instructions could be represented as assembly language. Assembly language is really just a "human friendly" textual representation of the instructions.)
so how does the computer know that when sin is written it has to do the sine of a number.
What happens is that the Java runtime loads that class containing the method. Then it looks for the sin(double) method in the class that it loaded. What typically happens is that the named method resolves to some bytecodes that are the instructions that tell the runtime what the method should do. But in the case of sin, the method is a native method, and the instructions are actually native instructions that are part of one of the JVM's native libraries.
If not of methods, Can we have binary representation of Keywords?? Like int, and float etc??
It depends on the actual keywords. But generally speaking, genuine Java keywords are transformed by the compiler into a form that doesn't have a distinct / discrete representation for the individual keywords.

If not only in java, in any language how is it represented?? because ultimately everything is translated into binary and then into on and off signals.
This tells me that you probably have a fundamental misunderstanding of how programming languages are implemented. So instead of answering this question (it doesn't really have a proper answer other than "well they're not represented at all"), I will try to help you understand why this question is the wrong one to ask.
Your computer runs machine code, and only machine code. You can feed it any random sequence of bytes, it doesn't matter what they were intended to be, as soon as you point the program counter to it it will be interpreted as if it is machine code (of course giving it bytes that were not intended to be machine code is probably a bad idea). As a running example, I'll use this x64 code:
48 01 F7 48 89 F8 C3
If you have no idea what's going on, that's normal at this level. Most people don't read machine code (but they could if they learned it, it's not magic). This is where the zeroes and ones are, to the processor it's not even in hexadecimal, that's just what humans like to read.
At a level above that there is assembly, which is in most cases really just a different way of looking at machine code, in such a way that humans find it easier to read. The example from earlier looks more sensible in assembly:
add rdi, rsi
mov rax, rdi
ret
Still not very clear what's going on to someone who doesn't know x64 assembly, but at least it gives some sort of clue: there's an add in it. It probably adds things.
At a yet higher level, you could have java bytecode or java, but I think the java aspect of this question misses the point, it's probably there because OP doesn't realize that java is different from "the classic picture". Java just complicates matters without explaining the big picture. Let's use C instead. The example in C could look like:
int64_t foo_or_whatever(int64_t x, int64_t y)
{
return x + y;
}
If you don't know C but you do know Java, the only strange thing here is int64_t, which is roughly the equivalent of a long in Java.
So yes, things were added, as the assembly code suggested. Now where did the keywords go?
That question doesn't make as much sense as you thought it did. The compiler understands keywords, and uses them to create machine code that implements your program. After that point they stop being relevant. They only mean something in the context of the high level language that you wrote the code in, you could say that at that level, they are stored as ASCII or UTF8 string in a file. They have nothing to do with machine code, they do not appear in any form there, and you can write machine code without having translated it from a high level language that has keywords. That return and ret looks vaguely similar is a bit of a red herring, they have something to do with each other but the relation is far from simple (that it worked out simply in the example I'm using is of course no accident).
The int64_t has perhaps not entirely disappeared (mostly it has, though). The fact that the addition operates on 64bit integers is encoded in the instruction 48 01 F7. Not the keyword int64_t (which isn't even a keyword, but let's not get into that), "the fact that what you have there is an addition between 64bit integers", which is an conceptually different thing though caused here by the use of int64_t. To split that instruction out while skipping some of the detail (because this is a beginner question), there's
48 = 01001000 encoding REX.W, meaning this instruction is 64bit
01 = 00000001 encoding add rm64, r64 in this case
D1 = 11010001 encoding the operands rdi and rsi
To learn more about what the processor does with machine code (in case your follow-up question is "but how does it know what to do with something like 48 01 F7"), study computer architecture. If you want a book, I recommend Computer Architecture, Fifth Edition: A Quantitative Approach, which is quite accessible to beginners and commonly used in first-year courses about computer architectures.
To learn more about the journey from high level language to machine code, study compiler construction. If you want a book, I recommend Compilers: Principles, Techniques, and Tools, but it may be hard to get through it as a beginner. If you want a free course, you could follow Compilers on Coursera (the first few lectures especially will give you an overview of what compilers do without getting too technical yet).
Incidentally, if you give the example C code to GCC, it makes
lea rax, [rdi + rsi]
ret
It's still doing the same thing, but in a way that didn't fit my story, so I took the liberty of doing it in a slightly different way.

sin() is a function so it's represented as a memory address where its code block is.
Keywords (like for) aren't represented as binary, for for example is converted to a list of byte code jump instructions which are compiled into assembly instructions which are represented as binary.
My point is that you cannot convert most keywords directly into binary. You can unroll them into bytecode which you could then convert to native machine code and binary but not directly to binary.
Here, read this then after you understand it move onto how bytecode is converted to native code.
Keywords and Functions
That said, a keyword in Java (and most languages) is a reserved word like for, while or return but your examples are not keywords, they are function names like sin() and sqrt()

Not really sure what you want to know here; so let's go "bytecode"...
Both the .sin() and .sqrt() methods are static methods from the Math class; therefore, the compiler will generate a call site with both arguments, a reference to the method and then call invokestatic.
Other than invokestatic, you have invokevirtual, invokespecial, invokeinterface and (since Java 7) invokedynamic.
Now, at runtime, the JIT will kick in; and the JIT may end up producing pure native code, but this is not a guarantee. In any event, the code will be fast enough.
And the same goes for the JDK libraries themselves; the JIT will kick in and maybe turn the byte code into native code given a sufficient time to analyze it (escape analysis, inlining etc).
And since the JIT does "whatever it wants", you reliably cannot have a "binary" representation of any method from any class.

Compare Code Submissions with Previous Submissions?

Users submit code (mainly java) on my site to solve simple programming challenges, but sending the code to a server to compile and execute it can sometimes take more than 10 seconds.
To speed up this process, I plan to first check the submissions database to see if equivalent code has been submitted before. I realize this will cause Random methods to always return the same result, but that doesn't matter much. Is there any other potential problem that could be caused by not running the code?
To find matches, I remove comments and whitespace when comparing code. However, the same code can still be written in different ways, such as with different variable names. Is there a way to compare code that will find more equivalent code?

You could store a SHA1 hash of the code to compare with a previous submission. You are right that different variable names would give different hashes. Try running the code through a minifier or obfuscator. That way, variable cat and dog will both end up like a1, then you could see if they are unique. The only other way would be to actually compile it into bytecode, but then it's too late.
Instead of analyzing the source code, why not speed up the compilation? Try having a servlet container always running with a custom ClassLoader, and use the JDK tools.jar to compile on the fly. You could even submit the code via AJAX REST and get the results back the same way.
Consider how Eclipse compiles your files in the background.
Also, consider how http://ideone.com implements their online compiler.
FYI It is a big security risk to allow random code execution. You have to be very careful about hackers.

Variable names:
You can write code to match variable names in one file with the variable names in the other, then you can replace both sets with a consistent variable name.
File 1:
var1 += this(var1 - 1);
File 2:
sum += this(sum - 1);
After you read File 1, you look for what variable name File 2 is using in the place of sum, then make the variable names the same across both files.
*Note, if variables are used in similar ways you may get incorrect substitutions. This is most likely when variables are being declared. To help mitigate this, you can start searching for variable names at the bottom of the file and work up.
Short hands:
Force {} and () braces into each if/else/for/while/etc...
rewrite operations like "i+=..." as "i=i+..."
Functions:
In cases where function order doesn't matter, you can make sure functions are equivalent and then ignore them.
Operator precedence:
"3 + (2 * 4)" is usually equivalent to "2 * 4 + 3"
A way around this could be by determining the precedence of each operation and then matching it to an operation of the same precedence in the other set of code. Once a set of operations have been matched, you can replace them with a variable to represent them.
Ex.
(2+4) * 3 + (2+6) * 5 == someotherequation
//substitute most precedent: (2+4) and (2+6) for a and b
... a * 3 + b * 5
//substitute most precedent: (a*3) and (b*5) for c and d
... c + d
//substitute most precedent....
These are just a couple ways I could think of. If you do it this way, it'll end up being quite a big project... especially if you're working with multiple languages.

BigDecimal notation eclipse plugin or nice external tool

I need to make a lot of operations using BigDecimal, and I found having to express
Double a = b - c * d; //natural way
as
BigDecimal a = b.subtract(c.multiply(d))//BigDecimal way
is not only ugly, but a source of mistakes and communication problems between me and business analysts. They were perfectly able to read code with Doubles, but now they can't.
Of course a perfect solution will be java support for operator overloading, but since this not going to happen, I'm looking for an eclipse plugin or even an external tool that make an automatic conversion from "natural way" to "bigdecimal way".
I'm not trying to preprocess source code or dynamic translation or any complex thing, I just want something I can input text and get text, and keep the "natural way" as a comment in source code.
P.S.: I've found this incredible smart hack but I don't want to start doing bytecode manipulation. Maybe I can use that to create a Natural2BigDecimal translator, but I don't want to reinvent the wheel if someone has already done such a tool.
I don't want to switch to Scala/Groovy/JavaScript and I also can't, company rules forbid anything but java in server side code.

"I'm not trying to preprocess source code ... I just want something I can input [bigDecimal arithmetic expression] text".
Half of solving a problem is recognizing the problem for what it is. You exactly want something to preprocess your BigDecimal expressions to produce legal Java.
You have only two basic choices:
A stand-alone "domain specific language" and DSL compiler that accepts "standard" expressions and converts them directly to Java code. (This is one kind of preprocessor). This leaves you with the problem of keeping all the expression fragments around, and somehow knowing where to put them in the Java code.
A tool that reads the Java source text, finds such expressions, and converts them to BigDecimal in the text. I'd suggest something that let you code the expressions outside the actual code and inserted the translation.
Perhaps (stolen from another answer):
// BigDecimal a = b - c * d;
BigDecimal a = b.subtract( c.multiply( d ) );
with the meaning "compile the big decimal expression in the comment into its java equivalent, and replace the following statement with that translation.
To implement the second idea, you need a program transformation system, which can apply source-to-source rewriting rules to transforms (generate as a special case of transform) the code. This is just a preprocessor that is organized to be customizable to your needs.
Our DMS Software Reengineering Toolkit with its Java Front End could do this. You need a full Java parser to do that transformation part; you'll want name and type resolution so that you can parse/check the proposed expression for sanity.
While I agree that the as-is Java notation is ugly, and your proposal would make it prettier, my personal opinion is this isn't worth the effort. You end up with a dependency on a complex tool (yes, DMS is complex: manipulating code isn't easy) for a rather marginal gain.
If you and your team wrote thousands of these formulas, or the writers of such formulas were Java-naive it might make sense. In that case,
I'd go further, and simply insist you write the standard expression format where you need it. You could customize the Java Front End to detect when the operand types were of decimal type, and do the rewriting for you. Then you simply run this preprocessor before every Java compilation step.

I agree, it's very cumbersome! I use proper documentation (comments before each equation) as the best "solution" to this.
// a = b - c * d;
BigDecimal a = b.subtract( c.multiply( d ) )

You might go the route of an expression evaluator. There is a decent (albeit paid) one at http://www.singularsys.com/jep. Antlr has a rudimentary grammar that also does expression evaluation (tho I am not sure how it would perform) at http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator.
Neither would give you the compile-time safety you would have with true operators. You could also write the various algorithm-based classes in something like Scala, which does support operator overloading out of the box and would interoperate seamlessly with your other Java classes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.