Compare Code Submissions with Previous Submissions? - java

Users submit code (mainly java) on my site to solve simple programming challenges, but sending the code to a server to compile and execute it can sometimes take more than 10 seconds.
To speed up this process, I plan to first check the submissions database to see if equivalent code has been submitted before. I realize this will cause Random methods to always return the same result, but that doesn't matter much. Is there any other potential problem that could be caused by not running the code?
To find matches, I remove comments and whitespace when comparing code. However, the same code can still be written in different ways, such as with different variable names. Is there a way to compare code that will find more equivalent code?

You could store a SHA1 hash of the code to compare with a previous submission. You are right that different variable names would give different hashes. Try running the code through a minifier or obfuscator. That way, variable cat and dog will both end up like a1, then you could see if they are unique. The only other way would be to actually compile it into bytecode, but then it's too late.
Instead of analyzing the source code, why not speed up the compilation? Try having a servlet container always running with a custom ClassLoader, and use the JDK tools.jar to compile on the fly. You could even submit the code via AJAX REST and get the results back the same way.
Consider how Eclipse compiles your files in the background.
Also, consider how http://ideone.com implements their online compiler.
FYI It is a big security risk to allow random code execution. You have to be very careful about hackers.

Variable names:
You can write code to match variable names in one file with the variable names in the other, then you can replace both sets with a consistent variable name.
File 1:
var1 += this(var1 - 1);
File 2:
sum += this(sum - 1);
After you read File 1, you look for what variable name File 2 is using in the place of sum, then make the variable names the same across both files.
*Note, if variables are used in similar ways you may get incorrect substitutions. This is most likely when variables are being declared. To help mitigate this, you can start searching for variable names at the bottom of the file and work up.
Short hands:
Force {} and () braces into each if/else/for/while/etc...
rewrite operations like "i+=..." as "i=i+..."
Functions:
In cases where function order doesn't matter, you can make sure functions are equivalent and then ignore them.
Operator precedence:
"3 + (2 * 4)" is usually equivalent to "2 * 4 + 3"
A way around this could be by determining the precedence of each operation and then matching it to an operation of the same precedence in the other set of code. Once a set of operations have been matched, you can replace them with a variable to represent them.
Ex.
(2+4) * 3 + (2+6) * 5 == someotherequation
//substitute most precedent: (2+4) and (2+6) for a and b
... a * 3 + b * 5
//substitute most precedent: (a*3) and (b*5) for c and d
... c + d
//substitute most precedent....
These are just a couple ways I could think of. If you do it this way, it'll end up being quite a big project... especially if you're working with multiple languages.

Related

Is there any way to write parsing logic using json?

I have a map in java Map<String,Object> dataMap whose content looks like this -
{country=Australia, animal=Elephant, age=18}
Now while parsing the map the use of various conditional statements may be made like-
if(dataMap.get("country").contains("stra")
OR
if(dataMap.get("animal") || 100 ==0)
OR
Some other operation inside if
I want to create a config file that contains all the rules on how the data inside the Map should look like. In simple words, I want to define the conditions that value corresponding to keys country, animal, and age should follow, what operations should be performed on them, all in the config file, so that the if elses and extra code can be removed. The config file will be used for parsing the map.
Can someone tell me how such a config file can be written, and how can it be used inside Java?
Sample examples and code references will be of help.
I am thinking of creating a json file for this purpose
Example -
Boolean b = true;
List<String> conditions = new ArrayList<>();
if(dataMap.get("animal").toString().contains("pha")){
conditions.add("condition1 satisifed");
if(((Integer.parseInt(dataMap.get("age").toString()) || 100) ==0)){
conditions.add("condition2 satisifed");
if(dataMap.get("country").equals("Australia")){
conditions.add("condition3 satisifed");
}
else{
b=false;
}
}
else{
b=false;
}
}
else{
b=false;
}
Now suppose I want to define the conditions in a config file for each map value like the operation ( equals, OR, contains) and the test values, instead of using if else's. Then the config file can be used for parsing the java map
Just to manage expectations: Doing this in JSON is a horrible, horrible idea.
To give you some idea of what you're trying to make:
Grammars like this are best visualized as a tree structure. The 'nodes' in this tree are:
'atomics' (100 is an atom, so is "animal", so is dataMap).
'operations' (+ is an operation, so is or / ||).
potentially, 'actions', though you can encode those as operations.
Java works like this, so do almost all programming languages, and so does a relatively simple 'mathematical expression engine', such as something that can evaluate e.g. the string "(1 + 2) * 3 + 5 * 10" into 59.
In java, dataMap.get("animal") || 100 ==0 is parsed into this tree:
OR operation
/ \
INVOKE get[1] equality
/ \ / \
dataMap "animal" INT(100) INT(0)
where [1] is stored as INVOKEVIRTUAL java.util.Map :: get(Object) with as 'receiver' an IDENT node, which is an atomic, with value dataMap, and an args list node which contains 1 element, the string literal atomic "animal", to be very precise.
Once you see this tree you see how the notion of precedence works - your engine will need to be capable of representing both (1 + 2) * 3 as well as 1 + (2 * 3), so doing this without trees is not really possible unless you delve into bizarre syntaxis, where the lexical ordering matching processing ordering (if you want that, look at how reverse polish notation calculators work, or something like fortran - stack based language design. I don't think you'll like what you find there).
You're already making language design decisions here. Apparently, you think the language should adopt a 'truthy'/'falsy' concept, where dataMap.get("animal") which presumably returns an animal object, is to be considered as 'true' (as you're using it in a boolean operation) if, presumably, it isn't null or whatnot.
So, you're designing an entire programming language here. Why handicap yourself by enforcing that it is written in, of all things, JSON, which is epically unsuitable for the job? Go whole hog and write an entire language. It'll take 2 to 3 years, of course. Doing it in json isn't going to knock off more than a week off of that total, and make something that is so incredibly annoying to write, nobody would ever do it, buying you nothing.
The language will also naturally trend towards turing completeness. Once a language is turing complete, it becomes mathematically impossible to answer such questions as: "Is this code ever going to actually finish running or will it loop forever?" (see 'halting problem'), you have no idea how much memory or CPU power it takes, and other issues that then result in security needs. These are solvable problems (sandboxing, for example), but it's all very complicated.
The JVM is, what, 2000 personyears worth of experience and effort?
If you got 2000 years to write all this, by all means. The point is: There is no 'simple' way here. It's a woefully incomplete thing that never feels like you can actually do what you'd want to do (which is express arbitrary ideas in a manner that feels natural enough, can be parsed by your system, and when you read back still makes sense), or it's as complex as any language would be.
Why not just ... use a language? Let folks write not JSON but write full blown java, or js, or python, or ruby, or lua, or anything else that already exists, is open source, seems well designed?

Unexpected results from Metaphone algorithm

I am using phonetic matching for different words in Java. i used Soundex but its too crude. i switched to Metaphone and realized it was better. However, when i rigorously tested it. i found weird behaviour. i was to ask whether thats the way metaphone works or am i using it in wrong way. In following example its works fine:-
Metaphone meta = new Metaphone();
if (meta.isMetaphoneEqual("cricket","criket")) System.out.prinlnt("Match 1");
if (meta.isMetaphoneEqual("cricket","criketgame")) System.out.prinlnt("Match 2");
This would Print
Match 1
Mathc 2
Now "cricket" does sound like "criket" but how come "cricket" and "criketgame" are the same. If some one would explain this. it would be of great help.
Your usage is slightly incorrect. A quick investigation of the encoded strings and default maximum code length shows that it is 4, which truncates the end of the longer "criketgame":
System.out.println(meta.getMaxCodeLen());
System.out.println(meta.encode("cricket"));
System.out.println(meta.encode("criket"));
System.out.println(meta.encode("criketgame"));
Output (note "criketgame" is truncated from "KRKTKM" to "KRKT", which matches "cricket"):
4
KRKT
KRKT
KRKT
Solution: Set the maximum code length to something appropriate for your application and the expected input. For example:
meta.setMaxCodeLen(8);
System.out.println(meta.encode("cricket"));
System.out.println(meta.encode("criket"));
System.out.println(meta.encode("criketgame"));
Now outputs:
KRKT
KRKT
KRKTKM
And now your original test gives the expected results:
Metaphone meta = new Metaphone();
meta.setMaxCodeLen(8);
System.out.println(meta.isMetaphoneEqual("cricket","criket"));
System.out.println(meta.isMetaphoneEqual("cricket","criketgame"));
Printing:
true
false
As an aside, you may also want to experiment with DoubleMetaphone, which is an improved version of the algorithm.
By the way, note the caveat from the documentation regarding thread-safety:
The instance field maxCodeLen is mutable but is not volatile, and accesses are not synchronized. If an instance of the class is shared between threads, the caller needs to ensure that suitable synchronization is used to ensure safe publication of the value between threads, and must not invoke setMaxCodeLen(int) after initial setup.

In dalvik, what expression will generate instructions 'not-int' and 'const-string/jumbo'?

I am new on learning dalvik, and I want to dump out every instruction in dalvik.
But there are still 3 instructions I can not get no matter how I write the code.
They are 'not-int', 'not-long', 'const-string/jumbo'.
I written like this to get 'not-int' but failed:
int y = ~x;
Dalvik generated an 'xor x, -1' instead.
and I know 'const-string/jumbo' means that there is more than 65535 strings in the code and the index is 32bit. But when I decleared 70000 strings in the code, the compiler said the code was too long.
So the question is: how to get 'not-int' and 'const-string/jumbo' in dalvik by java code?
const-string/jumbo is easy. As you noted, you just need to define more than 65535 strings, and reference one of the later ones. They don't all need to be in a single class file, just in the same DEX file.
Take a look at dalvik/tests/056-const-string-jumbo, in particular the "build" script that generates a Java source file with a large number of strings.
As far as not-int and not-long go, I don't think they're ever generated. I ran dexdump -d across a pile of Android 4.4 APKs and didn't find a single instance of either.

Are 0.0 and 1.0 considered magic numbers?

I know that -1, 0, 1, and 2 are exceptions to the magic number rule. However I was wondering if the same is true for when they are floats. Do I have to initialize a final variable for them or can I just use them directly in my program.
I am using it as a percentage in a class. If the input is less than 0.0 or greater than 1.0 then I want it set the percentage automatically to zero. So if (0.0 <= input && input <= 1.0).
Thank you
Those numbers aren't really exceptions to the magic number rule. The common sense rule (as far as there is "one" rule), when it isn't simplified to the level of dogma, is basically, "Don't use numbers in a context that doesn't make their meaning obvious." It just so happens that these four numbers are very commonly used in obvious contexts. That doesn't mean they're the only numbers where this applies, e.g. if I have:
long kilometersToMeters(int km) { return km * 1000L; }
there is really no point in naming the number: it's obvious from the tiny context that it's a conversion factor. On the other hand, if I do this in some low-level code:
sendCommandToDevice(1);
it's still wrong, because that should be a constant kResetCommand = 1 or something like it.
So whether 0.0 and 1.0 should be replaced by a constant completely depends on the context.
It really depends on the context. The whole point of avoiding magic numbers is to maintain the readability of your code. Use your best judgement, or provide us with some context so that we may use ours.
Magic numbers are [u]nique values with unexplained meaning or multiple occurrences which could (preferably) be replaced with named constants.
http://en.wikipedia.org/wiki/Magic_number_(programming)
Edit: When to document code with variables names vs. when to just use a number is a hotly debated topic. My opinion is that of the author of the Wiki article linked above: if the meaning is not immediately obvious and it occurs multiple times in your code, use a named constant. If it only occurs once, just comment the code.
If you are interested in other people's (strongly biased) opinions, read
What is self-documenting code and can it replace well documented code?
Usually, every rule has exceptions (and this one too). It is a matter of style to use some mnemonic names for these constants.
For example:
int Rows = 2;
int Cols = 2;
Is a pretty valid example where usage of raw values will be misleading.
The meaning of the magic number should be obvious from the context. If it is not - give the thing a name.
Attaching a name for something creates an identity. Given the definitions
const double Moe = 2.0;
const double Joe = 2.0;
...
double Larry = Moe;
double Harry = Moe;
double Garry = Joe;
the use of symbols for Moe and Joe suggests that the default value of Larry and Harry are related to each other in a way that the default value of Garry is not. The decision of whether or not to define a name for a particular constant shouldn't depend upon the value of that constant, but rather whether it will non-coincidentally appear multiple places in the code. If one is communicating with a remote device which requires that a particular byte value be sent to it to trigger a reset, I would consider:
void ResetDevice()
{
// The 0xF9 command is described in the REMOTE_RESET section of the
// Frobnitz 9000 manual
transmitByte(0xF9);
}
... elsewhere
myDevice.ResetDevice();
...
otherDevice.ResetDevice();
to be in many cases superior to
// The 0xF9 command is described in the REMOTE_RESET section of the
// Frobnitz 9000 manual
const int FrobnitzResetCode = 0xF9;
... elsewhere
myDevice.transmitByte(FrobnitzResetCode );
...
otherDevice.transmitByte(FrobnitzResetCode );
The value 0xF9 has no real meaning outside the context of resetting the Frobnitz 9000 device. Unless there is some reason why outside code should prefer to send the necessary value itself rather than calling a ResetDevice method, the constant should have no value to any code outside the method. While one could perhaps use
void ResetDevice()
{
// The 0xF9 command is described in the REMOTE_RESET section of the
// Frobnitz 9000 manual
int FrobnitzResetCode = 0xF9;
transmitByte(FrobnitzResetCode);
}
there's really not much point to defining a name for something which is in such a narrow context.
The only thing "special" about values like 0 and 1 is that used significantly more often than other constants like e.g. 23 in cases where they have no domain-specific identity outside the context where they are used. If one is using a function which requires that the first parameter indicates the number of additional parameters (somewhat common in C) it's better to say:
output_multiple_strings(4, "Bob", Joe, Larry, "Fred"); // There are 4 arguments
...
output_multiple_strings(4, "George", Fred, "James", Lucy); // There are 4 arguments
than
#define NUMBER_OF_STRINGS 4 // There are 4 arguments
output_multiple_strings(NUMBER_OF_STRINGS, "Bob", Joe, Larry, "Fred");
...
output_multiple_strings(NUMBER_OF_STRINGS, "George", Fred, "James", Lucy);
The latter statement implies a stronger connection between the value passed to the first method and the value passed to the second, than exists between the value passed to the first method and anything else in that method call. Among other things, if one of the calls needs to be changed to pass 5 arguments, it would be unclear in the second code sample what should be changed to allow that. By contrast, in the former sample, the constant "4" should be changed to "5".

comparing "the likes" smartly

Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)
Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.
Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)

Categories

Resources