Using int flags in lieu of booleans

Using int flags in lieu of booleans - java

So, for example, Notification has the following flag:
public static final int FLAG_AUTO_CANCEL = 0x00000010;
This is hexadecimal for the number 16. There are other flags with values:
0x00000020
0x00000040
0x00000080
Each time, it goes up by a power of 2. Converting this to binary, we get:
00010000
00100000
01000000
10000000
Hence, we can use a bitwise operators to determine which of the flags are present, etc, since each flag contains only one 1 and they are all in different locations.
Question:
This all makes perfect sense, but why not just use booleans? Is this merely stylistic, or are there memory or efficiency benefits?
EDIT:
I understand that by combining them, we can store a lot of information in a single int. Is this used solely so we can pass a lot of boolean type values in a single int instead of having to pass a ton of parameters? I don't mean to trivialize that, it's very convenient, but are there any other benefits?

What you're talking about is called a Bit Field. One advantage is that all the information can be contained in a single variable (with no overhead like that of an ArrayList). This is useful for keeping function signatures tidy, and will have some minor benefits with efficiency because of fewer stack operations, but probably this will be offset by additional bitshift operations. Additionally, you can use (for example) one byte to store 8 fields rather than wasting 7 additional bytes. You can also, if you're clever with it, perform several flag checks in a single operation.
Having said that, personal preference may see the list of booleans as cleaner or preferable. Bitfields are most common in embedded systems where space is limited or something of that nature.
In reference to your edit: it's storing the values of the flags in ints, but those are just reference constants-- you aren't editing those, you're sticking those bits into (or out of) the flags field, which is a single int. I don't really know why they chose a bitfield for this application; perhaps someone that grew up programming space-limited microcontrollers coded that specific class. The general consensus seems to be that bitfields shouldn't be included in new code.

This is a common idiom in C, where resource constraints are a much larger concern, and you usually see it in Java where the Java API is directly mapping an underlying well-known C API. However, it's not a great idea in Java for a wide number of reasons.
As of Java 5, most of the uses for one-bit bit fields are taken care of very nicely by EnumSet, which is internally implemented using a bit field (so it's extremely fast) but is type-safe, easy to read, and Iterable.

Related

Efficiently Storing a Short History of Boolean Events for many Components

To preface this - I have no influence on the design of this problem and I can't really give a lot of details about the technical background.
Say I have a lot of components of the same type that regularly get a boolean event - and I need to hold a short history of these boolean events.
A coworker of mine wrote a rather naive implementation using the type Map<Component, CircularFifoQueue<Boolean>>, CircularFifoQueue being data structure from Apache Commons. The code works, but given how generics work in Java and the dimensions used, this is really inefficient as it stores a reference to one of the two singleton boolean objects instead of just one bit.
Generally there are around 100K components and the history is supposed to hold the 5-10 most recent boolean values (might be subject to change but probably won't be larger than 10). This currently means that around 1.5GB of RAM are allocated just for these history maps. Also these changes happen quite frequently so it wouldn't hurt to increase the CPU efficiency if possible.
One obvious change would be to move the history into the Component class to remove the HashMap-induced overhead.
The more complicated question is how to efficiently store the last few boolean values.
One possible way would be to use BitSets, but as those use long[] as their underlying data structure, I doubt it would be the most efficient way to store what is essentially 5 bits.
Another option would be to directly use an integer and shift the value as a way to remove old entries. So basically
int history = 0;
public void set(int length, boolean active){
if(active) {
history |= 1 << length;
} else {
history &= ~(1 << length);
}
// shift one to the right to remove oldest entry
history = history >> 1;
}
Just off the top of my head. This code is untested. I don't know how efficient or if it works, but that is about what I had in mind.
But that would still lead to quite some overhead compared to the optimal case of storing 5 bits of data using 5 bits of memory.
One could achieve some additional saving if the histories of the different components were stored in a contiguous array, but I'm not sure how to handle either one giant contiguous BitSet. Or alternatively a large byte[] where each byte represents one bool-history as explained above.
This is a weirdly specific problem and I'd be really glad about any suggestions.

Setting aside the bit manipulations which I'm sure you'll conquer, please think how efficient is efficient enough.
Every instance of
class Foo {}
allocates 16 bytes. So if you were to introduce
class ComponentHistory {
private final int bits;
}
that's 20 bytes.
If you replace the int with byte, you're still at 20 bytes: byte type is padded to 4 bytes by JVM (at least).
If you define a global array of bits somewhere and refer to it from ComponentHistory, the reference itself is at least 4 bytes.
Basically, you can't win :)
But consider this: if you go with the simplest approach that you have already outlined, that produces simple readable code, your 100K component histories will take up 2MB of RAM - substantial savings from your current level of 1.5GB. Specifically, you've saved 1498MB.
Suppose you indeed invent a cumbersome yet working way of only storing 5 bits per history. You'd then need 500Kb = 60KB to store all histories. With the baseline of 1.5GB, your savings are now 1499.94MB. Savings improve by 0.1%. Does that at all matter? More often than not, I'd prefer to not over-optimize here while sacrificing simplicity.

EBCDIC unpacking comp-3 data returns 40404** in Java

I have used the unpack data logic provided in below link for java
How to unpack COMP-3 digits using Java?
But for the null data in source it returns 404040404 like on Java unpack code. I understand this was space in ebcdic, but how to unpack by handling this space or to avoid it.

There are two problems that we have to deal with. First, is the data valid comp-3 data and second, is the data considered “valid” by older language implementations like COBOL since Comp-3 was mentioned.
If the offests are not misaligned it would appear that spaces are being interpreted by existing programs as 0 instead of spaces. This would be incorrect but could be an artifact of older programs that were engineered to tolerate this bad behaviour.
The approach I would take in a legacy shop (assuming no misalignment) is to consider “spaces” (which are sequences of 0x404040404040) as being zero. This would be a legacy check to compare the field with spaces and then assume that 0x00000000000f as the actual default. This is something an individual shop would have to determine and is not recognized as a general programming approach.
In terms of Java, one has to remember that bytes are “signed” so comparisons can be tricky based on how the code is written. The only “unsigned” data type I
recall in java is char which is really two bytes (unit 16) basically.
This is less of a programming problem than it is recognizing historical tolerance and remediation.

Why was arg = args[n++] more efficient than 2 seperate statements in earlier compilers?

From the Book "Core Java for the Impatient", Chapter "increment and decrement operators"
String arg = args[n++];
sets arg to args[n], and then increments n. This made sense thirty
years ago when compilers didn’t do a good job optimizing code.
Nowadays, there is no performance drawback in using two separate
statements, and many programmers find the explicit form easier to
read.
I thought such usage of increment and decrement operators was only used in order to write less code, but according to this quote it wasn't so in the past.
What was the performance benefit of writing statements such as String arg = args[n++]?

Some processors, like the Motorola 68000, support addressing modes that specifically dereference a pointer, then increment it. For instance:
Older compilers might conceivably be able to use this addressing mode on an expression like *p++ or arr[i++], but might not be able to recognize it split across two statements.

Over years architectures and compilers became better. Given the improvements in architectures of CPUs and compilers I would say there is no single answer to it.
From the architecuture standpoint - many processors support STORE & POINTER AUTO-INCREMENT as a one CPU cycle. So in the past - the way you wrote the code would impact the result (one vs more operations). Most notably DSP architectures were good at paralleling things (e.g. TI DSPs like C54xx with post-increment and post-decrement instructions and instructions that you can execute in circular buffers - e.g. *"ADD *AR2+, AR2–, A ;after accessing the operands, AR2 ;is incremented by one." - from TMS320C54x DSP reference set). ARM cores also feature instructions that allows for similar parallelism (VLDR, VSTR instructions - see documentation )
From the compiler standpoint - Compiler looks at how variable is used in its scope (what could not be the the case before). It can see if the variable is reused later or not. It might be the case that in the code a variable is increased but then discarded. What is the point of doing that?Nowadays compiler has to track variable usage and it can make smart decisions based on that (if you look at Java 8 - the compiler must be able to spot "effectively final" variables that are not reassigned).

These operators were/are generally used for convenience by programmers rather than to achieve performance. Because effectively, the statement would get split into a two line statement during compilation!! Apparently, the overhead for performing Post/Pre-increment/decrement operators would be more as compared to an already split two liner statement!

How are keywords represented in binary form?

How are keywords represented in binary form?
For ex:: In java, how is the sin() represented in binary? How is sqrt() and other functions represented.
If not only in java, in any language how is it represented?? because ultimately everything is translated into binary and then into on and off signals.
Thanks in advance.

Firstly, sin is not a keyword in Java. It is an identifier. Keywords are things like if, class, and so on.
It depends on when you are asking about.
In the source code, the sin identifier is represented as characters, and those characters are represented as bits (i.e. binary) .... if you want to look at it that way.
In the classfile that is output by the javac compiler, the word sin is represented as string in the Constant Pool. (The JVM spec specifies the format of classfiles in great detail.)
When the classfile is first loaded by a JVM, the word sin becomes a Java String object.
When the code is linked by the JVM, the reference to the String is resolved to some kind of reference to a method. (The details are implementation specific. You'd need to read the JVM source code to find out more.)
When the code is JIT compiler, the reference to the method (typically) turns into the address in memory of the first native instruction of the JIT compiled method. (Strictly speaking, this is not "assembly language". But the native instructions could be represented as assembly language. Assembly language is really just a "human friendly" textual representation of the instructions.)
so how does the computer know that when sin is written it has to do the sine of a number.
What happens is that the Java runtime loads that class containing the method. Then it looks for the sin(double) method in the class that it loaded. What typically happens is that the named method resolves to some bytecodes that are the instructions that tell the runtime what the method should do. But in the case of sin, the method is a native method, and the instructions are actually native instructions that are part of one of the JVM's native libraries.
If not of methods, Can we have binary representation of Keywords?? Like int, and float etc??
It depends on the actual keywords. But generally speaking, genuine Java keywords are transformed by the compiler into a form that doesn't have a distinct / discrete representation for the individual keywords.

If not only in java, in any language how is it represented?? because ultimately everything is translated into binary and then into on and off signals.
This tells me that you probably have a fundamental misunderstanding of how programming languages are implemented. So instead of answering this question (it doesn't really have a proper answer other than "well they're not represented at all"), I will try to help you understand why this question is the wrong one to ask.
Your computer runs machine code, and only machine code. You can feed it any random sequence of bytes, it doesn't matter what they were intended to be, as soon as you point the program counter to it it will be interpreted as if it is machine code (of course giving it bytes that were not intended to be machine code is probably a bad idea). As a running example, I'll use this x64 code:
48 01 F7 48 89 F8 C3
If you have no idea what's going on, that's normal at this level. Most people don't read machine code (but they could if they learned it, it's not magic). This is where the zeroes and ones are, to the processor it's not even in hexadecimal, that's just what humans like to read.
At a level above that there is assembly, which is in most cases really just a different way of looking at machine code, in such a way that humans find it easier to read. The example from earlier looks more sensible in assembly:
add rdi, rsi
mov rax, rdi
ret
Still not very clear what's going on to someone who doesn't know x64 assembly, but at least it gives some sort of clue: there's an add in it. It probably adds things.
At a yet higher level, you could have java bytecode or java, but I think the java aspect of this question misses the point, it's probably there because OP doesn't realize that java is different from "the classic picture". Java just complicates matters without explaining the big picture. Let's use C instead. The example in C could look like:
int64_t foo_or_whatever(int64_t x, int64_t y)
{
return x + y;
}
If you don't know C but you do know Java, the only strange thing here is int64_t, which is roughly the equivalent of a long in Java.
So yes, things were added, as the assembly code suggested. Now where did the keywords go?
That question doesn't make as much sense as you thought it did. The compiler understands keywords, and uses them to create machine code that implements your program. After that point they stop being relevant. They only mean something in the context of the high level language that you wrote the code in, you could say that at that level, they are stored as ASCII or UTF8 string in a file. They have nothing to do with machine code, they do not appear in any form there, and you can write machine code without having translated it from a high level language that has keywords. That return and ret looks vaguely similar is a bit of a red herring, they have something to do with each other but the relation is far from simple (that it worked out simply in the example I'm using is of course no accident).
The int64_t has perhaps not entirely disappeared (mostly it has, though). The fact that the addition operates on 64bit integers is encoded in the instruction 48 01 F7. Not the keyword int64_t (which isn't even a keyword, but let's not get into that), "the fact that what you have there is an addition between 64bit integers", which is an conceptually different thing though caused here by the use of int64_t. To split that instruction out while skipping some of the detail (because this is a beginner question), there's
48 = 01001000 encoding REX.W, meaning this instruction is 64bit
01 = 00000001 encoding add rm64, r64 in this case
D1 = 11010001 encoding the operands rdi and rsi
To learn more about what the processor does with machine code (in case your follow-up question is "but how does it know what to do with something like 48 01 F7"), study computer architecture. If you want a book, I recommend Computer Architecture, Fifth Edition: A Quantitative Approach, which is quite accessible to beginners and commonly used in first-year courses about computer architectures.
To learn more about the journey from high level language to machine code, study compiler construction. If you want a book, I recommend Compilers: Principles, Techniques, and Tools, but it may be hard to get through it as a beginner. If you want a free course, you could follow Compilers on Coursera (the first few lectures especially will give you an overview of what compilers do without getting too technical yet).
Incidentally, if you give the example C code to GCC, it makes
lea rax, [rdi + rsi]
ret
It's still doing the same thing, but in a way that didn't fit my story, so I took the liberty of doing it in a slightly different way.

sin() is a function so it's represented as a memory address where its code block is.
Keywords (like for) aren't represented as binary, for for example is converted to a list of byte code jump instructions which are compiled into assembly instructions which are represented as binary.
My point is that you cannot convert most keywords directly into binary. You can unroll them into bytecode which you could then convert to native machine code and binary but not directly to binary.
Here, read this then after you understand it move onto how bytecode is converted to native code.
Keywords and Functions
That said, a keyword in Java (and most languages) is a reserved word like for, while or return but your examples are not keywords, they are function names like sin() and sqrt()

Not really sure what you want to know here; so let's go "bytecode"...
Both the .sin() and .sqrt() methods are static methods from the Math class; therefore, the compiler will generate a call site with both arguments, a reference to the method and then call invokestatic.
Other than invokestatic, you have invokevirtual, invokespecial, invokeinterface and (since Java 7) invokedynamic.
Now, at runtime, the JIT will kick in; and the JIT may end up producing pure native code, but this is not a guarantee. In any event, the code will be fast enough.
And the same goes for the JDK libraries themselves; the JIT will kick in and maybe turn the byte code into native code given a sufficient time to analyze it (escape analysis, inlining etc).
And since the JIT does "whatever it wants", you reliably cannot have a "binary" representation of any method from any class.

Detecting equivalent expressions

I'm currently working on a Java application where I need to implement a system for building BPF expressions. I also need to implement mechanism for detecting equivalent BPF expressions.
Building the expression is not too hard. I can build a syntax tree using the Interpreter design pattern and implement the toString for getting the BPF syntax.
However, detecting if two expressions are equivalent is much harder. A simple example would be the following:
A: src port 1024 and dst port 1024
B: dst port 1024 and src port 1024
In order to detect that A and B are equivalent I probably need to transform each expression into a "normalized" form before comparing them. This would be easy for above example, however, when working with a combination of nested AND, OR and NOT operations it's getting harder.
Does anyone know how I should best approach this problem?

One way to compare boolean expressions may be to convert both to the disjunctive normal form (DNF), and compare the DNF. Here, the variables would be Berkeley Packet Filter tokens, and the same token (e.g. port 80) appearing anywhere in either of the two expressions would need to be assigned the same variable name.
There is an interesting-looking applet at http://www.izyt.com/BooleanLogic/applet.php - sadly I can't give it a try right now due to Java problems in my browser.

I'm pretty sure detecting equivalent expressions is either an np-hard or np-complete problem, even for boolean-only expressions. Meaning that to do it perfectly, the optimal way is basically to build complete tables of all possible combinations of inputs and the results, then compare the tables.
Maybe BPF expressions are limited in some way that changes that? I don't know, so I'm assuming not.
If your problems are small, that may not be a problem. I do exactly that as part of a decision-tree designing algorithm.
Alternatively, don't try to be perfect. Allow some false negatives (cases which are equivalent, but which you won't detect).
A simple approach may be to do a variant of the normal expression-evaluation, but evaluating an alternative representation of the expression rather than the result. Impose an ordering on commutative operators. Apply some obvious simplifications during the evaluation. Replace a rich operator set with a minimal set of primitive operators - e.g. using de-morgans to eliminate OR operators.
This alternative representation forms a canonical representation for all members of a set of equivalent expressions. It should be an equivalence class in the sense that you always find the same canonical form for any member of that set. But that's only the set-theory/abstract-algebra sense of an equivalence class - it doesn't mean that all equivalent expressions are in the same equivalence class.
For efficient dictionary lookups, you can use hashes or comparisons based on that canonical representation.

I'd definitely go with syntax normalization. That is, like aix suggested, transform the booleans using DNF and reorder the abstract syntax tree such that the lexically smallest arguments are on the left-hand side. Normalize all comparisons to < and <=. Then, two equivalent expressions should have equivalent syntax trees.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.