Say I have a List<Integer> ls and I know its length could we allocate length*4 bytes for bytebuffer and use put to directlty read ls into the buffer?
This should do what you're looking for:
ByteBuffer buffer = ByteBuffer.allocate(ls.size()*4);
for (Integer i: ls)
buffer.putInt(i);
One caveat: This function assumes that your list does not contain any null-entries.
I don't think you can do much better than this, since the underlying array of Integer objects in ls is an array of references (pointers) to these Integer objects rather than their contained int-values. The actual Integer objects will reside in memory in some random order, dictated roughly by when they were created. So, it's unlikely that you would find some consecutive block of memory that contains the data that you would need to copy into your ByteBuffer (I think this is what you meant by "directly reading" it.). So, something like using sun.misc.Unsafe will probably not be able to help you here.
Even if you had an int[] instead of a List<Integers> you probably won't be able to reliably read the data directly from memory into your ByteBuffer, since some JVM's on 64-bit machines will align all values on 64-bit address-locations for faster access, which would lead to "gaps" between your ints. And then there is the issue of endianness, which might differ from platform to platform...
Edit:
Hmmm... I just took a look at the OpenJDK source-code for the putInt()-function that this would be calling... It's a horrendous mess of sub-function-calls. If the "sun"-implementation is as "bad", you might be better off to just do the conversion yourself (using shifting and binary-ops) if you're looking for performance... The fastest might be to convert your Integers into a byte[] and then using ByteBuffer.wrap(...) if you need the answer as a ByteBuffer. Let me know if you need code...
Related
I have a text file, with a sequence of integer per line:
47202 1457 51821 59788
49330 98706 36031 16399 1465
...
The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.
Edit:
Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.
Reading/Parsing the file:
The best way to handle large files, in any language, is to try and NOT load them into memory.
In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.
You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.
Handling the resulting objects
For dealing with the objects you produce while parsing, there are several options:
Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.
Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly
Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.
a note on java collections and maps in particular
For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.
Optimal would be to hold only integers and line ends.
To that end, one way would be: convert the file to two files:
one binary file of integers (4 bytes)
one binary file with indexes where the next line would start.
For this one can use a Scanner to read, and a DataOutputStream+BufferedOutputStream to write.
Then you can load those two files in arrays of primitive type:
int[] integers = new int[(int)integersFile.length() / 4];
int[] lineEnds = new int[(int)lineEndsFile.length() / 4];
Reading can be done with a MappedByteBuffer.toIntBuffer(). (You then would not even need the arrays, but it would become a bit COBOL like verbose.)
If I take an XML file that is around 2kB on disk and load the contents as a String into memory in Java and then measure the object size it's around 33kB.
Why the huge increase in size?
If I do the same thing in C++ the resulting string object in memory is much closer to the 2kB.
To measure the memory in Java I'm using Instrumentation.
For C++, I take the length of the serialized object (e.g string).
I think there are multiple factors involved.
First of all, as Bruce Martin said, objects in java have an overhead of 16 bytes per object, c++ does not.
Second, Strings in Java might be 2 Bytes per character instead of 1.
Third, it could be that Java reserves more Memory for its Strings than the C++ std::string does.
Please note that these are just ideas where the big difference might come from.
Assuming that your XML file contains mainly ASCII characters and uses an encoding that represents them as single bytes, then you can espect the in memory size to be at least double, since Java uses UTF-16 internally (I've heard of some JVMs that try to optimize this, thouhg). Added to that will be overhead for 2 objects (the String instance and an internal char array) with some fields, IIRC about 40 bytes overall.
So your "object size" of 33kb is definitely not correct, unless you're using a weird JVM. There must be some problem with the method you use to measure it.
In Java String object have some extra data, that increases it's size.
It is object data, array data and some other variables. This can be array reference, offset, length etc.
Visit http://www.javamex.com/tutorials/memory/string_memory_usage.shtml for details.
String: a String's memory growth tracks its internal char array's growth. However, the String class adds another 24 bytes of overhead.
For a nonempty String of size 10 characters or less, the added overhead cost relative to useful payload (2 bytes for each char plus 4 bytes for the length), ranges from 100 to 400 percent.
More:
What is the memory consumption of an object in Java?
Yes, you should GC and give it time to finish. Just System.gc(); and print totalMem() in the loop. You also better to create a million of string copies in array (measure empty array size and, then, filled with strings), to be sure that you measure the size of strings and not other service objects, which may present in your program. String alone cannot take 32 kb. But hierarcy of XML objects can.
Said that, I cannot resist the irony that nobody cares about memory (and cache hits) in the world of Java. We are know that JIT is improving and it can outperform the native C++ code in some cases. So, there is not need to bother about memory optimization. Preliminary optimization is a root of all evils.
As stated in other answers, Java's String is adding an overhead. If you need to store a large number of strings in memory, I suggest you to store them as byte[] instead. Doing so the size in memory should be the same than the size on disk.
String -> byte[] :
String a = "hello";
byte[] aBytes = a.getBytes();
byte[] -> String :
String b = new String(aBytes);
Does anyone know where I can find that algorithm? It takes a double and StringBuilder and appends the double to the StringBuilder without creating any objects or garbage. Of course I am not looking for:
sb.append(Double.toString(myDouble));
// or
sb.append(myDouble);
I tried poking around the Java source code (I am sure it does it somehow) but I could not see any block of code/logic clear enough to be re-used.
I have written this for ByteBuffer. You should be able to adapt it. Writing it to a direct ByteBuffer saves you having to convert it to bytes or copy it into "native" space.
See public ByteStringAppender append(double d)
If you are logging this to a file, you might use the whole library as it can write around 20 million doubles per second sustained. It can do this without system calls as it writes to a memory mapped file.
I made a JavaClass which is making addition, sub, mult. etc.
And the numbers are like (155^199 [+,-,,/] 555^669 [+,-,,/] ..... [+,-,*,/] x^n);
each number is stored in Byte-Array and byte-Array can contain max. 66.442
example:
(byte) array = [1][0] + [9][0] = [1][0][0]
(byte) array = [9][0] * [9][0] = [1][8][0][0]
My Class file is not working if the number is bigger then (example: 999^999)
How i can solve this problem to make addition between much bigger numbers?
When the byte-Array reachs the 66.443 values, VM gives this error:
Caused by: java.lang.ClassNotFoundException. which is actually not the correct error-description.
well it means, if i have a byte-array with 66.443 values, the class cannot read correctly.
Solved:
Used multidimensional-Byte Array to solve this problem.
array{array, ... nth-array} [+, -, /] nth-array{array, ... nth-array}
only few seconds to make an addition between big numbers.
Thank you!
A single method in Java is limited to 64KB of byte code. When you initialise an array in code it uses byte code to do this. This would limit the maximum size you can define an array to about this size.
If you have a large byte array of value I suggest you store it in an external file and load it at runtime. This way you can have a byte array of up to 2 GB. If you need more than this you need to have an array of arrays.
What does your actual code look like? What error are you getting?
A Java byte array can hold up to 2^31-1 values, if there is that much contiguous memory available.
Each array can hold a maximum of Integer.MAX_VALUE values. If it crashes, I guess you see an OutOfMemoryError. Fix that by starting you java vm with more heap space:
java -Xmx1024M <...>
(example give 1024 MByte heap space)
java.lang.ClassNotFoundException is thrown if the virtual machine needs a class and can't load it - usually because it is not on the class path (sometimes the case when we simply forget to compile a java source file..). This exception is totally unrelated to java array operations.
To continue the discussion in the comments section:
The name of the missing class is very important. At the line of code, where the exception is thrown, the VM tries to load the class ClassBigMath for the very first time and fails. The classloader can't find a file ClassBigMath.class on the classpath.
Double check first if the compiled java file is really present and double check that you don't have a typo in your source code. Typical reasons for this error:
We simply forget to compile a source file
A class file is on the classpath at compilation time but not at execution time
We do a Class.forName("MyClass") and have a typo in the class name
java.math.BigInteger is much better solution to handle large number. Is there any reason , you have choosed byte array ?
The maximum size of an array in Java is given by Integer.MAX_VALUE. This is 2^31-1 elements. You might get OOM exceptions for less if there is not enough memory free. Besides that, for what you are doing you might want to look at the BigInteger class. It seems you are doing your math in some form of decimal representation, which is not very memory efficient.
According to here, the C compiler will pad out values when writing a structure to a binary file. As the example in the link says, when writing a struct like this:
struct {
char c;
int i;
} a;
to a binary file, the compiler will usually leave an unnamed, unused hole between the char and int fields, to ensure that the int field is properly aligned.
How could I to create an exact replica of the binary output file (generated in C), using a different language (in my case, Java)?
Is there an automatic way to apply C padding in Java output? Or do I have to go through compiler documentation to see how it works (the compiler is g++ by the way).
Don't do this, it is brittle and will lead to alignment and endianness bugs.
For external data it is much better to explicitly define the format in terms of bytes and write explicit functions to convert between internal and external format, using shift and masks (not union!).
This is true not only when writing to files, but also in memory. It is the fact that the struct is padded in memory, that leads to the padding showing up in the file, if the struct is written out byte-by-byte.
It is in general very hard to replicate with certainty the exact padding scheme, although I guess some heuristics would get you quite far. It helps if you have the struct declaration, for analysis.
Typically, fields larger than one char will be aligned so that their starting offset inside the structure is a multiple of their size. This means shorts will generally be on even offsets (divisible by 2, assuming sizeof (short) == 2), while doubles will be on offsets divisible by 8, and so on.
UPDATE: It is for reasons like this (and also reasons having to do with endianness) that it is generally a bad idea to dump whole structs out to files. It's better to do it field-by-field, like so:
put_char(out, a.c);
put_int(out, a.i);
Assuming the put-functions only write the bytes needed for the value, this will emit a padding-less version of the struct to the file, solving the problem. It is also possible to ensure a proper, known, byte-ordering by writing these functions accordingly.
Is there an automatic way to apply C
padding in Java output? Or do I have
to go through compiler documentation
to see how it works (the compiler is
g++ by the way).
Neither. Instead, you explicitly specify a data/communication format and implement that specification, rather than relying on implementation details of the C compiler. You won't even get the same output from different C compilers.
For interoperability, look at the ByteBuffer class.
Essentially, you create a buffer of a certain size, put() variables of different types at different positions, and then call array() at the end to retrieve the "raw" data representation:
ByteBuffer bb = ByteBuffer.allocate(8);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.put(0, someChar);
bb.put(4, someInteger);
byte[] rawBytes = bb.array();
But it's up to you to work out where to put padding-- i.e. how many bytes to skip between positions.
For reading data written from C, then you generally wrap() a ByteBuffer around some byte array that you've read from a file.
In case it's helpful, I've written more on ByteBuffer.
A handy way of reading/writing C structs in Java is to use the javolution Struct class (see http://www.javolution.org). This won't help you with automatically padding/aligning your data, but it does make working with raw data held in a ByteBuffer much more convenient. If you're not familiar with javolution, it's well worth a look as there's lots of other cool stuff in there too.
This hole is configurable, compiler has switches to align structs by 1/2/4/8 bytes.
So the first question is: Which alignment exactly do you want to simulate?
With Java, the size of data types are defined by the language specification. For example, a byte type is 1 byte, short is 2 bytes, and so on. This is unlike C, where the size of each type is architecture-dependent.
Therefore, it would be important to know how the binary file is formatted in order to be able to read the file into Java.
It may be necessary to take steps in order to be certain that fields are a specific size, to account for differences in the compiler or architecture. The mention of alignment seem to suggest that the output file will depend on the architecture.
you could try preon:
Preon is a java library for building codecs for bitstream-compressed data in a
declarative (annotation based) way. Think JAXB or Hibernate, but then for binary
encoded data.
it can handle Big/Little endian binary data, alignment (padding) and various numeric types along other features. It is a very nice library, I like it very much
my 0.02$
I highly recommend protocol buffers for exactly this problem.
As I understand it, you're saying that you don't control the output of the C program. You have to take it as given.
So do you have to read this file for some specific set of structures, or do you have to solve this in a general case? I mean, is the problem that someone said, "Here's the file created by program X, you have to read it in Java"? Or do they expect your Java program to read the C source code, find the structure definition, and then read it in Java?
If you've got a specific file to read, the problem isn't really very difficult. Either by reviewing the C compiler specifications or by studying example files, figure out where the padding is. Then on the Java side, read the file as a stream of bytes, and build the values you know are coming. Basically I'd write a set of functions to read the required number of bytes from an InputStream and turn them into the appropriate data type. Like:
int readInt(InputStream is,int len)
throws PrematureEndOfDataException
{
int n=0;
while (len-->0)
{
int i=is.read();
if (i==-1)
throw new PrematureEndOfDataException();
byte b=(byte) i;
n=(n<<8)+b;
}
return n;
}
You can alter the packing on the c side to ensure that no padding is used, or alternatively you can look at the resultant file format in a hex editor to allow you to write a parser in Java that ignores bytes that are padding.