Raw memory access Java/Python - java

Memory-mapped hardware
On some computing architectures, pointers can be used to directly
manipulate memory or memory-mapped devices.
Assigning addresses to pointers is an invaluable tool when programming
microcontrollers. Below is a simple example declaring a pointer of
type int and initialising it to a hexadecimal address in this example
the constant 0x7FFF:
int *hardware_address = (int *)0x7FFF;
In the mid 80s, using the BIOS to access the video capabilities of PCs
was slow. Applications that were display-intensive typically used to
access CGA video memory directly by casting the hexadecimal constant
0xB8000 to a pointer to an array of 80 unsigned 16-bit int values.
Each value consisted of an ASCII code in the low byte, and a colour in
the high byte. Thus, to put the letter 'A' at row 5, column 2 in
bright white on blue, one would write code like the following:
#define VID ((unsigned short (*)[80])0xB8000)
void foo() {
VID[4][1] = 0x1F00 | 'A';
}
is such thing possible in Java/Python in the absence of pointers?
EDIT:
is such an acces possible:
char* m_ptr=(char*)0x603920;
printf("\nm_ptr: %c",*m_ptr);
?

I'm totally uncertain of the context and thus useful application of what you're trying to do, but here goes:
The Java Native Interface should allow direct memory access within the process space. Similarly, python can load c module that would provide an access method.
Unless you've got a driver loaded by the system to do the interfacing, however, any hardware device memory will be out-of-bounds. Even then, the driver / kernel module must be the one to address non-application space memory.

If you are on an operating system with /dev/mem, you can create a MappedByteBuffer onto it and do this sort of thing.

Related

Buffer vs Unsafe - Outside JVM

I have a requirement to use a space in the available RAM which the GC has no control on. I read a few articles on the same which gave me the introduction on two approaches. They are specified in the following code.
package com.directmemory;
import java.lang.reflect.Field;
import java.nio.ByteBuffer;
import sun.misc.Unsafe;
public class DirectMemoryTest {
public static void main(String[] args) {
//Approach 1
ByteBuffer directByteBuffer = ByteBuffer.allocateDirect(8);
directByteBuffer.putDouble(1.0);
directByteBuffer.flip();
System.out.println(directByteBuffer.getDouble());
//Approach 2
Unsafe unsafe = getUnsafe();
long pointer = unsafe.allocateMemory(8);
unsafe.putDouble(pointer, 2.0);
unsafe.putDouble(pointer+8, 3.0);
System.out.println(unsafe.getDouble(pointer));
System.out.println(unsafe.getDouble(pointer+8));
System.out.println(unsafe.getDouble(pointer+16));
}
public static Unsafe getUnsafe() {
try {
Field f = Unsafe.class.getDeclaredField("theUnsafe");
f.setAccessible(true);
return (Unsafe) f.get(null);
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
}
I have a couple of questions
1) Why should I ever pay attention to Approach 1 mentioned in the code because as per my understanding ByteBuffer.allocateDirect() cannot return me a buffer with a storage capacity of greater than 2GB ? So if my requirement is to store say 3 GB of data, I have to create a new buffer and store the data there which would mean that apart from storing data I have additional responsibility of identifying the respective buffer (out of the list of 'n' buffers) which maintains a pointer to direct memory.
2) Isn't approach 2 a little faster than approach 1 because I don't have to first find the buffer and then the data, I just need an indexing mechanism of for an object's field and use getDouble/getInt methods and pass the absolute address ?
3) Is the allocation of direct memory (is it right to say off heap memory ?) related to a PID ? If on a machine, I have 2 java processes, allocateMemory calls in PID 1 and PID 2 give me never intersecting memory blocks to use ?
4) Why is the last sysout statement not resulting in 0.0 ? The idea is that every double uses 8 bytes so I store 1.0 at address returned by allocateMemory say address = 1, the 2.0 at address 1+8 which is 9 and then stop. So shouldn't the default value be 0.0 ?
One point to consider is that sun.misc.Unsafe is not a supported API. It will be replaced by something else (http://openjdk.java.net/jeps/260)
1) If your code must run unchanged with Java 8 to Java 10 (and later), approach 1 with ByteBuffers is the way to go.
If you're ready to replace the use of sun.misc.Unsafe with whatever replace is in Java 9 / Java 10 you may well go with sun.misc.Unsafe.
2) For very large data structures with more than 2 GBytes approach 2 might be faster due to the necessary additional indirection in approach 1. However without a solid (micro) benchmark I would not bet anything on it.
3) The allocated memory is always bound to the currently running JVM. So with two JVMs running on the same machine you will not get intersecting memory.
4) You are allocating 8 bytes of uninitialized memory. The only amount of memory you may legally access now is 8 bytes. For the memory beyond your allocated size no guarantees are made.
4a) You are writing 8 bytes beyond your allocated memory (unsafe.putDouble(pointer+8, 3.0);), which already leads to memory corruption and can lead to a JVM crash on the next memory allocation.
4b) You are reading 16 bytes beyond your allocated memory, which (depending on your processor architecture and operating system and previous memory use) can lead to an immediate JVM crash.

File size vs. in memory size in Java

If I take an XML file that is around 2kB on disk and load the contents as a String into memory in Java and then measure the object size it's around 33kB.
Why the huge increase in size?
If I do the same thing in C++ the resulting string object in memory is much closer to the 2kB.
To measure the memory in Java I'm using Instrumentation.
For C++, I take the length of the serialized object (e.g string).
I think there are multiple factors involved.
First of all, as Bruce Martin said, objects in java have an overhead of 16 bytes per object, c++ does not.
Second, Strings in Java might be 2 Bytes per character instead of 1.
Third, it could be that Java reserves more Memory for its Strings than the C++ std::string does.
Please note that these are just ideas where the big difference might come from.
Assuming that your XML file contains mainly ASCII characters and uses an encoding that represents them as single bytes, then you can espect the in memory size to be at least double, since Java uses UTF-16 internally (I've heard of some JVMs that try to optimize this, thouhg). Added to that will be overhead for 2 objects (the String instance and an internal char array) with some fields, IIRC about 40 bytes overall.
So your "object size" of 33kb is definitely not correct, unless you're using a weird JVM. There must be some problem with the method you use to measure it.
In Java String object have some extra data, that increases it's size.
It is object data, array data and some other variables. This can be array reference, offset, length etc.
Visit http://www.javamex.com/tutorials/memory/string_memory_usage.shtml for details.
String: a String's memory growth tracks its internal char array's growth. However, the String class adds another 24 bytes of overhead.
For a nonempty String of size 10 characters or less, the added overhead cost relative to useful payload (2 bytes for each char plus 4 bytes for the length), ranges from 100 to 400 percent.
More:
What is the memory consumption of an object in Java?
Yes, you should GC and give it time to finish. Just System.gc(); and print totalMem() in the loop. You also better to create a million of string copies in array (measure empty array size and, then, filled with strings), to be sure that you measure the size of strings and not other service objects, which may present in your program. String alone cannot take 32 kb. But hierarcy of XML objects can.
Said that, I cannot resist the irony that nobody cares about memory (and cache hits) in the world of Java. We are know that JIT is improving and it can outperform the native C++ code in some cases. So, there is not need to bother about memory optimization. Preliminary optimization is a root of all evils.
As stated in other answers, Java's String is adding an overhead. If you need to store a large number of strings in memory, I suggest you to store them as byte[] instead. Doing so the size in memory should be the same than the size on disk.
String -> byte[] :
String a = "hello";
byte[] aBytes = a.getBytes();
byte[] -> String :
String b = new String(aBytes);

What does 'base' mean in JNA's Pointer.getPointerArray(long base) and Pointer.getStringArray(long base)?

What does 'base' mean in JNA's
Pointer.getPointerArray(long base)
Pointer.getStringArray(long base)
?
The JNA Documentation for Pointer doesn't explain what 'base' this is supposed to refer to.
If it is a text-formatting base, then why is it passed to getPointerArray as well?
Could it refer to the number of bits of a memory address? Why would it need such a thing passed (if within java, couldn't it figure that out itself, and if not how could I?)
And if it is the address width, why use long? Preparing for the future? Does the JNA project foresee a massive machine with a memory address bus which is 1E19 bits wide?
Is it supposed to be a long with all bits set to 1?
Could it refer to the hardware base of the host machine? Could this be anything other than 2 for binary?
Is it supposed to be an offset?
Could it be the array termination character? What if my termination character exceeds 64 bits? What if it is less than 64 bits?
Digging through JNA's source for the Pointer class,
Pointer.getPointerArray(long base)
Pointer.getStringArray(long base)
... apparently refer to 'base' in the context of base+address memory addressing modes found at the assembler/hardware level where one register would store a memory address while a second register would store an offset to that address, which is automatically summed during a memory access. Ideally a pointer would be the 'base address' and as you iterate over the memory's contents you would be adjusting the 'offset' address.
So basically 'base' means 'offset' in this context: It starts at 'base' bytes after the pointer's location and then spits Pointer/String objects based on the address locations it reads from these parts of memory until it finds a null value. I speculate the reason why the word 'base' is used has to do with how the method is coded internally:
It creates a second Pointer object based on itself, but passes your 'base' argument as an 'offset', and then creates an index variable called 'offset'... yeah. Then it iterates, incrementing 'offset' by the address word size in bytes (typically 8) until it reads a null value.
So because 'offset' is already used as a local variable it would clash with the offset parameter so the coder named the method's offset parameter 'base'.

Huge String Table in Java

I've got a question about storing huge amount of Strings in application memory. I need to load from file and store about 5 millions lines, each of them max 255 chars (urls), but mostly ~50. From time to time i'll need to search one of them. Is it possible to do this app runnable on ~1GB of RAM?
Will
ArrayList <String> list = new ArrayList<String>();
work?
As far as I know String in java is coded in UTF-8, what gives me huge memory use. Is it possible to make such array with String coded in ANSI?
This is console application run with parameters:
java -Xmx1024M -Xms1024M -jar "PServer.jar" nogui
The latest JVMs support -XX:+UseCompressedStrings by default which stores strings which only use ASCII as a byte[] internally.
Having several GB of text in a List isn't a problem, but it can take a while to load from disk (many seconds)
If the average URL is 50 chars which are ASCII, with 32 bytes of overhead per String, 5 M entries could use about 400 MB which isn't much for a modern PC or server.
A Java String is a full blown object. This means that appart from the characters of the string theirselves, there is other information to store in it (a pointer to the class of the object, a counter with the number of pointers pointing to it, and some other infrastructure data). So an empty String already takes 45 bytes in memory (as you can see here).
Now you just have to add the maximum lenght of your string and make some easy calculations to get the maximum memory of that list.
Anyway, I would suggest you to load the string as byte[] if you have memory issues. That way you can control the encoding and you can still do searchs.
Is there some reason you need to restrict it to 1G? If you want to search through them, you definitely don't want to swap to disk, but if the machine has more memory it makes sense to go higher then 1G.
If you have to search, use a SortedSet, not an ArrayList

Replicating C struct padding in Java

According to here, the C compiler will pad out values when writing a structure to a binary file. As the example in the link says, when writing a struct like this:
struct {
char c;
int i;
} a;
to a binary file, the compiler will usually leave an unnamed, unused hole between the char and int fields, to ensure that the int field is properly aligned.
How could I to create an exact replica of the binary output file (generated in C), using a different language (in my case, Java)?
Is there an automatic way to apply C padding in Java output? Or do I have to go through compiler documentation to see how it works (the compiler is g++ by the way).
Don't do this, it is brittle and will lead to alignment and endianness bugs.
For external data it is much better to explicitly define the format in terms of bytes and write explicit functions to convert between internal and external format, using shift and masks (not union!).
This is true not only when writing to files, but also in memory. It is the fact that the struct is padded in memory, that leads to the padding showing up in the file, if the struct is written out byte-by-byte.
It is in general very hard to replicate with certainty the exact padding scheme, although I guess some heuristics would get you quite far. It helps if you have the struct declaration, for analysis.
Typically, fields larger than one char will be aligned so that their starting offset inside the structure is a multiple of their size. This means shorts will generally be on even offsets (divisible by 2, assuming sizeof (short) == 2), while doubles will be on offsets divisible by 8, and so on.
UPDATE: It is for reasons like this (and also reasons having to do with endianness) that it is generally a bad idea to dump whole structs out to files. It's better to do it field-by-field, like so:
put_char(out, a.c);
put_int(out, a.i);
Assuming the put-functions only write the bytes needed for the value, this will emit a padding-less version of the struct to the file, solving the problem. It is also possible to ensure a proper, known, byte-ordering by writing these functions accordingly.
Is there an automatic way to apply C
padding in Java output? Or do I have
to go through compiler documentation
to see how it works (the compiler is
g++ by the way).
Neither. Instead, you explicitly specify a data/communication format and implement that specification, rather than relying on implementation details of the C compiler. You won't even get the same output from different C compilers.
For interoperability, look at the ByteBuffer class.
Essentially, you create a buffer of a certain size, put() variables of different types at different positions, and then call array() at the end to retrieve the "raw" data representation:
ByteBuffer bb = ByteBuffer.allocate(8);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.put(0, someChar);
bb.put(4, someInteger);
byte[] rawBytes = bb.array();
But it's up to you to work out where to put padding-- i.e. how many bytes to skip between positions.
For reading data written from C, then you generally wrap() a ByteBuffer around some byte array that you've read from a file.
In case it's helpful, I've written more on ByteBuffer.
A handy way of reading/writing C structs in Java is to use the javolution Struct class (see http://www.javolution.org). This won't help you with automatically padding/aligning your data, but it does make working with raw data held in a ByteBuffer much more convenient. If you're not familiar with javolution, it's well worth a look as there's lots of other cool stuff in there too.
This hole is configurable, compiler has switches to align structs by 1/2/4/8 bytes.
So the first question is: Which alignment exactly do you want to simulate?
With Java, the size of data types are defined by the language specification. For example, a byte type is 1 byte, short is 2 bytes, and so on. This is unlike C, where the size of each type is architecture-dependent.
Therefore, it would be important to know how the binary file is formatted in order to be able to read the file into Java.
It may be necessary to take steps in order to be certain that fields are a specific size, to account for differences in the compiler or architecture. The mention of alignment seem to suggest that the output file will depend on the architecture.
you could try preon:
Preon is a java library for building codecs for bitstream-compressed data in a
declarative (annotation based) way. Think JAXB or Hibernate, but then for binary
encoded data.
it can handle Big/Little endian binary data, alignment (padding) and various numeric types along other features. It is a very nice library, I like it very much
my 0.02$
I highly recommend protocol buffers for exactly this problem.
As I understand it, you're saying that you don't control the output of the C program. You have to take it as given.
So do you have to read this file for some specific set of structures, or do you have to solve this in a general case? I mean, is the problem that someone said, "Here's the file created by program X, you have to read it in Java"? Or do they expect your Java program to read the C source code, find the structure definition, and then read it in Java?
If you've got a specific file to read, the problem isn't really very difficult. Either by reviewing the C compiler specifications or by studying example files, figure out where the padding is. Then on the Java side, read the file as a stream of bytes, and build the values you know are coming. Basically I'd write a set of functions to read the required number of bytes from an InputStream and turn them into the appropriate data type. Like:
int readInt(InputStream is,int len)
throws PrematureEndOfDataException
{
int n=0;
while (len-->0)
{
int i=is.read();
if (i==-1)
throw new PrematureEndOfDataException();
byte b=(byte) i;
n=(n<<8)+b;
}
return n;
}
You can alter the packing on the c side to ensure that no padding is used, or alternatively you can look at the resultant file format in a hex editor to allow you to write a parser in Java that ignores bytes that are padding.

Categories

Resources