Java uses only 1 of 2 CPU with NUMA (Neo4J)

Java uses only 1 of 2 CPU with NUMA (Neo4J) - java

I’m working on a java program to create a really large Neo4J database. I use the batchinserter and Executors.newFixedThreadPool to speed things up. My Win2012R2 server has 2 cpu’s (2x6 Cores + 2x6 Hyper-threads) and 256GB in NUMA architecture. My problem is now, that my importer only uses 1 CPU (Node).
Is it possible to use both NUMA-Nodes with only one javaprocess?
Javaoptions: -XX:+UseNUMA -Xmx64g -Xms64g

It isn't clear how much memory is assigned to each node - is it 256GB or 128GB? Either way, as I understand it, setting a max-heap size less than the amount of memory assigned to the node will usually mean the application stays affined to a single node. This is true under Windows, Solaris and Linux, as far as I'm aware.
Even if you allocate a JVM max heap size greater then the assigned memory to a node, if your heap doesn't grow beyond that size, the process won't spill because the JVM object allocator will always try to create a new object in the same memory pool as the creating thread - and that includes new thread objects.
The primary design goal of the NUMA architecture is to enable different processes to operate on different CPUs with each CPU having localised memory access, rather than having all CPUs contend for the same global shared memory. Having the same process running across multiple nodes is not necessarily that efficient, unless you can arrange for a particular thread to always use the local memory associated with a specific node (thread affinity). Otherwise, remote memory access will slow you down.
I suspect that to use more than one node in your example you will need to either assign different tasks to different nodes, or parallelise the same task across multiple nodes. In the latter case you'll need to ensure that each node has a copy of the same data in local memory. There are libraries available to manage thread affinity from your Java code.
https://github.com/peter-lawrey/Java-Thread-Affinity

The BatchInserter is single-threaded. You should use the import tool instead. See http://neo4j.com/docs/stable/import-tool.html

Related

How much threads can be created and executed on single cpu core?

If I have a program with multithreading and 1 dual-core cpu on working machine, how much threads can I create parallel for these 2 cores? In some articles I saw that cpu cor can handle only on thread. Does it mean that I can create only 2 threads? Or I can create multiple threads, but only 2 of them will be executed by core in a tact? I know, that this question is simple, but, I'm confused a little bit

Modern hardware and OS combinations can easily handle thousands of them. And they can all be in the 'running state'.
However, let's say you have 2000 threads, 2 CPUs in the box, and each CPU has 16 cores, then certainly at most 32 of your running threads are truly 'executing stuff' - the other ones are ready to go if any of he 32 currently actually executing threads do something that isn't CPU-dependent (for example, waiting for some bytes to flow in from the network, or bytes to be read from a disk), or just time passes (eventually the CPU+OS+JVM work together to pre-empt an executing thread - to pause it so that it doesn't hog the resources).
The bottleneck you're most likely to run into first is memory: Each thread needs a stack, and stacks take memory. If for example you are working with 1MB stacks, creating 2000 threads means you now have 2GB worth of stackspace alone, and the java heap (Where all objects live) probably also takes a GB or 2, that's 4. On your average provisioned IAAS box you don't have that much.
A simple solution to that problem is that you can actually control the stack size when you create a new thread object. You can use this to make much smaller stacks, for example, 64k stacks - now 2000 threads only take an eight of a GB. Of course, a 64k stack is not capable of running particularly deeply nested methods - usually if you're creating thousands of threads, the code they actually end up running should be relatively simple and not create large stacks or do recursive calling.
If you mess this up, you'll get StackOverflowError. Then you know to either adjust the code or increase the stack sizes.
If you're not making thousands of threads (merely, say, hundreds), don't worry about it. Just.. make em and run em, trust the OS+CPU+JVM to sort it out for you.

Data visibility on multi-core processor by single thread

In a single thread program, how are changes made by thread in core 1 made visible to another core 2, so that after a context switch the thread (now running on core 2) will have updated value?
Consider the following example:
The value in main memory for variable x is 10.
The thread runs on core 1 and changes x to 5, which is still in cache and not yet flushed to main memory as we are not using any memory barrier.
A context switch occurs and the thread moves from core 1 to core 2.
The thread reads the value of x.
What would be the value of x if thread resumes execution in core 2 after the context switch?
If "cache coherence" manages consistency to handle a case like above then why do we need for explicit locking (or any read/write barrier) in a multi-threaded program?

Considering your first question, context switches also preserve the register contents. Therefore, the threads sees the latest value, even if moved to another core (or CPU).
However for a multi-threaded program, CPU registers are distinct for different threads (regardless on how many cores the threads are executed), and registers are not part of cache coherency.
Therefore, I think, a multi-threaded program does need to make sure the values in the registers are up-to-date with the values in the main memory.
(Cache coherence only makes sure that the CPU cache is up-to-date with the memory).
Therefore, I suppose, you need a barrier to synchronize the register with the memory.
You can understand it as this: the program essentially operates only on the main memory. However, compilers optimise access to main memory and use registers for intermediate operations.
Thus, the program access only memory and registers.
However, the CPU also introduces its own cache of the memory.
Reads and writes from/to the memory are internally (by the CPU) optimised by the cache.
Cache coherency only ensures within the CPU, that the cache is up-to-date (and therefore, a program accessing memory gets the correct value.)
To sum up:
Cache coherence ensures cache and memory are up-to-date, it is out of the control of the program, as it is internal to the CPU.
Context switches are handled by the operating system, which ensures correct values of registers when it moves threads to different cores.
Memory barriers ensure that the registers and memory are up-to-date, this is what the program has to ensure.

How does the heap manager in java or C++ keep track of all the memory locations used by the threads or processes?

I wanted to understand what data structures the heap managers in Java or OS in case of C++ or C keep track of the memory locations used by the threads and processes. One way is to use a map of objects and the memory address and a reverse map of memory starting address and the size of the object in the memory.
But here it won't be able to cater the new memory requests in O(1) time. Is there any better data structure to do this?

Note that unmanaged languages are going to be allocating/freeing memory through system calls, generally not managing it themselves. Still regardless of what level of abstraction (OS to the run time), something has to deal with this:
One method is called buddy block allocation, described well with an example on Wikipedia. It essentially keeps track of the usage of spaces in memory of varying sizes (typically multiples of 2). This can be done with a number of arrays with clever indexing, or perhaps more intuitively with a binary tree, each node tell whether a certain block is free, all nodes on a level representing the same size block.
This suffers from internal fragmentation; as things come and go, you might ended up with your data scattered rather than being efficiently consolidated, making it harder to fit in large data. This could be countered by a more complicated, dynamic system, but buddy blocks have the advantage of simplicity.

The OS keeps track of the process's memory allocation in an overall view - 4KB pages or bigger "lumps" are stored in some form of list.
In the typical Windows implementation (Microsoft's C runtime library) - at least in recent versions, all memory allocations are done through the HeapAlloc() system call. So every single heap allocation goes through to the OS. Whether the OS actually tracks every single allocation or just keeps a map of "what is free, what is used" is another matter. It is my understanding that the heap management code has no list of "current allocations", just a list of freed memory lump
In Linux/Unix, the C library will typically avoid calling the OS for every little allocation, and instead uses a large lump of memory, and splits that up into smaller pieces per allocation. Again, no tracking of allocated memory inside the heap management.
This is done at a process level. I'm not aware of an operating system that differentiates memory allocations on a per-thread level (other than TLS - thread local storage, but that is typically a very small region, outside of the typical heap code management).
So, in summary: the OS and/or C/C++ runtime doesn't actually keep a list of all the used allocations - it keeps a list of "freed" memory [and when another lump is freed, typically will "Join" previous and next consecutive allocations to reduce fragmentation]. When the allocator is firsts started, it's given a large lump, which is then assigned as a single freed allocation. When a request is made, the lump is split into sections and the free list becomes the remainder. When that lump is not sufficient, another big lump is carved off using the underlying OS allocations.
There is a small amount of metadata stored with each allocation, which contains things like "how much memory is allocated", and this metadata is used when freeing the memory. In the typical case, this data is stored immediately before the allocated memory. But there is no way to find the allocation metadata without knowing about the allocations in some other way.

there is no automatic garbage collection in C++. You need to call free/delete for malloc/new heap memory allocations. That's where tools like valgrind(to check memory leak) comes handy. There are other concepts like auto_ptr which automatically frees the heap memory which you can refer to.

Glassfish V2.1.1 Heap size never decrease after server batch job

I've set up a glassfish cluster with 1 DAS and 2 Node Agents.
The system has TimedObjects which are batched once a day. As glassfish architecture, there is only 1 cluster instance allowed to trigger timeout event of each Timer created by TimerService.
My problems is about Heap size of a cluster instance which triggers batch job. The VisualVM shows that one instance always has scalable heap size (increase when the server is loaded and decrease after that) but another one always has heap size at the maximum and never decrease.
It is acceptable to tell me that the heap size is at the maximum because the batch job is huge. But, the only question I have is why it does not decrease after the job is done???
VisualVM shows that the "Used Heap Memory" of the instance which triggers timeout event decreases after the batch job. But, why its "Heap Size" is not scaled down accordingly?
Thank you for your advice!!! ^^

Presumably you have something referencing the memory. I suggest getting a copy of MAT and doing a heap dump. From there you can see what's been allocated and what is referencing it.

This is the final answer (thanks Preston ^^)
From the article :
http://www.ibm.com/developerworks/java/library/j-nativememory-linux/index.html
I captured these statements to answer my question!
1 :
"Runtime environments (JVM) provide capabilities that are driven by some unknown user code; that makes it impossible to predict which resources the runtime environment will require in every situation"
2 : This is why the node which triggers batch job always consumes the memory at all time.
"Reserving native memory is not the same as allocating it. When native memory is reserved, it is not backed with physical memory or other storage. Although reserving chunks of the address space will not exhaust physical resources, it does prevent that memory from being used for other purposes"
3 : And this is why the node which does not trigger batch job has scalable Heap Size behavior.
"Some garbage collectors minimise the use of physical memory by decommitting (releasing the backing storage for) parts of the heap as the used area of heap shrinks."

What is the rough "Cost" of a Thread in CPU cycles and memory?

What is the rough "cost" of using threads in java? Are the any rule of thumbs/empirical values, how much memory the creation of one thread costs? Is there a rough estimate how many CPU cycles it costs to create a thread?
Context: In a servlet of a webapplication I want to parallelize the content creation as parts of the content are file based, database based as well as webservices based. But this would mean that for every "http-request-thread" (of my serlvet container) I will have two-to-four additional threads. Note that I will be using the ExecutorService in Java 6.
What should I expect when I use hundreds to thousands of Java threads on a web-server?

Each thread has its own stack, and consequently there's an immediate memory impact. The default thread stack size is ,IIRC, for Java 6, 512k (different JVMs/version will possibly have different defaults). This figure is adjustable using the -Xss option. Consequently using hundreds of threads will have an impact on the memory the VM consumes (quite possibly before any CPU impact unless those threads are running).
I've seen clients run into problems related to threads/memory, since it's not an obvious link. It's trivial to create 100,000 threads (using executors/pools etc.) and memory problems don't appear to be immediately attributable to this.
If you're servicing many clients, you may want to take a look at the Java NIO API and in particular multiplexing, which allows asynchronous network programming. That will permit you to handle many clients with only one thread, and consequently reduce your requirement for a huge number of threads.

That depends: It depends on the OS, the Java version, and the CPU. The only way to figure this out is to try it and measure the results.
Since you'll be using the ExecutorService, it will be simple to control the number of threads. Don't use too few or requests will stack up. If you use too many, you'll run into performance problems with your file system and the DB long before Java runs out of threads.

During preparation of a magazine article about Fibers (aka project loom) I run some simple tests (Windows 10, JDK-Loom 15.b3):
AtomicInteger counter = new AtomicInteger(T);
AtomicBoolean go = new AtomicBoolean(false);
for (int i = 0; i < 10000; i++) {
Thread.newThread(Thread.VIRTUAL, () -> { // <-- remove Thread.VIRTUAL for plain Threads
while (!go.get()) Thread.sleep(1);
counter.decrementAndGet();
}).start();
}
My Windows desktop (i7-8700K) needs about 400000 ms to create all the 10000 threads and additional 200 ms to run the counter down.
Surprisingly I could not confirm the memory consumption of 512k per thread (1Mb according to some other sources). Windows memory monitor shows additional memory consumption of only about 500Mb for all the 10000 threads (50k per thread)
Project Loom Fibers manage to run the test in 30 respectively 50 ms and show no measurable memory consumption.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.