I have a java application which is started and stopped multiple times per second over hundreds of millions of items (called from an external script).
Input: String key
Output: int value
The purpose of this application is to look for a certain key in a never ever ever changing Map (~30k keys) and to return the value. Very easy.
Question: what is more efficient when used multiple times per second:
hard-coded dictionary in a Map
Read an external file with a BufferedReader
...amaze me with your other ideas
I know hard-coding is evil but sometimes, you need to be evil to be efficient :-)
Read in the dictionary from file. Store it in a Map. Set up your Java application as a service that runs continuously (since you said it gets called many times per second). Then your Map will be cached in RAM.
The fastest is a hard coded map in memory.
if u a have a huge file you can use a Memory Mapped file :
MappedByteBuffer in = new FileInputStream("map.txt").getChannel().map(
FileChannel.MapMode.READ_ONLY, 0, LENGTH);
StringBuilder bs = new StringBuilder();
//read 1/4 of the file
while (i < LENGTH/4)
bs.append((char)in.get(i++));
This approach is a bit problematic though,in practice you will want to partition the file
on line breaks i.e read until the 100th line clean the buffer and read some more.
I would load the file into a Map at startup of the application, and then use it as you describe.
I would store the data in a database for faster load times.
Definitely do not have the application startup and shutdown every time it is called; have it as a service that waits for IO, using Asynchronous I/O, such as netty
Related
I have data saved as a single partition on HDFS (in bytes) and when I want to get the content of the data using below code, collect takes more time than first in a single partition of the data.
JavaRDD<String> mytext = sc.textFile("...");
List<String> lines = mytext.collect();
I was expecting collect and first to take the same time. Yet collect is slower than first for data in a single partition of HDFS.
What might be the reason behind this?
rdd.first() doesnt have to scan the whole partition. It gets only the first
item and returns it.
rdd.collect() has to scan the whole partition, collect all of it and send
all of it back (serialization + deserialization costs, etc.)
The reason (see apache-spark-developers forum) is likely because first() is entirely executed on the driver
node in the same process, while collect() needs to connect with worker
nodes.
Usually the first time you run an action, most of the JVM code are not
optimized, and the classloader also needs to load a lot of things on the
fly. Having to connect with other processes via RPC can slow the first
execution down in collect.
That said, if you run this a few times (in the same driver program) and it
is still much slower, you should look into other factors such as network
congestion, cpu/memory load on workers, etc.
Let's say I'm designing a REST service with Spring, and I need to have a method that accepts a file, and returns some kind of ResponseDto. The application server has its POST request size limited to 100MB. Here's the hypothetical spring controller method implementation:
public ResponseEntity<ResponseDto> uploadFile(#RequestBody MultipartFile file) {
return ResponseEntity.ok(someService.process(file));
}
Let's assume that my server has 64GB of RAM. How do I ensure that I don't get an out of memory error if in a short period (short enough for process() method to still be running for every file uploaded), 1000 users decide to upload a 100MB file (or just 1 user concurrently uploads 1000 files)?
EDIT: To clarify, I want to make sure my application doesn't crash, but instead just stops accepting/delays new requests.
You can monitor the memory usage and see when you have to stop accepting requests or cancel existing requests.
https://docs.oracle.com/javase/6/docs/api/java/lang/management/MemoryMXBean.html
https://docs.oracle.com/javase/6/docs/api/java/lang/management/MemoryPoolMXBean.html
Also you can use this
Runtime runtime = Runtime.getRuntime();
System.out.println("Free memory: " + runtime.freeMemory() + " bytes.");
Consider creating a database table that holds that holds the uploads being done:
CREATE TABLE PROC_FILE_UPLOAD
(
ID NUMBER(19,0) NOT NULL
, USER_ID NUMBER(19,0) NOT NULL
, UPLOAD_STATUS_ID NUMBER(19,0) NOT NULL
, FILE_SIZE NUMBER(19,0)
, CONSTRAINT PROC_FILE_UPLOAD_PK PRIMARY KEY (ID) ENABLE
);
COMMENT ON COLUMN PROC_FILE_UPLOAD.FILE_SIZE IS 'In Bytes';
USER_ID being a FK to your users table and UPLOAD_STATUS_ID a FK to a data dictionary with the different status for your application (IN_PROGRESS, DONE, ERROR, UNKNOWN, whatever suits you).
Before your service uploads a file, it must check if the current user is already uploading a file and if the maximum number of concurrent uploads has been reached. If so, reject the upload, else update PROC_FILE_UPLOAD information with the new upload and proceed.
Even though you could hold many files in memory with 64 GB RAM, you don't want to waste too much resources with it. There are memory efficient ways to read files, for example, you could use a BufferedReader, it is a very memory efficient way to read files since it doesn't store the whole file in memory.
The documentation does a really good job explaining it:
Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.
The buffer size may be specified, or the default size may be used. The default is large enough for most purposes.
In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders and InputStreamReaders. For example,
BufferedReader in
= new BufferedReader(new FileReader("foo.in"));
will buffer the input from the specified file. Without buffering, each invocation of read() or readLine() could cause bytes to be read from the file, converted into characters, and then returned, which can be very inefficient.
Here is another SO questions that you may find useful:
Java read large text file with 70 million lines of text
If you need to calculate the checksum of the file like you said in the comments you could use this link.
You can either limit the number of concurrent requests or use streaming to avoid keeping the whole file in RAM.
Limiting requests
You can limit the number of concurrent incoming requests in the web server. The default web server for Spring Boot is Tomcat, which is configurable in application.properties with server.tomcat.max-connections. If you have 64 GB RAM available after the app is fully loaded and your max file size 100 MB, you should be able to accept 640 concurrent requests. After that limit is reached, you can keep incoming connections in a queue before accepting them, configurable with server.tomcat.accept-count. These properties are described here: https://tomcat.apache.org/tomcat-9.0-doc/config/http.html
(In theory you can do better. If you know the upload size in advance, you can use a counting semaphore to reserve space for a file when it's time to start processing it, and delay starting any upload until there is room to reserve space for it.)
Streaming
If you are able to implement streaming instead, you can handle many more connections at the same time by not ever keeping the whole file in RAM for any one connection but instead processing the upload one bit at a time, e.g. as you write the upload out to a file or database. It looks like Apache Commons library has a component to help you build an API which streams in the request:
https://www.initialspark.co.uk/blog/articles/java-spring-file-streaming.html
I need to write a list of words to a file and then save the file on a disk. Is one of the following two ways better than the other? The second one obviously uses more main memory but is there a difference in speed?
(this is just pseudocode)
for i = 0 to i = n:
word = generateWord();
FileWriter.println(word);
end loop
versus
String [] listOfWords = new List
for i = 0 to i = n:
word = generateWord();
listOfWords.add(word)
end loop
for i = 0 to n:
FileWriter.println(listOfWords[i]);
end loop
These two methods you show are exactly the same in terms of disk usage efficiency.
When thinking about speed of disk writes, you must always take into account what kind of writer object you are using. There are many types of writer objects and each of them may behave differently when it comes to actual disk writes.
If the object you are using is one of those that write the exact data you tell it to, then your way of writing is very inefficient. You should consider switching to another writer (BufferedWriter for example) or building a longer string before writing it.
In general, you should try to write data in chunks that fit the disk's chunk size.
Between your code and the disk, you have a stack something like: Java library code, a virtual machine runtime, the C runtime library, the operating system file cache/virtual memory subsystem, the operating system I/O scheduler, a device driver and the physical disk firmware.
Just do the simplest thing possible unless profiling shows a problem. Several of those layers will already be tuned to handle buffering, batching and scheduling sequential writes since they're such a common use case.
From FileWriters standpoint you are doing the exacty same thing in both examples, so clearly there cannot be any difference regarding file I/O. And, as you say, the first one's space complexity is O(1), as opposed to second one's O(N).
I am reading a 50G file containing millions of rows separated by newline character. Presently I am using following syntax to read the file
String line = null;
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("FileName")));
while ((line = br.readLine()) != null)
{
// Processing each line here
// All processing is done in memory. No IO required here.
}
Since the file is too big, it is taking 2 Hrs to process the whole file. Can I improve the reading of file from the harddisk so that the IO(Reading) operation takes minimal time. The restriction with my code is that I have to process each line sequential order.
it is taking 2 Hrs to process the whole file.
50 GB / 2 hours equals approximately 7 MB/s. It's not a bad rate at all. A good (modern) hard disk should be capable of sustaining higher rate continuously, so maybe your bottleneck is not the I/O? You're already using BufferedReader, which, like the name says, is buffering (in memory) what it reads. You could experiment creating the reader with a bit bigger buffer than the default size (8192 bytes), like so:
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream("FileName")), 100000);
Note that with the default 8192 bytes buffer and 7 MB/s throughput the BufferedReader is going to re-fill its buffer almost 1000 times per second, so lowering that number could really help cutting down some overhead. But if the processing that you're doing, instead of the I/O, is the bottleneck, then no I/O trick is going to help you much. You should maybe consider making it multi-threaded, but whether it's doable, and how, depends on what "processing" means here.
Your only hope is to parallelize the reading and processing of what's inside. Your strategy should be to never require the entire file contents to be in memory at once.
Start by profiling the code you have to see where the time is being spent. Rewrite the part that takes the most time and re-profile to see if it improved. Keep repeating until you get an acceptable result.
I'd think about Hadoop and a distributed solution. Data sets that are larger than yours are processed routinely now. You might need to be a bit more creative in your thinking.
Without NIO you won't be able to break the throughput barrier. For example, try using new Scanner(File) instead of directly creating readers. Recently I took a look at that source code, it uses NIO's file channels.
But the first thing I would suggest is to run an empty loop with BufferedReader that does nothing but reading. Note the throughput -- and also keep an eye on the CPU. If the loop floors the CPU, then there's definitely an issue with the IO code.
Disable the antivirus and any other program which adds to disk contention while reading the file.
Defragment the disk.
Create a raw disk partition and read the file from there.
Read the file from an SSD.
Create a 50GB Ramdisk and read the file from there.
I think you may get the best results by re-considering the problem you're trying to solve. There's clearly a reason you're loading this 50Gig file. Consider if there isn't a better way to break the stored data down and only use the data you really need.
The way you read the file is fine. There might be ways to get it faster, but it usually requires understanding where your bottleneck is. Because the IO throughput is actually on the lower end, I assume the computation is having a performance side effect. If its not too lengthy you could show you whole program.
Alternatively, you could run your program without the contents of the loop and see how long it takes to read through the file :)
I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.
I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:
//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
BufferedWriter bufferWritter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWritter.newLine();
bufferWritter.write(data);
}
bufferWritter.close();
I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).
Fastest way to write huge data in text file Java (realized I should use buffered writer)
Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)
Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.
A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.
This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.
However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.
Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.
If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.
//...background multi-threaded process keeps building the queue..
OutputStream out = new FileOutputStream("foo.txt",true);
OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
BufferedWriter bufferWriter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWriter.newLine();
bufferWriter.write(data);
}
bufferWriter.close();
If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.
I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.
You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).
Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?
Another approach is:
Do you empty your queue? Does solutions.poll() reduce your solutions queue?
writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps