I had this "java.lang.OutOfMemoryError: Java heap space " and I read and understand that I can increase my memory using -Xmx1024m. But I think in my code I can change something to this error does not happen anymore.
First, this is the image from VisualVM about my memory :
In the image you can see that the object "Pedidos" is not so big and I have the another object "Enderecos" that have more and less the same size but is not complete because I have the error before the object is completed.
The point is :
I have 2 classes that search for a big csv file ( 400.000 values each ), I will show the code. I tried to use Garbage Collector, set variables as null, but is not working, can anyone help me please? Here is the Code from the class "Pedidos", the class "Enderecos" is the same and my project is just calling this 2 classes.
// all Imports
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import javax.swing.JOptionPane;
import Objetos.Pedido;
// CLASS
public class GerenciadorPedido{
// ArrayList I will add all the "Pedidos" Objects
ArrayList<Pedido> listaPedidos = new ArrayList<Pedido>();
// Int that I need to use the values correctly
int helper;
// I create this global because I didnt want to create a new String everytime the for is running (trying to use less memory)
String Campo[];
String Linha;
String newLinha;
public ArrayList<Pedido> getListaPedidos() throws IOException {
// Here I change the "\" and "/" to be accepted be the FILE (the csv address)
String Enderecotemp = System.getProperty("user.dir"), Endereco = "";
char a;
for (int i = 0; i < Enderecotemp.length(); i++) {
a = Enderecotemp.charAt(i);
if (a == '\\') a = '/';
Endereco = Endereco + String.valueOf(a);
}
Endereco = Endereco + "/Pedido.csv";
// Open the CSV File and the reader to read it
File NovoArquivo = new File(Endereco);
Reader FileLer = null;
// Try to read the File
try
{
FileLer = new FileReader(NovoArquivo);
}
catch(FileNotFoundException e) {
JOptionPane.showMessageDialog(null, "Erro, fale com o Vini <Arquivo de Pedido Não Encontrado>");
}
// Read the File
BufferedReader Lendo = new BufferedReader(FileLer);
try
{
// Do for each line of the csv
while (Lendo.ready()) {
// Read the line and replace the caracteres ( needed to funcionality works )
Linha = Lendo.readLine();
newLinha = Linha.replaceAll("\"", "");
newLinha = newLinha.replaceAll(",,", ", , ");
newLinha = newLinha.replaceAll(",,", ", , ");
newLinha = newLinha + " ";
// Create Campo[x] for each value between ","
Campo = newLinha.split(",");
// Object
Pedido pedido = new Pedido();
helper = 0;
// Just to complete the object with the right values if the Campo.length have 15, 16, 17, 18 or 19 of size.
switch (Campo.length) {
case 15: pedido.setAddress1(Campo[9]);
break;
case 16: pedido.setAddress1(Campo[9] + Campo[10]);
helper = 1;
break;
case 17: pedido.setAddress1(Campo[9] + Campo[10] + Campo[11]);
helper = 2;
break;
case 18: pedido.setAddress1(Campo[9] + Campo[10] + Campo[11] + Campo[12]);
helper = 3;
break;
case 19: pedido.setAddress1(Campo[9] + Campo[10] + Campo[11] + Campo[12] + Campo[13]);
helper = 4;
break;
}
// Complete the Object
pedido.setOrder(Campo[0]);
pedido.setOrderValue(Float.parseFloat(Campo[1]));
pedido.setOrderPv(Float.parseFloat(Campo[2]));
pedido.setCombinedOrderFlag(Campo[3]);
pedido.setCombineOrder(Campo[4]);
pedido.setOrderType(Campo[5]);
pedido.setOrderShipped(Campo[6]);
pedido.setOrderCancelled(Campo[7]);
pedido.setTransactionType(Campo[8]);
pedido.setAddress2(Campo[10 + helper]);
pedido.setAddress3(Campo[11 + helper]);
pedido.setPost(Campo[12 + helper]);
pedido.setCity(Campo[13 + helper]);
pedido.setState(Campo[14 + helper]);
// Add the object in the ArrayList
listaPedidos.add(pedido);
// Set everything to null to start again
Campo = null;
Linha = null;
newLinha = null;
}
}
catch(IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
finally
{
// Close the file and run garbage collector to try to clear the trash
Lendo.close();
FileLer.close();
System.gc();
}
// return the ArrayList.
return listaPedidos;
}
}
The project runs this class, but when the project try to run the another ( the same as this one, changing just names and the csv ), I have the memory Error. I don't know how can I clear this char[] and String that is to big as can you see on the image. Any new Ideas ? Is really impossible without increase the memory ?
As is being discussed in the comments already, the main factor is your program places everything in memory at the same time. That design will inherently limit the size of the files you can process.
The way garbage collection works is that only garbage is collected. Any object that is referenced by another is not garbage. So, starting with the "root" objects (anything declared as static or local variables currently on the stack), follow the references. Your GerenciadorPedido instance is surely referenced from main(). It references a list listaPedidos. That list references (many) instances of Pedido each of which references many String instances. Those objects will all remain in memory while they are reachable through the list.
The way to design your program so it doesn't have a limit on the size of the file it can process is to eliminate the list entirely. Don't read the entire file and return a list (or other collection). Instead implement an Iterator. Read one line from the CSV file, create the Pedido, return it. When the program is finished with that one, then read the next line and create the next Pedido. Then you will have only one of these objects in memory at any given time.
Some additional notes regarding your current algorithm:
every String object references a char[] internally that contains the characters
ArrayList has very poor memory usage characteristics when adding to a large list. Since it is backed by an array, in order to grow to add the new element, it must create an entirely new array larger than the current one then copy all of the references. During the process it will use double the memory that it needs. This also becomes increasingly slower the larger the list is.
One solution is to tell the ArrayList how large you will need it to be so you can avoid resizing. This is only applicable if you actually know how large you will need it to be. If you need 100 elements: new ArrayList<>(100).
Another solution is to use a different data structure. A LinkedList is better for adding elements one at a time because it does not need to allocate and copy an entire array.
Each call to .replaceAll() will create a new char[] for the new String object. Since you then orphan the previous String object, it will get garbage collected. Just be aware of this need for allocation.
Each string concatenation (eg newLinha + " " or Campo[9] + Campo[10]) will create a new StringBuilder object, append the two strings, then create a new String object. This, again, can have an impact when repeated for large amounts of data.
You should, in general, never need to call System.gc(). It is okay to call it, but the system will perform garbage collection whenever memory is needed.
One addtional note: your approach to parsing the CSV will fail when the data contains characters you aren't expecting. In particular if any of the fields were to contain a comma. I recommend using an existing CSV parsing library for a simple solution to correctly handling the entire definition of CSV. (I have successful experience using opencsv)
Related
RasEnumConnections function which realized in JNA is returning incomplete data.
What wrong? This is my code:
public static void main(String[] args) {
Connected();
}
private static void Connected () {
boolean state = false;
ArrayList<String> connectedNames = new ArrayList<>();
IntByReference lpcb = new IntByReference(0);
IntByReference lpcConnections = new IntByReference(0);
Rasapi32.INSTANCE.RasEnumConnections(null, lpcb,lpcConnections);
WinRas.RASCONN conn = new WinRas.RASCONN();
conn.dwSize = lpcb.getValue();
WinRas.RASCONN[] connArray;
if(lpcConnections.getValue() > 0)
connArray = (WinRas.RASCONN[])conn.toArray(lpcConnections.getValue());
else
connArray = (WinRas.RASCONN[])conn.toArray(1);
System.out.println("lpcb: " + lpcb.getValue() + " lpcConnections: " + lpcConnections.getValue() + " RASCONN Size: " + conn.dwSize);
int error = Rasapi32.INSTANCE.RasEnumConnections(connArray, lpcb,lpcConnections);
if(error == WinError.ERROR_SUCCESS) {
System.out.println("Entry name: " + Native.toString(connArray[0].szEntryName)
+ " Guid string: " + connArray[0].guidEntry.toGuidString());
System.out.println(connArray[0].guidEntry.Data1);
System.out.println(connArray[0].guidEntry.Data2);
System.out.println(connArray[0].guidEntry.Data3);
}
else System.out.println("Error: " + error);
WinRas.RASENTRY.ByReference entry = getPhoneBookEntry("test1");
if(entry != null) {
System.out.println("test1 guid: "+ entry.guidId.toGuidString());
System.out.println(entry.guidId.Data1);
System.out.println(entry.guidId.Data2);
System.out.println(entry.guidId.Data3);
}
else System.out.println("Error: " + Native.getLastError());
}
}
Char array szEntryName contains only 3 last chars of connection name. (Connection name is "test1")
As I've noted in the comments, the debug output gives you a strong hint at what's happening. The missing "t" and "e" characters appear as 0x74 and 0x65 in the midst of what JNA expects to be a 64-bit pointer. The logical conclusion is that Windows is returning a 32-bit pointer followed by the string, 4 bytes earlier than JNA expected.
RasEnumConnections states a few things regarding the buffer you are passing as connArray:
On input, an application must set the dwSize member of the first
RASCONN structure in the buffer to sizeof(RASCONN) in order to
identify the version of the structure being passed.
In your sample code above you are leaving this value the same as the value from the initial return. This is specifying the "wrong" version of the structure. Instead, you should set the dwSize member to the size you want in your JNA structure:
conn.dwSize = conn.size();
Actually, the constructor for RASCONN sets this for you! So you actually don't have to do this. But in your code sample above, you are overwriting what was pre-set; just delete your conn.dwSize line.
Note that since you are now requesting a (4-bytes per array element) larger buffer by definining the structure size, you also need to pass the increased size in the (second) RasEnumConnections() call. It's set as the number of elements times the (smaller) structure size, but you should reset to the number of elements times the (larger) size like this:
lpcb.setValue(conn.size() * lpcConnections.getValue());
prior to fetching the full array. Otherwise you'll get the error 632 (Incorrect Structure Size).
For reference (or perhaps a suitable replacement for your own code), take a look at the code as implemented in the getRasConnection(String connName) method in JNA's Rasapi32Util.java class.
I have a folder with 100 .txt files, 20 or + MB each.
All the files have about 2*10^5 lines of UTF-8 encoded text.
What is, possibly using multithreading, the fastest way to find which files contain a fixed key string? (The criteria for the contain is the same as the java .contains() function, ie. a normal substring).
There are several ways I found here on SO, but none used multithreading (Why?), and all of them seem to vary speed depending on the requirements, I cannot seem to understand which of the approaches is better for me.
For example this super-complex approach:
https://codereview.stackexchange.com/questions/44021/fast-way-of-searching-for-a-string-in-a-text-file
seems to be 2 times slower than a simple line by line search with a BufferedReader and the .contains() function. How can it be?
And how can I use multithreading to its full potential? The program is run on a very powerful multicore machine.
The output I'm looking for is which files contain the string, and, possibly, at which line.
The following code does the job.
It'll get to your directory and look for all the files.
And then will create a new thread for each file and look for target string.
Make sure to change the path of folder and target String in TheThread class as per you need
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
//class used for thread
class TheThread implements Runnable {
int counter = 0;
//to get stream of paths
Stream<Path> streamOfFiles = Files.walk(Paths.get("./src/Multi_tasking/Files"));
//List of total all files in the folder
List<Path> listOfFiles = streamOfFiles.collect(Collectors.toList());
//because Files.walk may throw IOException
public TheThread() throws IOException {
}
#Override
public void run() {
//increments counter to access the indexes of the list
counter++;
//Calling the method for search file at index counter and target String
SearchTextInMultipleFilesUsingMultiThreading.lookIn(listOfFiles.get(counter), "target String");
}
}
public class SearchTextInMultipleFilesUsingMultiThreading {
//method responsible for searching the target String in file
public static void lookIn(Path path, String text) {
try {
List<String> texts = Files.readAllLines(path);
boolean flag = false;
for (int i = 0; i < texts.size(); i++) {
String str = texts.get(i);
if (str.contains(text)) {
System.out.println("Found \"" + text + "\" in " + path.getFileName() + " at line : " + (i + 1) + " from thread : " + Thread.currentThread().getName());
flag = true;
}
}
if (!flag) {
System.out.println("\"" + text + "\" not found in " + path.getFileName() + " through thread : " + Thread.currentThread().getName());
}
} catch (IOException e) {
System.out.println("Error while reading " + path.getFileName());
e.printStackTrace();
}
}
public static void main(String[] args) throws IOException {
//creating object of our thread class
TheThread theThread = new TheThread();
//getting the number of files in the folder
int numberOfFiles = theThread.listOfFiles.size() - 1;
//if the folder doesn't contain any file at all
if (numberOfFiles == 0) {
System.out.println("No file found in the folder");
System.exit(0);
}
//creating the List to store threads
List<Thread> listOfThreads = new ArrayList<>();
//keeping required number of threads inside the list
for (int i = 0; i < numberOfFiles; i++) {
listOfThreads.add(new Thread(theThread));
}
//starting all the threads
for (Thread thread :
listOfThreads) {
thread.start();
}
}
}
I'll let the answers to other questions speak for themselves, but multithreading is unlikely to be helpful for I/O bound tasks with data stored on a single disk. Assuming your folder is stored on a single disk, the use case that the disk caches are most optimized for is single threaded access, so that's likely to be the most effective solution. The reason is because reading the data from disk is likely to be slower than looking through the data once it's loaded into memory, so the disk read is rate limiting.
The simple solution with a BufferedReader and the contains() function may well be the fastest since this is library code that is likely highly optimized.
Now, if your data was sharded onto multiple disks, it might be worthwhile to run multiple threads, depending on how the operating system did disk caching. If you were going to do multiple searches for different strings, not all known at the time of the first search so that a single pass approach wouldn't work, it might be worthwhile to load all the files into memory and then do multithreaded searches only on memory. But then your problem isn't really a file search problem any more, but a more general data search problem.
My application stores a large number (about 700,000) of strings in an ArrayList. The strings are loaded from a text file like this:
List<String> stringList = new ArrayList<String>(750_000);
//there's a try catch here but I omitted it for this example
Scanner fileIn = new Scanner(new FileInputStream(listPath), "UTF-8");
while (fileIn.hasNext()) {
String s = fileIn.nextLine().trim();
if (s.isEmpty()) continue;
if (s.startsWith("#")) continue; //ignore comments
stringList.add(s);
}
fileIn.close();
Later on, Other strings are compared to this list, using this code:
String example = "Something";
if (stringList.contains(example))
doSomething();
This comparison will happen many hundreds (thousands?) of times.
This all works, but I want to know if there's anything I can do to make it better. I notice that the JVM increases in size from about 100MB to 600MB when it loads the 700K Strings. The strings are mainly about this size:
Blackened Recordings
Divergent Series: Insurgent
Google
Pixels Movie Money
X Ambassadors
Power Path Pro Advanced
CYRFZQ
Is there anything I can do to reduce the memory, or is that to be expected? Any suggestions in general?
ArrayList is a memory effective. Probably your issue is caused by java.util.Scanner. Scanner creates a lot of temp objects during parsing (Patterns, Matchers etc) and not suitable for big files.
Try to replace it with java.io.BufferedReader:
List<String> stringList = new ArrayList<String>();
BufferedReader fileIn = new BufferedReader(new FileReader("UTF-8"));
String line = null;
while ((line = fileIn.readLine()) != null) {
line = line.trim();
if (line.isEmpty()) continue;
if (line.startsWith("#")) continue; //ignore comments
stringList.add(line);
}
fileIn.close();
See java.util.Scanner source code
To pinpoint memory issue attach to your JVM any memory profiler, for example VisualVM from JDK tools.
Added:
Let's make few assumtions:
you have 700000 string with 20 characters each.
object reference size is 32 bits, object header - 24, array header - 16, char - 16, int 32.
Then every string will consume 24+32*2+32+(16+20*16) = 456 bits.
Whole ArrayList with string object will consume about 700000*(32*2+456) = 364000000 bits = 43.4 MB (very roughly).
Not quite an answer, but:
Your scenario uses around 70mb on my machine:
long usedMemory = -(Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory());
{//
String[] strings = new String[700_000];
for (int i = 0; i < strings.length; i++) {
strings[i] = new String(new char[20]);
}
}//
usedMemory += Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.println(usedMemory / 1_000_000d + " mb");
How did you reach 500mb there? As far as I know, String has internally a char[], and each char has 16 bits. Taking the Object and String overhead in account, 500mb is still quite much for the strings only. You may perform some benchmarking tests on your machine.
As others already mentioned, you should change the datastructure for element look-ups/comparison.
You're likely going to be better off using a HashSet instead of an ArrayList as both add and contains are constant time operations in a HashSet.
However, it does assume that your object's hashCode implementation (which is part of Object, but can be overridden) is evenly distributed.
There is a Trie data structure which can be used as dictionary, with so many strings they can occur multiple times. https://en.wikipedia.org/wiki/Trie . It seems to fit your case.
UPDATE:
An alternative can be HashSet or HashMap string -> something if you want occurrences of strings for example. Hashed collection will be faster than list for sure.
I would start with HashSet.
Using an ArrayList is a very bad idea for your use case, because it is not sorted, and hence you cannot efficiently search for an entry.
The best built-in type for your case is a is a TreeSet<String>. It guarantees O(log(n)) Performance for add() and contains().
Be aware that TreeSet is not thread-safe in the basic implementation. Use an mt-safe wrapper (see the JavaDocs of TreeSet for this).
Here is a Java 8 approach. It uses Files.lines() method which take advantage of Stream API. This method reads all lines from a file as a Stream.
As a consequence no String objects are created till the terminal operation which is a static method MyExecutor.doSomething(String).
/**
* Process lines from a file.
* Uses Files.lines() method which take advantage of Stream API introduced in Java 8.
*/
private static void processStringsFromFile(final Path file) {
try (Stream<String> lines = Files.lines(file)) {
lines.map(s -> s.trim())
.filter(s -> !s.isEmpty())
.filter(s -> !s.startsWith("#"))
.filter(s -> s.contains("Something"))
.forEach(MyExecutor::doSomething);
} catch (IOException ex) {
logProcessStringsFailed(ex);
}
}
I conducted an Analysis of Memory Usage in NetBeans and here are the Memory Results for empty implementation of doSomething()
public static void doSomething(final String s) {
}
Live Bytes = 6702720 ≈ 6.4MB.
I need the advice from someone who knows Java very well and the memory issues.
I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.
I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.
My file contains text, it is not binary and I have about 20 character per line.
To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.
It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.
Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.
Kickoff example:
String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
int count = 0;
for (String line; (line = reader.readLine()) != null;) {
if (count++ % maxlines == 0) {
close(writer);
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
}
writer.write(line);
writer.newLine();
}
} finally {
close(writer);
close(reader);
}
First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible length).
Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).
If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,
You can consider using memory-mapped files, via FileChannels .
Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.
Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness
This is a very good article:
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
In summary, for great performance, you should:
Avoid accessing the disk.
Avoid accessing the underlying operating system.
Avoid method calls.
Avoid processing bytes and characters individually.
For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.
Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.
You can use java.nio which is faster than classical Input/Output stream:
http://java.sun.com/javase/6/docs/technotes/guides/io/index.html
Yes.
I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file
(Eg : read(buffer,0,buffer.length))
And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.
package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;
/**
* #author Naresh Bhabat
*
Following implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.
Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.
It uses random access file,which is almost like streaming API.
* ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);
* for 10 threads:Total time required for reading and writing the text in
* :seconds 349.317
*
* For 100:Total time required for reading the text and writing : seconds 464.042
*
* For 1000 : Total time required for reading and writing text :466.538
* For 10000 Total time required for reading and writing in seconds 479.701
*
*
*/
public class DealWithHugeRecordsinFile extends TestCase {
static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
static volatile RandomAccessFile fileToWrite;
static volatile RandomAccessFile file;
static volatile String fileContentsIter;
static volatile int position = 0;
public static void main(String[] args) throws IOException, InterruptedException {
long currentTimeMillis = System.currentTimeMillis();
try {
fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles
file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles
seriouslyReadProcessAndWriteAsynch();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Thread currentThread = Thread.currentThread();
System.out.println(currentThread.getName());
long currentTimeMillis2 = System.currentTimeMillis();
double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
System.out.println("Total time required for reading the text in seconds " + time_seconds);
}
/**
* #throws IOException
* Something asynchronously serious
*/
public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
while (true) {
String readLine = file.readLine();
if (readLine == null) {
break;
}
Runnable genuineWorker = new Runnable() {
#Override
public void run() {
// do hard processing here in this thread,i have consumed
// some time and ignore some exception in write method.
writeToFile(FILEPATH_WRITE, readLine);
// System.out.println(" :" +
// Thread.currentThread().getName());
}
};
executor.execute(genuineWorker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
file.close();
fileToWrite.close();
}
/**
* #param filePath
* #param data
* #param position
*/
private static void writeToFile(String filePath, String data) {
try {
// fileToWrite.seek(position);
data = "\n" + data;
if (!data.contains("Randomization")) {
return;
}
System.out.println("Let us do something time consuming to make this thread busy"+(position++) + " :" + data);
System.out.println("Lets consume through this loop");
int i=1000;
while(i>0){
i--;
}
fileToWrite.write(data.getBytes());
throw new Exception();
} catch (Exception exception) {
System.out.println("exception was thrown but still we are able to proceeed further"
+ " \n This can be used for marking failure of the records");
//exception.printStackTrace();
}
}
}
Don't use read without arguments.
It's very slow.
Better read it to buffer and move it to file quickly.
Use bufferedInputStream because it supports binary reading.
And it's all.
Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.
In some part of my application, I am parsing a 17MB log file into a list structure - one LogEntry per line. There are approximately 100K lines/log entries, meaning approx. 170 bytes per line. What surprised me is that I run out of heap space, even when I specify 128MB (256MB seems sufficient). How can 10MB of text turned into a list of objects cause a tenfold increase in space?
I understand that String objects use at least twice the amount of space compared to ANSI text (Unicode, one char=2 bytes), but this consumes at least four times that.
What I am looking for is an approximation for how much an ArrayList of n LogEntries will consume, or how my method might create extraneous objects that aggravate the situation (see comment below on String.trim())
This is the data part of my LogEntry class
public class LogEntry {
private Long id;
private String system, version, environment, hostName, userId, clientIP, wsdlName, methodName;
private Date timestamp;
private Long milliSeconds;
private Map<String, String> otherProperties;
This is the part doing the reading
public List<LogEntry> readLogEntriesFromFile(File f) throws LogImporterException {
CSVReader reader;
final String ISO_8601_DATE_PATTERN = "yyyy-MM-dd HH:mm:ss,SSS";
List<LogEntry> logEntries = new ArrayList<LogEntry>();
String[] tmp;
try {
int lineNumber = 0;
final char DELIM = ';';
reader = new CSVReader(new InputStreamReader(new FileInputStream(f)), DELIM);
while ((tmp = reader.readNext()) != null) {
lineNumber++;
if (tmp.length < LogEntry.getRequiredNumberOfAttributes()) {
String tmpString = concat(tmp);
if (tmpString.trim().isEmpty()) {
logger.debug("Empty string");
} else {
logger.error(String.format(
"Invalid log format in %s:L%s. Not enough attributes (%d/%d). Was %s . Continuing ...",
f.getAbsolutePath(), lineNumber, tmp.length, LogEntry.getRequiredNumberOfAttributes(), tmpString)
);
}
continue;
}
List<String> values = new ArrayList<String>(Arrays.asList(tmp));
String system, version, environment, hostName, userId, wsdlName, methodName;
Date timestamp;
Long milliSeconds;
Map<String, String> otherProperties;
system = values.remove(0);
version = values.remove(0);
environment = values.remove(0);
hostName = values.remove(0);
userId = values.remove(0);
String clientIP = values.remove(0);
wsdlName = cleanLogString(values.remove(0));
methodName = cleanLogString(stripNormalPrefixes(values.remove(0)));
timestamp = new SimpleDateFormat(ISO_8601_DATE_PATTERN).parse(values.remove(0));
milliSeconds = Long.parseLong(values.remove(0));
/* remaining properties are the key-value pairs */
otherProperties = parseOtherProperties(values);
logEntries.add(new LogEntry(system, version, environment, hostName, userId, clientIP,
wsdlName, methodName, timestamp, milliSeconds, otherProperties));
}
reader.close();
} catch (IOException e) {
throw new LogImporterException("Error reading log file: " + e.getMessage());
} catch (ParseException e) {
throw new LogImporterException("Error parsing logfile: " + e.getMessage(), e);
}
return logEntries;
}
Utility function used for populating the map
private Map<String, String> parseOtherProperties(List<String> values) throws ParseException {
HashMap<String, String> map = new HashMap<String, String>();
String[] tmp;
for (String s : values) {
if (s.trim().isEmpty()) {
continue;
}
tmp = s.split(":");
if (tmp.length != 2) {
throw new ParseException("Could not split string into key:value :\"" + s + "\"", s.length());
}
map.put(tmp[0], tmp[1]);
}
return map;
}
You also have a Map there, where you store other properties. Your code doesn't show how this Map is populated, but keep in mind that Maps may have a hefty memory overhead compared to the memory needed for the entries themselves.
The size of the array that backs the Map (at least 16 entries * 4 bytes) + one key/value pair per entry + the size of data themselves. Two map entries, each using 10 chars for key and 10 chars for value, would consume 16*4 + 2*2*4 + 2*10*2 + 2*10*2 + 2*2*8= 64+16+40+40+24 = 184 bytes (1 char = 2 byte, a String object consumes min 8 byte). That alone would almost double the space requirements for the entire log string.
Add to this that the LogEntry contains 12 Objects, i.e. at least 96 bytes. Hence the log objects alone would need around 100 bytes, give or take some, without the Map and without actual string data. Plus all the pointers for the references (4B each). I count at least 18 with the Map, meaning 72 bytes.
Adding the data (-object references and object "headers" mentioned in the last paragraph):
2 longs = 16B, 1 date stored as long = 8B, the map = 184B. In addition comes the string content, say 90 chars = 180 byte. Perhaps a byte or two in each end of the list item when put in the list, so in total somewhere around 100+72+16+8+184+180=560 ~ 600 byte per log line.
So around 600 byte per log line, meaning 100K lines would consume around 60MB minimum. This would place it at least in the same order of magnitude as the heap size that was set asize. In addition comes the fact that tmpString.trim() in a loop might be creating copies of string. Similarly String.format() may also be creating copies. The rest of the application must also fit within this heap space, and might explain where the rest of memory is going.
Don't forget that each String object consumes space (24 bytes ?) for the actual Object definition, plus the reference to the char array, the offset (for substring() usage) etc. So representing a line as 'n' strings will add that additional storage requirement. Can you lazily evaluate these instead within your LogEntry class ?
(re. the String offset usage - prior to Java 7b6 String.substring() acts as a window onto an existing char array and consequently you need an offset. This has recently changed and it may be worth determining if a later JDK build is more memory efficient)