Storing and comparing a large quantity of Strings in Java

Storing and comparing a large quantity of Strings in Java - java

My application stores a large number (about 700,000) of strings in an ArrayList. The strings are loaded from a text file like this:
List<String> stringList = new ArrayList<String>(750_000);
//there's a try catch here but I omitted it for this example
Scanner fileIn = new Scanner(new FileInputStream(listPath), "UTF-8");
while (fileIn.hasNext()) {
String s = fileIn.nextLine().trim();
if (s.isEmpty()) continue;
if (s.startsWith("#")) continue; //ignore comments
stringList.add(s);
}
fileIn.close();
Later on, Other strings are compared to this list, using this code:
String example = "Something";
if (stringList.contains(example))
doSomething();
This comparison will happen many hundreds (thousands?) of times.
This all works, but I want to know if there's anything I can do to make it better. I notice that the JVM increases in size from about 100MB to 600MB when it loads the 700K Strings. The strings are mainly about this size:
Blackened Recordings
Divergent Series: Insurgent
Google
Pixels Movie Money
X Ambassadors
Power Path Pro Advanced
CYRFZQ
Is there anything I can do to reduce the memory, or is that to be expected? Any suggestions in general?

ArrayList is a memory effective. Probably your issue is caused by java.util.Scanner. Scanner creates a lot of temp objects during parsing (Patterns, Matchers etc) and not suitable for big files.
Try to replace it with java.io.BufferedReader:
List<String> stringList = new ArrayList<String>();
BufferedReader fileIn = new BufferedReader(new FileReader("UTF-8"));
String line = null;
while ((line = fileIn.readLine()) != null) {
line = line.trim();
if (line.isEmpty()) continue;
if (line.startsWith("#")) continue; //ignore comments
stringList.add(line);
}
fileIn.close();
See java.util.Scanner source code
To pinpoint memory issue attach to your JVM any memory profiler, for example VisualVM from JDK tools.
Added:
Let's make few assumtions:
you have 700000 string with 20 characters each.
object reference size is 32 bits, object header - 24, array header - 16, char - 16, int 32.
Then every string will consume 24+32*2+32+(16+20*16) = 456 bits.
Whole ArrayList with string object will consume about 700000*(32*2+456) = 364000000 bits = 43.4 MB (very roughly).

Not quite an answer, but:
Your scenario uses around 70mb on my machine:
long usedMemory = -(Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory());
{//
String[] strings = new String[700_000];
for (int i = 0; i < strings.length; i++) {
strings[i] = new String(new char[20]);
}
}//
usedMemory += Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.println(usedMemory / 1_000_000d + " mb");
How did you reach 500mb there? As far as I know, String has internally a char[], and each char has 16 bits. Taking the Object and String overhead in account, 500mb is still quite much for the strings only. You may perform some benchmarking tests on your machine.
As others already mentioned, you should change the datastructure for element look-ups/comparison.

You're likely going to be better off using a HashSet instead of an ArrayList as both add and contains are constant time operations in a HashSet.
However, it does assume that your object's hashCode implementation (which is part of Object, but can be overridden) is evenly distributed.

There is a Trie data structure which can be used as dictionary, with so many strings they can occur multiple times. https://en.wikipedia.org/wiki/Trie . It seems to fit your case.
UPDATE:
An alternative can be HashSet or HashMap string -> something if you want occurrences of strings for example. Hashed collection will be faster than list for sure.
I would start with HashSet.

Using an ArrayList is a very bad idea for your use case, because it is not sorted, and hence you cannot efficiently search for an entry.
The best built-in type for your case is a is a TreeSet<String>. It guarantees O(log(n)) Performance for add() and contains().
Be aware that TreeSet is not thread-safe in the basic implementation. Use an mt-safe wrapper (see the JavaDocs of TreeSet for this).

Here is a Java 8 approach. It uses Files.lines() method which take advantage of Stream API. This method reads all lines from a file as a Stream.
As a consequence no String objects are created till the terminal operation which is a static method MyExecutor.doSomething(String).
/**
* Process lines from a file.
* Uses Files.lines() method which take advantage of Stream API introduced in Java 8.
*/
private static void processStringsFromFile(final Path file) {
try (Stream<String> lines = Files.lines(file)) {
lines.map(s -> s.trim())
.filter(s -> !s.isEmpty())
.filter(s -> !s.startsWith("#"))
.filter(s -> s.contains("Something"))
.forEach(MyExecutor::doSomething);
} catch (IOException ex) {
logProcessStringsFailed(ex);
}
}
I conducted an Analysis of Memory Usage in NetBeans and here are the Memory Results for empty implementation of doSomething()
public static void doSomething(final String s) {
}
Live Bytes = 6702720 ≈ 6.4MB.

Related

List of String and long String

Suppose I have a list of Book where the number of books can be quite large. I'm logging the isbn of those books. I've come up with two approaches, would there be any performance difference / which approach is considered as better ?
My concern with 2) will be whether the length of String is too long to become an issue. Refer to How many characters can a Java String have?, it's not likely it will hit the max number of characters, but I'm not sure on the point about "Half your maximum heap size", and whether it's actually a good practice to construct a long String.
Convert to list of String
List<Book> books = new ArrayList<>();
books.add(new Book().name("book1").isbn("001"));
books.add(new Book().name("book2").isbn("002"));
if (books != null && books.size() > 0) {
List<String> isbns = books.stream()
.map(Book:getIsbn)
.collect(Collectors.toList());
logger.info("List of isbn = {}", isbns);
} else {
logger.info("Empty list of isbn");
}
Using StringBuilder and concatenate as one long String
List<Book> books = new ArrayList<>();
books.add(new Book().name("book1").isbn("001"));
books.add(new Book().name("book2").isbn("002"));
if (books != null && books.size() > 0) {
StringBuilder strB = new StringBuilder();
strB.append("List of isbn: ");
books.stream()
.forEach(book -> {
strB.append(book.getIsbn());
strB.append("; ");
});
logger.info("List of isbn = {}", strB.toString());
} else {
logger.info("Empty list of isbn");
}

... but I'm not sure on the point about "Half your maximum heap size"
The JVM heap has a maximum size set by command line options, or by defaults.
If you fill the heap, your JVM will throw and OutOfMemoryError (OOME) and that will typically cause your application to terminate (or worse!).
When you construct a string using a StringBuilder the builder uses a roughly exponential resizing strategy. When you fill the builder's buffer, it allocates a new one with double the size. But the old and new buffers need to exist at the same time. So when the buffer is between 1/2 and 2/3rds of the size of the entire heap, and the buffer fills up, the StringBuilder will attempt allocate a new buffer that is larger than the remaining available space and an OOME will ensue.
Having said that, assembling a single string containing a huge amount of data is going to be bad for performance even if you don't trigger an OOME. A better idea is to write the data via a buffers OutputStream or Writer.
This may be problematic if you are outputting the data via a Logger. But I wouldn't try to use a Logger for that. And I certainly wouldn't try to do it using a single Logger.info(...) call.

What is making my code use so much memory?

I have solved in various ways a simple problem on CodeEval, which specification can be found here (only a few lines long).
I have made 3 working versions (one of them in Scala) and I don't understand the difference of performances for my last Java version which I expected to be the best time and memory-wise.
I also compared this to a code found on Github. Here are the performance stats returned by CodeEval :
. Version 1 is the version found on Github
. Version 2 is my Scala solution :
object Main extends App {
val p = Pattern.compile("\\d+")
scala.io.Source.fromFile(args(0)).getLines
.filter(!_.isEmpty)
.map(line => {
val dists = new TreeSet[Int]
val m = p.matcher(line)
while (m.find) dists += m.group.toInt
val list = dists.toList
list.zip(0 +: list).map { case (x,y) => x - y }.mkString(",")
})
.foreach(println)
}
. Version 3 is my Java solution which I expected to be the best :
public class Main {
public static void main(String[] args) throws IOException {
Pattern p = Pattern.compile("\\d+");
File file = new File(args[0]);
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
Set<Integer> dists = new TreeSet<Integer>();
Matcher m = p.matcher(line);
while (m.find()) dists.add(Integer.parseInt(m.group()));
Iterator<Integer> it = dists.iterator();
int prev = 0;
StringBuilder sb = new StringBuilder();
while (it.hasNext()) {
int curr = it.next();
sb.append(curr - prev);
sb.append(it.hasNext() ? "," : "");
prev = curr;
}
System.out.println(sb);
}
br.close();
}
}
Version 4 is the same as version 3 except I don't use a StringBuilder to print the output and do like in version 1
Here is how I interpreted those results :
version 1 is too slow because of the too high number of System.out.print calls. Moreover, using split on very large lines (that's the case in the tests performed) uses a lot of memory.
version 2 seems slow too but it is mainly because of an "overhead" on running Scala code on CodeEval, even very efficient code run slowly on it
version 2 uses unnecessary memory to build a list from the set, which also takes some time but should not be too significant. Writing more efficient Scala would probably like writing it in Java so I preferred elegance to performance
version 3 should not use that much memory in my opinion. The use of a StringBuilder has the same impact on memory as calling mkString in version 2
version 4 proves the calls to System.out.println are slowering down the program
Does someone see an explanation to those results ?

I conducted some tests.
There is a baseline for every type of language. I code in java and javascript. For javascript here are my test results:
Rev 1: Default empty boilerplate for JS with a message to standard output
Rev 2: Same without file reading
Rev 3: Just a message to the standard output
You can see that no matter what, there will be at least 200 ms runtime and about 5 megs of memory usage. This baseline depends on the load of the servers as well! There was a time when codeevals was heavily overloaded, thus making impossible to run anything within the max time(10s).
Check this out, a totally different challenge than the previous:
Rev4: My solution
Rev5: The same code submitted again now. Scored 8000 more ranking point. :D
Conclusion: I would not worry too much about CPU and memory usage and rank. It is clearly not reliable.

Your scala solution is slow, not because of "overhead on CodeEval", but because you are building an immutable TreeSet, adding elements to it one by one. Replacing it with something like
val regex = """\d+""".r // in the beginning, instead of your Pattern.compile
...
.map { line =>
val dists = regex.findAllIn(line).map(_.toInt).toIndexedSeq.sorted
...
Should shave about 30-40% off your execution time.
Same approach (build a list, then sort) will, probably, help your memory utilization in "version 3" (java sets are real memory hogs). It is also a good idea to give your list an initial size while you are at it (otherwise, it'll grow by 50% every time it runs out of capacity, which is wasteful in both memory and performance). 600 sounds like a good number, since that's the upper bound for the number of cities from the problem description.
Now, since we know the upper boundary, an even faster and slimmer approach is to do away with lists and boxed Integeres, and just do int dists[] = new int[600];.
If you wanted to get really fancy, you'd also make use of the "route length" range that's mentioned in the description. For example, instead of throwing ints into an array and sorting (or keeping a treeset), make an array of 20,000 bits (or even 20K bytes for speed), and set those that you see in input as you read it ... That would be both faster and more memory efficient than any of your solutions.

I tried solving this question and figured that you don't need the names of the cities, just the distances in a sorted array.
It has much better runtime of 738ms, and memory of 4513792 with this.
Although this may not help improve your piece of code, it seems like a better way to approach the question. Any suggestions to improve the code further are welcome.
import java.io.*;
import java.util.*;
public class Main {
public static void main (String[] args) throws IOException {
File file = new File(args[0]);
BufferedReader buffer = new BufferedReader(new FileReader(file));
String line;
while ((line = buffer.readLine()) != null) {
line = line.trim();
String out = new Main().getDistances(line);
System.out.println(out);
}
}
public String getDistances(String s){
//split the string
String[] arr = s.split(";");
//create an array to hold the distances as integers
int[] distances = new int[arr.length];
for(int i=0; i<arr.length; i++){
//find the index of , - get the characters after that - convert to integer - add to distances array
distances[i] = Integer.parseInt(arr[i].substring(arr[i].lastIndexOf(",")+1));
}
//sort the array
Arrays.sort(distances);
String output = "";
output += distances[0]; //append the distance to the closest city to the string
for(int i=0; i<arr.length-1; i++){
//get distance between current element(city) and next
int distance_between = distances[i+1] - distances[i];
//append the distance to the string
output += "," + distance_between;
}
return output;
}
}

Out of memory error, while reading large CSV file (millions of lines) in java

I am getting out of memory error while reading large CSV file in java. How can I deal with this problem. I increased the heap size, I also tried using BufferedReader, but still the same problem persist. Here is my code
public class CsvParser {
public static void main(String[] args) {
try {
FileReader fr = new FileReader((args.length > 0) ? args[0] : "data.csv");
Map<String, List<String>> values = parseCsv(fr, " ", true);
System.out.println(values);
} catch (IOException e) {
e.printStackTrace();
}
}
public static Map<String, List<String>> parseCsv(Reader reader, String separator, boolean hasHeader)
throws IOException {
Map<String, List<String>> values = new LinkedHashMap<String, List<String>>();
List<String> columnNames = new LinkedList<String>();
BufferedReader br = null;
br = new BufferedReader(reader);
String line;
int numLines = 0;
while ((line = br.readLine()) != null) {
if (StringUtils.isNotBlank(line)) {
if (!line.startsWith("#")) {
String[] tokens = line.split(separator);
if (tokens != null) {
for (int i = 0; i < tokens.length; ++i) {
if (numLines == 0) {
columnNames.add(hasHeader ? tokens[i] : ("row_" + i));
} else {
List<String> column = values.get(columnNames.get(i));
if (column == null) {
column = new LinkedList<String>();
}
column.add(tokens[i]);
values.put(columnNames.get(i), column);
}
}
}
++numLines;
}
}
}
return values;
}
}

Do not try to build your custom parser. Your implementation will probably not be fast or flexible enough to handle all corner cases.
You should try uniVocity-parsers CSV parser to handle that for you. It comes with a built in CSV parser, which is the fastest parser among any other for java. Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
It is extremely memory efficient and we built a custom parser on top of its architecture to parse a 42GB MySQL dump file, with more than 1 billion rows, for this project
Here's a quick and diry example of how to use uniVocity-parsers CSV parser:
CsvParserSettings settings = new CsvParserSettings();
CsvParser parser = new CsvParser(settings);
// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));

If you want to load everything in memory, you need memory.
By loading the complete file in memory you wil always have the risk of OutOfMemory errors.
If you really need all data always accessible you can start thinking of using a database. An embedded database like sqlite is easy to integrate, little overhead and is able to manage the data on disk. This way no mather how large your files are, you will not have a memory issue.

Memory is a limited resource so if you want to deal with large files you need to have an approach of dealing with portions of it. I suggest taking a look at RandomAccessFile and MappedByteBuffer of the NIO library. Is the best solution i can think of your problem. You can access the data of the files without loading it entirely to the memory. take a look at this link for a quick head start.

it's not the csv-file itself, which fills the memory up, it's the values variable which contains the "copy" of the file itself + certain object overhead.
I also saw, that you are "transposing" your original csv-file. That means, that, as other posters already mentioned, you HAVE to use some file-based storage to keep the memory fingerprint at minimum, or add more RAM to you computer and hope that it helps

Assuming: C columns, L lines, B characters per field, and 64-bit JVM:
The data from the CSV file has roughly C×L×B characters, so it takes (32 + 24 +2×B)C×L×B bytes of memory to store all the values as strings. Consider interning them if the values repeat, or storing as UTF-8 byte arrays in (24 + B)C×L×B bytes. Or, if you feel confident, combine the two and implement an interning pool for byte arrays.
LinkedList takes 40 bytes per node, so it's another 40×C×L bytes. ArrayLists are smaller, they take only 8 bytes per node, and also faster in almost every use case, including yours.
You need at least (96 + 2×B)×L×C bytes of memory, plus a bit of overhead. If you switch to ArrayLists and byte arrays, you should need about (32 + B)×L×C plus overhead.

Rather than loading it all into memory, try doing a bit at a time.
Something like a LineNumberReader or a BufferedReader should help you manage this.

sorting lines of an enormous file.txt in java

I'm working with a very big text file (755Mb).
I need to sort the lines (about 1890000) and then write them back in another file.
I already noticed that discussion that has a starting file really similar to mine:
Sorting Lines Based on words in them as keys
The problem is that i cannot store the lines in a collection in memory because I get a Java Heap Space Exception (even if i expanded it at maximum)..(already tried!)
I can't either open it with excel and use the sorting feature because the file is too large and it cannot be completely loaded..
I thought about using a DB ..but i think that writing all the lines then use the SELECT query it's too much long in terms of time executing..am I wrong?
Any hints appreciated
Thanks in advance

I think the solution here is to do a merge sort using temporary files:
Read the first n lines of the first file, (n being the number of lines you can afford to store and sort in memory), sort them, and write them to file 1.tmp (or however you call it). Do the same with the next n lines and store it in 2.tmp. Repeat until all lines of the original file has been processed.
Read the first line of each temporary file. Determine the smallest one (according to your sort order), write it to the destination file, and read the next line from the corresponding temporary file. Repeat until all lines have been processed.
Delete all the temporary files.
This works with arbitrary large files, as long as you have enough disk space.

You can run the following with
-mx1g -XX:+UseCompressedStrings # on Java 6 update 29
-mx1800m -XX:-UseCompressedStrings # on Java 6 update 29
-mx2g # on Java 7 update 2.
import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class Main {
public static void main(String... args) throws IOException {
long start = System.nanoTime();
generateFile("lines.txt", 755 * 1024 * 1024, 189000);
List<String> lines = loadLines("lines.txt");
System.out.println("Sorting file");
Collections.sort(lines);
System.out.println("... Sorted file");
// save lines.
long time = System.nanoTime() - start;
System.out.printf("Took %.3f second to read, sort and write to a file%n", time / 1e9);
}
private static void generateFile(String fileName, int size, int lines) throws FileNotFoundException {
System.out.println("Creating file to load");
int lineSize = size / lines;
StringBuilder sb = new StringBuilder();
while (sb.length() < lineSize) sb.append('-');
String padding = sb.toString();
PrintWriter pw = new PrintWriter(fileName);
for (int i = 0; i < lines; i++) {
String text = (i + padding).substring(0, lineSize);
pw.println(text);
}
pw.close();
System.out.println("... Created file to load");
}
private static List<String> loadLines(String fileName) throws IOException {
System.out.println("Reading file");
BufferedReader br = new BufferedReader(new FileReader(fileName));
List<String> ret = new ArrayList<String>();
String line;
while ((line = br.readLine()) != null)
ret.add(line);
System.out.println("... Read file.");
return ret;
}
}
prints
Creating file to load
... Created file to load
Reading file
... Read file.
Sorting file
... Sorted file
Took 4.886 second to read, sort and write to a file

divide and conquer is the best solution :)
divide your file to smaller ones, sort each file seperately then regroup.
Links:
Sort a file with huge volume of data given memory constraint
http://hackerne.ws/item?id=1603381

Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.
Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.
The above algorithm is also known as external sort. Step 3 is known as N-way merge

Why don't you try multithreading and increasing heap size of the program you are running? (this also requires you to use merge sort kind of thing provided you have more memory than 755mb in your system.)

Maybe u can use perl to format the file .and load into the database like mysql. it's so fast. and use the index to query the data. and write to another file.
u can set jvm heap size like '-Xms256m -Xmx1024m' .i hope to help u .thanks

Java string comparison

I am comparing substrings in two large text files. Very simple, tokenizing into two token containers, comparing with 2 for loops. Performance is disastrous! Does anybody have an advice or idea how to improve performance?
for (int s = 0; s < txtA.TokenContainer.size(); s++) {
String strTxtA = txtA.getSubStr(s);
strLengthA = txtA.getNumToken(s);
if (strLengthA >= dp.getMinStrLength()) {
int tokenFileB = 1;
for (int t = 0; t < txtB.TokenContainer.size(); t++) {
String strTxtB = txtB.getSubStr(t);
strLengthB = txtB.getNumToken(t);
if (strTxtA.equalsIgnoreCase(strTxtB)) {
try {
subStrTemp = new SubStrTemp(
txtA.ID, txtB.ID, tokenFileA, tokenFileB,
(tokenFileA + strLengthA - 1),
(tokenFileB + strLengthB - 1));
if (subStrContainer.contains(subStrTemp) == false) {
subStrContainer.addElement(subStrTemp);
}
} catch (Exception ex) {
logger.error("error");
}
}
tokenFileB += strLengthB;
}
tokenFileA += strLengthA;
}
}
Generally my code reading two large Strings with Java Tokonizer into containers A and B. And then trying to compare substrings.Possision of Substrgs which are existing in both strings to store into a Vector. But performance is awful, also don't really know how to solve it with HashMap.

Your main problem is that you go through all txtB for each token in txtA.
You should store informations on token from txtA (in a HashMap for instance) and then in a second loop (but not a nested one) you compare the strings with the existing one in the Map.
On the same topic :
term frequency using java program
How to count words in java

You are doing a join with nested loops? Yes, that is O(n^2). What about doing a hash join instead? That is, create a map from (lowercased) strText to t and do lookups with this map rather than iterating over the token container?

Put the tokens of fileA into a trie data structure. Then when tokenising fileB you can check quite quickly if these tokens are in the trie. A few code comments would help.

A said, this is an issue of complexity and you're algorithm runs in O(n^2) instead of O(n) using hash.
For second order improvements try to call less to functions, for example you can get the size once
sizeB = txtB.TokenContainer.size();
Depeneds on the size, you may call the container once to get an array of strings to save the getStr....
Roni

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Storing and comparing a large quantity of Strings in Java - java

You're likely going to be better off using a HashSet instead of an ArrayList as both add and contains are constant time operations in a HashSet. However, it does assume that your object's hashCode implementation (which is part of Object, but can be overridden) is evenly distributed.

Related

List of String and long String

What is making my code use so much memory?

Out of memory error, while reading large CSV file (millions of lines) in java

sorting lines of an enormous file.txt in java

Java string comparison

Categories

Resources