i'm making an Highscore implementation for my game. Here there's what i want to do:
I have object Score which contains String Name and Integer score.
Now :
if Name isn't already in the file, add it
if Name is on the file, after a space take the String and convert into integer, so i got the score.
Now, if score is better than the actual, i have to OVERWRITE it on the file...
and here's my problem..i can i do that? how can i write exactly a string over another in a certain point of the file?
Generally it's considered too fiddly to replace text in text files for this kind of requirement, so the usual way to do it is just to read in the whole file, make the replacement and write a new version of the whole file. If you have large amounts of data you would use a database or a NoSQL solution instead.
P.S. consider using serialization, it can make things easier.
I have object Score which contains String Name and Integer score.
Use a Properties file for this instead. It provides an easy interface to load & save the file, get the keys (Name) and set or retrieve values (Score). String values are stored, but they can be converted to/from integer easily.
I concur that this is best done by fully re-serializing the entire database. On modern computers, you can push 30MB/s per disk for linear writes (more if there's sufficient cache). And if you're dealing with more than 30MB of data, you REALLY need a DB (HSQLDB, DerbyDB, BerkleyDB) are trivial DBs. Or go all the way to postgres/mysql.
However, the way to overwrite a FIXED sized section of an existing file (or rather, the way to emulate doing so), is to use:
RandomAccessFile raf = new RandomAccessFile(fileName, "rw");
try {
raf.seek(position);
raf.writeInt(newScore);
raf.close();
} finally { raf.close(); }
Note using the writeInt instead of raf.write(Integer.toHexString(newScore).getBytes()), because you really really need that to be fixed in size.
Now if the text file is intrinsicly ascii (e.g. humans will read the file), and thus the value can't be binary.. Perhaps you could keep it HexString (because that will be fixed in size), or you could a zero-padded decimal string:
But what you absoluately positively can not do is grow the string by 1 byte.
So:
bob 15
joe 7
nina 981
Can't have joe's score replaced with 10, UNLESS you've padded a bunch of spaces.
If this is your data-file, then you will absolutely have to rewrite the whole file (even if you write the extra code to only rewrite from the point of change on - statistically that'll be 50% of the file and thus not worth bothering.
One other thing - if you do rewrite, you have the risk of shortening the file.. For that you need to call
raf.setLength(0);
before writing the first byte.. Otherwise, you'll see phantom text beyond the end of your new file.
final Map<String,Long> symbol2Position = new ConcurrentHashMap<>();
final Map<String,Integer> symbol2Score = new ConcurrentHashMap<>();
final String fileName;
final RandomAccessFile raf;
// ... skipped code
void storeFull() {
raf.position(0);
raf.setLength(0);
for (Map.Entry<String,Long> e : symbol2Position.entrySet()) {
raf.writeUTF8(e.getKey());
raf.write(',');
symbol2Score.put(e.getKey(), raf.position());
raf.writeUTF8(String.format("%06d",e.getValue()));
raf.write('\n');
}
}
void updateScore(String key, int newScore) {
if (symbol2Score.containsKey(key)) {
symbol2Score.put(key, newScore);
raf.position(symbol2Position.get(key));
raf.writeString(String.format("%06d", newScore));
} else {
symbol2Score.put(key, newScore);
long fileLen = raf.length();
symbol2Position.put(key, fileLen);
raf.position(fileLen);
raf.writeString(String.format("%06d", newScore));
}
}
I'd probably rather use a DB or binary map file.. meaning file with 8B per field, 4B pointing to user-name position, and 4B representing score. But this allows for a human readable data-file, while making updates faster than just rewriting the property file.
Check out LevelDB - fastest damn DB on the planet for embedded systems. :) Main thing it has over the above, is thousands / millions of updates per second without the rand-seek-rewrite cost of updating 6 bytes randomly across a multi-GB file.
Just a thought, any specific reason for the file storage of names and scores ?
Seems like a Map<String, Integer> would serve you much better...
Related
I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();
You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".
Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions
I'm trying to load in a csv file with a huge amount of lines (>5 million) but it slows down massively when trying to process them all into an arraylist of each value
I've tried a few different variations of reading and removing from the input list i loaded from the file, but it still ends up running out of heapspace, even when i allocate 14gb to the process, while the file is only 2gb
I know i need to be removing values so that i dont end up with duplicate references in memory, so that I dont end up with an arraylist of lines and also an arraylist of the individual comma seperated values, but i have no idea how to do something like that
Edit : For reference, in this particular situation, data should end up containing 16 * 5 million values.
If there's a more elegant solution, i'm all for it
The intention when loading this file is to process it as a database, with the appropriate methods like select and select where, all handled by a sheet class. It worked just fine with my smaller sample file of 36k lines, but i guess it doesnt scale very well
Current code :
//Load method to load it from file
private static CSV loadCSV(String filename, boolean absolute)
{
String fullname = "";
if (!absolute)
{
fullname = baseDirectory + filename;
if (!Load.exists(fullname,false))
return null;
}
else if (absolute)
{
fullname = filename;
if (!Load.exists(fullname,false))
return null;
}
ArrayList<String> output = new ArrayList<String>();
AtomicInteger atomicInteger = new AtomicInteger(0);
try (Stream<String> stream = Files.lines(Paths.get(fullname)))
{
stream.forEach(t -> {
output.add(t);
atomicInteger.getAndIncrement();
if (atomicInteger.get() % 10000 == 0)
{
Log.log("Lines done " + output.size());
}
});
CSV c = new CSV(output);
return c;
}
catch (IOException e)
{
Log.log("Error reading file " + fullname,3,"FileIO");
e.printStackTrace();
}
return null;
}
//Process method inside CSV class
public CSV(List<String> output)
{
Log.log("Inside csv " + output.size());
ListIterator<String> iterator = output.listIterator();
while (iterator.hasNext())
{
ArrayList<String> d = new ArrayList<String>(Arrays.asList(iterator.next().split(splitter,-1)));
data.add(d);
iterator.remove();
}
}
You need to use any database, which provide required functionality for your task (select, group).
Any database can effective read and aggregate 5 million rows.
Don't try to use "operations on ArrayList", it's works good only on small dataset.
I think some key concepts are missing here:
You said the file size is 2GB. That does not mean that when you load that file data in an ArrayList, the size in memory would also be 2GB. Why? Usually files store data using UTF-8 character encoding, whereas JVM internally stores String values using UTF-16. So, assuming your file contains only ASCII characters, each character occupies 1 byte in the filesystem whereas 2 bytes in memory. Assuming (for the sake of simplicity) all String values are unique, there will be space required to store the String references which are 32 bits each (assuming a 64 bits system with compressed oop). How much is your heap (excluding other memory areas)? How much is your eden space and old space? I'll come back to this again shortly.
In your code, you don't specify ArrayList size. This is a blunder in this case. Why? JVM creates a small ArrayList. After sometime JVM sees that this guy keeps pumping in data. Let's create a bigger ArrayList and copy the data of the old ArrayList into the new list. This event has some deeper implications when you are dealing with such huge volume of data: firstly, note that both the old and new arrays (with millions of entries) are in memory simultaneously occupying space, secondly unnecessarily data copy happens from one array to another - not once or twice but repeatedly, everytime the array run out of space. What happens to the old array? Well it's discarded and needs to be garbage collected. So, these repeated array copy and garbage collections slow down the process. CPU is really working hard here. What happens when your data no longer fits into the young generation (which is smaller than heap)? Maybe you need to see the behaviour using something like JVisualVM.
All in all, what I mean to say is there are good number of reasons why a 2GB file fills up your much larger heap and why your process performance is poor.
I would have a method that took a line read from the file as parameter and split it into a list of strings and then returned that list. I would then add that list to the CSV object in the file reading loop. That would mean only one large collection instead of two and the read lines could be freed from memory quicker.
Something like this
CSV csv = new CSV();
try (Stream<String> stream = Files.lines(Paths.get(fullname))) {
stream.forEach(t -> {
List<String> splittedString = splitFileRow(t);
csv.add(splittedString);
});
Trying to solve this problem using pure Java it is overwhelming. I suggest using a processing engine like Apache Spark that can process the file in a distributed way, by increasing the level of parallelism.
Apache Spark has specific APIs to load CSV file:
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
You can transform it into an RDD, or Dataframe and perform operations on it.
You can find more online, or here
I need the advice from someone who knows very well java and the memory issues. I have a large CSV files (something like 500mb each) and I need to merge these files in one using only 64mb of xmx. I've tried to do it different ways, but nothing works - always got memory exception. What should I do to make it work properly?
The task is:
Develop a simple implementation that joins two input tables in a reasonably efficient way and can store both tables in RAM if needed.
My code works, but it takes alot of memory, so can't fit at 64mb.
public class ImprovedInnerJoin {
public static void main(String[] args) throws IOException {
RandomAccessFile firstFile = new RandomAccessFile("input_A.csv", "r");
FileChannel firstChannel = firstFile.getChannel();
RandomAccessFile secondFile = new RandomAccessFile("input_B.csv", "r");
FileChannel secondChannel = secondFile.getChannel();
RandomAccessFile resultFile = new RandomAccessFile("result2.csv", "rw");
FileChannel resultChannel = resultFile.getChannel().position(0);
ByteBuffer resultBuffer = ByteBuffer.allocate(40);
ByteBuffer firstBuffer = ByteBuffer.allocate(25);
ByteBuffer secondBuffer = ByteBuffer.allocate(25);
while (secondChannel.position() != secondChannel.size()){
Map <String, List<String>>table2Part = new HashMap();
for (int i = 0; i < secondChannel.size(); ++i){
if (secondChannel.read(secondBuffer) == -1)
break;
secondBuffer.rewind();
String[] table2Tuple = (new String(secondBuffer.array(), Charset.defaultCharset())).split(",");
if (!table2Part.containsKey(table2Tuple[0]))
table2Part.put(table2Tuple[0], new ArrayList());
table2Part.get(table2Tuple[0]).add(table2Tuple[1]);
secondBuffer.clear();
}
Set <String> taple2keys = table2Part.keySet();
while (firstChannel.read(firstBuffer) != -1){
firstBuffer.rewind();
String[] table1Tuple = (new String(firstBuffer.array(), Charset.defaultCharset())).split(",");
for (String table2key : taple2keys){
if (table1Tuple[0].equals(table2key)){
for (String value : table2Part.get(table2key)){
String result = table1Tuple[0] + "," + table1Tuple[1].substring(0,14) + "," + value; // 0,14 or result buffer will be overflown
resultBuffer.put(result.getBytes());
resultBuffer.rewind();
while(resultBuffer.hasRemaining()){
resultChannel.write(resultBuffer);
}
resultBuffer.clear();
}
}
}
firstBuffer.clear();
}
firstChannel.position(0);
table2Part.clear();
}
firstChannel.close();
secondChannel.close();
resultChannel.close();
System.out.println("Operation completed.");
}
}
A very easy to implement version of an external join is the external hash join.
It is much easier to implement than an external merge sort join and only has one drawback (more on that later).
How does it work?
Very similar to a hashtable.
Choose a number n, which signifies how many files ("buckets") you're distributing your data into.
Then do the following:
Setup n file writers
For each of your files that you want to join and for each line:
take the hashcode of the key you want to join on
compute the modulo of the hashcode and n, that will give you k
append your csv line to the kth file writer
Flush/Close all n writers.
Now you have n, hopefully smaller, files with the guarantee that the same key will always be in the same file. Now you can run your standard HashMap/HashMultiSet based join on each of these files separately.
Limitations
Why did I mentioned hopefully smaller files? Well, it depends on the distribution of the keys and their hashcodes. Think for the worst case, all of your files have the exact same key: you only have one file and you didn't win anything from partitioning.
Similar for skewed distributions, sometimes a few of your bucket files will be too big to fit into your RAM.
Usually there are three ways out of this dilemma:
Run the algorithm again with a bigger n, so you have more buckets to distribute to
Take only the buckets that are too big and do another hash partitioning pass only on those files (so each file goes into n newly created buckets again)
Fallback to an external merge sort on the big partition files.
Sometimes all three are used in a different combinations, which is called dynamic partitioning.
If central memory is a constraint for your application but you can access a persistent file, I would create as suggested by blahfunk a temporary SQLite file to your tmp folder, read every file by chunks and merge them with a simple join. You could could create a temporary SQLite DB by giving a look to libraries such as Hibernate, just take a look to what have I found on this StackOverflow question: How to create database in Hibernate at runtime?
If you cannot perform such a task, your remaining option is to consume more cpu and load just the first row of the first file searching for a row with the same index on the second file, buffering the result and flushing them as late as possible on the output file, repeating this for every row of the first file.
Maybe you can stream the first file and turn each line into a hashcode and save all those hashcodes in memory. Then stream the second file and make a hashcode for each line as it comes in. If the hashcode is in the first file, i.e., in memory, then don't write the line, else write the line. After that, append the first file in its entirety into the result file.
This would be effectively creating an index to compare your updates to.
It's probably a stupid question but here's the thing. I was reading this question:
Storing 1 million phone numbers
and the accepted question was what I was thinking: using a trie. In the comments Matt Ball suggested:
I think storing the phone numbers as ASCII text and compressing is a very reasonable suggestion
Problem: how do I do that in Java? And ASCII text does stand for String?
For in-memory storage as indicated in the question:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(
new GZIPOutputStream(baos), "US-ASCII");
for(String number : numbers){
out.write(number);
out.write('\n');
}
byte[] data = baos.toByteArray();
But as Pete remarked: this may be good for memory efficiency, but you can't really do anything with the data afterwards, so it's not really very useful.
Yes, ASCII means Strings in this case. You can store compressed data in Java using the java.util.zip.GZIPOutputStream.
In answer to an implied, but different question;
Q: You have 1 billion phones numbers and you need to send these over a low bandwidth connection. You only need to send whether the phone number is in the collection or not. (No other information required)
A: This is the general approach
First sort the list if its not sorted already.
From the lowest number find regions of continuous numbers. Send the start of the region and the phones which are taken. This can be stored a BitSet (1-bit per possible number) Send the phone number at the start and the BitSet whenever the gap is more than some threshold.
Write the stream to a compressed data set.
Test this to compare with a simple sending of all numbers.
You can use Strings in a sorted TreeMap. One million numbers is not very much and will use about 64 MB. I don't see the need for a more complex solution.
The latest version of Java can store ASCII text efficiently by using a byte[] instead of a char[] however, the overhead of your data structure is likely to be larger.
If you need to store a phone numbers as a key, you could store them with the assumption that large ranges will be continous. As such you could store them like
NavigableMap<String, PhoneDetails[]>
In this structure, the key would define the start of the range and you could have a phone details for each number. This could be not much bigger than the reference to the PhoneDetails (which is the minimum)
BTW: You can invent very efficient structures if you don't need access to the data. If you never access the data, don't keep it in memory, in fact you can just discard it as it won't ever be needed.
Alot depending on what you want to do with the data and why you have it in memory at all.
You can Use DeflatorOutputStream to a ByteArrayOutputStream, which will be very small, but not very useful.
I suggest using DeflatorOutputStream as its more light weight/faster/smaller than GZIPOutputStream.
Java String are by default UTF-8 encoded, you have to change the encoding if you want to manipulate ASCII text.
I have a large (3Gb) binary file of doubles which I access (more or less) randomly during an iterative algorithm I have written for clustering data. Each iteration does about half a million reads from the file and about 100k writes of new values.
I create the FileChannel like this...
f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();
I then use a private ByteBuffer the size of a double to read from it
private ByteBuffer _double_bb = ByteBuffer.allocate(8);
and my reading code looks like this
public double GetValue(long lRow, long lCol)
{
long idx = TriangularMatrix.CalcIndex(lRow, lCol);
long position = idx * BLOCK_SIZE;
double d = 0;
try
{
_double_bb.position(0);
_ioChannel.read(_double_bb, position);
d = _double_bb.getDouble(0);
}
...snip...
return d;
}
and I write to it like this...
public void SetValue(long lRow, long lCol, double d)
{
long idx = TriangularMatrix.CalcIndex(lRow, lCol);
long offset = idx * BLOCK_SIZE;
try
{
_double_bb.putDouble(0, d);
_double_bb.position(0);
_ioChannel.write(_double_bb, offset);
}
...snip...
}
The time taken for an iteration of my code increases roughly linearly with the number of reads. I have added a number of optimisations to the surrounding code to minimise the number of reads, but I am at the core set that I feel are necessary without fundamentally altering how the algorithm works, which I want to avoid at the moment.
So my question is whether there is anything in the read/write code or JVM configuration I can do to speed up the reads? I realise I can change hardware, but before I do that I want to make sure that I have squeezed every last drop of software juice out of the problem.
Thanks in advance
As long as your file is stored on a regular harddisk, you will get the biggest possible speedup by organizing your data in a way that gives your accesses locality, i.e. causes as many get/set calls in a row as possible to access the same small area of the file.
This is more important than anything else you can do because accessing random spots on a HD is by far the slowest thing a modern PC does - it takes about 10,000 times longer than anything else.
So if it's possible to work on only a part of the dataset (small enough to fit comfortably into the in-memory HD cache) at a time and then combine the results, do that.
Alternatively, avoid the issue by storing your file on an SSD or (better) in RAM. Even storing it on a simple thumb drive could be a big improvement.
Instead of reading into a ByteBuffer, I would use file mapping, see: FileChannel.map().
Also, you don't really explain how your GetValue(row, col) and SetValue(row, col) access the storage. Are row and col more or less random? The idea I have in mind is the following: sometimes, for image processing, when you have to access pixels like row + 1, row - 1, col - 1, col + 1 to average values; on trick is to organize the data in 8 x 8 or 16 x 16 blocks. Doing so helps keeping the different pixels of interest in a contiguous memory area (and hopefully in the cache).
You might transpose this idea to your algorithm (if it applies): you map a portion of your file once, so that the different calls to GetValue(row, col) and SetValue(row, col) work on this portion that's just been mapped.
Presumably if we can reduce the number of reads then things will go more quickly.
3Gb isn't huge for a 64 bit JVM, hence quite a lot of the file would fit in memory.
Suppose that you treat the file as "pages" which you cache. When you read a value, read the page around it and keep it in memory. Then when you do more reads check the cache first.
Or, if you have the capacity, read the whole thing into memory, in at the start of processing.
Access byte-by-byte always produce poor performance (not only in Java). Try to read/write bigger blocks (e.g. rows or columns).
How about switching to database engine for handling such amounts of data? It would handle all optimizations for you.
May be This article helps you ...
You might want to consider using a library which is designed for managing large amounts of data and random reads rather than using raw file access routines.
The HDF file format may by a good fit. It has a Java API but is not pure Java. It's licensed under an Apache Style license.