I have more than 10 million JSON documents of the form :
["key": "val2", "key1" : "val", "{\"key\":\"val", \"key2\":\"val2"}"]
in one file.
Importing using JAVA Driver API took around 3 hours, while using the following function (importing one BSON at a time):
public static void importJSONFileToDBUsingJavaDriver(String pathToFile, DB db, String collectionName) {
// open file
FileInputStream fstream = null;
try {
fstream = new FileInputStream(pathToFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("file not exist, exiting");
return;
}
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
// read it line by line
String strLine;
DBCollection newColl = db.getCollection(collectionName);
try {
while ((strLine = br.readLine()) != null) {
// convert line by line to BSON
DBObject bson = (DBObject) JSON.parse(JSONstr);
// insert BSONs to database
try {
newColl.insert(bson);
}
catch (MongoException e) {
// duplicate key
e.printStackTrace();
}
}
br.close();
} catch (IOException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
Is there a faster way? Maybe, MongoDB settings may influence the insertion speed? (for, example adding key : "_id" which will function as index, so that MongoDB would not have to create artificial key and thus index for each document) or disable index creation at all at insertion.
Thanks.
I'm sorry but you're all picking minor performance issues instead of the core one. Separating the logic from reading the file and inserting is a small gain. Loading the file in binary mode (via MMAP) is a small gain. Using mongo's bulk inserts is a big gain, but still no dice.
The whole performance bottleneck is the BSON bson = JSON.parse(line). Or in other words, the problem with the Java drivers is that they need a conversion from json to bson, and this code seems to be awfully slow or badly implemented. A full JSON (encode+decode) via JSON-simple or specially via JSON-smart is 100 times faster than the JSON.parse() command.
I know Stack Overflow is telling me right above this box that I should be answering the answer, which I'm not, but rest assured that I'm still looking for an answer for this problem. I can't believe all the talk about Mongo's performance and then this simple example code fails so miserably.
I've done importing a multi-line json file with ~250M records. I just use mongoimport < data.txt and it took 10 hours. Compared to your 10M vs. 3 hours I think this is considerably faster.
Also from my experience writing your own multi-threaded parser would speed things up drastically. The procedure is simple:
Open the file as BINARY (not TEXT!)
Set markers(offsets) evenly across the file. The count of markers depends on the number of threads you want.
Search for '\n' near the markers, calibrate the markers so they are aligned to lines.
Parse each chunk with a thread.
A reminder:
when you want performance, don't use stream reader or any built-in line-based read methods. They are slow. Just use binary buffer and search for '\n' to identify a line, and (most preferably) do in-place parsing in the buffer without creating a string. Otherwise the garbage collector won't be so happy with this.
You can parse the entire file together at once and the insert the whole json in mongo document, Avoid multiple loops, You need to separate the logic as follows:
1)Parse the file and retrieve the json Object.
2)Once the parsing is over, save the json Object in the Mongo Document.
I've got a slightly faster way (I'm also inserting millions at the moment), insert collections instead of single documents with
insert(List<DBObject> list)
http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#insert(java.util.List)
That said, it's not that much faster. I'm about to experiment with setting other WriteConcerns than ACKNOWLEDGED (mainly UNACKNOWLEDGED) to see if I can speed it up faster. See http://docs.mongodb.org/manual/core/write-concern/ for info
Another way to improve performance, is to create indexes after bulk inserting. However, this is rarely an option except for one off jobs.
Apologies if this is slightly wooly sounding, I'm still testing things myself. Good question.
You can also remove all the indexes (except for the PK index, of course) and rebuild them after the import.
Use bulk operations insert/upserts. After Mongo 2.6 you can do Bulk Updates/Upserts. Example below does bulk update using c# driver.
MongoCollection<foo> collection = database.GetCollection<foo>(collectionName);
var bulk = collection.InitializeUnorderedBulkOperation();
foreach (FooDoc fooDoc in fooDocsList)
{
var update = new UpdateDocument { {fooDoc.ToBsonDocument() } };
bulk.Find(Query.EQ("_id", fooDoc.Id)).Upsert().UpdateOne(update);
}
BulkWriteResult bwr = bulk.Execute();
You can use a bulk insertion
You can read the documentation at mongo website and you can also check this java example on StackOverflow
Related
I have some data that I want to write.
Code:
private void saveStats(int early, int infected, int recovered, int deads, int notInfected, int vaccinated, int iteration){
try
{
FileWriter txt = new FileWriter("statistic.csv");
txt.write((String.valueOf(early)));
txt.write(";");
txt.write(String.valueOf(infected));
txt.write(";");
txt.write((String.valueOf(recovered)));
txt.write(";");
txt.write((String.valueOf(deads)));
txt.write(";");
txt.write((String.valueOf(notInfected)));
txt.write(";");
txt.write((String.valueOf(vaccinated)));
txt.write("\n");
txt.close();
} catch (IOException ex)
{
ex.printStackTrace();
System.out.println("Error!");
}
}
I will use this function to save the iteration number and some additional data; for example:
Iteration Infected Recovered Dead NotInfected Vaccinated
1 200 300 400 500
2 300 400 600 900
etc
A perfect solution would have the first row of the file hold names for each column, similar to what's written above.
For something like this, it is a good idea to use an existing Java CSV library. One possibility is Apache Commons CSV. "Google is your friend" if you want to find tutorials or other alternatives.
But if you wanted to "roll your own" code, there are various ways to do it. The simplest way to change your code so that it records multiple rows in the CSV would be to change
new FileWriter("statistic.csv");
to
new FileWriter("statistic.csv", true);
That opens the file in "append" mode, and the new row will be added at the end of the file instead of replacing the existing row.
You should also use Java 7+ try with resources to manage the FileWriter. That will make sure that the FileWriter is always properly closed.
If you want to get fancy with CSV headers, more efficient file handling, etc, you will need to write your own CSVWriter class. But if you are doing that, you would be better off using a library someone has already designed, written and tested. (See above!)
I need a process a large file and insert into Db and don't want to spend lot of ram doing the same. I know we can read line in streaming mode by using apache commons API or buffered reader....bt I wish to insert in DB in batch mode like 1000 insertions at 1 go and not 1 by 1. ....is reading the file line by line ,adding to a list ,counting size ,inserting and refreshing the list of lines the only option to achieve this ?
According to your description, Spring-Batch fit very well.
Basically, it use chunk concept to read/process/write the content. By the way, it can be concurrent for performance.
#Bean
protected Step loadFeedDataToDbStep() {
return stepBuilder.get("load new fincon feed").<com.xxx.Group, FinconFeed>chunk(250)
.reader(itemReader(OVERRIDDEN_BY_EXPRESSION))
.processor(itemProcessor(OVERRIDDEN_BY_EXPRESSION, OVERRIDDEN_BY_EXPRESSION_DATE, OVERRIDDEN_BY_EXPRESSION))
.writer(itemWriter())
.listener(archiveListener())
.build();
}
You can refer to here for more
I have a java program that is supposed to output data, take in data again, read and then output with a few extra columns of result. (So two outputs in total) To test my program I just tried to read and print out the exact same csv to see if it works. However, my first output returns 786718 rows of data, which is complete and correct, but when it gets read again to output the second time, the data is cut at row 786595 and even that row is missing some column data. The file size is also 74868KB vs 74072KB of data. Is this because of the lack of memory from my java program or excel/the .csv file's problem?
PrintWriter writer = null;
try {
writer = new PrintWriter(saveFileName + " updated.csv", "UTF-8");
for (Map.Entry<String, ArrayList> entry : readOutputCSV(saveFileName).entrySet()) {
FindOutput.find(entry.getKey(), entry.getValue(), checkInMRTWriter);
}
} finally {
if (writer != null) {
writer.flush();
writer.close();
}
}
The most likely reason is you are not flushing nor closing the PrintWriter.
From the Java source
public PrintWriter(OutputStream out) {
this(out, false);
}
public PrintWriter(OutputStream out, boolean autoFlush) {
this(new BufferedWriter(new OutputStreamWriter(out)), autoFlush);
You can see that PrintWriter is buffered by default.
The default buffer size is 8 KiB so if you leave this data in the buffer and don't write it out you can lose up to the last 8 KiB of your data.
Some things might influence here:
input/output encoding
line separators (you might be reading a file with '\r\n' and writing '\n' back
CSV escape - values might be escaped or not depending on how you are handling the special cases (values with newlines, comma, or quote). You might be reading valid CSV with a parser but printing out unescaped (and broken) CSV.
whitespaces. Some libraries clear the whitespace when parsing automatically.
The best way to verify is to use a CSV parsing library, such as univocity-parsers and use it to read/write your data with a fixed format configuration. Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
I would like to extract the value of <div class="score">4.1</div> from a website with JAVA (Android). I tried Jsoup and even though it couldn't be simpler to use, it gives me the value in 8 seconds, which is very slow. You need to know, the page source of the site has 300,000 characters and this <div> is somewhere in the middle.
Even using HttpClient and getting the source into a StringBuilder then going through the whole string until the score part is found is faster (3-4 seconds).
I couldn't try out HtmlUnit as it requires a massive amount of jar files and after a while Eclipse always pissed itself in its confusion.
Is there a faster way?
You may simply send a XMLhttpRequest and then search the response using search() function. I think this would be much faster.
Similar Question: Retrieving source code using XMLhttpRequest in javascript
To make the search more fast, you can simply use indexOf([sting to search],[starting index]) and specify the starting index (it doesn't needs to be very accurate, you just have to decrease your search area).
Here is what I did. The problem was that I read the webpage line by line then glued them together into a StringBuilder and searched for the specific part. Then I asked myself: why do I read the page line by line then glue them together? So instead I read the page into a ByteArray and converted it into a String. The scraping time became less than a second!
try
{
InputStream is = new URL(url).openStream();
outputDoc = new ByteArrayOutputStream();
byte buf[]=new byte[1024];
int len;
while((len=is.read(buf))>0)
{
outputDoc.write(buf,0, len);
}
outputDoc.close();
} catch(Exception e) { e.printStackTrace(); }
try {
page = new String(outputDoc.toByteArray(), "UTF-8");
//here I used str.indexOf to find the part
}
I was curious as to what was the best and FASTEST way to get a response from the server, say if I used a for loop to load a url that returned an XML file, which way could I use to load the url get the response 10 times in a row? speed is the most important thing. I know it can only go as fast as your internet but I need a way to load the url as fast as my internet will allow and then put the who output of the url in a string so i can append to JTextArea.. This is the code Ive been using but seek faster alternatives if possible
int times = Integer.parseInt(jTextField3.getText());
for(int abc = 0; abc!=times; abc++){
try {
URL gameHeader = new URL(jTextField2.getText());
InputStream in = gameHeader.openStream();
byte[] buffer = new byte[1024];
try {
for(int cwb; (cwb = in.read(buffer)) != -1;){
jTextArea1.append(new String(buffer, 0, cwb));
}
} catch (IOException e) {}
} catch (MalformedURLException e) {} catch (IOException e) {}
}
is there anything that would be faster than this?
Thanks
-CLUEL3SS
This seems like a job for Java NIO (Non-blocking IO). This article is from Java 1.4 but still will give you a good understanding of how to setup NIO. Since then NIO have evolved a lot and you may need to look up the API for Java 6 or Java 7 to find out whats new.
This solution is probably best as an async option. Basically it will allow you to load 10 URLs without waiting for each one to be complete before moving on and loading an other.
You can't load text this way as the 1024 byte boundary could break an encoded character in two.
Copy all the data to ByteArrayInputStream and use toString() on it or read Text as Text using BufferedReader.
Use a BufferedReader; use a much larger buffer size than 1024; don't swallow exceptions. You could also try re-using the same URL object instead of creating a new one each time, might help with connection pooling.
But why would you want to read the same URL 10 times in a row?