I'm trying to find the most efficient way to parse a large file, and save the results in a database.
The file in question is a 12_500_000 line text file, with logs from a server.
The log looks like this
[notice] 2021-03-10T16:19:26.102551Z couchdb#127.0.0.1 <0.8999.68> 351ac014dd 87.92.211.148:5984 125.129.113.37 user1 GET /userdb 200 ok 8
I'm parsing the database name (userdb) and the verb (GET).
while ((line = reader.readLine()) != null) {
String[] data = line.split("\\s+");
service.save(new Request(data[8], data[9].substring(1)));
}
The parsing time(under 1ms) is insignificant compared to the time it takes to save it to the database (78.6ms).
I wanted to do this async, but from what I understand, you can't save records to a database asynchronously.
Any idea which way to go, to it faster?
Related
I use vertica flex table to load json to vertica without defining the tables, and I got problems with my loading time.
I connect to my vertica with jdbc drive and then use this code..
String copyQuery = "COPY schema.tablename FROM STDIN PARSER fjsonparser()";
VerticaCopyStream vstream = new VerticaCopyStream((VerticaConnection)conn, copyQuery);
InputStream input;
vstream.start();
for(JsonNode json : jsonList){
input = new ByteArrayInputStream(json.toString().getBytes());
vstream.addStream(input);
input.close();
}
vstream.execute();
vstream.finish();
The command "vstream.execute()" takes 12 seconds for 5000 jsons but when I use COPY command from file it runs for less then a second.
Your problem is not with the VerticaCopyStream , the problem is with regard to the different parsers you used , you need to compare apple to apple , JSON parser should be more slower the simple CSV parser .
COPY FROM STDIN and COPY LOCAL stream data from the client. Running it on the server with just a COPY (no LOCAL or STDIN) will be a direct load straight from the vertica daemon with no network latency (assuming it is on local disk and not a NAS).
In addition, your method of reinstantiating the ByteArrayInputStream... wouldn't it be better to turn your jsonList into an InputStream and pass just that in instead of creating an input stream for every item?
if you run the samr insert by useing vsql it solve the problem
I need to insert about 700 records(name,id) into sqlite permanently ,because app will get user's name from the database.
I think ,reading text file is a solution but not know this is the best.
Can you show me other options to insert about 700 records into database?
thanks
The best practice to add multiple inserts into database shown in this video tutorial, you can watch it from 10.15
[Android Sqlite3 video tutoridal][inserting multiple values into database using fast way]
https://www.youtube.com/watch?v=dBnOn17pI7c&list=PLGLfVvz_LVvQUjiCc8lUT9aO0GsWA4uNe&index=14
U have Sqlite browser in order to view sqlite database.Insert data using the browser and u can permanently use that database.
Or try adding data to database using webservices.
It really depends on what you want to do and why you want to do it. That being said, text files can work. I had a similar case where I stored a few thousand items into an SQLite database. I used a text file and a CSVReader to parse the text file.
InputStream is = new ByteArrayInputStream(theContent.getBytes());
BufferedReader br = new BufferedReader(new InputStreamReader(is));
CSVReader<String[]> csvReader = new CSVReaderBuilder<String[]>(br).strategy(new CSVStrategy('\t', '\b', '#', true, true)).entryParser(new EntryParser()).build();
while ((nextLine = csvReader.readNext()) != null) {
// Do Parsing work and Store to SQLite Database
}
If you know the data won't change and want the fastest solution, then a text file is sufficient. If the data will change frequently, then you're probably going to want to access a web service to update your data. The speed of this method will be affected by the internet speed of the user.
I have more than 10 million JSON documents of the form :
["key": "val2", "key1" : "val", "{\"key\":\"val", \"key2\":\"val2"}"]
in one file.
Importing using JAVA Driver API took around 3 hours, while using the following function (importing one BSON at a time):
public static void importJSONFileToDBUsingJavaDriver(String pathToFile, DB db, String collectionName) {
// open file
FileInputStream fstream = null;
try {
fstream = new FileInputStream(pathToFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("file not exist, exiting");
return;
}
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
// read it line by line
String strLine;
DBCollection newColl = db.getCollection(collectionName);
try {
while ((strLine = br.readLine()) != null) {
// convert line by line to BSON
DBObject bson = (DBObject) JSON.parse(JSONstr);
// insert BSONs to database
try {
newColl.insert(bson);
}
catch (MongoException e) {
// duplicate key
e.printStackTrace();
}
}
br.close();
} catch (IOException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
Is there a faster way? Maybe, MongoDB settings may influence the insertion speed? (for, example adding key : "_id" which will function as index, so that MongoDB would not have to create artificial key and thus index for each document) or disable index creation at all at insertion.
Thanks.
I'm sorry but you're all picking minor performance issues instead of the core one. Separating the logic from reading the file and inserting is a small gain. Loading the file in binary mode (via MMAP) is a small gain. Using mongo's bulk inserts is a big gain, but still no dice.
The whole performance bottleneck is the BSON bson = JSON.parse(line). Or in other words, the problem with the Java drivers is that they need a conversion from json to bson, and this code seems to be awfully slow or badly implemented. A full JSON (encode+decode) via JSON-simple or specially via JSON-smart is 100 times faster than the JSON.parse() command.
I know Stack Overflow is telling me right above this box that I should be answering the answer, which I'm not, but rest assured that I'm still looking for an answer for this problem. I can't believe all the talk about Mongo's performance and then this simple example code fails so miserably.
I've done importing a multi-line json file with ~250M records. I just use mongoimport < data.txt and it took 10 hours. Compared to your 10M vs. 3 hours I think this is considerably faster.
Also from my experience writing your own multi-threaded parser would speed things up drastically. The procedure is simple:
Open the file as BINARY (not TEXT!)
Set markers(offsets) evenly across the file. The count of markers depends on the number of threads you want.
Search for '\n' near the markers, calibrate the markers so they are aligned to lines.
Parse each chunk with a thread.
A reminder:
when you want performance, don't use stream reader or any built-in line-based read methods. They are slow. Just use binary buffer and search for '\n' to identify a line, and (most preferably) do in-place parsing in the buffer without creating a string. Otherwise the garbage collector won't be so happy with this.
You can parse the entire file together at once and the insert the whole json in mongo document, Avoid multiple loops, You need to separate the logic as follows:
1)Parse the file and retrieve the json Object.
2)Once the parsing is over, save the json Object in the Mongo Document.
I've got a slightly faster way (I'm also inserting millions at the moment), insert collections instead of single documents with
insert(List<DBObject> list)
http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#insert(java.util.List)
That said, it's not that much faster. I'm about to experiment with setting other WriteConcerns than ACKNOWLEDGED (mainly UNACKNOWLEDGED) to see if I can speed it up faster. See http://docs.mongodb.org/manual/core/write-concern/ for info
Another way to improve performance, is to create indexes after bulk inserting. However, this is rarely an option except for one off jobs.
Apologies if this is slightly wooly sounding, I'm still testing things myself. Good question.
You can also remove all the indexes (except for the PK index, of course) and rebuild them after the import.
Use bulk operations insert/upserts. After Mongo 2.6 you can do Bulk Updates/Upserts. Example below does bulk update using c# driver.
MongoCollection<foo> collection = database.GetCollection<foo>(collectionName);
var bulk = collection.InitializeUnorderedBulkOperation();
foreach (FooDoc fooDoc in fooDocsList)
{
var update = new UpdateDocument { {fooDoc.ToBsonDocument() } };
bulk.Find(Query.EQ("_id", fooDoc.Id)).Upsert().UpdateOne(update);
}
BulkWriteResult bwr = bulk.Execute();
You can use a bulk insertion
You can read the documentation at mongo website and you can also check this java example on StackOverflow
Im using using following Method to catch Data from a webapi:
public static String sendRequest(String requestURL, String data)
throws IOException {
URL url = new URL(requestURL + "?" + data);
URLConnection conn = url.openConnection();
conn.setReadTimeout(10000);
BufferedReader in = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
String inputLine;
StringBuilder answerBuilder = new StringBuilder("");
try {
while ((inputLine = in.readLine()) != null)
answerBuilder.append(inputLine);
in.close();
} catch (Exception e) {
}
return answerBuilder.toString();
}
With some requests, this leads to a OutOfMemoryError caused by a too small HeapSize:
(...)Caused by: java.lang.OutOfMemoryError: (Heap Size=17927KB, Allocated=14191KB, Bitmap Size=2589KB)
at java.lang.AbstractStringBuilder.enlargeBuffer(AbstractStringBuilder.java:95)
at java.lang.AbstractStringBuilder.append0(AbstractStringBuilder.java:132)
at java.lang.StringBuilder.append(StringBuilder.java:272)
at java.io.BufferedReader.readLine(BufferedReader.java:423)
at com.elophant.utils.HTTPUtils.sendRequest(HTTPUtils.java:23)
at (..)
I already swapped from normal String operations like String answer += inputLine to a StringBuilder but this didnt help. How can i solve this Problem? Increasing maximum heap size via export JVM_ARGS="-Xmx1024m -XX:MaxPermSize=256m" isnt an option as its an android app.
Use a file for temporary storage like when a hard drive starts paging because it's out of memory.
One solution would be to persist the content being downloaded to a storage.
Depending on what you are download you could parse it during its read and store it in a SQL Lite DataBase. This would allow you to use Query language to handle data afterwards. This would be really useful if file being downloaded is a JSON or XML.
In JSON you could get the InputStream as you already do and read stream with the JSON Reader. For every record read from the JSON you can store in a table (or more tables depending on how each record is structured). The good thing from this approach is that at the end you don't need file handling and you already have your data distributed in tables within your database ready to be queried.
you should write the stringbuilder content into a file and clear it from time to time.
If your String is really that large, you will need to store it in a file temporarily and process it in chunks (or handle it in chunks while you receive it)
Not for the faint-hearted, but write your own MyString class that uses a byte for each char~ 50% memory savings! And consequnetely, MyStringBuilder class. Only assuming you are dealing with ASCII.
I have an sqback file respresenting an sqlite db file. I want to extract the data from this sqback file, ie, table names and contents, and convert it into csv file format. I want to do this using Java.
** The sqback file will have already been uploaded from android device to pc prior to processing. So I need a solution that is appropriate for taking place server side.
Does anyone have any leads on how to perform such a task?
If using Android you can take advantage of the built SQLiteDatabase and SQLiteOpenHelper. You'll find all the info you need here.
After parsing everything you can export to CSV the way you want by using File.
EDIT: So basically what you need to do is to parse the bytes by reading them and that way have access to the content. In some cases you don't even need to convert them to a String, since could be that you only need the value of the byte. (ie.: Offset: 44 Length:4 Description:The schema format number. Supported schema formats are 1, 2, 3, and 4.).
You can always check if your values are correct with any HEX editor, even opening the sqlite file with a text editor of any kind would help.
Let's start from scratch. First, reading the file. Two approaches
a. Read the whole file and parse it after
b. Read and parse the whole file in blocks (recommended, specially for bigger files)
Both approaches would share most of the following code:
File file = new File("YOUR FILE ROUTE");
int len = 1024 * 512; //512KB
try {
fis = new FileInputStream(file);
} catch (FileNotFoundException e1) {
fis = null;
e1.printStackTrace();
}
byte[] b = new byte[len];
int bread;
try {
while((bread = fis.read(b, 0, len))!=-1)
{
if(parseBlock(b, bread)==1)
return;
}
fis.close();
} catch (IOException e) {
e.printStackTrace();
}
The difference would be between getting partial blocks and parsing them on the fly (which I guess works for you) and getting the whole thing would be to just put:
fis.read(b, 0, fis.available);
Instead of the while loop.
Ultimately your approach is right, and that's the way to get bytes into a String. (new String(b)). Moreover the first characters are likely to represent weird symbols, if you have a look to the file format of SQL, these are reserved bytes for storing some metadata.
Open the sqlite file with any text editor and check that what you see there matches with what comes out of your code.
This website indicates which extra libraries to make use of, as well as provides examples of how to interact with the sqlite files (http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC#Usage)
important things to note:
1) make sure to include load the sqlite-JDBC driver using the current class loader. This is done with the line
2) the sqlite file IS a db, even if its not sitting on a server somewhere. So you still must create a connection to the file to interact with it. And you must open the connection and close the connection as well.
Connection connection = null;
connection = DriverManager.getConnection("jdbc:sqlite:" + path); // path is a String to the sqlite file (sqlite or sqback)
connection.close(); // after you are done with the file
3) Information can be extracted by using sql code to query the file. This returns a processable object of type ResultSet that holds your data pertaining to the query
Statement statement = connection.createStatement();
statement.setQueryTimeout(30);
ResultSet rs = statement.executeQuery("SELECT * FROM " + tblName);
4) from the ResultsSet you can grab data using the get commands with either the column index or the column header key
rs.getString("qo")
Hope that helps anyone having the same issue as I was having