I have an entity implemented by a POJO, let's call it TestEntity that is saved in a Collection in Mongo. I generate instances of this entity by iterating over a CSV file and inserting OR updating an existing record if some conditions are met. Below you can find some sample of my code
// File is parsed into an entity
MongoClient client = new MongoClient("127.0.0.1", 27017);
Mapper w2sMap = new Mapper();
w2sMap.addMappedClass(TestEntity.class);
morphia = new Morphia();
Datastore datastore = morphia.createDatastore(client, w2sMap, "test");
// SaveTask implements Runnable
#Override
public void run() {
try {
datastore.save(testEntity);
} catch (DuplicateKeyException ex) {
if (updateConditions) // Query Mongo to find the entity, change some fields and save
else // Log
}
The SaveTask is run by an ExecutorService by multiple threads each saving one instance of TestEntity. However some of the CSV files are enormous and many save operations have to be performed which is cumbersome (750k saves are executed in roughly 10 minutes on my machine). One solution is to store TestEntity in an Iterable Collection and fire a save task every now and then on the collection, thus reducing network latency and increasing performance (My actual MongoDB is run on a different server)
However the problem rises when I try to use bulk insert cause I still get exceptions when a duplicate key is found. I know about new InsertOptions().continueOnError(true) however I need to be able to upsert documents in one go, this will simply ignore duplicated entries.
Is there anyway around this or do I need to catch the exceptions, find out which objects were NOT saved and keep going from there? I can provide a better code example if requested.
You have a couple of options to try:
Save with a write concern of 0 "Don't wait for acknowledgement from the server", i.e. datastore.save(testEntity, WriteConcern. UNACKNOWLEDGED);. This is fast but will ignore errors because it doesn't wait for a server ack.
You already considered saving multiple objects in one go with save(...) but you could try the WriteConcern there too.
You can use the Java bulk writes to have more control, e.g.
DBCollection coll = db.getCollection("user");
BulkWriteOperation bulk = coll.initializeUnorderedBulkOperation();
bulk.find(new BasicDBObject("z", 1)).upsert().update(new BasicDBObject("$inc", new BasicDBObject("y", -1)));
bulk.find(new BasicDBObject("z", 1)).upsert().update(new BasicDBObject("$inc", new BasicDBObject("y", -1)));
bulk.execute();
Your example code doesn't show how you are using upserts, but in general you should always catch exceptions and handle them yourself.
Related
I have a Spring application that runs a cron on it. The cron every few minutes gets new data from external API. The data should be stored in a database (MySQL), in place of old data (Old data should be overwritten by new data). The data requires to be overwritten instead of updated. The application itself provides REST API so the client is able to get the data from the database. So there should not be situation that client sees an empty or just a part of data from database because there is an data update.
Currently I've tried deleting whole old data and insert new data but there is a place that a client gets just a part of the data. I've tried it via Spring Data deleteAll and saveAll methods.
#Override
#Transactional
public List<Country> overrideAll(#NonNull Iterable<Country> countries) {
removeAllAndFlush();
List<CountryEntity> countriesToCreate = stream(countries.spliterator(), false)
.map(CountryEntity::from)
.collect(toList());
List<CountryEntity> createdCountries = repository.saveAll(countriesToCreate);
return createdCountries.stream()
.map(CountryEntity::toCountry)
.collect(toList());
}
private void removeAllAndFlush() {
repository.deleteAll();
repository.flush();
}
I also thought about having a temporary table that gets new data and when the data is complete just replace main table with temporary table. Is it a good idea? Any other ideas?
It's a good idea. You can minimize the downtime by working on another table until it's ready and then switch tables quickly by renaming. This will also improve perceived performance by the users because no record needs to be locked like what happens when using UPDATE/DELETE.
In MySQL, you can use RENAME TABLE if you don't have triggers on the table. It allows multiple table renaming at once and it works atomically (i.e. transaction - if any error happens, no change is made). You can use the following for example
RENAME TABLE countries TO countries_old, countries_new TO countries;
DROP TABLE countries_old;
Refer here for more details
https://dev.mysql.com/doc/refman/5.7/en/rename-table.html
I am trying to write points to influxDB using their Java client.
Batch is important to me.
If I use the influxDB.enableBatch with influxDB.write(Point) no data is inserted.
If I use the BatchPoints and influxDB.write(batchPoints) - data is inserted successfully.
Both code samples are taken from: https://github.com/influxdata/influxdb-java/tree/influxdb-java-2.7
InfluxDB influxDB = InfluxDBFactory.connect(influxUrl, influxUser, influxPassword);
influxDB.setDatabase(dbName);
influxDB.setRetentionPolicy("autogen");
// Flush every 2000 Points, at least every 100ms
influxDB.enableBatch(2000, 100, TimeUnit.MILLISECONDS);
influxDB.write(Point.measurement("cpu")
.time(System.currentTimeMillis(), TimeUnit.MILLISECONDS)
.addField("idle", 90L)
.addField("user", 9L)
.addField("system", 1L)
.build());
Query query = new Query("SELECT idle FROM cpu", dbName);
QueryResult result = influxDB.query(query);
Returns nothing.
BatchPoints batchPoints = BatchPoints.database(dbName).tag("async", "true").build();
Point point1 = Point
.measurement("cpu")
.tag("atag", "test")
.addField("idle", 90L)
.addField("usertime", 9L)
.addField("system", 1L)
.build();
batchPoints.point(point1);
influxDB.write(batchPoints);
Query query = new Query("SELECT * FROM cpu ", dbName);
QueryResult result = influxDB.query(query);
This returns data successfully.
As mentioned, I need the first way to function.
How can I achieve that?
versions:
influxdb-1.3.6
influxdb-java:2.7
Regards, Ido
maybe it's too late or you have already resolved your issue, but I will answer your question, it may be useful for others.
I think your first example is not working because you enabled batch functionality and it will "Flush every 2000 Points, at least every 100ms". So basically it's working, but you are making select before the actual save is performed.
When you use influxDB.enableBatch(...); functionality influxdb-client creates internal thread pool for storing your data after collecting them or by timeout and it will not be done immediately.
In second example when you use influxDB.write(batchPoints); influxdb-client is synchronously writing your data to InfluxDb. That's why your select statement is able to return data immediately.
In my application there are 2 threads:
crawl the web-sites and insert the data into MongoDB
retrieve the crawled sites and perform business logic
In order to retrieve the the crawled sites I use the following query:
Document query = new Document("fetchStatus", new Document("$lte", fetchStatusParam));
FindIterable<Document> unfetchedEpisodes = dbC_Episodes.find(query);
As the result I get all episodes, which its fetchStatusParam is less or equal to the specific value.
The next step, I store the items of the result set in HashMap<String, TrackedEpisode>, which is an object property in order to track them:
for (Document document : unfetchedEpisodes) {
this.trackedEpisodes.put(document.get("_id").toString(), new TrackedEpisode(document));
}
Then I do some business logic, which:
doesn't modify the unfetchedEpisodes result set.
doesn't remove any object from trackedEpisodes.
Up till now everything is OK.
The last step, I pass over all retrieved documents and mark them as fetched in order to prevent the duplicate fetching in the future.
for (Document document : unfetchedEpisodes) {
if (this.trackedEpisodes.containsKey(document.get("_id").toString())) {
// prevent repeated fetching
document.put("fetchStatus", FetchStatus.IN_PROCESS.getID());
if (this.trackedEpisodes.get(document.get("_id").toString()).isExpired()) {
document.put("isExpired", true);
document.put("fetchStatus", FetchStatus.FETCHED.getID());
}
} else {
System.out.println("BOO! Strange new object detected");
}
dbC_Episodes.updateOne(new Document("_id", document.get("_id")), new Document("$set", document));
}
I run this code for a couple of days and paid attention that sometimes it arrives to the else part of the if (this.trackedEpisodes.containsKey()) statement. It's weird for me, how it can be possible that unfetchedEpisodes and trackedEpisodes are not synchronized and don't contain the same items?
I began to investigate the case and paid attention that the times I arrive to "BOO! Strange new object detected" the document iterator contains the item which is in database but should not yet be in unfetchedEpisodes since I didn't execute a new query to database.
I checked couple of times the matter of storing retrieved items into trackedEpisodes and always all elements from the unfetchedEpisodes have been added to trackedEpisodes but after that sometimes I still arrive to "BOO! Strange new object detected".
My question:
Why unfetchedEpisodes gets new items after execution of a query?
Is it possible that unfetchedEpisodes will be modified by MongoDB driver after execution of Collection#query()?
Maybe should I use kind of .close() after executing a query from the MongoDB?
The used versions:
MongoDB: 3.2.3, x64
MongoDB Java Driver: mongodb-driver-3.2.2, mongodb-driver-core-3.2.2, bson-3.2.2
When you call find here:
FindIterable<Document> unfetchedEpisodes = dbC_Episodes.find(query);
you are not actually getting all the episodes back. You are getting a database cursor pointing to the matched documents.
Then when you call:
for (Document document : unfetchedEpisodes){}
an iterator is created over all of the documents that match the query.
When you call it a second time, a new cursor is returned, for the same query, and all of the documents that match now are iterated over.
If the collection has changed in between, the results will be different.
If you want to ensure that the contents of unfetchedEpisodes are unchanged then one option is you could pull the entire result set into memory and iterate over it in memory rather than on the DB, e.g.
ArrayList<Document> unfetchedEpisodes = dbC_Episodes.find(query).into(new ArrayList<Document>());
I'm having a problem with MongoDB using Java when I try adding documents with customized _id field. And when I insert new document to that collection, I want to ignore the document if it's _id has already existed.
In Mongo shell, collection.save() can be used in this case but I cannot find the equivalent method to work with MongoDB java driver.
Just to add an example:
I have a collection of documents containing websites' information
with the URLs as _id field (which is unique)
I want to add some more documents. In those new documents, some might be existing in the current collection. So I want to keep adding all the new documents except for the duplicate ones.
This can be achieve by collection.save() in Mongo Shell but using MongoDB Java Driver, I can't find the equivalent method.
Hopefully someone can share the solution. Thanks in advance!
In the MongoDB Java driver, you could try using the BulkWriteOperation object with the initializeOrderedBulkOperation() method of the DBCollection object (the one that contains your collection). This is used as follows:
MongoClient mongo = new MongoClient("localhost", port_number);
DB db = mongo.getDB("db_name");
ArrayList<DBObject> objectList; // Fill this list with your objects to insert
BulkWriteOperation operation = col.initializeOrderedBulkOperation();
for (int i = 0; i < objectList.size(); i++) {
operation.insert(objectList.get(i));
}
BulkWriteResult result = operation.execute();
With this method, your documents will be inserted one at a time with error handling on each insert, so documents that have a duplicated id will throw an error as usual, but the operation will still continue with the rest of the documents. In the end, you can use the getInsertedCount() method of the BulkWriteResult object to know how many documents were really inserted.
This can prove to be a bit ineffective if lots of data is inserted this way, though. This is just sample code (that was found on journaldev.com and edited to fit your situation.). You may need to edit it so it fits your current configuration. It is also untested.
I guess save is doing something like this.
fun save(doc: Document, col: MongoCollection<Document>) {
if (doc.getObjectId("_id") != null) {
doc.put("_id", ObjectId()) // generate a new id
}
col.replaceOne(Document("_id", doc.getObjectId("_id")), doc)
}
Maybe they removed save so you decide how to generate the new id.
I have a table with 62,000,000 rows aprox, a need select data from these a export to .txt or .csv
My query limit the result to 60,000 rows aprox.
When I run my the query in my developer machine, I eat all memory and get a java.lang.OutOfMemoryError
In this moment I use Hibernate for DAO, but I can change to pure JDBC solution when you recommend
My pseoudo-code is
List<Map> list = myDao.getMyData(Params param); //program crash here
initFile();
for(Map map : list){
util.append(map); //this transform row to file
}
closeFile();
Suggesting me to write my file?
Note: I use .setResultTransformer(Transformers.ALIAS_TO_ENTITY_MAP); to get Map instead of any Entity
You could use hibernate's ScrollableResults. See documentation here: http://docs.jboss.org/hibernate/orm/4.3/manual/en-US/html/ch11.html#objectstate-querying-executing-scrolling
This uses server-side cursors, if your database engine / database driver supports this. Be sure for this to work you set the following properties:
query.setReadOnly(true);
query.setCacheable(false);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
while (results.next()) {
SomeEntity entity = results.get()[0];
}
results.close();
lock the table and then perform subset selection and exports, appending to the results file. ensure you unconditionally unlock when done.
Not nice, but the task will perform to completion even on limited resource servers or clients.