Concurrent inserting to DB - java

I made a parser based on Jsoup. This parser handles a page with pagination. This page contains, for example, 100 links to be parsed. I created a main loop that goes over pagination. And I need to run async tasks to parse each of 100 items on each page. As I understand, Jsoup does not support async requests handling. After handling each of item I need to save it to DB. I want to avoid errors during insert into DB's table (if threads will use the same id for different items at the same time, if its possible). What you could suggest?
Could I use simple Thread instance to parse each item:
public class ItemParser extends Thread {
private String url;
private MySpringDataJpaRepository repo;
public ItemParser(String url, MySpringDataJpaRepository repoReference) {
this.url = url;
this.repo = repoReference;
}
#Override
public void run() {
final MyItem item = jsoupParseItem();
repo.save(item);
}
}
And run this like:
public class Parser {
#Autowired
private MySpringDataJpaRepository repoReference; // <-- SINGLETON
public static void main(String[] args) {
int pages = 10000;
for (int i = 0; i < pages; i++) {
Document currentPage = Jsoup.parse();
List<String> links = currentPage.extractLinks(); // contains 100 links to be parsed on each for-loop iteration
links.forEach(link -> new ItemParser(link, repoReference).start());
}
}
}
I know that this code is not compilable, I just want to show you my idea.
Or maybe it's better to use Spring Batch?
What is best practice to solve this?
What do you think?

If you use row level locking should be fine. It might save problems to have each insert be a transaction but this has implications given the whole notion of a transaction as a unit of work (i.e. if a single insert fails do you want the whole run to fail and rollback?).
Also, if you use UUIDs or db-generated ids you won't have any collision issues.
As to how to structure the code, I'd look at using Runnables for each task, and a thread pool executor. Too many threads and the system will lose efficiency for trying to manage them all. I notice you're using spring, so take a look at https://docs.spring.io/spring/docs/current/spring-framework-reference/html/scheduling.html

Related

Handling additional data in Apache ServiceComb compensation methods

I'm currently looking at the implementations of saga pattern for distributed transactions and I found that Apache ServiceComp pack might be something that works for me.
However, I have found a problem that the limitation of compensating methods to have the same declaration as the methods they compensate may be a bottleneck.
From Apache's example:
#Compensable(compensationMethod = "cancel")
void order(CarBooking booking) {
booking.confirm();
bookings.put(booking.getId(), booking);
}
void cancel(CarBooking booking) {
Integer id = booking.getId();
if (bookings.containsKey(id)) {
bookings.get(id).cancel();
}
}
You can see that we have the same declaration for both methods.
But, what if I need additional information to compensate my transaction? For instance, I have a call to external system to update some flag to "true". When I need to compensate it, how do I make "cancel" method know what the original value of this flag was?
The things get more tricky when we update the whole object. How do I send the whole object before modification to the cancel transaction?
These limitation doesn't look quite promising. Do you know if there are approaches to fight with this limitation?
You can save localTxId and flag an in your application and use localTxId in the compensation method to get the flag
Map extmap = new HashMap();
#Autowired
OmegaContext omegaContext;
#Compensable(compensationMethod = "cancel")
void order(CarBooking booking) {
booking.confirm();
bookings.put(booking.getId(), booking);
//save flag
extmap.put(omegaContext.localTxId(),'Your flag')
}
void cancel(CarBooking booking) {
//get flag
extmap.get(omegaContext.localTxId());
Integer id = booking.getId();
if (bookings.containsKey(id)) {
bookings.get(id).cancel();
}
}

How to shard a Set?

Could you help me with one thing? Imagine I have a simple RESTful microserver with one GET method which simply responds with a random String.
I assemble all the strings in a ConcurrentHashSet<String> that holds all answers.
There is a sloppy implementation below, the main thing is that the Set<String> is a fail-safe and can be modified simultaneously.
#RestController
public class Controller {
private final StringService stringService;
private final CacheService cacheService;
public Controller(final StringService stringService, final CacheService cacheService) {
this.stringService = stringService;
this.cacheService = cacheService;
}
#GetMapping
public String get() {
final String str = stringService.random();
cacheService.add(str);
return str;
}
}
public class CacheService {
private final Set<String> set = ConcurrentHashMap.newKeySet();
public void add(final String str) {
set.add(str);
}
}
While you are reading this line my endpint is being used by 1 billion people.
I want to shard the cache. Since my system is heavily loaded I can't hold all the strings on one server. I want to have 256 servers/instances and uniformly distribute my cache utilizing str.hashCode()%256 function to determine on each server/instance should a string be kept.
Could you tell me what should I do next?
Assume that currently, I have only running locally Spring Boot application.
You should check out Hazelcast, it is open source and has proved useful for me in a case where i wanted to share data among multiple instances of my application. The In-memory data grid provided by hazelcast might just be the thing you are looking for.
I agree with Vicky, this is what Hazelcast is made for. It's a single jar, a couple lines of code and instead of a HashMap, you have an IMap, which is an extension of HashMap, and you're good to go. All the distribution, sharding, concurrency, etc is done for you. Check out:
https://docs.hazelcast.org/docs/3.11.1/manual/html-single/index.html#map
Try follow codes.But,it is a bad way,you best use Map to cache your data in one instance.If you need to create distributed application,try distributed catche service like Redis.
class CacheService {
/**
* assume read operation is more frequently than write operation
*/
private final static List<Set<String>> sets = new CopyOnWriteArrayList<>();
static {
for (int i = 0; i < 256; i++) {
sets.add(ConcurrentHashMap.newKeySet());
}
}
public void add(final String str) {
int insertIndex = str.hashCode() % 256;
sets.get(insertIndex).add(str);
}
}

Unit testing ElasticSearch search result converter

In our project I have written a small class which is designed to take the result from an ElasticSearch query containing a named aggregation and return information about each of the buckets returned in the result in a neutral format, suitable for passing on to our UI.
public class AggsToSimpleChartBasicConverter {
private SearchResponse searchResponse;
private String aggregationName;
private static final Logger logger = LoggerFactory.getLogger(AggsToSimpleChartBasicConverter.class);
public AggsToSimpleChartBasicConverter(SearchResponse searchResponse, String aggregationName) {
this.searchResponse = searchResponse;
this.aggregationName = aggregationName;
}
public void setChartData(SimpleChartData chart,
BucketExtractors.BucketNameExtractor keyExtractor,
BucketExtractors.BucketValueExtractor valueExtractor) {
Aggregations aggregations = searchResponse.getAggregations();
Terms termsAggregation = aggregations.get(aggregationName);
if (termsAggregation != null) {
for (Terms.Bucket bucket : termsAggregation.getBuckets()) {
chart.add(keyExtractor.extractKey(bucket), Long.parseLong(valueExtractor.extractValue(bucket).toString()));
}
} else {
logger.warn("Aggregation " + aggregationName + " could not be found");
}
}
}
I want to write a unit test for this class by calling setChartData() and performing some assertions against the object passed in, since the mechanics of it are reasonably simple. However in order to do so I need to construct an instance of org.elasticsearch.action.search.SearchResponse containing some test data, which is required by my class's constructor.
I looked at implementing a solution similar to this existing question, but the process for adding aggregation data to the result is more involved and requires the use of private internal classes which would likely change in a future version, even if I could get it to work initially.
I reviewed the ElasticSearch docs on unit testing and there is a mention of a class org.elasticsearch.test.ESTestCase.java (source) but there is no guidance on how to use this class and I'm not convinced it is intended for this scenario.
How can I easily unit test this class in a manner which is not likely to break in future ES releases?
Note, I do not want to have to start up an instance of ElasticSearch, embedded or otherwise since that is overkill for this simple unit test and would significantly slow down the execution.

Should I abstract the service layer on the client side and if yes how?

The thing is that I am using Hibernate on the server side and that I am sending basically "raw" database data to the client - which is fine I guess but that also means that my client gets a List<UpcomingEventDTO> when calling the according service which is just a list from a specified date to another one.
If I now want to split those events into a map where the keys map to lists of events of one day e.g. a Map<Integer, List<UpcomingEventDTO>> then I will have to do this on the client side. This wouldn't bother me if I wouldn't have to do that in my Presenter.
On the one hand I'm having the loading in my presenter:
private void loadUpcomingEvents(final Integer calendarWeekOffset) {
new XsrfRequest<StoreServletAsync, List<UpcomingEventDTO>>(this.storeServlet) {
#Override
protected void onCall(AsyncCallback<List<UpcomingEventDTO>> asyncCallback) {
storeServlet.getUpcomingEventsForCalendarWeek(storeId, calendarWeekOffset, asyncCallback);
}
#Override
protected void onFailure(Throwable caught) {
}
#Override
protected void onSuccess(List<UpcomingEventDTO> result) {
upcomingEvents = result;
presentUpcomingEvents();
}
}.request();
}
and the conversion of the data before I can present it:
private void presentUpcomingEvents() {
Map<Integer, List<UpcomingEventDTO>> dayToUpcomingEvent = new HashMap<>();
for (UpcomingEventDTO upcomingEvent : this.upcomingEvents) {
#SuppressWarnings("deprecation")
Integer day = upcomingEvent.getDate().getDay();
List<UpcomingEventDTO> upcomingEvents = dayToUpcomingEvent.get(day);
if(upcomingEvents == null) {
upcomingEvents = new ArrayList<>();
}
upcomingEvents.add(upcomingEvent);
dayToUpcomingEvent.put(day, upcomingEvents);
}
List<Integer> days = new ArrayList<Integer>(dayToUpcomingEvent.keySet());
Collections.sort(days);
this.calendarWeekView.removeUpcomingEvent();
for(Integer day : days) {
CalendarDayPresenterImpl eventCalendarDayPresenter = null;
eventCalendarDayPresenter = this.dayToEventCalendarDayPresenter.get(day);
if(eventCalendarDayPresenter == null) {
List<UpcomingEventDTO> upcomingEvents = dayToUpcomingEvent.get(day);
eventCalendarDayPresenter = new CalendarDayPresenterImpl(upcomingEvents);
this.dayToEventCalendarDayPresenter.put(day, eventCalendarDayPresenter);
}
this.calendarWeekView.appendEventCalendarDay(eventCalendarDayPresenter.getView());
}
}
So my problem is basically that I am not really happy with having code like this in my presenter but on the other hand I wouldn't know how and where to provide the data in this "upgraded" form for my presenter(s).
One could argue and say that I could also just return the data from the server in a way I would need it on the server but then I would lose generality and I don't want to write for all views and presenters their "own" API to the database.
Another possibility would be e.g. to introduce another layer between the service/servlet layer and have something like a DAO- or database-layer before my presenters model. But this would also raise quite a lot questions for me. E.g. what would be the name of such a layer ^^ and would that layer provide "customize" data for presenters or would the data still be kind of generalized?
I'm having quite a huge issue figuring out what to do here so I hope I can benefit from someones experience.
Thanks a lot for any help here!
The presentation logic should be on server side in controller layer where its meant to prepare the view for the clients. ( MVC pattern )
And if many views want to use this, you can make an abstract controller which can be reused for other views.
Also its good to prepare your controller layer for the future requirements. Ask yourself whether another client will ask to present the data in different granularity ? May be show the upcoming events by month/time ? Hence you have to provide your API a granularity enum UPCOMING_EVENTS_DAY_GRANULARITY( DAY, MONTH, HOUR) as a method parameter so that you will make client to decide what they want.
And to make it more beautiful, you can also say rename/move controller layer into a webservice layer which can be considered as your future API for external systems (not only for your views but for anyone outside your system)..

How do I synchronize to prevent a java.util.ConcurrentModificationException

I have program consisting of a number of classes. I have a problem with the interraction of two of the classes - WebDataCache and Client. The problem classes are listed below.
WebData:
This is just a data class representing some data retrieved from the internet.
WebService:
This is just a web service wrapper class which connects to a particular web service, reads some data and stores it in an object of type WebData.
WebDataCache:
This is a class which uses the WebService class to retreive data that's cached in a map, keyed by the ids fields of the data.
Client:
This is is a class which contains a refrence to an instance of the WebDataCache class and uses the cached data.
The problem is (as illustrated below) when the class is looping through the cached data, it is possible for the WebDataCache to update the underlying collection.
My question is how do I synchronize access to the cache?
I don't want to synchronize the whole cache as there are multiple instance of the Client class, however each instantiated with a unique id (i.e. new Client(0,...), new Client(1,...), new Client(2,...), etc each instance only interested in data keyed by the id the client was instansiated with.
Are there any relevent design patterns I can use?
class WebData {
private final int id;
private final long id2;
public WebData(int id, long id2) {
this.id = id;
this.id2 = id2;
}
public int getId() { return this.id; }
public long getId2() { return this.id2; }
}
class WebService {
Collection<WebData> getData(int id) {
Collection<WebData> a = new ArrayList<WebData>();
// populate A with data from a webservice
return a;
}
}
class WebDataCache implements Runnable {
private Map<Integer, Map<Long, WebData>> cache =
new HashMap<Integer, Map<Long, WebData>>();
private Collection<Integer> requests =
new ArrayList<Integer>();
#Override
public void run() {
WebService webSvc = new WebService();
// get data from some web service
while(true) {
for (int id : requests) {
Collection<WebData> webData = webSvc.getData(id);
Map<Long, WebData> row = cache.get(id);
if (row == null)
row = cache.put(id, new HashMap<Long, WebData>());
else
row.clear();
for (WebData webDataItem : webData) {
row.put(webDataItem.getId2(), webDataItem);
}
}
Thread.sleep(2000);
}
}
public synchronized Collection<WebData> getData(int id){
return cache.get(id).values();
}
public synchronized void requestData(int id) {
requests.add(id);
}
}
-
class Client implements Runnable {
private final WebDataCache cache;
private final int id;
public Client(int id, WebDataCache cache){
this.id = id;
this.cache = cache;
}
#Override
public void run() {
cache.requestData(id);
while (true) {
for (WebData item : cache.getData(id)) {
// java.util.ConcurrentModificationException is thrown here...
// I understand that the collection is probably being modified in WebDataCache::run()
// my question what's the best way to sychronize this code snippet?
}
}
}
}
Thanks!
Use java.util.concurrent.ConcurrentHashMap instead of plain old java.util.HashMap. From the Javadoc:
A hash table supporting full
concurrency of retrievals and
adjustable expected concurrency for
updates. This class obeys the same
functional specification as Hashtable,
and includes versions of methods
corresponding to each method of
Hashtable. However, even though all
operations are thread-safe, retrieval
operations do not entail locking, and
there is not any support for locking
the entire table in a way that
prevents all access. This class is
fully interoperable with Hashtable in
programs that rely on its thread
safety but not on its synchronization
details.
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ConcurrentHashMap.html
So you would replace:
private Map<Integer, Map<Long, WebData>> cache =
new HashMap<Integer, Map<Long, WebData>>();
With
private Map<Integer, Map<Long, WebData>> cache =
new ConcurrentHashMap<Integer, Map<Long, WebData>>();
My best recommendation is to use an existing cache implementation such as JCS or EhCache - these are battle tested implementations.
Otherwise, you have a couple of things going on in your code. Things that can break in funny ways.
HashMap can grow infinite loops when modified concurrently by multiple threads. So don't. Use java.util.concurrent.ConcurrentHashMap instead.
The ArrayList that you use for WebDataCache.requests isn't thread-safe either and you have inconsistent synchronization - either change it to a safer list implementation from java.util.concurrent or make sure that all access to it is synchronizing on the same lock.
Lastly, have your code checked with FindBugs and/or properly reviewed by someone with solid knowledge and experience on writing multi-threaded code.
If you want to read a book on this stuff, I can recommend Java Concurrency in Practice by Brian Goetz.
In addition to the other posted recommendations, consider how often the cache is updated versus just being read. If the reading dominates and updating is rare, and it's not critical that the reading loop be able to see every update immediately, consider using a CopyOnWriteArraySet. It and its sibling CopyOnWriteArrayList allow concurrent reading and updating of the members; the reader sees a consistent snapshot unaffected by any mutation of the underlying collection -- analogous to the SERIALIZABLE isolation level in a relational database.
The problem here, though, is that neither of these two structures give you your dictionary or associative array storage (a la Map) out of the box. You'd have to define a composite structure to store the key and value together, and, given that CopyOnWriteArraySet uses Object#equals() for membership testing, you'd have to write an unconventional key-based equals() method for your structure.
The answer from LES2 is good except that you would have to replace:
row = cache.put(id, new HashMap<Long, WebData>());
with:
row = cache.put(id, new ConcurrentHashMap<Long, WebData>());
For that's the one that hold the "problematic" collection and not the whole cache.
You can synchronize on the row returned by the cache who is at the end who holds the collection that is being shared.
On WebDataCache:
Map<Long, WebData> row = cache.get(id);
if (row == null) {
row = cache.put(id, new HashMap<Long, WebData>());
} else synchronized( row ) {
row.clear();
}
for (WebData webDataItem : webData) synchronized( row ) {
row.put(webDataItem.getId2(), webDataItem);
}
// it doesn't make sense to synchronize the whole cache here.
public Collection<WebData> getData(int id){
return cache.get(id).values();
}
On Client:
Collection<WebData> data = cache.getData(id);
synchronized( data ) {
for (WebData item : cache.getData(id)) {
}
}
Of course this is far from perfect it just answer the question of what to synchronize. In this case it would be the access to the underlaying collection in. row.clear row.put on the cache and the iteration on the client.
BTW, why do you have a Map in the cache, and you use a collection in the client. You should use the same structure on both and don't expose the underlying implementation.

Categories

Resources