Vespa visitor indexing documents - java

I want to attribute an ID to every document in a vespa cluster.
But I don't completely understand how visitors work in vespa.
Can I get a shared field (meaning shared by all instances of my visitor), which I can atomically increment (using some lock) every time I visit a document ?
What I tried obviously doesn't work, but you'll see the general idea :
public class MyVisitor extends DocumentProcessor {
// where should i put this ?
private int document_id;
private final Lock lock = new ReentrantLock();
#Override
public Progress process(Processing processing) {
Iterator<DocumentOperation> it = processing.getDocumentOperations().iterator();
while (it.hasNext()) {
DocumentOperation op = it.next();
if (op instanceof DocumentPut) {
Document doc = ((DocumentPut) op).getDocument();
/*
* Remove the PUT operation from the iterator so that it is not indexed back in
* the document cluster
*/
it.remove();
try {
try {
lock.lock();
document_id += 1;
} finally {
lock.unlock();
}
} catch (StatusRuntimeException | IllegalArgumentException e) {
}
}
}
return Progress.DONE;
}
}
Another idea it to get the number of buckets and the bucket id I'm currently dealing with and to increment using this pattern:
document_id = bucket_id
document_id += bucked_count
which would work (if I can ensure my visitor operates on a single bucket at a time) but I don't know how to get these information from my visitor.

Document processors operate on incoming document writes, so they cannot be applied to the result of visiting (not without a bit more setup anyway).
What you can do to visit the documents instead is to just get all the documents using HTTP/2: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#visit
Then use the same API to issue an update operation for each document to set the field using the same API: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#put
Since this is done by a single process, you can then have a document_id counter which assigns unique values.
As an aside, a common trick to avoid that requirement is to generate an UUID for each document.

Related

How to write large volumes of unique data to Postgres without storing it all in memory

I have a Spring Boot application that generates images. I'm trying to scale it to the point it can generate an unlimited number of images.
When an image is generated I create a hash using MurmurHash3 of the base64 encoded values of the image, this is then added to an object as a #Lob value. The hash is how I consider images to be unique, the images are then pushed into Postgres.
So far everything is fine and this creates ~1,000 images in a few seconds without problem. Where I'm having issues is say I want to create 100,000+ images.
When the images are generated there is a pretty good chance of duplicates, so what I thought was a good idea would be to create 'chunks' of images using a HashSet to hopefully rule out duplicates at least within the specific 'chunk'
public class CreateImages {
//...
#EventListener(ApplicationReadyEvent.class)
public void process() {
while (repository.count() < 100_000) {
createChunk();
}
}
private void createChunk() {
Set<TokenUri> result = new HashSet<>();
while (result.size() < 1000) {
final ImageWrapper imageWrapper = svgService.create(-1);
result.add(TokenUri.builder()
.hash(imageWrapper.hash())
.data(encodeService.encode(imageWrapper))
.build());
}
try {
repository.saveAll(result);
} catch (Exception e) {
log.error("Failed to save chunk {}", e.getMessage());
}
log.info("Created {} images", repository.count());
}
}
Not worrying about time taken here to create the images (it's all single threaded) this will do what I expect, each chunk doesn't contain duplicates, however more than likely it will contain duplicates when compared to previously generated chunks.
So to try and solve that I added a #Column(unique = true) annotation to each hash row being saved. Thinking Postgres will reject duplicates, but allow 'non-duplicates' to be saved.
What seems to happen though is the batch write fails due to not satisfying the condition and doesn't seem to move past it.
2022-01-02 15:38:56.071 ERROR 19292 --- [ main] o.h.engine.jdbc.spi.SqlExceptionHelper : ERROR: duplicate key value violates unique constraint "uk_90jgw9r7w8bhtgw17fmi79j0w"
Detail: Key (hash)=(1625765490) already exists.
Even when attempting to catch those with a generic Exception either I'm not handling it correctly, or it doesn't do what I expect.
Even this feels rather hackey and not a correct solution.
So, tl:dr - How can I generate an unknown (assume millions) of unique objects, (without keeping them all in memory to check for uniqueness) and safely store those into Postgres?
Is there some standard pattern for this kind of thing?
This could be a possible solution. You save only hash property in a Set and in that way you can have unique hash across chunks and also it's memory efficient because you are saving only a String and not an object with all properties. You also need to override hashCode and equals methods in TokenUri (to use hash property) because otherwise Set<TokenUri> doesn't work.
public class CreateImages {
//...
#EventListener(ApplicationReadyEvent.class)
public void process() {
Set<String> hashCodesOfSavedImages = new HashSet<>();
while (hashCodesOfSavedImages.size() < 100_000) {
hashCodesOfSavedImages.addAll(createChunkOfImages(hashCodesOfSavedImages, 1000));
}
}
private Set<String> createChunkOfImages(Set<String> hashCodesOfSavedImages, int chunkSize) {
Set<TokenUri> chunkOfImages = new HashSet<>();
while (chunkOfImages.size() < chunkSize) {
final ImageWrapper imageWrapper = svgService.create(-1);
// O(1) time complexity (contains)
if (!hashCodesOfSavedImages.contains(imageWrapper.hash())) {
chunkOfImages.add(TokenUri.builder()
.hash(imageWrapper.hash())
.data(encodeService.encode(imageWrapper))
.build());
}
}
repository.saveAll(chunkOfImages);
return chunkOfImages.stream().map(TokenUri::hash).collect(Collectors.toSet());
}
}

Datastore queries in Dataflow DoFn slow down pipeline when run in the cloud

I am trying to enhance data in a pipeline by querying Datastore in a DoFn step.
A field from an object from the Class CustomClass is used to do a query against a Datastore table and the returned values are used to enhance the object.
The code looks like this:
public class EnhanceWithDataStore extends DoFn<CustomClass, CustomClass> {
private static Datastore datastore = DatastoreOptions.defaultInstance().service();
private static KeyFactory articleKeyFactory = datastore.newKeyFactory().kind("article");
#Override
public void processElement(ProcessContext c) throws Exception {
CustomClass event = c.element();
Entity article = datastore.get(articleKeyFactory.newKey(event.getArticleId()));
String articleName = "";
try{
articleName = article.getString("articleName");
} catch(Exception e) {}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
c.output(enhanced);
}
When it is run locally, this is fast, but when it is run in the cloud, this step slows down the pipeline significantly. What's causing this? Is there any workaround or better way to do this?
A picture of the pipeline can be found here (the last step is the enhancing step):
pipeline architecture
What you are doing here is a join between your input PCollection<CustomClass> and the enhancements in Datastore.
For each partition of your PCollection the calls to Datastore are going to be single-threaded, hence incur a lot of latency. I would expect this to be slow in the DirectPipelineRunner and InProcessPipelineRunner as well. With autoscaling and dynamic work rebalancing, you should see parallelism when running on the Dataflow service unless something about the structure of your causes us to optimize it poorly, so you can try increasing --maxNumWorkers. But you still won't benefit from bulk operations.
It is probably better to express this join within your pipeline, using DatastoreIO.readFrom(...) followed by a CoGroupByKey transform. In this way, Dataflow will do a bulk parallel read of all the enhancements and use the efficient GroupByKey machinery to line them up with the events.
// Here are the two collections you want to join
PCollection<CustomClass> events = ...;
PCollection<Entity> articles = DatastoreIO.readFrom(...);
// Key them both by the common id
PCollection<KV<Long, CustomClass>> keyedEvents =
events.apply(WithKeys.of(event -> event.getArticleId()))
PCollection<KV<Long, Entity>> =
articles.apply(WithKeys.of(article -> article.getKey().getId())
// Set up the join by giving tags to each collection
TupleTag<CustomClass> eventTag = new TupleTag<CustomClass>() {};
TupleTag<Entity> articleTag = new TupleTag<Entity>() {};
KeyedPCollectionTuple<Long> coGbkInput =
KeyedPCollectionTuple
.of(eventTag, keyedEvents)
.and(articleTag, keyedArticles);
PCollection<CustomClass> enhancedEvents = coGbkInput
.apply(CoGroupByKey.create())
.apply(MapElements.via(CoGbkResult joinResult -> {
for (CustomClass event : joinResult.getAll(eventTag)) {
String articleName;
try {
articleName = joinResult.getOnly(articleTag).getString("articleName");
} catch(Exception e) {
articleName = "";
}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
return enhanced;
}
});
Another possibility, if there are very few enough articles to store the lookup in memory, is to use DatastoreIO.readFrom(...) and then read them all as a map side input via View.asMap() and look them up in a local table.
// Here are the two collections you want to join
PCollection<CustomClass> events = ...;
PCollection<Entity> articles = DatastoreIO.readFrom(...);
// Key the articles and create a map view
PCollectionView<Map<Long, Entity>> = articleView
.apply(WithKeys.of(article -> article.getKey().getId())
.apply(View.asMap());
// Do a lookup join by side input to a ParDo
PCollection<CustomClass> enhanced = events
.apply(ParDo.withSideInputs(articles).of(new DoFn<CustomClass, CustomClass>() {
#Override
public void processElement(ProcessContext c) {
Map<Long, Entity> articleLookup = c.sideInput(articleView);
String articleName;
try {
articleName =
articleLookup.get(event.getArticleId()).getString("articleName");
} catch(Exception e) {
articleName = "";
}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
return enhanced;
}
});
Depending on your data, either of these may be a better choice.
After some checking I managed to pinpoint the problem: the project is located in the EU (and as such, the Datastore is located in the EU-zone; same as the AppEningine zone), while the Dataflow jobs themselves (and thus the workers) are hosted in the US by default (when not overwriting the zone-option).
The difference in performance is 25-30 fold: ~40 elements/s compared to ~1200 elements/s for 15 workers.

XML Remove node without changing other nodes [duplicate]

When processing XML by means of standard DOM, attribute order is not guaranteed after you serialize back. At last that is what I just realized when using standard java XML Transform API to serialize the output.
However I do need to keep an order. I would like to know if there is any posibility on Java to keep the original order of attributes of an XML file processed by means of DOM API, or any way to force the order (maybe by using an alternative serialization API that lets you set this kind of property). In my case processing reduces to alter the value of some attributes (not all) of a sequence of the same elements with a bunch of attributes, and maybe insert a few more elements.
Is there any "easy" way or do I have to define my own XSLT transformation stylesheet to specify the output and altering the whole input XML file?
Update I must thank all your answers. The answer seems now more obvious than I expected. I never paid any attention to attribute order, since I had never needed it before.
The main reason to require an attribute order is that the resulting XML file just looks different. The target is a configuration file that holds hundreds of alarms (every alarm is defined by a set of attributes). This file usually has little modifications over time, but it is convenient to keep it ordered, since when we need to modify something it is edited by hand. Now and then some projects need light modifications of this file, such as setting one of the attributes to a customer specific code.
I just developed a little application to merge original file (common to all projects) with specific parts of each project (modify the value of some attributes), so project-specific file gets the updates of the base one (new alarm definitions or some attribute values bugfixes). My main motivation to require ordered attributes is to be able to check the output of the application againts the original file by means of a text comparation tool (such as Winmerge). If the format (mainly attribute order) remains the same, the differences can be easily spotted.
I really thought this was possible, since XML handling programs, such as XML Spy, lets you edit XML files and apply some ordering (grid mode). Maybe my only choice is to use one of these programs to manually modify the output file.
Sorry to say, but the answer is more subtle than "No you can't" or "Why do you need to do this in the first place ?".
The short answer is "DOM will not allow you to do that, but SAX will".
This is because DOM does not care about the attribute order, since it's meaningless as far as the standard is concerned, and by the time the XSL gets hold of the input stream, the info is already lost.
Most XSL engine will actually gracefully preserve the input stream attribute order (e.g.
Xalan-C (except in one case) or Xalan-J (always)). Especially if you use <xsl:copy*>.
Cases where the attribute order is not kept, best of my knowledge, are.
- If the input stream is a DOM
- Xalan-C: if you insert your result-tree tags literally (e.g. <elem att1={#att1} .../>
Here is one example with SAX, for the record (inhibiting DTD nagging as well).
SAXParserFactory spf = SAXParserFactoryImpl.newInstance();
spf.setNamespaceAware(true);
spf.setValidating(false);
spf.setFeature("http://xml.org/sax/features/validation", false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
SAXParser sp = spf.newSAXParser() ;
Source src = new SAXSource ( sp.getXMLReader(), new InputSource( input.getAbsolutePath() ) ) ;
String resultFileName = input.getAbsolutePath().replaceAll(".xml$", ".cooked.xml" ) ;
Result result = new StreamResult( new File (resultFileName) ) ;
TransformerFactory tf = TransformerFactory.newInstance();
Source xsltSource = new StreamSource( new File ( COOKER_XSL ) );
xsl = tf.newTransformer( xsltSource ) ;
xsl.setParameter( "srcDocumentName", input.getName() ) ;
xsl.setParameter( "srcDocumentPath", input.getAbsolutePath() ) ;
xsl.transform(src, result );
I'd also like to point out, at the intention of many naysayers that there are cases where attribute order does matter.
Regression testing is an obvious case.
Whoever has been called to optimise not-so-well written XSL knows that you usually want to make sure that "new" result trees are similar or identical to the "old" ones. And when the result tree are around one million lines, XML diff tools prove too unwieldy...
In these cases, preserving attribute order is of great help.
Hope this helps ;-)
Look at section 3.1 of the XML recommendation. It says, "Note that the order of attribute specifications in a start-tag or empty-element tag is not significant."
If a piece of software requires attributes on an XML element to appear in a specific order, that software is not processing XML, it's processing text that looks superficially like XML. It needs to be fixed.
If it can't be fixed, and you have to produce files that conform to its requirements, you can't reliably use standard XML tools to produce those files. For instance, you might try (as you suggest) to use XSLT to produce attributes in a defined order, e.g.:
<test>
<xsl:attribute name="foo"/>
<xsl:attribute name="bar"/>
<xsl:attribute name="baz"/>
</test>
only to find that the XSLT processor emits this:
<test bar="" baz="" foo=""/>
because the DOM that the processor is using orders attributes alphabetically by tag name. (That's common but not universal behavior among XML DOMs.)
But I want to emphasize something. If a piece of software violates the XML recommendation in one respect, it probably violates it in other respects. If it breaks when you feed it attributes in the wrong order, it probably also breaks if you delimit attributes with single quotes, or if the attribute values contain character entities, or any of a dozen other things that the XML recommendation says that an XML document can do that the author of this software probably didn't think about.
XML Canonicalisation results in a consistent attribute ordering, primarily to allow one to check a signature over some or all of the XML, though there are other potential uses. This may suit your purposes.
It's not possible to over-emphasize what Robert Rossney just said, but I'll try. ;-)
The benefit of International Standards is that, when everybody follows them, life is good. All our software gets along peacefully.
XML has to be one of the most important standards we have. It's the basis of "old web" stuff like SOAP, and still 'web 2.0' stuff like RSS and Atom. It's because of clear standards that XML is able to interoperate between different platforms.
If we give up on XML, little by little, we'll get into a situation where a producer of XML will not be able to assume that a consumer of XML will be able to consumer their content. This would have a disasterous affect on the industry.
We should push back very forcefully, on anyone who writes code that does not process XML according to the standard. I understand that, in these economic times, there is a reluctance to offend customers and business partners by saying "no". But in this case, I think it's worth it. We would be in much worse financial shape if we had to hand-craft XML for each business partner.
So, don't "enable" companies who do not understand XML. Send them the standard, with the appropriate lines highlighted. They need to stop thinking that XML is just text with angle brackets in it. It simply does not behave like text with angle brackets in it.
It's not like there's an excuse for this. Even the smallest embedded devices can have full-featured XML parser implementations in them. I have not yet heard a good reason for not being able to parse standard XML, even if one can't afford a fully-featured DOM implementation.
I think I can find some valid justifications for caring about attribute order:
You may be expecting humans to have to manually read, diagnose or edit the XML data one time or another; readability would be important in that instance, and a consistent and logical ordering of the attributes helps with that;
You may have to communicate with some tool or service that (admitedly erroneously) cares about the order; asking the provider to correct its code may not be an option: try to ask that from a government agency while your user's deadline for electronically delivering a bunch of fiscal documents looms closer and closer!
It seems like Alain Pannetier's solution is the way to go.
Also, you may want to take a look at DecentXML; it gives you full control of how the XML is formatted, even though it's not DOM-compatible. Specially useful if you want to modify some hand-edited XML without losing the formatting.
I had the same exact problem. I wanted to modify XML attributes but wanted to keep the order because of diff. I used StAX to achieve this. You have to use XMLStreamReader and XMLStreamWriter (the Cursor based solution). When you get a START_ELEMENT event type, the cursor keeps the index of the attributes. Hence, you can make appropriate modifications and write them to the output file "in order".
Look at this article/discussion. You can see how to read the attributes of the start elements in order.
You can still do this using the standard DOM and Transformation API by using a quick and dirty solution like the one I am describing:
We know that the transformation API solution orders the attributes alphabetically. You can prefix the attributes names with some easy-to-strip-later strings so that they will be output in the order you want. Simple prefixes as "a_" "b_" etc should suffice in most situations and can be easily stripped from the output xml using a one liner regex.
If you are loading an xml and resave and want to preserve attributes order, you can use the same principle, by first modifying the attribute names in the input xml text and then parsing it into a Document object. Again, make this modification based on a textual processing of the xml. This can be tricky but can be done by detecting elements and their attributes strings, again, using regex. Note that this is a dirty solution. There are many pitfalls when parsing XML on your own, even for something as simple as this, so be careful if you decide to implement this.
You really shouldn't need to keep any sort of order. As far as I know, no schema takes attribute order into account when validating an XML document either. It sounds like whatever is processing XML on the other end isn't using a proper DOM to parse the results.
I suppose one option would be to manually build up the document using string building, but I strongly recommend against that.
Robert Rossney said it well: if you're relying on the ordering of attributes, you're not really processing XML, but rather, something that looks like XML.
I can think of at least two reasons why you might care about attribute ordering. There may be others, but at least for these two I can suggest alternatives:
You're using multiple instances of attributes with the same name:
<foo myAttribute="a" myAttribute="b" myAttribute="c"/>
This is just plain invalid XML; a DOM processor will probably drop all but one of these values – if it processes the document at all. Instead of this, you want to use child elements:
<foo>
<myChild="a"/>
<myChild="b"/>
<myChild="c"/>
</foo>
You're assuming that some sort of distinction applies to the attribute(s) that come first. Make this explicit, either through other attributes or through child elements. For example:
<foo attr1="a" attr2="b" attr3="c" theMostImportantAttribute="attr1" />
Kind of works...
package mynewpackage;
// for the method
import java.lang.reflect.Constructor;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
// for the test example
import org.xml.sax.InputSource;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import java.io.StringReader;
import org.w3c.dom.Document;
import java.math.BigDecimal;
public class NodeTools {
/**
* Method sorts any NodeList by provided attribute.
* #param nl NodeList to sort
* #param attributeName attribute name to use
* #param asc true - ascending, false - descending
* #param B class must implement Comparable and have Constructor(String) - e.g. Integer.class , BigDecimal.class etc
* #return
*/
public static Node[] sortNodes(NodeList nl, String attributeName, boolean asc, Class<? extends Comparable> B)
{
class NodeComparator<T> implements Comparator<T>
{
#Override
public int compare(T a, T b)
{
int ret;
Comparable bda = null, bdb = null;
try{
Constructor bc = B.getDeclaredConstructor(String.class);
bda = (Comparable)bc.newInstance(((Element)a).getAttribute(attributeName));
bdb = (Comparable)bc.newInstance(((Element)b).getAttribute(attributeName));
}
catch(Exception e)
{
return 0; // yes, ugly, i know :)
}
ret = bda.compareTo(bdb);
return asc ? ret : -ret;
}
}
List<Node> x = new ArrayList<>();
for(int i = 0; i < nl.getLength(); i++)
{
x.add(nl.item(i));
}
Node[] ret = new Node[x.size()];
ret = x.toArray(ret);
Arrays.sort(ret, new NodeComparator<Node>());
return ret;
}
public static void main(String... args)
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
String s = "<xml><item id=\"1\" price=\"100.00\" /><item id=\"3\" price=\"29.99\" /><item id=\"2\" price=\"5.10\" /></xml>";
Document doc = null;
try
{
builder = factory.newDocumentBuilder();
doc = builder.parse(new InputSource(new StringReader(s)));
}
catch(Exception e) { System.out.println("Alarm "+e); return; }
System.out.println("*** Sort by id ***");
Node[] ret = NodeTools.sortNodes(doc.getElementsByTagName("item"), "id", true, Integer.class);
for(Node n: ret)
{
System.out.println(((Element)n).getAttribute("id")+" : "+((Element)n).getAttribute("price"));
}
System.out.println("*** Sort by price ***");
ret = NodeTools.sortNodes(doc.getElementsByTagName("item"), "price", true, BigDecimal.class);
for(Node n: ret)
{
System.out.println(((Element)n).getAttribute("id")+" : "+((Element)n).getAttribute("price"));
}
}
}
In my simple test it prints:
*** Sort by id ***
1 : 100.00
2 : 5.10
3 : 29.99
*** Sort by price ***
2 : 5.10
3 : 29.99
1 : 100.00
Inspired by the answer of Andrey Lebedenko.
Capable of sorting by a Nodes attribute or by a Nodes text content.
Ready to be used in Your XML utility class.
public static Collection<Node> nodeListCollection(final NodeList nodeList) {
if (nodeList == null) {
return Collections.emptyList();
}
final int length = nodeList.getLength();
if (length == 0) {
return Collections.emptyList();
}
return IntStream.range(0, length)
.mapToObj(nodeList::item)
.collect(Collectors.toList());
}
private static int compareString(final String str1, final String str2, final boolean nullIsLess) {
if (Objects.equals(str1, str2)) {
return 0;
}
if (str1 == null) {
return nullIsLess ? -1 : 1;
}
if (str2 == null) {
return nullIsLess ? 1 : -1;
}
return str1.compareTo(str2);
}
private static final Function<Boolean, Comparator<Node>> StringNodeValueComparatorSupplier = (asc) ->
(Node a, Node b) -> {
final String va = a == null ? null : a.getTextContent();
final String vb = b == null ? null : b.getTextContent();
return (asc ? 1 : -1) * compareString(va, vb,asc);
};
private static final BiFunction<Boolean, String, Comparator<Node>> StringNodeAttributeComparatorSupplier = (asc, attrName) ->
(Node a, Node b) -> {
final String va = a == null ? null : a.hasAttributes() ?
((Element) a).getAttribute(attrName) : null;
final String vb = b == null ? null : b.hasAttributes() ?
((Element) b).getAttribute(attrName) : null;
return (asc ? 1 : -1) * compareString(va, vb,asc);
};
private static <T extends Comparable<T>> Comparator<Node> nodeComparator(
final boolean asc,
final boolean useAttr,
final String attribute,
final Constructor<T> constructor
) {
return (Node a, Node b) -> {
if (a == null && b == null) {
return 0;
} else if (a == null) {
return (asc ? -1 : 1);
} else if (b == null) {
return (asc ? 1 : -1);
}
T aV;
try {
final String aStr;
if (useAttr) {
aStr = a.hasAttributes() ? ((Element) a).getAttribute(attribute) : null;
} else {
aStr = a.getTextContent();
}
aV = aStr == null || aStr.matches("\\s+") ? null : constructor.newInstance(aStr);
} catch (Exception ignored) {
aV = null;
}
T bV;
try {
final String bStr;
if (useAttr) {
bStr = b.hasAttributes() ? ((Element) b).getAttribute(attribute) : null;
} else {
bStr = b.getTextContent();
}
bV = bStr == null || bStr.matches("\\s+") ? null : constructor.newInstance(bStr);
} catch (Exception ignored) {
bV = null;
}
final int ret;
if (aV == null && bV == null) {
ret = 0;
} else if (aV == null) {
ret = -1;
} else if (bV == null) {
ret = 1;
} else {
ret = aV.compareTo(bV);
}
return (asc ? 1 : -1) * ret;
};
}
/**
* Method to sort any NodeList by an attribute all nodes must have. <br>If the attribute is absent for a signle
* {#link Node} or the {#link NodeList} does contain elements without Attributes, null is used instead. <br>If
* <code>asc</code> is
* <code>true</code>, nulls first, else nulls last.
*
* #param nodeList The {#link NodeList} containing all {#link Node} to sort.
* #param attribute Name of the attribute to extract and compare
* #param asc <code>true</code>: ascending, <code>false</code>: descending
* #param compareType Optional class to use for comparison. Must implement {#link Comparable} and have Constructor
* that takes a single {#link String} argument. If <code>null</code> is supplied, {#link String} is used.
* #return A collection of the {#link Node}s passed as {#link NodeList}
* #throws RuntimeException If <code>compareType</code> does not have a constructor taking a single {#link String}
* argument. Also, if the comparator created does violate the {#link Comparator} contract, an
* {#link IllegalArgumentException} is raised.
* #implNote Exceptions during calls of the single String argument constructor of <code>compareType</code> are
* ignored. Values are substituted by <code>null</code>
*/
public static <T extends Comparable<T>> Collection<Node> sortNodesByAttribute(
final NodeList nodeList,
String attribute,
boolean asc,
Class<T> compareType) {
final Comparator<Node> nodeComparator;
if (compareType == null) {
nodeComparator = StringNodeAttributeComparatorSupplier.apply(asc, attribute);
} else {
final Constructor<T> constructor;
try {
constructor = compareType.getDeclaredConstructor(String.class);
} catch (NoSuchMethodException e) {
throw new RuntimeException(
"Cannot compare Node Attribute '" + attribute + "' using the Type '" + compareType.getName()
+ "': No Constructor available that takes a single String argument.", e);
}
nodeComparator = nodeComparator(asc, true, attribute, constructor);
}
final List<Node> nodes = new ArrayList<>(nodeListCollection(nodeList));
nodes.sort(nodeComparator);
return nodes;
}
/**
* Method to sort any NodeList by their text content using an optional type. <br>If
* <code>asc</code> is
* <code>true</code>, nulls first, else nulls last.
*
* #param nodeList The {#link NodeList} containing all {#link Node}s to sort.
* #param asc <code>true</code>: ascending, <code>false</code>: descending
* #param compareType Optional class to use for comparison. Must implement {#link Comparable} and have Constructor
* that takes a single {#link String} argument. If <code>null</code> is supplied, {#link String} is used.
* #return A collection of the {#link Node}s passed as {#link NodeList}
* #throws RuntimeException If <code>compareType</code> does not have a constructor taking a single {#link String}
* argument. Also, if the comparator created does violate the {#link Comparator} contract, an
* {#link IllegalArgumentException} is raised.
* #implNote Exceptions during calls of the single String argument constructor of <code>compareType</code> are
* ignored. Values are substituted by <code>null</code>
*/
public static <T extends Comparable<T>> Collection<Node> sortNodes(
final NodeList nodeList,
boolean asc,
Class<T> compareType) {
final Comparator<Node> nodeComparator;
if (compareType == null) {
nodeComparator = StringNodeValueComparatorSupplier.apply(asc);
} else {
final Constructor<T> constructor;
try {
constructor = compareType.getDeclaredConstructor(String.class);
} catch (NoSuchMethodException e) {
throw new RuntimeException(
"Cannot compare Nodes using the Type '" + compareType.getName()
+ "': No Constructor available that takes a single String argument.", e);
}
nodeComparator = nodeComparator(asc, false, null, constructor);
}
final List<Node> nodes = new ArrayList<>(nodeListCollection(nodeList));
nodes.sort(nodeComparator);
return nodes;
}
I have a quite similar problem. I need to have always the same attribute for first.
Example :
<h50row a="1" xidx="1" c="1"></h50row>
<h50row a="2" b="2" xidx="2"></h50row>
must become
<h50row xidx="1" a="1" c="1"></h50row>
<h50row xidx="2" a="2" b="2"></h50row>
I found a solution with a regex:
test = "<h50row a=\"1\" xidx=\"1\" c=\"1\"></h50row>";
test = test.replaceAll("(<h5.*row)(.*)(.xidx=\"\\w*\")([^>]*)(>)", "$1$3$2$4$5");
Hope you find this usefull

Data commit issue in multithreading

I am new to Java and Hibernate.
I have implemented a functionality where I generate request nos. based on already saved request no. This is done by finding the maximum request no. and incrementing it by 1,and then again save i it to database.
However I am facing issues with multithreading. When two threads access my code at the same time both generate same request no. My code is already synchronized. Please suggest some solution.
synchronized (this.getClass()) {
System.out.println("start");
certRequest.setRequestNbr(generateRequestNumber(certInsuranceRequestAddRq.getAccountInfo().getAccountNumberId()));
reqId = Utils.getUniqueId();
certRequest.setRequestId(reqId);
ItemIdInfo itemIdInfo = new ItemIdInfo();
itemIdInfo.setInsurerId(certRequest.getRequestId());
certRequest.setItemIdInfo(itemIdInfo);
dao.insert(certRequest);
addAccountRel();
System.out.println("end");
}
Following is the output showing my synchronization:
start
end
start
end
Is it some Hibernate issue.
Does the use of transactional attribute in Spring affects the code commit in my Case?
I am using the following Transactional Attribute:
#Transactional(readOnly = false, propagation = Propagation.REQUIRED, rollbackFor = Exception.class)
EDIT: code for generateRequestNumber() shown in chat room.
public String generateRequestNumber(String accNumber) throws Exception {
String requestNumber = null;
if (accNumber != null) {
String SQL_QUERY = "select CERTREQUEST.requestNbr from CertRequest as CERTREQUEST, "
+ "CertActObjRel as certActObjRel where certActObjRel.certificateObjkeyId=CERTREQUEST.requestId "
+ " and certActObjRel.certObjTypeCd=:certObjTypeCd "
+ " and certActObjRel.certAccountId=:accNumber ";
String[] parameterNames = {"certObjTypeCd", "accNumber"};
Object[] parameterVaues = new Object[]
{
Constants.REQUEST_RELATION_CODE, accNumber
};
List<?> resultSet = dao.executeNamedQuery(SQL_QUERY,
parameterNames, parameterVaues);
// List<?> resultSet = dao.retrieveTableData(SQL_QUERY);
if (resultSet != null && resultSet.size() > 0) {
requestNumber = (String) resultSet.get(0);
}
int maxRequestNumber = -1;
if (requestNumber != null && requestNumber.length() > 0) {
maxRequestNumber = maxValue(resultSet.toArray());
requestNumber = Integer.toString(maxRequestNumber + 1);
} else {
requestNumber = Integer.toString(1);
}
System.out.println("inside function request number" + requestNumber);
return requestNumber;
}
return null;
}
Don't synchronize on the Class instance obtained via getClass(). It can have some strange side effects. See https://www.securecoding.cert.org/confluence/pages/viewpage.action?pageId=43647087
For example use:
synchronize(this) {
// synchronized code
}
or
private synchronized void myMethod() {
// synchronized code
}
To synchronize on the object instance.
Or do:
private static final Object lock = new Object();
private void myMethod() {
synchronize(lock) {
// synchronized code
}
}
Like #diwakar suggested. This uses a constant field to synchronize on to guarantee that this code is synchronizing on the same lock.
EDIT: Based on information from chat, you are using a SELECT to get the maximum requestNumber and increasing the value in your code. Then this value is set on the CertRequest which is then persisted in the database via a DAO. If this persist action is not committed (e.g. by making the method #Transactional or some other means) then another thread will still see the old requestNumber value. So you could solve this by making the code transactional (how depends on which frameworks you use etc.). But I agree with #VA31's answer which states that you should use a database sequence for this instead of incrementing the value in code. Instead of a sequence you could also consider using an auto-incement field in CertRequest, something like:
#GeneratedValue(strategy=GenerationType.AUTO)
private int requestNumber;
For getting the next value from a sequence you can look at this question.
You mentioned this information in your question.
I have implemented a functionality where I generate request nos. based on already saved request no. This is done by finding the maximum request no. and incrementing it by 1,and then again save i it to database.
On a first look, it seems the problem caused by multi appserver code. Threads are synchronised inside one JVM(appserver). If you are using more than one appserver then you have to do it differently using more robust approach by using server to server communication or by batch allocation of request no to each appserver.
But, if you are using only one appserver and multiple threads accessing the same code then you can put a lock on the instance of the class rather then the class itself.
synchronized(this) {
lastName = name;
nameCount++;
}
Or you can use the locks private to the class instance
private Object lock = new Object();
.
.
synchronized(lock) {
System.out.println("start");
certRequest.setRequestNbr(generateRequestNumber(certInsuranceRequestAddRq.getAccountInfo().getAccountNumberId()));
reqId = Utils.getUniqueId();
certRequest.setRequestId(reqId);
ItemIdInfo itemIdInfo = new ItemIdInfo();
itemIdInfo.setInsurerId(certRequest.getRequestId());
certRequest.setItemIdInfo(itemIdInfo);
dao.insert(certRequest);
addAccountRel();
System.out.println("end");
}
But make sure that your DB is updated by the new sequence no before the next thread is accessing it to get new one.
It is a good practice to generate "the request number (Unique Id)" by using the DATABASE SEQUENCE so that you don't need to synchronize your Service/DAO methods.
First thing:
Why are you getting the thread inside the method. I is not required here.
Also, one thing;
Can you try like this once:
final static Object lock = new Object();
synchronized (lock)
{
.....
}
what I feel is that object what you are calling is different so try this once.

Spring Batch Processor

I have a requirement in Spring Batch where I have a file with thousands of records coming in a sorted order.The key field is product code.
The file may have multiple records of the same product code.The requirement is that I have to group the records that have the same
product Code in a collection (i.e List) and then send them over to a method i.e validateProductCodes(List prodCodeList).
I am looking for the best way to do this.The approach I thought of was to read every record in the Processor and then build a collection
of records for the same product code in the processor.If at any point in the processor,if the product code in the record is different than it would imply that
the productCode grouping is complete and the validateProductCodes() can be called for that group of records with the same product code.Also I am using a Step.So does
not that automatically mean that the process is multithreaded?Meaning Groups of records with same productCode will be processed in a multithreaded way.Please advise.
Thanks
There are two questions in your question: first, you want to know how to group the items together and second how they are processed.
In order to group them, you could create a group reader as Luca suggested or something like:
public class GroupReader<I> implements ItemReader<List<I>>{
private SingleItemPeekableItemReader<I> reader;
private ItemReader<I> peekReaderDelegate;
public void setReader(ItemReader<I> reader) {
peekReaderDelegate = reader;
}
#Override
public void afterPropertiesSet() throws Exception {
Assert.notNull(peekReaderDelegate, "The 'itemReader' may not be null");
this.reader= new SingleItemPeekableItemReader<I>();
this.reader.setDelegate(delegateReader);
}
#Override
public List<I> read() throws Exception {
State state = State.NEW;
List<I> group = null;
I item = null;
while (state != State.COMPLETE) {
item = reader.read();
switch (state) {
case NEW: {
if (item == null) {
// end reached
state = State.COMPLETE;
break;
}
group = new ArrayList<I>();
group.add(item);
state = State.READING;
I nextItem = reader.peek();
if (isItAKeyChange(item, nextItem)) {
state = State.COMPLETE;
}
break;
}
case READING: {
group.add(item);
// peek and check if there the peeked entry has a new date
I nextItem = peekEntry();
if (isItAKeyChange(item, nextItem)) {
state = State.COMPLETE;
}
break;
}
default: {
throw new org.springframework.expression.ParseException(groupCounter, "ParsingError: Reader is in an invalid state");
}
}
}
return group;
}
}
For every key, this reader will return a list with all elements matching this key. Therefore, the grouping ist done directly in the reader.
You cannot do that with a processor, as you described.
Your second question about multithreading.
Now, using a step does not necessarily mean, that the step is processed with several threads.
In order to do that, you need set an AsyncTaskExecutor and you have to set the throttle limit.
But if you do that, your reader must be threadsafe, or otherwise your grouping won't work. You could do that by simply defining the read method above as synchronized.
Another way could be to write a small SynchronizedWrapperReader, as suggested in this question: Parellel Processing Spring Batch StaxEventItemReader
Please note, depending on your target you are writing to, you probably also have to synchronize the writer, and if necessary to reorder the result.

Categories

Resources