Modify XML File with SAX

Modify XML File with SAX - java

I'm currently trying to generate a set of models (specified via XML). In order to achieve this, I need to change a single attribute inside the file and save it under a new file name.
The XML File looks like this:
(...)
<place id="P19" initialMarking="0" invariant="< inf" markingOffsetX="0.0" markingOffsetY="0.0" name="P19" nameOffsetX="-5.0" nameOffsetY="35.0" positionX="615.0" positionY="375.0"/>
<place id="P20" initialMarking="0" invariant="< inf" markingOffsetX="0.0" markingOffsetY="0.0" name="P20" nameOffsetX="-5.0" nameOffsetY="35.0" positionX="375.0" positionY="225.0"/>
(...)
What needs changing is the value of initialMarking to values from 2 through 999.
Here is what I have so far:
This is where I get the list of files to change and pass them to the parser
public void parse(String dir){
getFiles(dir);
try {
XMLReader xmlReader = XMLReaderFactory.createXMLReader();
for(int i = 0; i < fileList.length; i++) {
FileReader reader = new FileReader(fileList[i]);
InputSource inputSource = new InputSource(reader);
xmlReader.setContentHandler(new ModelContentHandler());
xmlReader.parse(inputSource);
}
(...)
This is where I'm searching for the element I need to change:
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
if(localName.equals("place") && atts.getValue(0).equals("P14") && atts.getValue(1).equals("2")){
System.out.println("Initial Marking of " + atts.getValue(0) + " is: " + atts.getValue(1) + "\n");
while(currentTokens <= Configuration.MAX_TOKENS){
System.out.println("Setting initial Tokens to: " + currentTokens);
}
}
}
Now, instead of printing out "Setting..." I'd like to change the according value and just save the whole file under some new name like "Model_X_Y_Token.xml".
Seems like a fairly simple thing to do, but I've never used SAX before and looking at the JavaDoc, I can't even find a place to start.
Maybe someone can point me in the right direction?

One of the best approaches here is to use dom4j.I don't exactly get the big picture of what you're trying to do, but I understand the result you want to get. Note that you will also need jaxen for this.
Step 1 : read the file into an xml doxument
for(int i=0; i<fileList.length; i++){
Document doc = new SAXReader().read(fileList[i]);
}
Step 2 : parse the elements you need. For this you need to know a bit of xpath. ///place will fetch all the place elements. ///place[#id="P14"] will fetch only one place element.
Element place14 = (Element) doc.selectSingleNode("//*/place[#id="p14" and initialMarking="2"]");
Step 3 : change the attributes of the element
plac14.attribute("attributename").setValue("attributeValue");

The most efficient way possible is with vtd-xml as it is the only API that does something called incremental update...
import com.ximpleware.*;
public class changeAttrVal {
public static void main(String s[]) throws VTDException,java.io.UnsupportedEncodingException,java.io.IOException{
VTDGen vg = new VTDGen();
if (!vg.parseFile("input.xml", false))
return;
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
XMLModifier xm = new XMLModifier(vn);
ap.selectXPath("/*/place[#id=\"p14\" and #initialMarking=\"2\"]/#initialMarking");
int i=0;
while((i=ap.evalXPath())!=-1){
xm.updateToken(i+1, "499");// change initial marking from 2 to 499
}
xm.output("new.xml"); // output to a new document called new.xml
}
}

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}

Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.

I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

Upload documents into Watson's Retrieve & Rank service

I'm implementing a solution using Watson's Retrieve & Rank service.
When I use the tooling interface, I upload my documents and they appear as a list, where I can click on any of them to open up all the Titles that are inside the document ( Answer Units ), as you can see on the Picture 1 and Picture 2.
When I try to upload documents via Java, it wont recognize the documents, they get uploaded in parts ( Answer units as documents ), each part as a new document.
I would like to know how can I upload my documents as a entire document and not only parts of it?
Here's the codes for the upload function in Java:
public Answers ConvertToUnits(File doc, String collection) throws ParseException, SolrServerException, IOException{
DC.setUsernameAndPassword(USERNAME,PASSWORD);
Answers response = DC.convertDocumentToAnswer(doc).execute();
SolrInputDocument newdoc = new SolrInputDocument();
WatsonProcessing wp = new WatsonProcessing();
Collection<SolrInputDocument> newdocs = new ArrayList<SolrInputDocument>();
for(int i=0; i<response.getAnswerUnits().size(); i++)
{
String titulo = response.getAnswerUnits().get(i).getTitle();
String id = response.getAnswerUnits().get(i).getId();
newdoc.addField("title", titulo);
for(int j=0; j<response.getAnswerUnits().get(i).getContent().size(); j++)
{
String texto = response.getAnswerUnits().get(i).getContent().get(j).getText();
newdoc.addField("body", texto);
}
wp.IndexDocument(newdoc,collection);
newdoc.clear();
}
wp.ComitChanges(collection);
return response;
}
public void IndexDocument(SolrInputDocument newdoc, String collection) throws SolrServerException, IOException
{
UpdateRequest update = new UpdateRequest();
update.add(newdoc);
UpdateResponse addResponse = solrClient.add(collection, newdoc);
}

You can specify config options in this line:
Answers response = DC.convertDocumentToAnswer(doc).execute();
I think something like this should do the trick:
String configAsString = "{ \"conversion_target\":\"answer_units\", \"answer_units\": { \"selector_tags\": [] } }";
JsonParser jsonParser = new JsonParser();
JsonObject customConfig = jsonParser.parse(configAsString).getAsJsonObject();
Answers response = DC.convertDocumentToAnswer(doc, null, customConfig).execute();
I've not tried it out, so might not have got the syntax exactly right, but hopefully this will get you on the right track.
Essentially, what I'm trying to do here is use the selector_tags option in the config (see https://www.ibm.com/watson/developercloud/doc/document-conversion/customizing.shtml#htmlau for doc on this) to specify which tags the document should be split on. By specifying an empty list with no tags in, it results in it not being split at all - and coming out in a single answer unit as you want.
(Note that you can do this through the tooling interface, too - by unticking the "Split my documents up into individual answers for me" option when you upload the document)

ArrayList<String> in PDF from a new row

I want to send some survey in PDF from java, I tryed different methods. I use with StringBuffer and without, but always see text in PDF in one row.
public void writePdf(OutputStream outputStream) throws Exception {
Paragraph paragraph = new Paragraph();
Document document = new Document();
PdfWriter.getInstance(document, outputStream);
document.open();
document.addTitle("Survey PDF");
ArrayList nameArrays = new ArrayList();
StringBuffer sb = new StringBuffer();
int i = -1;
for (String properties : textService.getAnswer()) {
nameArrays.add(properties);
i++;
}
for (int a= 0; a<=i; a++){
System.out.println("nameArrays.get(a) -"+nameArrays.get(a));
sb.append(nameArrays.get(a));
}
paragraph.add(sb.toString());
document.add(paragraph);
document.close();
}
textService.getAnswer() this - ArrayList<String>
Could you please advise how to separate the text in order each new sentence will be starting from new row?
Now I see like this:

You forgot the newline character \n and your code seems a bit overcomplicated.
Try this:
StringBuffer sb = new StringBuffer();
for (String property : textService.getAnswer()) {
sb.append(property);
sb.append('\n');
}

What about:
nameArrays.add(properties+"\n");

You might be able to fix that by simply appending "\n" to the strings that you collecting in your list; but I think: that very much depends on the PDF library you are using.
You see, "newlines" or "paragraphs" are to a certain degree about formatting. It seems like a conceptual problem to add that "formatting" information to the data that you are processing.
Meaning: you might want to check if your library allows you to provide strings - and then have the library do the formatting for you!
In other words: instead of giving strings with newlines; you should check if you can keep using strings without newlines, but if there is way to have the PDF library add line breaks were appropriate.
Side note on code quality: you are using raw types:
ArrayList nameArrays = new ArrayList();
should better be
ArrayList<String> names = new ArrayList<>();
[ I also changed the name - there is no point in putting the type of a collection into the variable name! ]

This method is for save values in array list into a pdf document. In the mfilePath variable "/" in here you can give folder name. As a example "/example/".
and also for mFileName variable you can use name. I give the date and time that document will created. don't give static name other vice your values are overriding in same pdf.
private void savePDF()
{
com.itextpdf.text.Document mDoc = new com.itextpdf.text.Document();
String mFileName = new SimpleDateFormat("YYYY-MM-DD-HH-MM-SS", Locale.getDefault()).format(System.currentTimeMillis());
String mFilePath = Environment.getExternalStorageDirectory() + "/" + mFileName + ".pdf";
try
{
PdfWriter.getInstance(mDoc, new FileOutputStream(mFilePath));
mDoc.open();
for(int d = 0; d < g; d++)
{
String mtext = answers.get(d);
mDoc.add(new Paragraph(mtext));
}
mDoc.close();
}
catch (Exception e)
{
}
}

How to keep Lucene index without deleted documents

This is my first question on Stack Overflow,so wish me luck.
I am doing a classification process over a Lucene index with java and i need to update a document field named category. I have been using Lucene 4.2 with the index writer updateDocument() function for that purpose and its working very well, except for the deletion part. Even if i use the forceMergeDeletes() function after the update the index show me some already deleted documents. For example, if I run the classification over an index with 1000 documents the final amount of documents in the index remain the same and work as expected, but when I increase the index documents to 10000 the index shows some already deleted documents but not all. So, how can I actually erase those deleted documents from index?
Here is some snippets of my code:
public static void main(String[] args) throws IOException, ParseException {
///////////////////////Preparing config data////////////////////////////
File indexDir = new File("/indexDir");
Directory fsDir = FSDirectory.open(indexDir);
IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_42, new WhitespaceSpanishAnalyzer());
iwConf.setOpenMode(IndexWriterConfig.OpenMode.APPEND);
IndexWriter indexWriter = new IndexWriter(fsDir, iwConf);
IndexReader reader = DirectoryReader.open(fsDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
KNearestNeighborClassifier classifier = new KNearestNeighborClassifier(100);
AtomicReader ar = new SlowCompositeReaderWrapper((CompositeReader) reader);
classifier.train(ar, "text", "category", new WhitespaceSpanishAnalyzer());
System.out.println("***Before***");
showIndexedDocuments(reader);
System.out.println("***Before***");
int maxdoc = reader.maxDoc();
int j = 0;
for (int i = 0; i < maxdoc; i++) {
Document doc = reader.document(i);
String clusterClasif = doc.get("category");
String text = doc.get("text");
String docid = doc.get("doc_id");
ClassificationResult<BytesRef> result = classifier.assignClass(text);
String classified = result.getAssignedClass().utf8ToString();
if (!classified.isEmpty() && clusterClasif.compareTo(classified) != 0) {
Term term = new Term("doc_id", docid);
doc.removeField("category");
doc.add(new StringField("category",
classified, Field.Store.YES));
indexWriter.updateDocument(term,doc);
j++;
}
}
indexWriter.forceMergeDeletes(true);
indexWriter.close();
System.out.println("Classified documents count: " + j);
System.out.println();
reader.close();
reader = DirectoryReader.open(fsDir);
System.out.println("Deleted docs: " + reader.numDeletedDocs());
System.out.println("***After***");
showIndexedDocuments(reader);
}
private static void showIndexedDocuments(IndexReader reader) throws IOException {
int maxdoc = reader.maxDoc();
for (int i = 0; i < maxdoc; i++) {
Document doc = reader.document(i);
String idDoc = doc.get("doc_id");
String text = doc.get("text");
String category = doc.get("category");
System.out.println("Id Doc: " + idDoc);
System.out.println("Category: " + category);
System.out.println("Text: " + text);
System.out.println();
}
System.out.println("Total: " + maxdoc);
}
I have spend many hours looking for a solution to this, someones say that the deleted documents in the index are not important and that eventually they will be erased when we keep adding documents to the index, but I need to control that process in a way I can iterate over the index documents at any time and that the documents I retrieve are actually the lived ones. Lucene versions previous to 4.0 had a function in the IndexReader class named isDeleted(docId) that gives if a document has been marked has deleted, that could be just half of the solution to my problem but I have not found a way to do that with the version 4.2 of Lucene. If you know how to do that I really appreciate if you share it.

You can check is a document is deleted is the MultiFields class, like:
Bits liveDocs = MultiFields.getLiveDocs(reader);
if (!liveDocs.get(docID)) ...
So, working this into your code, perhaps something like:
int maxdoc = reader.maxDoc();
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i = 0; i < maxdoc; i++) {
if (!liveDocs.get(docID)) continue;
Document doc = reader.document(i);
String idDoc = doc.get("doc_id");
....
}
By the way, sounds like you have previously been working with 3.X, and are now on 4.X. The Lucene Migration Guide is very helpful for these understanding these sorts of changes between versions, and how to resolve them.

Parsing malformed/incomplete/invalid XML files [duplicate]

This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 5 years ago.
I have a process that parses an XML file using JDOM and xpath to parse the file as shown below:
private static SAXBuilder builder = null;
private static Document doc = null;
private static XPath xpathInstance = null;
builder = new SAXBuilder();
Text list = null;
try {
doc = builder.build(new StringReader(xmldocument));
} catch (JDOMException e) {
throw new Exception(e);
}
try {
xpathInstance = XPath.newInstance("//book[author='Neal Stephenson']/title/text()");
list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
throw new Exception(e);
}
The above works fine. The xpath expressions are stored in a properties file so these can be changed anytime. Now i have to process some more xml files that come from a legacy system that will only send the xml files in chunks of 4000 bytes. The existing processing reads the 4000 byte chunks and stores them in an Oracle database with each chunk as one row in the database (Making any changes to the legacy system or the processing that stores the chunks as rows in the database is out of the question).
I can build the complete valid XML document by extracting all the rows related to a specific xml document and merging them and then use the existing processing (shown above) to parse the xml document.
The thing is though, the data i need to extract from the XML document will always be on the first 4000 bytes. This chunk ofcourse is not a valid XML document as it will be incomplete but will contain all the data i need. I cant parse just the one chunk as the JDOM builder will reject it.
I am wondering whether i can parse the malformed XML chunk without having to merge all parts (which could get to quite many) in order to get a valid XML document. This will save me several trips to the database to check if a chunk is available and i wont have to merge 100s of chunks only for being able to use the first 4000 bytes.
I know i could probably use java's string functions to extract the relevant data but is this possible using a parser or even xpath? or do they both expect the xml document to be a well formed document before it can parse it?

You could try to use JSoup to parse the invalid XML. By definition XML should be well-formed, otherwise it's invalid and should not be used.
UPDATE - example:
public static void main(String[] args) {
for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
new Element(Tag.valueOf("p"), ""),
"")) {
print(node, 0);
}
}
public static void print(Node node, int offset) {
for (int i = 0; i < offset; i++) {
System.out.print(" ");
}
System.out.print(node.nodeName());
for (Attribute attribute: node.attributes()) {
System.out.print(", ");
System.out.print(attribute.getKey() + "=" + attribute.getValue());
}
System.out.println();
for (Node child : node.childNodes()) {
print(child, offset + 4);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Modify XML File with SAX - java

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

Upload documents into Watson's Retrieve & Rank service

ArrayList<String> in PDF from a new row

How to keep Lucene index without deleted documents

Parsing malformed/incomplete/invalid XML files [duplicate]

Categories

Resources