ITEXT dataElements Loop Performance

ITEXT dataElements Loop Performance - java

Hi recently I'm working on a project and one of the reporting module is using iText 2.0.8 version library. Everything work fine until the number of data became huge (around 50,000+ of row). I really need suggestion from every expert on Stackoverflow to improve my code.
My code logic: I wrote HTML code with all data contained inside. After the full HTML code is done, it will store into a variable called as "content", then I'll convert the "content" variable into IElement list and perform a for loop to add into the document. I realise this loop is causing a bad performance (CPU usage is high) and the report is generating very slow (even caused connection timeout).
The following the part of code that caused a very high CPU usage for Java process.
//The **content** String variable is contain the HTML code of the report
//(From <head> to <body> with <table> as the main content to structure the data row).
//I didnt include here because the code is huge.
String PDFFileName = "123.pdf";
PdfDocument pdf = new PdfDocument(new PdfWriter(new FileOutputStream(PDFFileName)));
Document document = new Document(pdf);
List<IElement> dataElements = HtmlConverter.convertToElements(content.toString(), converterProperties);
for (IElement element : dataElements) {
if (element instanceof IBlockElement) {
document.add((IBlockElement) element);
}
}
I know the loop is the issue, but I don't any other way is better and efficient for my case, hope someone can help me on this! Thank you. Please comment below if need extra information (Sorry cant really include all the code since it's very huge).
Specification: itext 2.0.8, Java 8.0, HTML, CSS.

Related

iText PDF Table error using writeSelectedRows

I have the next error.
I'm trying to create a PDF using iText, with an specific format. I opted to use tables for each section of the page, because the format that I need to do have tables. All right, I already did everything, I create the tables and adding it with the doc.add(table) method, this worked fine, but I needed to set the tables into an specific position. So I opted to use table.writeSelectedRows() method, and this worked fine.
And here comes the error, this is my code:
table_SectionTwo.addCell(cell_White);
table_SectionTwo.addCell(cell_White);
table_SectionTwo.addCell(p);
table_SectionTwo.addCell(cell_OrderDate);
table_SectionTwo.addCell(cell_CustomerOrderDate);
table_SectionTwo.addCell(cell_OrderNumberSection);
float[] columnWidths = new float[] {38f, 105f, 90f};
table_SectionTwo.setTotalWidth(columnWidths);
table_SectionTwo.setLockedWidth(true);
table_SectionTwo.completeRow();
table_SectionTwo.writeSelectedRows(0, -1, 260f, 770f, super.getPdfWriter().getDirectContent());
doc.add(table_SectionTwo);
As you can see, if I execute this code, this will add the same table 2 times
the problem is when I remove doc.add(table), I do this only for add one table into an specific position using table.writeSelectedRows(). This is how my code remains:
table_SectionTwo.writeSelectedRows(0, -1, 260f, 770f, super.getPdfWriter().getDirectContent());
//super.addTable(table_SectionTwo);
I commented doc.add(table).
And this should write only one table. But this doesn't work. When I do this throws:
ExceptionConverter: java.io.IOException: The document has no pages.
at com.itextpdf.text.pdf.PdfPages.writePageTree(PdfPages.java:113)
at com.itextpdf.text.pdf.PdfWriter.close(PdfWriter.java:1217)
at com.itextpdf.text.pdf.PdfDocument.close(PdfDocument.java:777)
at com.itextpdf.text.Document.close(Document.java:398)
at PDFConstructor.CloseDocument(PDFConstructor.java:85)
at InvoicePDF.CloseDocument(InvoicePDF.java:58)
at Demo.main(Demo.java:72)
The curious thing is when I comment the doc.add(table) this doesn't work, and when I comment the table.writeSelectedRows() the doc.add(table) works fine.
This error occurs only when I have doc.add(table) commented and table.writeSelectedRows() uncommented.
Please help me..

Although you don't give sufficient information in your question, I think the problem is caused by the fact that you don't define the width of the table.
Do this test: ask the table for its total height. If iText returns 0, then you forgot to define the width of the table; if it doesn't return 0, then iText knows the width either because you defined it explicitly, or because you used document.add(table), which calculated the dimensions of the table based on the page metrics of the document object.
If something else is at play, you'll have to provide more info.

What i understand from your question that you want to write specific rows to document but at a specified position.If this is correct super.getPdfWriter().getDirectContent()) is that necessary?i don't think so.To analyse this part i need your whole code snippet or a demo version of this code which explain the same.
2nd:Internally itext also use PdfContentByte to write PdfPTable using PdfPRow & also Remember according to the author(Bruno) itext is built on Builder Pattern.If previous lines have no meaning to you skip it.
You currently adding content to table even before its all required property is set.
table_SectionTwo.addCell(cell_White);
table_SectionTwo.addCell(cell_White);
table_SectionTwo.addCell(p);
table_SectionTwo.addCell(cell_OrderDate);
table_SectionTwo.addCell(cell_CustomerOrderDate);
table_SectionTwo.addCell(cell_OrderNumberSection);
It will be something like this
float[] columnWidths = new float[] {38f, 105f, 90f};
PdfPTable table_SectionTwo= new PdfPTable(clmnWdthTpHdr);
table_SectionTwo.setTotalWidth(500.0f);
table_SectionTwo.setWidthPercentage(100.0f);
table_SectionTwo.setLockedWidth(true);
3.Don't use super.getPdfWriter().getDirectContent()).As the above code shows me you are using document so i think you must write following code snippet too(Something like this:lol)
PdfWriter pdfWrtr=null;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Document doc= new Document(UtilConstant.pageSizePdf,0,0,0,0);
pdfWrtr=PdfWriter.getInstance(doc,baos);
In try catch use pdfWrtr.getDirectContent(); instead.
These all are based on my code analysis.
Also another point from the exception
ExceptionConverter: java.io.IOException: The document has no pages.
at com.itextpdf.text.pdf.PdfPages.writePageTree(PdfPages.java:113)
...............................
at InvoicePDF.CloseDocument(InvoicePDF.java:58)
at Demo.main(Demo.java:72)
It is a typical error when nothing is added to the document.Maybe an exception is thrown (and ignored) in step 4(according to Itext in Action) andmaybe you are executing step 5 (document.close()) anyway(in spite of the exception in step 4).So please attach Demo.java
if the above is not clear enough to help you.

Same Jsoup code behaving differently on Android and desktop

I've got 5-line, simple Jsoup code parsing some strings, it smoothly runs and returns an array list with values that i want, however on android emulator and phone, it just returns nothing without even giving an error.
Thats the whole code :
Document doc = Jsoup.connect(myURL).get();
Elements els = doc.select("div font a");
for (int i = 3; i < els.size(); i++) {
latestNews.add(els.get(i).text());
}
On desktop, it adds elements into array list, however on device, nothing occurs. Could anyone help about it ?

Are you sure you are receiving the same HTML from the site? you should debug and check your doc variable to make sure it contains the same HTML as you'd expect on the site. Possible case of grabbing the mobile site when you are parsing the full site? (not sure if Jsoup prevents getting the mobile site or not). You likely need to set the user agent so that you receive the full desktop variant of the website.
ex.
Document doc = Jsoup.connect(myURL).userAgent("Mozilla").get();

Saxon is slow parsing

I am trying to parse some xml with saxon to make some xpath querying on it but got 2 problems : the first one is that saxon is very long to build a very short document in xhtml.
code is this :
Processor processorInstance = new Processor(false);
processorInstance.setConfigurationProperty(FeatureKeys.DTD_VALIDATION, false);
XPathCompiler XPathCompilerInstance = processorInstance.newXPathCompiler();
XPathCompilerInstance.setBackwardsCompatible(false);
String expressionTitre = "//div[#class='score_global']/preceding-sibling::img[1]";
XPathExecutable XPathExecutableInstance = XPathCompilerInstance.compile(expressionTitre);
XPathSelector selector = XPathExecutableInstance.load();
logger.info("Xpath compiled.");
// Phase 2, load xml document.
DocumentBuilder documentBuilderInstance = processorInstance.newDocumentBuilder();
documentBuilderInstance.setSchemaValidator(null);
documentBuilderInstance.setLineNumbering(false);
documentBuilderInstance.setRetainPSVI(false);
XdmNode context = documentBuilderInstance.build(new File("sample/sample.xml")); // This line takes ages to return.
What I don't understand is that if I do it with SAX, it loads at normal speed :(.
What did I forget to provide in saxon ?
Java 1.6
Saxon 9.1.0.8
Second problem is that he is unable to process accented characters while my xml was like this:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
So I removed xml:lang en lang= attributes but got no better luck :(
Do you have any ideas ?
Thank you !

Well After much reading, it was simply necessary to define a CatalogResolver and downloading locally the Xhtml dtds. I dropped saxon and used simple JaxP/SaxReader instead.
This page http://xml.apache.org/commons/components/resolver/resolver-article.html proved very interesting.
Hope this considerations will prove themselves useful to someone :)

Ok, I've found out that although I configured Saxon not to validate, he nonetheless tried to resolve the URI and did not manage to find it locally, so he went online and gets & 503 from W3c which takes a long time to return.
I removed the DTD declaration in my xml, and it worked.
My next step is to make it stop to try to resolve it. I am currently reading saxon doc and playing with entity resolver and it should be ok.

Is there a decent, customisable, HTML to Markdown Java API?

I want to save text I scrape from various sources without the HTML tags that are on it, but also keeping as much of the structure as I reasonably can.
Markdown seems to be the solution to this (or possibly MultiMarkdown).
There is a question which offers a suggestion on converting from HTML to markdown, but I want to specify some specific things:
ALL links (including images) are referenced at the END only (i.e. no inline urls)
NO embeded HTML (I'm not even 100% sure yet how I'd like to deal with difficult HTML... but it won't be embeded!)
So my question is as stated in the title: Is there a decent, customisable, HTML to Markdown Java API?

You could try adapting HtmlCleaner which provides a workable interface onto the DOM:
TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
((TagNode)found[0]).removeFromTree();
}
This would allow you to structure your output stream in any format that you want using a fairly simple API.

There is a great library for JS called Turndown, you can try it online here. It can be partially customized. For example, links can be referenced at the end. And as far as I know there is no embedded html, everything is transformed.
I needed it for Java (as the linked question), so I ported it. The library for Java is called CopyDown, it has the same test suite as Turndown.
To install with gradle:
dependencies {
compile 'io.github.furstenheim:copy_down:1.0'
}
Then to use it:
CopyDown converter = new CopyDown();
String myHtml = "<h1>Some title</h1><div>Some html<p>Another paragraph</p></div>";
String markdown = converter.convert(myHtml);
System.out.println(markdown);
> Some title\n==========\n\nSome html\n\nAnother paragraph\n

Unkown error when calling Java applet from JavaScript

Here's the JavaScript (on an aspx page):
function WriteDocument(clientRef, system, branch, category, pdfXML)
{
AppletReturnValue = document.DocApplet.WriteDocument(clientRef, apmBROOMS, branch, category, pdfXML);
if (AppletReturnValue.length > 0) {
document.getElementById('pdfData').value = "";
CallServer(AppletReturnValue,'');
}
PostBackAndDisplayPDF()
}
pdfXML is got from pdfData which is a hidden field on the page containing the XML that contains base64 encoded pdf data which is passed to the java applet. All the other values being passed have within range sensible values.
The XML is like this
<Documents>
<FileName>AFileName</FileName>
<PDF>JVBERiDAzOTY1NzMwIDAwMDAwIG4NCjAwMDM5NjU4NDcgMDAwMDAgbg0KMDAwMzk2NTk2</PDF>
</Documents>
The contents of the element PDF is a lot bigger than displayed here
The signature of the Java method is:
public String WriteDocument(String clientPolicyReference,
int systemType,
int branch,
String category,
String PDFData) throws Exception
It seems that when the size of the PDF data gets large the applet fails to be called and the error 'Unknown Error' is thrown in the JS.
The PDF doc the data of which is producing this error is about 4Mb in size.
Many thanks in advance for any help.

Thanks for responding chaps but I've sorted the problem.
How? I took JRE 1.6 update 12 off and stuck update 7 (which is the version we reccomend to those who use our website) on my machine.
Why update 12 stopped working I don't know. Why update 7 is stable I don't know. [sigh]
It's things like this that make me glad I work mostly with a 'long time between releases' framework like .net.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

ITEXT dataElements Loop Performance - java

Related

iText PDF Table error using writeSelectedRows

Same Jsoup code behaving differently on Android and desktop

Saxon is slow parsing

Is there a decent, customisable, HTML to Markdown Java API?

Unkown error when calling Java applet from JavaScript

Categories

Resources