Specific characters not rendering properly in Java

Specific characters not rendering properly in Java - java

I have an issue when displaying strings received from a server in a JTable. Some specific characters appear as little white squares instead of "é" or "à" etc. I tried a lot of things but none of them fixed my problem. I'm working with Eclipse under Windows. The server was developped using Visual Studio 2010.
The server sends an XML file using tinyXML2, the client uses JDom to read it. The font used is "Dialog". The server takes the strings from an Oracle database.
I assume this is an encoding problem, but I haven't been able to fix it yet.
Does anyone have an idea ?
Thx
Arnaud
EDIT : As requested, this is how I use JDom
public static Player fromXML(Element e)
{
Player result = new Player();
String e_text = null;
try
{
e_text = e.getChildText(XMLTags.XML_Player_playerId);
if (e_text != null) result.setID(Integer.parseInt(e_text));
e_text = e.getChildText(XMLTags.XML_Player_lastName);
if (e_text != null) result.setName(e_text);
e_text = e.getChildText(XMLTags.XML_Player_point_scored);
if (e_text != null) result.addSpecial(STAT_SCORED, Double.parseDouble(e_text));
e_text = e.getChildText(XMLTags.XML_Player_point_scored_last);
if (e_text != null) result.addSpecial(STAT_SCORED_LAST, Double.parseDouble(e_text));
}
catch (Exception ex) {
ex.printStackTrace();
}
return result;
}
public static Document load(String filename) {
File XMLFile = new File(CLIENT_TO_SERVER, filename);
SAXBuilder sxb = new SAXBuilder();
Document document = new Document();
try
{
document = sxb.build(new File(XMLFile.getPath()));
} catch(Exception e){e.printStackTrace();}
return document;
}

read the file using correct encoding, something like:
document = sxb.build(new BufferedReader(new InputStreamReader(new FileInputStream(XMLFile.getPath()), "UTF8")));
Note: 1. 1st determine which char encoding used in that file. specify that charset instead of UTF8 above.
Incase encoding is not known or it's being generated from various systems with different encoding, you may use 'encoding detector library of Mozilla'. #see https://code.google.com/p/juniversalchardet/
need to handle UnsupportedEncodingException

Related

Java printing PDF with options (staple, duplex, etc)

I have a java program that prints PDFs. It uses Apache PDFBox to create a PDDocument object (from a pdf document or from a stream in some cases) and then sends it to the printer using the javax.print API:
private boolean print(File pdf, String printer)
{
boolean success = false;
try (PDDocument document = PDDocument.load(pdf))
{
PrintService[] printServices = PrinterJob.lookupPrintServices();
PrintService printService = PrintServiceLookup.lookupDefaultPrintService();
PrinterJob job = PrinterJob.getPrinterJob();
job.setPageable(new PDFPageable(document));
// set printer
if (printer != null)
{
for (PrintService selected : printServices)
{
if (selected.getName().equals(printer))
{
printService = selected;
break;
}
}
}
job.setPrintService(printService);
job.print();
success = true;
}
catch (Exception e)
{
myLog.error("Printer error.", e);
}
return success;
}
Now I need to be able to tell the printer to staple the thing...
I am familiar with the javax.print.attributes API and use this successfully for specifying the tray or setting duplex, e.g.:
// this works fine
if (duplex != null)
{
if (duplex.equalsIgnoreCase("short"))
{
myLog.debug("Setting double-sided: Short");
attr.add(Sides.TWO_SIDED_SHORT_EDGE);
}
else
{
myLog.debug("Setting double-sided: Long");
attr.add(Sides.TWO_SIDED_LONG_EDGE);
}
}
I know there is an attribute for stapling:
attr.add(javax.print.attribute.standard.Finishings.STAPLE);
I have a Xerox Versalink B7035 with a Finisher XL attachment that fully supports stapling (i.e. it works from MS Office document settings) however the printer disregards the STAPLE attribute set from Java. I tried all other variants of staple attributes but soon found that the printer did not support ANY Java finishing attributes.
Or to put it in code, the following prints NO results:
DocFlavor flavor = DocFlavor.SERVICE_FORMATTED.PRINTABLE;
Object finishings = myPrinter.getSupportedAttributeValues(Finishings.class, flavor, null);
if (finishings != null && o.getClass().isArray())
{
for (Finishings finishing : (Finishings[]) finishings)
{
System.out.println(finishing.getValue() + " : " + finishing);
}
}
After reading this and trying a few different things I concluded the printer will not accept the STAPLE attribute because the finisher is an attachment or simply because Xerox doesn't like Java or something. So now I am attempting to solve this by prepending PJL commands to the pdf before sending it, as covered here.
*PJL = Print Job Language
E.g:
<ESC>%-12345X#PJL<CR><LF>
#PJL SET STAPLE=LEFTTOP<CR><LF>
#PJL ENTER LANGUAGE = PDF<CR><LF>
[... all bytes of the PDF file, starting with '%PDF-1.' ...]
[... all bytes of the PDF file ............................]
[... all bytes of the PDF file ............................]
[... all bytes of the PDF file, ending with '%%EOF' .......]
<ESC>%-12345X
At first I assumed there would just be some method in the Apache PDFBox library to do just this, but no luck. Then I checked out the API for Ghost4J and saw nothing for prepending. Has anyone else solved this already?

Reverting to Java socket printing makes PJL a thing:
// this works, it also printed faster than javax.print when tested
private static void print(File document, String printerIpAddress, boolean staple)
{
try (Socket socket = new Socket(printerIpAddress, 9100))
{
DataOutputStream out = new DataOutputStream(socket.getOutputStream());
byte[] bytes = Files.readAllBytes(document.toPath());
out.write(27); //esc
out.write("%-12345X#PJL\n".getBytes());
out.write("#PJL SET DUPLEX=ON\n".getBytes());
if (staple)
{
out.write("#PJL SET STAPLEOPTION=ONE\n".getBytes());
}
out.write("#PJL ENTER LANGUAGE=PDF\n".getBytes());
out.write(bytes);
out.write(27); //esc
out.write("%-12345X".getBytes());
out.flush();
out.close();
}
catch (Exception e)
{
System.out.println(e);
}
}
The needed PJL commands came from this Xerox datasheet.
It should be noted that the same PJL commands worked for two different Xerox models and a Lexmark printer, that's all I had handy to test with. Dunno if other models will want something different.
Do not need the Apache PDFBox library anymore. Or any external libraries at all.
This might work for other types of documents, aside from PDFs.

How to use OpenNLP parser models in an Android app?

I go through this link for java nlp https://www.tutorialspoint.com/opennlp/index.htm
I tried below code in android:
try {
File file = copyAssets();
// InputStream inputStream = new FileInputStream(file);
ParserModel model = new ParserModel(file);
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
} catch (Exception e) {
}
i download file **en-parser-chunking.bin** from internet and placed in assets of android project but code stop on third line i.e ParserModel model = new ParserModel(file); without giving any exception. Need to know how can this work in android? if its not working is there any other support for nlp in android without consuming any services?

The reason the code stalls/breaks at runtime is that you need to use an InputStream instead of a File to load the binary file resource. Most likely, the File instance is null when you "load" it the way as indicated in line 2. In theory, this constructor of ParserModelshould detect this and an IOException should be thrown. Yet, sadly, the JavaDoc of OpenNLP is not precise about this kind of situation and you are not handling this exception properly in the catch block.
Moreover, the code snippet you presented should be improved, so that you know what actually went wrong.
Therefore, loading a POSModel from within an Activity should be done differently. Here is a variant that takes care for both aspects:
AssetManager assetManager = getAssets();
InputStream in = null;
try {
in = assetManager.open("en-parser-chunking.bin");
POSModel posModel;
if(in != null) {
posModel = new POSModel(in);
if(posModel!=null) {
// From here, <posModel> is initialized and you can start playing with it...
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "ParserModel could not initialized.");
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "OpenNLP binary model file could not found in assets.");
}
}
catch (Exception ex) {
Log.e("NLP", "message: " + ex.getMessage(), ex);
// proper exception handling here...
}
finally {
if(in!=null) {
in.close();
}
}
This way, you're using an InputStream approach and at the same time you take care for proper exception and resource handling. Moreover, you can now use a Debugger in case something remains unclear with the resource path references of your model files. For reference, see the official JavaDoc of AssetManager#open(String resourceName).
Note well:
Loading OpenNLP's binary resources can consume quite a lot of memory. For this reason, it might be the case that your Android App's request to allocate the needed memory for this operation can or will not be granted by the actual runtime (i.e., smartphone) environment.
Therefore, carefully monitor the amount of requested/required RAM while posModel = new POSModel(in); is invoked.
Hope it helps.

Using boilerpipe to extract non-english articles

I am trying to use boilerpipe java library, to extract news articles from a set of websites.
It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.
In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.
My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?
How i'm using the library:
(first attempt based on the URL):
URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);
(second on the HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

You don't have to modify inner Boilerpipe classes.
Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
Regards!

Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:
public static HTMLDocument fetch(final URL url) throws IOException {
final URLConnection conn = url.openConnection();
final String ct = conn.getContentType();
Charset cs = Charset.forName("Cp1252");
if (ct != null) {
Matcher m = PAT_CHARSET.matcher(ct);
if(m.find()) {
final String charset = m.group(1);
try {
cs = Charset.forName(charset);
} catch (UnsupportedCharsetException e) {
// keep default
}
}
}
Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding

Ok, got a solution.
As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax
What i did was to convert all the text that was fetched, to UTF-8.
At the end of the fetch function, i had to add two lines, and change the last one:
final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line

Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.
Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.
This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.

Java:
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class Boilerpipe {
public static void main(String[] args) {
try{
URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
System.out.println(text);
}catch(Exception e){
e.printStackTrace();
}
}
}
Eclipse:
Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.

I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);

Inputstream handled by different objects depending on the content

I am writing a crawler/parser that should be able to process different types of content, being RSS, Atom and just plain html files. To determine the correct parser, I wrote a class called ParseFactory, which takes an URL, tries to detect the content-type, and returns the correct parser.
Unfortunately, checking the content-type using the provided in method in URLConnection doesn't always work. For example,
String contentType = url.openConnection().getContentType();
doesn't always provide the correct content-type (e.g "text/html" where it should be RSS) or doesn't allow to distinguish between RSS and Atom (e.g. "application/xml" could be both an Atom or a RSS feed). To solve this problem, I started looking for clues in the InputStream. Problem is that I am having trouble coming up an elegant class design, where I need to download the InputStream only once. In my current design I have wrote a separate class first that determines the correct content-type, next the ParseFactory uses this information to create an instance of the corresponding parser, which in turn, when the method 'parse()' is called, downloads the entire InputStream a second time.
public Parser createParser(){
InputStream inputStream = null;
String contentType = null;
String contentEncoding = null;
ContentTypeParser contentTypeParser = new ContentTypeParser(this.url);
Parser parser = null;
try {
inputStream = new BufferedInputStream(this.url.openStream());
contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();
assert (contentType != null);
inputStream = new BufferedInputStream(this.url.openStream());
if (contentType.equals(ContentTypes.rss))
{
logger.info("RSS feed detected");
parser = new RssParser(this.url);
parser.parse(inputStream);
}
else if (contentType.equals(ContentTypes.atom))
{
logger.info("Atom feed detected");
parser = new AtomParser(this.url);
}
else if (contentType.equals(ContentTypes.html))
{
logger.info("html detected");
parser = new HtmlParser(this.url);
parser.setContentEncoding(contentEncoding);
}
else if (contentType.equals(ContentTypes.UNKNOWN))
logger.debug("Unable to recognize content type");
if (parser != null)
parser.parse(inputStream);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return parser;
}
Basically, I am looking for a solution that allows me to eliminate the second "inputStream = new BufferedInputStream(this.url.openStream())".
Any help would be greatly appreciated!
Side note 1: Just for the sake of being complete, I also tried using the URLConnection.guessContentTypeFromStream(inputStream) method, but this returns null way too often.
Side note 2: The XML-parsers (Atom and Rss) are based on SAXParser, the Html-parser on Jsoup.

Can you just call mark and reset?
inputStream = new BufferedInputStream(this.url.openStream());
inputStream.mark(2048); // Or some other sensible number
contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();
inputstream.reset(); // Let the parser have a crack at it now

Perhaps your ContentTypeParser should cache the content internally and feed it to the appropiate ContentParser instead of reacquiring data from InputStream.

xhtmlrenderer creating PDFs of length 0

I am new to org.xhtmlrenderer.pdf.ITextRenderer and have this problem:
The PDFs that my test servlet streams to my Downloads folder are in fact empty files.
The relevant method, streamAndDeleteTheClob, is shown below.
The first try block is definitely not a problem.
The server spends a lot of time in the second try block. No exception thrown.
Can anyone suggest a solution to this problem or a good approach to to debugging it?
Can anyone point me to essentially similar code that really works?
Any help would be much appreciated.
res.setContentType("application/pdf");
ServletOutputStream out = res.getOutputStream();
...
private boolean streamAndDeleteTheClob(int pageid,
Connection con,
ServletOutputStream out) throws IOException, ServletException {
Statement statement;
Clob htmlpage;
StringBuffer pdfbuf = new StringBuffer();
final String pageToSendQuery = "SELECT text FROM page WHERE pageid = " + pageid;
// create xhtml file as a CLOB (Oracle large character object) and stream it into StringBuffer pdfbuf
try { // definitely no problem in this block
statement = con.createStatement();
resultSet = statement.executeQuery(pageToSendQuery);
if (resultSet.next()) {
htmlpage = resultSet.getClob(1);
} else {
return true;
}
final Reader in = htmlpage.getCharacterStream();
final char[] buffer = new char[4096];
while ((in.read(buffer)) != -1) {
pdfbuf.append(buffer);
}
} catch (Exception ex) {
out.println("buffering CLOB failed: " + ex);
}
// create pdf from StringBuffer
try {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader(pdfbuf.toString())));
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(doc, null);
renderer.layout();
renderer.createPDF(out);
out.close();
} catch (Exception ex) {
out.println("streaming of pdf failed: " + ex);
}
deleteClob(con, pageid);
return false;
}

Using the DocumentBuilder.parse this way will try to resolve the DTD referenced in the XHTML page. It takes a really long time. The easyest way to aviod that if you are using the Flying Saucer (xhtmlrenderer), is to create the document this way:
Document myDocument = XMLResource.load(myInputStream).getDocument();
Note that you can use XMLResource.load with a Reader too.

Two things I can think of.
1) If the iText document is not closed, it'll be empty. Looks like renderer.finish() will work, but createPDF(out) should do that already.
2) If there are no pages, you could get an empty doc as well... so an empty input could result in a 0-byte PDF.
3) You might be getting a perfectly valid PDF that's not being streamed properly. Try writing to a ByteArrayOutputStream and checking the length there.
4) An almost fanatical dedication to the Pope!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Specific characters not rendering properly in Java - java

Related

Java printing PDF with options (staple, duplex, etc)

How to use OpenNLP parser models in an Android app?

Using boilerpipe to extract non-english articles

Inputstream handled by different objects depending on the content

xhtmlrenderer creating PDFs of length 0

Categories

Resources