Apache TikaParser throws uncatchable exceptions - java

i'm currently trying to develop a tool which uses Apache TikaParser to extract the content from different files. In most cases everything works fine but there a some files where Tika throws the following exception:
Mar 09, 2020 11:21:58 AM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$€-2]\ * "-"_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$€-2]\ * "-"_)' with c2: null
at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:167)
at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:343)
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:901)
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:873)
at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:143)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell(ExcelExtractor.java:673)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:447)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:340)
at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:92)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.processRecord(ExcelExtractor.java:666)
at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:109)
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:178)
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:135)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:316)
at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:169)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at attproc.processors.AttachmentProcessor.run(AttachmentProcessor.java:68)
at attproc.Main.lambda$main$0(Main.java:89)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
I'm trying to catch this exception with the following code:
try {
byte[] content = Files.readAllBytes(path);
try {
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler(-1);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, tikaConfig.pdfConfig);
try {
tikaConfig.autoDetectParser.parse(new ByteArrayInputStream(content), handler, metadata, parseContext);
text = Optional.ofNullable(handler.toString()).orElse("");
} catch (Exception ignored) {}
} catch (Exception ignored) {
}
} catch (IOException ignored) {
}
"tikaConfig" is a singleton object:
public class TikaConfiguration {
private final TikaConfig tikaConfig;
public final PDFParserConfig pdfConfig;
public final Parser autoDetectParser;
private static TikaConfiguration instance;
private TikaConfiguration() throws Exception {
ClassLoader classLoader = getClass().getClassLoader();
InputStream stream = classLoader.getResourceAsStream("tikaconfig.xml");
this.tikaConfig = new TikaConfig(stream);
this.pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(false);
tikaConfig.getDetector();
autoDetectParser = new AutoDetectParser(tikaConfig);
}
public static TikaConfiguration setConfiguration() {
if (TikaConfiguration.instance == null) {
try {
TikaConfiguration.instance = new TikaConfiguration();
} catch (Exception ignored) {}
}
return TikaConfiguration.instance;
}
}
What do i have to do to catch this exception?

Take a look at this somewhat old thread. What you are seeing looks very similar. It suggests that the POI library, used by Tika for parsing Excel, is throwing a warning, not an error (and your log output reflects that also). The warning happens to include a stack trace in its logging (caught by POI I assume, then passed on to Tika).
The warning would therefore not be caught by your code (it's not a thrown exception).
As one commenter mentions in the JIRA:
I'm not sure this is even a bug. This is the output of the POILogger, not, e.g. printStackTrace().
Regardless of its status as a bug, a work-around is also proposed in the JIRA: When running the application, redirect the err stream to null (an example is provided).
I downloaded the spreadsheet attached to the JIRA and I was able to recreate their version of your message:
WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * #,##0.00_)' with c2: null
at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
...
However, my program completed successfully. It went on to generate its output correctly.

Related

How to create mock CsvExceptions to use with csvToBean.getCapturedExceptions()

I am trying to write some unit tests to see if a logging method gets called for csv exceptions. The flow goes something like this:
CsvToBean is used to parse some info and each bean that is produced has some work done on it.
After all this, CsvToBean.getCapturedExceptions().forEach() is used to processed the exceptions.
How to I create some of these exceptions for testing?
public void parseAndSaveReportToDB(Reader reader, String reportFileName,ItemizedActivityRepository iaRepo,
ICFailedRecordsRepository icFailedRepo,
String reportCols) throws Exception {
try {
CsvToBean<ItemizedActivity> csvToBean = new CsvToBeanBuilder<ItemizedActivity>(reader).withType(ItemizedActivity.class).withThrowExceptions(false).build();
csvToBean.parse().forEach(itmzActvty -> {
itmzActvty.setReportFileName(reportFileName);
String liteDesc = itmzActvty.getBalanceTransactionDescription();
if (liteDesc.contains(":")) {
liteDesc = liteDesc.substring(liteDesc.indexOf(":")+1).trim();
}
itmzActvty.setLiteDescription(liteDesc);
itmzActvty.setAmount(convertCentToDollar(itmzActvty.getAmount()));
iaRepo.save(itmzActvty);
});
log.info("Successfully saved report data in DB");
csvToBean.getCapturedExceptions().forEach(csvExceptionObj -> logFailedRecords(reportFileName, csvExceptionObj, icFailedRepo, reportCols));
reader.close();
} catch (Exception ex) {
log.error("Exception when saving report data to DB", ex);
throw ex;
}
}
In this code I need to trigger the logFailedRecords method. To do so I need to fill the captured exceptions queue with an exception. I don't know how to get an exception in there.
What I have is not much since I keep hitting walls
#Test
public void testParseAndSaveReportToDBWithExceptions() throws Exception {
// CsvException csvExceptionObject = new CsvException("testException");
CsvToBean<ItemizedActivity> csvToBean = mock(CsvToBean.class);//<ItemizedActivity>(reader).withType(ItemizedActivity.class).withThrowExceptions(false).build().class);
BufferedReader reader = mock(BufferedReader.class);
ReportingMetadata rmd = this.getReportingMetadata();
verify(this.reportsUtil).parseAndSaveReportToDB(reader,"test.csv",
this.iaRepo,this.icFailedRepo,rmd.getReportCols());
// System.out.println(csvToBean.getCapturedExceptions().toString());
}

Possible bug with load() and parse() methods in PDFBox?

I tried to use PDFBox on regular .pdf files and it worked fine.
However when I encountered a corrupted .pdf , the code would "freeze" .. not throwing errors or something .. simply the load or parse function take forever to execute
Here is the corrupted file (i have zipped it so that everybody could download it), it is probably not a native pdf file but it was saved as a .pdf extension and it is only 4 Kb.
I am not an expert at all, but I think that this is a bug with PDFBox. According to documentation, both load() and parse() methods are supposed to throw exceptions if they fail. However in case with my file, the code would take forever to execute and not throw exception.
I tried using only load, one can try parse() .. the result is the same
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class TestTest {
public static void main(String[] args) throws FileNotFoundException, IOException {
System.out.println(pdfToText("C:\\..............MYFILE.pdf"));
System.out.println("done ! ! !");
}
private static String pdfToText(String fileName) throws IOException {
PDDocument document = null;
document = PDDocument.load(new File(fileName)); // THIS TAKES FOREVER
PDFTextStripper stripper = new PDFTextStripper();
document.close();
return stripper.getText(document);
}
}
How to force this code throw an exception or stop executing if the .pdf file is corrupted?
Thanks
Try this solution:
private static String pdfToText(String fileName) {
PDDocument document = null;
try {
document = PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(document);
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
return null;
} finally {
if (document != null) {
try {
document.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
For implementing simple timeouts for 3rd party libs I often use an implementation like Apache Commons ThreadMonitor:
long timeoutInMillis = 1000;
try {
Thread monitor = ThreadMonitor.start(timeoutInMillis);
// do some work here
ThreadMonitor.stop(monitor);
} catch (InterruptedException e) {
// timed amount was reached
}
Example code is from Apache's ThreadMonitor Javadoc.
I only use this when the 3rd party API does not provide some timeout mechanism, of course.
However I was forced to tweak this a bit some weeks ago, because this solution does not work well with (3rd party) code that is using Exception masking.
In particular we run into problems with c3p0 which masks all Exceptions (and in particular InterruptedExceptions). Our solution was to tweak the implementation to also check the exception's cause chain for InterruptedExceptions.

How to extract useful information from TransformerException

I am using javax.xml.transform.* to do XSLT transformation. Since the xslt file to be used comes from the outside world there could be errors in that file, and I am going to give back some meaningful response to the user.
Although I can easily catch the TransformationExceptions, I found no way to obtain enough information from it. For example, if there is a tag to be terminated by an end-tag, printStackTrace() gives scarring message
javax.xml.transform.TransformerConfigurationException: Could not compile stylesheet
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.newTemplates(Unknown Source)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.newTransformer(Unknown Source)
... (100 lines)
and getMessage() gives only
Could not compile stylesheet
None of them gives the real reason of the error.
I noticed that in Eclipse test console I can see the following
[Fatal Error] :259:155: The element type "sometag" must be terminated by the matching end-tag "</sometag>".
ERROR: 'The element type "sometag" must be terminated by the matching end-tag "</sometag>".'
FATAL ERROR: 'Could not compile stylesheet'
This is exactly what I want. Unfortunately, since this is a web application, the user cannot see this.
How can I display the correct error message to the user?
Put your own ErrorListener on your Transformer instance using Transformer.setErrorListener, like so:
final List<TransformationException> errors = new ArrayList<TransformationException>();
Transformer transformer = ... ;
transformer.setErrorListener(new ErrorListener() {
#Override
public void error(TransformerException exception) {
errors.add(exception);
}
#Override
public void fatalError(TransformerException exception) {
errors.add(exception);
}
#Override
public void warning(TransformerException exception) {
// handle warnings as well if you want them
}
});
// Any other transformer setup
Source xmlSource = ... ;
Result outputTarget = ... ;
try {
transformer.transform(xmlSource, outputTarget);
} catch (TransformerException e) {
errors.add(e); // Just in case one is thrown that isn't handled
}
if (!errors.isEmpty()) {
// Handle errors
} else {
// Handle output since there were no errors
}
This will log all the errors that occur into the errors list, then you can use the messages off those errors to get what you want. This has the added benefit that it will try to resume the transformation after the errors occur. If this causes any problems, just rethrow the exception by doing:
#Override
public void error(TransformerException exception) throws TransformationException {
errors.add(exception);
throw exception;
}
#Override
public void fatalError(TransformerException exception) throws TransformationException {
errors.add(exception);
throw exception;
}
Firstly, it's likely that any solution will dependent on your choice of XSLT processor. Different implementations of the JAXP interface might well provide different information in the exceptions they generate.
It's possible that the error from the XML parser is available in a wrapped exception. For historic reasons, TransformerConfigurationException offers both getException() and getCause() to access wrapped exceptions, and it may be worth checking them both.
Alternatively it's possible that the information was supplied in a separate call to the ErrorListener.
Finally, this particular error is detected by the XML parser (not the XSLT processor) so in the first instance it will be handled by the parser. It may well be worth setting the parser's ErrorHandler and catching parsing errors at that level. If you want explicit control over the XML parser used by the transformation, use a SAXSource whose XMLReader is suitably initialized.
You can configure System.out to write in your own OutputStream.
Use of ErrorListener don't catch all output.
If you work with threads you can look here (http://maiaco.com/articles/java/threadOut.php) to avoid change of System.out for other threads.
example
public final class XslUtilities {
private XslUtilities() {
// only static methods
}
public static class ConvertWithXslException extends Exception {
public ConvertWithXslException(String message, Throwable cause) {
super(message, cause);
}
}
public static String convertWithXsl(String input, String xsl) throws ConvertWithXslException {
ByteArrayOutputStream systemOutByteArrayOutputStream = new ByteArrayOutputStream();
PrintStream oldSystemOutPrintStream = System.out;
System.setOut(new PrintStream(systemOutByteArrayOutputStream));
ByteArrayOutputStream systemErrByteArrayOutputStream = new ByteArrayOutputStream();
PrintStream oldSystemErrPrintStream = System.err;
System.setErr(new PrintStream(systemErrByteArrayOutputStream));
String resultXml;
try {
System.setProperty("javax.xml.transform.TransformerFactory", "net.sf.saxon.TransformerFactoryImpl");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer(new StreamSource(new StringReader(xsl)));
StringWriter stringWriter = new StringWriter();
transformer.transform(new StreamSource(new StringReader(input)), new StreamResult(stringWriter));
resultXml = stringWriter.toString();
} catch (TransformerException e) {
System.out.flush();
final String systemOut = systemOutByteArrayOutputStream.toString();
System.err.flush();
final String systemErr = systemErrByteArrayOutputStream.toString();
throw new ConvertWithXslException("TransformerException - " + e.getMessageAndLocation()
+ (systemOut.length() > 0 ? ("\nSystem.out:" + systemOut) : "")
+ (systemErr.length() > 0 ? ("\nSystem.err:" + systemErr) : ""), e);
} finally {
System.setOut(oldSystemOutPrintStream);
System.setErr(oldSystemErrPrintStream);
}
return resultXml;
}
}

Unable to catch exception for bad URL in JAVA class thrown by SCALA class

Im writing a program that should read data from an online XML file. The computation is done by classes written in Scala, should any exception be caught, it must be thrown to a Java class that will handle the exceptions. For some reason, i get an error with the exception type. What is the right exception that should be thrown when trying to access a bad URL or any similar issue (no internet connection?). Thanks!
The main class (Scala)
object test
{
def main(args:Array[String])
{
val x:A = new A()
}
}
The class that parses the XML file and tries to access the URL
import java.net.{URL, URLConnection}
import xml.{XML, Elem}
import java.lang.NullPointerException
import java.io.IOException
class XMLparser {
#throws(classOf[Exception])
private val connectionXMLURL = "http://www.boi.org.il/currency.xml1"
private var urlConnection:URLConnection = null
private var url:URL = null
private var doc:Elem = null
private val currencies = new java.util.LinkedHashMap[String,java.lang.Double]()
try
{
url = new URL(connectionXMLURL)
urlConnection = url.openConnection
doc = XML.load(urlConnection.getInputStream)
}
catch
{
case ex: org.xml.sax.SAXParseException => //is caught!!! and thrown!
{
throw ex
}
case e: IOException =>
{
throw e
//add error log!!
}
case e: NullPointerException =>
{
throw e
//add error log!!
}
case e: Exception =>
{
throw e
}
}
}
The class that should catch the exception (Java)
public class A
{
private XMLparser x;
public A()
{
try
{
x = new XMLparser();
}
catch(org.xml.sax.SAXParseException e) //Cannot catch it!!??
{
}
/* catch (Exception e)
{
e.printStackTrace();
} */
}
}
EDIT: this is the error message i get when trying to catch the exception:
scala: warning: [options] bootstrap class path not set in conjunction with -source 1.6
scala: C:\Users\home\Dropbox\Exchange Currency\src\il\hit\currencyExchange\CurrencyExchangeGUI.java:138: error: exception SAXParseException is never thrown in body of corresponding try statement
scala: catch (org.xml.sax.SAXParseException ex)
The problem is that you need to declare that your constructor method (the default one) throws a checked java exception. I saw that you already added the #throws(classOf[Exception]) annotation to your code, but it is slightly misplaced for your purposes. Check this link
https://issues.scala-lang.org/browse/SI-1420
It should look like
class XMLparser #throws(classOf[Exception]) {
private val connectionXMLURL = "http://www.boi.org.il/currency.xml1"
...
For an URLConnection, it can be SocketTimeoutException for both connection and input stream timeouts. It can also be an IOException for an unavailable website.
You can find this in the javadoc for URLConnection

loader.InputStreams with no valid reference is closed

While upgrading sun application server 8.2 to a new patch level an exception occurred and I don't know why. Following a code snippet from a Servlet:
public void init() throws ServletException {
Properties reqProperties = new Properties();
try {
reqProperties.load(this.getClass().getResourceAsStream(
"/someFile.properties"));
} catch (IOException e) {
e.printStackTrace();
}
...
}
The file does exists on the classpath and in previous patch versions it worked just fine. but now when deploying this result in a exception. The stack trace:
[#|2010-04-14T16:43:48.208+0200|WARNING|sun-appserver-ee8.2|javax.enterprise.system.core.classloading|_ThreadID=11;|loader.InputStreams with no valid reference is closed
java.lang.Throwable
at com.sun.enterprise.loader.EJBClassLoader$SentinelInputStream.<init>(EJBClassLoader.java:1172)
at com.sun.enterprise.loader.EJBClassLoader.getResourceAsStream(EJBClassLoader.java:858)
at java.lang.Class.getResourceAsStream(Class.java:1998)
at a.package.TestServlet.init(TestServlet.java:44)
at javax.servlet.GenericServlet.init(GenericServlet.java:261)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:592)
at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:249)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:282)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:165)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:118)
at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1093)
at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:931)
at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4183)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4535)
at com.sun.enterprise.web.WebModule.start(WebModule.java:241)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1086)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:847)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1086)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:483)
at org.apache.catalina.startup.Embedded.start(Embedded.java:894)
at com.sun.enterprise.web.WebContainer.start(WebContainer.java:741)
at com.sun.enterprise.web.HttpServiceWebContainer.startInstance(HttpServiceWebContainer.java:963)
at com.sun.enterprise.web.HttpServiceWebContainerLifecycle.onStartup(HttpServiceWebContainerLifecycle.java:50)
at com.sun.enterprise.server.ApplicationServer.onStartup(ApplicationServer.java:300)
at com.sun.enterprise.server.PEMain.run(PEMain.java:308)
at com.sun.enterprise.server.PEMain.main(PEMain.java:221)
|#]
I've no idea what could be the problem anyone have any idea?
(note that I changed some names in the code and stacktrace)
Are you sure it throws an exception? We get warnings like this in Glassfish all the time. The EJBClassLoader uses a throwable to dump the stack trace so it may look like an exception to you.
EJBClassLoader wraps all streams with sentinels. This warning simply tells you that your stream is not closed. You can safely ignore it. To get rid of the warning, you have to close the stream after you use it.
you should always close inputstreams after using:
public void init() throws ServletException {
InputStream str = null;
Properties reqProperties = new Properties();
try {
str = this.getClass().getResourceAsStream("/someFile.properties");
reqProperties.load(str);
} catch (IOException e) {
e.printStackTrace();
} finally {
if (str != null) {
try {
str.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
btw, the finally clause can be made a lot simpler using apache commons / io:
finally {
IOUtils.closeQuietly(str);
}

Categories

Resources