What HTML parser can tidy up this code? - java

I'm using Java 6. I want to find a tool that can clean up ugly HTML. Specifically, I would like a tool that could deal with the following ...
<script type="text/javascript">
document.write(
'<scr'+'ipt src="http://ox-d.journatic.com/w/1.0/jstag"><\/scr'+'ipt>');
</script>
I have tried JSoup v 1.6.2 to deal with the above, but running the above code using
final org.jsoup.nodes.Document doc = Jsoup.parse(html);
final String formattedHtml = doc.toString();
returns the same code. THe problem with the above is taht when I try and parse it with
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
final DocumentBuilder builder = factory.newDocumentBuilder();
final InputSource s = new InputSource(new StringReader(cleanedUpHtml));
org.w3c.dom.Document result = builder.parse(s);
I get the exception ...
org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
at com.myco.myproject.util.XmlUtilities.getStringAsDocument(XmlUtilities.java:146)
at com.myco.myproject.util.NetUtilities.getUrlAsDocument(NetUtilities.java:54)
at com.myco.myproject.parsers.impl.AbstractMetromixParser.parsePage(AbstractMetromixParser.java:107)
at com.myco.myproject.parsers.impl.AbstractMetromixParser.getEvents(AbstractMetromixParser.java:76)
at com.myco.myproject.domain.EventFeed.refresh(EventFeed.java:81)
at com.myco.myproject.domain.EventFeed.getEvents(EventFeed.java:72)
at com.myco.myproject.parsers.impl.MetromixParserTest.testParser(MetromixParserTest.java:21)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:74)
at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:83)
at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:231)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:71)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:174)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Any suggestions on an HTML parser that can tidy up the above? - Dave

You do not need to parse the String into a html document. Just use Jsoup.clean() on the raw String. See a simple example at http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

Related

java.lang.IllegalArgumentException: While deserializing JSON feed , Using OdatajClient

I keep running into a java.lang.IllegalArgumentException: While deserializing JSON feed
Whenever I run this function inside a JUnit test. Any Ideas what could be causing the issue
public ODataEntity getInvoicedOrderDetails(Long orderNumber) {
ODataURIBuilder uriBuilder = new ODataURIBuilder(serviceRootUri).appendEntitySetSegment(orders).top(3);
ODataEntitySetRequest req = ODataRetrieveRequestFactory.getEntitySetRequest(uriBuilder.build());
req.setFormat(ODataPubFormat.JSON);
ODataRetrieveResponse<ODataEntitySet> groupResult = req.execute();
ODataEntitySet result = groupResult.getBody();
ODataEntity res = result.getEntities().get(0);
return res ;
}
Full Trace from the junit test
java.lang.IllegalArgumentException: While deserializing JSON feed
at com.msopentech.odatajclient.engine.data.Deserializer.toJSONFeed(Deserializer.java:234)
at com.msopentech.odatajclient.engine.data.Deserializer.toFeed(Deserializer.java:116)
at com.msopentech.odatajclient.engine.data.ODataReader.readEntitySet(ODataReader.java:63)
at com.msopentech.odatajclient.engine.communication.request.retrieve.ODataEntitySetRequest$ODataEntitySetResponseImpl.getBody(ODataEntitySetRequest.java:89)
at com.msopentech.odatajclient.engine.communication.request.retrieve.ODataEntitySetRequest$ODataEntitySetResponseImpl.getBody(ODataEntitySetRequest.java:61)
at com.sbu.orderLookup.orderLookup.getInvoicedOrderDetails(orderLookup.java:43)
at com.sbu.orderLookup.orderLookupTest.retrieveInvoiceNumberTest(orderLookupTest.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "Items" (class com.msopentech.odatajclient.engine.data.json.JSONFeed), not marked as ignorable (4 known properties: , "value", "odata.count", "odata.nextLink", "odata.metadata"])
at [Source: org.apache.http.conn.EofSensorInputStream#6616b9; line: 1, column: 11] (through reference chain: com.msopentech.odatajclient.engine.data.json.JSONFeed["Items"])
at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:79)
at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:555)
at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:708)
at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1160)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:315)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:121)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2888)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2080)
at com.msopentech.odatajclient.engine.data.Deserializer.toJSONFeed(Deserializer.java:232)
... 31 more

How to fetch a map with value as list in mybatis

I am trying to fetch a Map< X, List> from a mapper function in mybatis using annotations something like :
#Select("SELECT * FROM relation WHERE id = #{id}")
Map<X, List<Y>> getXYRelations(int id );
Using this I am getting a TooManyResultException, something like this :
org.apache.ibatis.exceptions.TooManyResultsException: Expected one result (or null) to be returned by selectOne(), but found: 2
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectOne(DefaultSqlSession.java:63)
at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:95)
at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:40)
at sun.proxy.$Proxy51.getXYRelation(Unknown Source)
at com.project.service.XYServiceImpl.getXYRelation(XYServiceImpl.java:548)
at com.project.ee.test.JunitTest1.test(JunitTest1.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
So I added a mapkey something like this:
#Select("Select * FROM relation WHERE id = #{id}")
#MapKey("X")
Map<X, List<Y>> getXYRelations(int id);
but I am still getting failure with the following trace :
org.apache.ibatis.exceptions.PersistenceException:
### Error querying database. Cause: java.lang.UnsupportedOperationException
### The error may involve com.project.data.mapper.XYMapper.getXYRelation-Inline
### The error occurred while setting parameters
### SQL: SELECT * FROM relation WHER id = ?
### Cause: java.lang.UnsupportedOperationException
at org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:23)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:104)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectMap(DefaultSqlSession.java:78)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectMap(DefaultSqlSession.java:74)
at org.apache.ibatis.binding.MapperMethod.executeForMap(MapperMethod.java:158)
at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:92)
at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:40)
at sun.proxy.$Proxy51.getAllUsersWithRoles(Unknown Source)
at com.project.ee.test.JunitTest1.test(JunitTest1.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: java.lang.UnsupportedOperationException
at org.apache.ibatis.reflection.wrapper.CollectionWrapper.findProperty(CollectionWrapper.java:42)
at org.apache.ibatis.reflection.MetaObject.findProperty(MetaObject.java:86)
at org.apache.ibatis.executor.resultset.FastResultSetHandler.applyAutomaticMappings(FastResultSetHandler.java:332)
at org.apache.ibatis.executor.resultset.FastResultSetHandler.getRowValue(FastResultSetHandler.java:261)
at org.apache.ibatis.executor.resultset.FastResultSetHandler.handleRowValues(FastResultSetHandler.java:214)
at org.apache.ibatis.executor.resultset.FastResultSetHandler.handleResultSet(FastResultSetHandler.java:186)
at org.apache.ibatis.executor.resultset.FastResultSetHandler.handleResultSets(FastResultSetHandler.java:152)
at org.apache.ibatis.executor.statement.PreparedStatementHandler.query(PreparedStatementHandler.java:57)
at org.apache.ibatis.executor.statement.RoutingStatementHandler.query(RoutingStatementHandler.java:70)
at org.apache.ibatis.executor.SimpleExecutor.doQuery(SimpleExecutor.java:57)
at org.apache.ibatis.executor.BaseExecutor.queryFromDatabase(BaseExecutor.java:267)
at org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:141)
at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:105)
at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:81)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:101)
... 33 more
I need List in the value somehow and I am not wanting to write a business logic for fetching a different list conveniently and converting it into the required as that will be of O(n) minimum. So can someone tell me something about this?
It should be
#Select("SELECT * FROM relation WHERE id = #{id}")
List<Map<String,Object>> getXYRelations(int id );

Log4j + Amqp Log sample Code is Failing

I am trying to create a Distributed Logging system, where my each tomcat instance will put the logs into a exchange/queue and it will be accumulated and stored in one plcae.
I was reading about Log4j + Amqp using rabbitmq.
I have downloaded the following sample code also.
https://nodeload.github.com/SpringSource/spring-amqp-samples/zipball/master
Imported project log4j from the package.
When i run the IntegrationTest.java (Junit). I get a failure. Following is the trace.
java.lang.AssertionError: Time out waiting for message
at org.junit.Assert.fail(Assert.java:91)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.springframework.amqp.rabbit.log4j.web.controller.IntegrationTest.logInfo(IntegrationTest.java:75)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:74)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:82)
at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:240)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:70)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:180)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
This is happening because no logs is coming to the listener AmqpLogMessageListener in the watch time.
I have some doubts about the code:
How the listener AmqpLogMessageListener.java is registered for listening to the configured exchange.
Why the Test is failing ?
Juste to be sure, check your log4j properties file against the source https://github.com/SpringSource/spring-amqp-samples/blob/master/log4j/src/main/resources/log4j.properties.
Then, it may be your connection on RabbitMQ which can be long to obtain, so change this:
new CountDownLatch(1);
by this:
new CountDownLatch(5);

Way it ignore entity reference resolving?

I'm using Java 6 and the latest version of Xerces. I'm trying to parse an HTML document that begins like this ...
<!DOCTYPE html>
and later references the entity "&raquo". Parsing dies with the exception ...
org.xml.sax.SAXParseException: The entity "raquo" was referenced, but not declared.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
at com.myco.myproject.util.XmlUtilities.getStringAsDocument(XmlUtilities.java:147)
at com.myco.myproject.util.NetUtilities.getUrlAsDocument(NetUtilities.java:65)
at com.myco.myproject.parsers.impl.AbstractMetromixParser.parsePage(AbstractMetromixParser.java:107)
at com.myco.myproject.parsers.impl.AbstractMetromixParser.getEvents(AbstractMetromixParser.java:76)
at com.myco.myproject.domain.EventFeed.refresh(EventFeed.java:81)
at com.myco.myproject.domain.EventFeed.getEvents(EventFeed.java:72)
at com.myco.myproject.parsers.impl.MetromixParserTest.testParser(MetromixParserTest.java:21)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:74)
at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:83)
at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:231)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:71)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:174)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Is there any way to tell the parser to ignore these types of entities it cannot resolve? If not, what resolver do I have to plugin?
Edit: Here is how I am parsing the HTML, which is actually XHTML. I pass the String through JSoup to clean it up before I attempt the below ...
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setExpandEntityReferences(false);
final DocumentBuilder builder = factory.newDocumentBuilder();
final InputSource s = new InputSource(new StringReader(str));
org.w3c.dom.Document result = builder.parse(s);
Starting from 1.10.3 version JSoup provides the W3CDom helper class which allows you to convert your org.jsoup.nodes.Document directly to the org.w3c.dom.Document.
Consider the following example:
String str =
"<!DOCTYPE html>" +
"<html>" +
"<dody>" +
"<div>ยป example</div>" +
"</dody>" +
"</html>";
Document document = Jsoup.parse(str);
W3CDom w3cDom = new W3CDom();
org.w3c.dom.Document result = w3cDom.fromJsoup(document);

Cure for 'The string "--" is not permitted within comments.' exception?

I'm using Java 6. I have this dependency in my pom ...
<dependency>
<groupId>xerces</groupId>
<artifactId>xercesImpl</artifactId>
<version>2.10.0</version>
</dependency>
I'm trying to parse an XHTML doc with this line
<!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:TrackMoves/> <w:TrackFormatting/> <w:PunctuationKerning/> <w:ValidateAgainstSchemas/> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:DoNotPromoteQF/> <w:LidThemeOther>EN-US</w:LidThemeOther> <w:LidThemeAsian>JA</w:LidThemeAsian> <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript> <w:Compatibility> <w:BreakWrappedTables/> <w:SnapToGridInCell/> <w:WrapTextWithPunct/> <w:UseAsianBreakRules/> <w:DontGrowAutofit/> <w:SplitPgBreakAndParaMark/> <w:EnableOpenTypeKerning/> <w:DontFlipMirrorIndents/> <w:OverrideTableStyleHps/> <w:UseFELayout/> </w:Compatibility> <m:mathPr> <m:mathFont m:val="Cambria Math"/> <m:brkBin m:val="before"/> <m:brkBinSub m:val="--"/> <m:smallFrac m:val="off"/> <m:dispDef/> <m:lMargin m:val="0"/> <m:rMargin m:val="0"/> <m:defJc m:val="centerGroup"/> <m:wrapIndent m:val="1440"/> <m:intLim m:val="subSup"/> <m:naryLim m:val="undOvr"/> </m:mathPr></w:WordDocument> </xml><![endif]-->
using this code ...
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setExpandEntityReferences(false);
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
final DocumentBuilder builder = factory.newDocumentBuilder();
final InputSource s = new InputSource(new StringReader(str));
org.w3c.dom.Document result = builder.parse(s);
but my parsing is dying with the following exception ...
[Fatal Error] :91:947: The string "--" is not permitted within comments.
org.xml.sax.SAXParseException: The string "--" is not permitted within comments.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at com.myco.myproject.util.XmlUtilities.getStringAsDocument(XmlUtilities.java:201)
at com.myco.myproject.util.NetUtilities.getUrlAsDocument(NetUtilities.java:67)
at com.myco.myproject.parsers.impl.ForesightEventsParser.getEventsFromElement(ForesightEventsParser.java:133)
at com.myco.myproject.parsers.impl.ForesightEventsParser.parsePage(ForesightEventsParser.java:99)
at com.myco.myproject.parsers.impl.ForesightEventsParser.getEvents(ForesightEventsParser.java:58)
at com.myco.myproject.domain.EventFeed.refresh(EventFeed.java:87)
at com.myco.myproject.domain.EventFeed.getEvents(EventFeed.java:72)
at com.myco.myproject.parsers.impl.ForesightParserTest.testParser(ForesightParserTest.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:74)
at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:83)
at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:231)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:71)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:174)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Without changing my XHTML, anyone know how I can parse this document successfully?
Edit Per the comments given, I removed the term "well-formed" from my original question. I'm still really interested in how to make this exception go away without changing the text I'm parsing (which I don't have control over). For the purposes of this question, you can assume the "--" within comments is the only violation of the term "well-formed."
By definition:
A comment starts and ends with "--", and does not contain any occurrence of "--".
So no, your XHTML is not well-formed because you can't use -- anywhere inside a comment. Can you replace it by something else? or maybe put a space in-between, like this: - -. There really isn't a clean solution to this problem, any alternatives involve messing around with placeholders, encodings, etc.
Your document must have at least an extra "-". Might be you have written:
<!--- --> or
<!--- ---> or
<!-- ---> etc.

Categories

Resources