parsing a non xml file in java - java

I want to parse a document that is not pure xml. For example
my name is <j> <b> mike</b> </j>
example 2
my name is <mytag1 attribute="val" >mike</mytag1> and yours is <mytag2> john</mytag2>
Means my input is not pure xml. ITs simliar to html but the tags are not html.
How can i parse it in java?

Your examples are valid XML, except for the lack of a document element. If you know this to always be the case, then you could just wrap a set of dummy tags around the whole thing and use a standard parser (SAX, DOM...)
On the other hand if you get something uglier (e.g. tags don't match up, or are spaced out in an overlapping fashion), you'll have to do something custom which will involve a number of rules that you have to decide on that will be unique to your application. (e.g. How do I handle an opening tag that has no close? What do I do if the closing tag is outside the parent?)

There are few parsers that take not well formed html and turn it into well formed xml, here is some comparison with examples, that includes the most popular ones, except maybe HTMLParser. Probably that's what you need.

Related

Java add attribute to HTML tags without changing formatting

A have a task to make a maven plugin which takes HTML files in certain location and adds a service attribute to each tag that doesn't have it. This is done on the source code which means my colleagues and I will have to edit those files further.
As a first solution I turned to Jsoup which seems to be doing the job but has one small yet annoying problem: if we have a tag with multiple long attributes (we often do as this HTML code is a source for further processing) we wrap the lines like this:
<ui:grid id="category_search" title="${handler.getMessage( 'title' )}"
class="is-small is-outlined is-hoverable is-foldable"
filterListener="onApplyFilter" paginationListener="onPagination" ds="${handler.ds}"
filterFragment="grid_filter" contentFragment="grid_contents"/>
However, Jsoup turns this into one very long line:
<ui:grid id="category_search" title="${handler.getMessage( 'title' )}" class="is-small is-outlined is-hoverable is-foldable" filterListener="onApplyFilter" paginationListener="onPagination" ds="${handler.ds}" filterFragment="grid_filter" contentFragment="grid_contents"/>
Which is a bad practice and real pain to read and edit.
So is there any other not very convoluted way to add this attribute without parsing and recomposing HTML code or maybe somehow preserve line breaks inside the tag?
Unfortunately JSoup's main use case is not to create HTML that is read or edited by humans. Specifically JSoup's API is very closely modeled after DOM which has no way to store or model line breaks inside tags, so it has no way to preserve them.
I can think of only two solutions:
Find (or write) an alternative HTML parser library, that has an API that preserves formatting inside tags. I'd be surprised if such a thing already exists.
Run the generated code through a formatter that supports wrapping inside tags. This won't preserve the original line breaks, but at least the attributes won't be all on one line. I wasn't able to find a Java library that does that, so you may need to consider using an external program.
It seems there is no good way to preserve breaks inside tags while parsing them into POJOs (or I haven't found one), so I wrote a simple tokenizer which splits incoming HTML string into parts sort of like this:
String[] parts = html.split( "((?=<)|(?<=>))" );
This uses regex lookups to split before < and after >. Then just iterate over parts and decide whether to insert attribute or not.

How to replace all visible text by text with tags

I need to put every visible in browser word of html to shell like this:
source:
<p><strong> My source sentence</strong></p>
goal:
<p><strong><span>My </span><span>source </span><span>sentence</span></strong></p>
But do not touch any tags, javascripts and etc.
How can I do this?
With no disrespect, but this looks like a dumb thing to do. But in any case, you can try to parse the the HTML (as parsing the XML, using a library) then replace every line with the new line.
If your source is valid XML then it should be fairly easy to write a SAX handler to read the source in and output it the way you want, have a look at this tutorial.
Essentially each time you come across an element you just output the element to the output stream. Each time you come across some text just use a regular expression (or similar) to split it into the parts your want and wrap each part in a span element. This seems like a really strange thing to do though.
If your input source isn't valid XML (if it's HTML with all the various things that can be broken with that) then it's going to be much harder unless you can first transform the source into valid XML.

Text Processing - Detecting if you are inside an HTML tag in Java

I have a program that does text processing on a html formatted document based on information on the same document without the html information. I basically, locate a word or phrase in the unformatted document, then find the corresponding word in the formatted document and alter the appearance of the word or phrase using HTML tags to make it stick out (e.g. bold it or change its color).
Here is my problem. Occasionally, I want to do formatting to a word or phrase which might be part of a html tag (for example perhaps I want to do some formatting to the word "font" but only if is a word that is not inside an html tag). Is there an easy way to detect whether a string is part of an html tag in a block of text or not?
By the way, I can't just strip out the html tags in the document and do my processing on the remaining text because I need to preserve the html in the result. I need to add to the existing html but I need to reliably distinguish between strings that are part of tags and strings that are not.
Any ideas?
Thank you,
Elliott
You could do a few things
Write a regular expression for what you're doing. There are plenty of prewritten ones you can find on Google
Find a library to parse the document (e.g., http://htmlparser.sourceforge.net/) and only replace text
The first is likely to the be the fastest and easiest, but the second will be more reliable.
Use the following regex code to detect if it has HTML tags: "\<.*?\>"
And here you can learn how to effectively use regex in your java code.
Happy coding ;)
If you have parsed the DOM, what you have, if you are doing it correctly. Then ask the super tag that contains current tag, and keep doing that, if that is not the tag, that you are looking for.
If you use some custom search or regex to parse html, then check best answe for this question:
RegEx match open tags except XHTML self-contained tags (It has +4000 upvotes for a reason)

How do I write unescaped XML outside of a CDATA

I am trying to write XML data using Stax where the content itself is HTML
If I try
xtw.writeStartElement("contents");
xtw.writeCharacters("<b>here</b>");
xtw.writeEndElement();
I get this
<contents><b>here</b></contents>
Then I notice the CDATA method and change my code to:
xtw.writeStartElement("contents");
xtw.writeCData("<b>here</b>");
xtw.writeEndElement();
and this time the result is
<contents><![CDATA[<b>here</b>]]></contents>
which is still not good. What I really want is
<contents><b>here</b></contents>
So is there an XML API/Library that allows me to write raw text without being in a CDATA section? So far I have looked at Stax and JDom and they do not seem to offer this.
In the end I might resort to good old StringBuilder but this would not be elegant.
Update:
I agree mostly with the answers so far. However instead of <b>here</b> I could have a 1MB HTML document that I want to embed in a bigger XML document. What you suggest means that I have to parse this HTML document in order to understand its structure. I would like to avoid this if possible.
Answer:
It is not possible, otherwise you could create invalid XML documents.
The issue is that is not raw text it is an element so you should be writing
xtw.writeStartElement("contents");
xtw.writeStartElement("b");
xtw.writeCData("here");
xtw.writeEndElement();
xtw.writeEndElement();
If you want the XML to be included AS XML and not as character data, then it has to be parsed at some point. If you don't want to manually do the parsing yourself, you have two alternatives:
(1) Use external parsed entities -- in this case the external file will be pulled in and parsed by the XML parser. When the output is again serialized, it will include the contents of the external file.
[ See http://www.javacommerce.com/displaypage.jsp?name=entities.sql&id=18238 ]
(2) Use Xinclude -- in that case the file has to be run thru an xinclude processor which will merge the xinclude references into the output. Most xslt processors, as well as xmllint will also do xinclude with an appropriate option.
[ See: http://www.xml.com/pub/a/2002/07/31/xinclude.html ]
( XSLT can also be used to merge documents without using the XInclude syntax. XInclude just provides a standard syntax )
The problem is not "here", it's <b></b>.
Add the <b> element as a child of contents and you'll be able to do it. Any library like JDOM or DOM4J will allow you to do this. The general case is to parse the content into an XML DOM and add the root element as a child of <contents>.
You can't add escaped values outside of a CDATA section.
If you want to embed a large HTML document in an XML document then CDATA imho is the way to go. That way you don't have to understand or process the internal structure and you can later change the document type from HTML to something else without much hassle. Also I think you can't embed e.g. DOCTYPE instructions directly (i.e. as structured data that retains the semantics of the DOCTYPE instruction). They have to be represented as characters.
(This is primarily a response to your update but alas I don't have enough rep to comment...............)
I don't see what the problem is with parsing the large block of XML you want to insert into your output. Use a StAX parser to parse it, and just write code to forward all of the events to your existing serializer (variable "xtw").
If the blob of html is actually xhtml then I'd suggest doing something like (in pseudo-code):
xtw.writeStartElement("contents")
XMLReader xtr=new XMLReader();
xtr.read(blob);
Dom dom=xtr.getDom();
for(element e:dom){
xtw.writeElement(e);
}
xtw.writeEndElement();
or something like that. I had to do something similar once but used a different library.
If your XML and HTML are not too big, you could make a workaround:
xtw.writeStartElement("contents");
xtw.writeCharacters("anUniqueIdentifierForReplace"); // <--
xtw.writeEndElement();
When you have your XML as a String:
xmlAsString.replace("anUniqueIdentifierForReplace", yourHtmlAsString);
I know, it's not so nice, but this could work.
Edit: Of course, you should check if yourHtmlAsString is valid.

Getting elements by type in malformed HTML

What's the easiest way in Java to retrieve all elements with a certain type in a malformed HTML page? So I want to do something like this:
public static void main(String[] args) {
// Read in an HTML file from disk
// Retrieve all INPUT elements regardless of whether the HTML is well-formed
// Loop through all elements and retrieve their ids if they exist for the element
}
HtmlCleaner is arguably one of the best HTML parsers out there when it comes to dealing with (somewhat) malformed HTML.
Documentation is here with some code samples; you're basically looking for getElementsByName() method.
Take a look at Comparison of Java HTML parsers if you're considering other libraries.
I've had success using tagsoup. Heres a short description from their home page:
This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Check Jtidy.
JTidy is a Java port of HTML Tidy, a
HTML syntax checker and pretty
printer. Like its non-Java cousin,
JTidy can be used as a tool for
cleaning up malformed and faulty HTML.
In addition, JTidy provides a DOM
interface to the document that is
being processed, which effectively
makes you able to use JTidy as a DOM
parser for real-world HTML.

Categories

Resources