Parse html string in java servlet - java

I am trying to parse below string in java servlet.
<con>
<status>OK</status>
<session>12312332432</session>
</con>
I want value of <session> element.
Ideas ?

If it's not XML, you can use regex:
import java.util.regex.*;
String aParser="<con><status>OK</status><session>12312332432</session></con>";
Pattern p=Pattern.compile("<session>(.*)</session>");
Matcher m=p.matcher(aParser);
while(m.find())
{
System.out.println(m.group(1));
}

For XML purposes you should use an XMLParser.
Look at SAXParser it will do the job.
SAXParser is very good when dealing with big files, because the whole document isn´t hold
in memory.

If you just need session value and not the entire XMl data.just split() the string as below
xmlString.split(starttag)[1].split(endtag)[0];

Related

Extracting String's from a String JAVA

Hello I want to extract "Hello, World!" "and" and the Paragraph "This is a minimal....." from the given string in JAVA. I am having problems in extracting, so can anyone help me with it?
So I always get different Strings and want to extract the string between 2 square brackets []......[].
String s1="[sh1] Hello, World! [/s11] and [pp]This is a minimal "hello world" HTML document. It demonstrates the basic structure of an HTML file and anchors. [/xy]"
Thanks
Use the Pattern & Matcher to match square brackets:
Pattern pattern = Pattern.compile("\\[[^\\]]*\\]([^\\]]*)\\[[^\\]]*\\]");
Matcher matcher = pattern.matcher(s1);
while (matcher.find()) {
System.out.println( "Found value: " + matcher.group(1).trim() );
}
Demo: https://ideone.com/kNKBgg
Please don't use RegEx-es to do this (it's what Pattern and Matcher do) - see here for reason why you shouldn't. While you could use this for the particular bracket example, if you expect full-blown HTML don't do it.
If you want to extract content from HTML use a parser, for example SAXParser or DOMParser - see Oracle documentation for examples.

Get an Array or List of Strings between some Strings (Search multiple Strings)

I have an large String which contains some XML. This XML contains input like:
<xyz1>...</xyz1>
<hello>text between strings #1</hello>
<xyz2>...</xyz2>
<hello>text between strings #2</hello>
<xyz3>...</xyz3>
I want to get all these <hello>text between strings</hello>.
So in the end I want to have a List or any Collection which contains all <hello>...</hello>
I tried it with Regex and Matcher but the problem is it doesn't work with large strings.... if I try it with smaller Strings, it works. I read a blogpost about this and this says the Java Regex Broken for Alternation over Large Strings.
Is there any easy and good way to do this?
Edit:
An attempt is...
String pattern1 = "<hello>";
String pattern2 = "</hello>";
List<String> helloList = new ArrayList<String>();
String regexString = Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2);
Pattern pattern = Pattern.compile(regexString);
Matcher matcher = pattern.matcher(scannerString);
while (matcher.find()) {
String textInBetween = matcher.group(1); // Since (.*?) is capturing group 1
// You can insert match into a List/Collection here
helloList.add(textInBetween);
logger.info("-------------->>>> " + textInBetween);
}
If you have to parse an XML file, I suggest you to use XPath language. So you have to do basically these actions:
Parse the XML String inside a DOM object
Create an XPath query
Query the DOM
Try to have a look at this link.
An example of what you haveto do is this:
String xml = ...;
try {
// Build structures to parse the String
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
// Parse the XML string into a DOM object
Document document= builder.parse(new ByteArrayInputStream(xml.getBytes()));
// Create an XPath query
XPath xPath = XPathFactory.newInstance().newXPath();
// Query the DOM object with the query '//hello'
NodeList nodeList = (NodeList) xPath.compile("//hello").evaluate(document, XPathConstants.NODESET);
} catch (Exception e) {
e.printStackTrace();
}
You have to parse your xml with an xml parser. It is easier than using regular expressions.
DOM parser is the simplest to use, but if your xml is very big use the SAX parser
I would highly recommend using one of the multiple public XML parsers available:
Woodstox
Stax
dom4j
It is simply easier to achieve what you're trying to achieve (even if you wish to elaborate on your request in the future). If you have no issues with speed and memory, go ahead and use dom4j. There is ALOT of resource online if you wish me to post good examples on this answer for you, as my answer right now is simply redirecting you alternative options but I'm not sure what your limitations are.
Regarding REGEX when parsing XML, Dour High Arch gave a great response:
XML is not a regular language. You cannot parse it using a regular expression. An expression you think will work will break when you get nested tags, then when you fix that it will break on XML comments, then CDATA sections, then processor directives, then namespaces, ... It cannot work, use an XML parser.
Parsing XML with REGEX in Java
With Java 8 you could use the Dynamics library to do this in a straightforward way
XmlDynamic xml = new XmlDynamic(
"<bunch_of_data>" +
"<xyz1>...</xyz1>" +
"<hello>text between strings #1</hello>" +
"<xyz2>...</xyz2>" +
"<hello>text between strings #2</hello>" +
"<xyz3>...</xyz3>" +
"</bunch_of_data>");
List<String> hellos = xml.get("bunch_of_data").children()
.filter(XmlDynamic.hasElementName("hello"))
.map(hello -> hello.asString())
.collect(Collectors.toList()); // ["text between strings #1", "text between strings #2"]
See https://github.com/alexheretic/dynamics#xml-dynamics

Regex to remove html does not get rid of img tag

I am using a regex to remove HTML tags. I do something like -
result.replaceAll("\<.*?\>", "");
However, it does not help me get rid of the img tags in the html. Any idea what is a good way to do that?
If you cannot use HTML parsers/cleaners then I would at least suggest you to use Pattern.DOTALL flag to take care of multi-line HTML blocks. Consider code like this:
String str = "123 <img \nsrc='ping.png'>abd foo";
Pattern pt = Pattern.compile("<.*?>", Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
System.out.println("Output: " + sb);
OUTPUT
Output: 123 abd foo
To give a more concrete recommendation, use JSoup (or NekoHTML) to parse the HTML into a Java object.
Once you've got a Document object it can easily be traversed to remove the tags. This cookbook recipe shows how to get attributes and text from the DOM.
Another suggestion is HtmlCleaner
I'm just re-iterating what others have said already, but this point cannot be over-stated: DO NOT USE REGEXES TO PARSE HTML. There are a 1,000 similar questions on this on SO. Use a proper HTML parser, it will make your life so much easier, and is far more robust and reliable. Take a look at Dom4j, Jericho, JSoup. Please.
So, a piece of code for you.
I use http://htmlparser.sourceforge.net/ to parse HTML. It is not overcomplicated and quite straightforward to use.
Basically it looks like this:
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
...
String html; /* read your HTML into variable 'html' */
String result=null;
....
try {
Parser p = new Parser(html);
NodeList nodes = p.parse(null);
result = nodes.asString();
} catch (ParserException e) {
e.printStackTrace();
}
That will give you plain text stripped of tags (but no substitutes like & would be fixed). And of course you can do plenty more with this library, like applying filters, visitors, iterating and all the stuff.
use html parser instead. iterate over the object, print however you like and get the best result.
I have been able achieve do this with the below code snippet.
String htmlContent = values.get(position).getContentSnippet();
String plainTextContent = htmlContent.replaceAll("<img .*?/>", "");
I used the above regex to clean the img tags in my RSS content.

Convert HTML symbols and HTML names to HTML number using Java

I have an XML which contains many special symbols like ® (HTML number &#174) etc.
and HTML names like &atilde (HTML number &#227) etc.
I am trying to replace these HTML symbols and HTML names with corresponding HTML number using Java. For this, I first converted XML file to string and then used replaceAll method as:
File fn = new File("myxmlfile.xml");
String content = FileUtils.readFileToString(fn);
content = content.replaceAll("®", "&\#174");
FileUtils.writeStringToFile(fn, content);
But this is not working.
Can anyone please tell how to do it.
Thanks !!!
The signature for the replaceAll method is:
public String replaceAll(String regex, String replacement)
You have to be careful that your first parameter is a valid regular expression. The Java Pattern class describes the constructs used in a Java regular expression.
Based on what I see in the Pattern class description, I don't see what's wrong with:
content = content.replaceAll("®", "&\#174");
You could try:
content = content.replaceAll("\\p(®)", "&\#174");
and see if that works better.
I don't think that \# is a valid escape sequence.
BTW, what's wrong with "&#174" ?
If you want HTML numbers try first escaping for XML.
Use EscapeUtils from Apache Commons Lang.
Java may have trouble dealing with it, so first I prefere to escape Java, and after that XML or HTML.
String escapedStr= StringEscapeUtils.escapeJava(yourString);
escapedStr= StringEscapeUtils.escapeXML(yourString);
escapedStr= StringEscapeUtils.escapeHTML(yourString);

How to extract a substring from a string in java

What I am doing is validating URLs from my code. So I have a file with url's in it and I want to see if they exist or not. If they exist, the web page contains xml code in which there will be an email address I want to extract.
I go round a while loop and in each instance, if the url exists, The xml is added to a string. This one big string contains the xml code. What I want to do is extract the email address from this string with the xml code in it. I can't use the methods in the string api as they require you to specify the sarting index which I don't know as it varies each time.
What I was hoping to do was search the string for a sub-string starting with (e.g. "<email id>") and ending with (e.g. "</email id>") and add the string between these strings to a seperate string.
Does anyone know if this is possible to do or if there is an easier/different way of doing what I want to do?
Thanks.
If you know well the structure of the XML document, I'll recommand to use XPath.
For example, with emails contained in <email>a#b.com</email>, there will a XPath request like /root/email (depends on your xml structure)
By executing this XPath query on your XML file, you will automatically get all <email> element (Node) returned in an array. And if you have XML element, you have XML content. (#getNodeValue)
To answer your subject question: .indexOf, or, regular expressions.
But after a brief review of your question, you should really be processing the XML document properly.
A regular expression that will find and return strings between two " characters:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
private final static Pattern pattern = Pattern.compile("\"(.*?)\"");
private void doStuffWithStringsBetweenQuotes(String source) {
Matcher matcher = pattern.matcher(source);
while (matcher.find()) {
String match = matcher.group(1);
}
}
Have you try to use Regex? Probably a sample document will be very useful for this kind of question.
Check out the org.xml.sax API. It is very easy to use and allows you to parse through XML and do whatever you want with the contents whenever you come across anything of interest. So you could easily add some logic to look for < email > start elements then save the contents (characters) which will contain your email address.
If I understand your question correctly you are extracting pieces of XML from multiple web pages and concatenating them into a big 'xml' string,
something that looks like
"<somedata>blah</somedata>
<email>a.b#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.c#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.d#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
"
I'd advise making that a somewhat valid xml document by including a root element.
"
<?xml version="1.0" encoding="ISO-8859-1"?>
<newRoot>
<somedata>blah</somedata>
<email>a.b#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.c#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.d#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
</newroot>"
Then you could load that into an Xml Document object and can use Xpath expressions to extract the email nodes and their values.
If you don't want to do that that you could use the indexOf(String str, int fromIndex) method to find the <email> and </email> (or whatever they are called) positions. and then substring based on those. That's not a particularly clean or easy to read way of doing it though.

Categories

Resources