lets say the string is <title>xyz</title>
I want to extract the xyz out of the string.
I used:
Pattern titlePattern = Pattern.compile("<title>\\s*(.+?)\\s*</title>");
Matcher titleMatcher = titlePattern.matcher(line);
String title=titleMatcher.group(1));
but I am getting an error for titlePattern.matcher(line);
You say your error occurs earlier (what is the actual error, runs without an error for me), but after solving that you will need to call find() on the matcher once to actually search for the pattern:
if(titleMatcher.find()){
String title = titleMatcher.group(1);
}
Not that if you really match against a string with non-escaped HTML entities like
<title>xyz</title>
Then your regular expression will have to use these, not the escaped entities:
"<title>\\s*(.+?)\\s*</title>"
Also, you should be careful about how far you try to get with this, as you can't really parse HTML or XML with regular expressions. If you are working with XML, it's much easier to use an XML parser, e.g. JDOM.
Not technically an answer but you shouldn't be using regular expressions to parse HTML. You can try and you can get away with it for simple tasks but HTML can get ugly. There are a number of Java libraries that can parse HTML/XML just fine. If you're going to be working a lot with HTML/XML it would be worth your time to learn them.
As others have suggested, it's probably not a good idea to parse HTML/XML with regex. You can parse XML Documents with the standard java API, but I don't recommend it. As Fabian Steeg already answered, it's probably better to use JDOM or a similar open source library for parsing XML.
With javax.xml.parsers you can do the following:
String xml = "<title>abc</title>";
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(xml)));
NodeList nodeList = doc.getElementsByTagName("title");
String title = nodeList.item(0).getTextContent();
This parses your XML string into a Document object which you can use for further lookups. The API is kinda horrible though.
Another way is to use XPath for the lookup:
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xPath = xpathFactory.newXPath();
String titleByXpath = xPath.evaluate("/title/text()", new InputSource(new StringReader(xml)));
// or use the Document for lookup
String titleFromDomByXpath = xPath.evaluate("/title/text()", doc);
Related
I have an large String which contains some XML. This XML contains input like:
<xyz1>...</xyz1>
<hello>text between strings #1</hello>
<xyz2>...</xyz2>
<hello>text between strings #2</hello>
<xyz3>...</xyz3>
I want to get all these <hello>text between strings</hello>.
So in the end I want to have a List or any Collection which contains all <hello>...</hello>
I tried it with Regex and Matcher but the problem is it doesn't work with large strings.... if I try it with smaller Strings, it works. I read a blogpost about this and this says the Java Regex Broken for Alternation over Large Strings.
Is there any easy and good way to do this?
Edit:
An attempt is...
String pattern1 = "<hello>";
String pattern2 = "</hello>";
List<String> helloList = new ArrayList<String>();
String regexString = Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2);
Pattern pattern = Pattern.compile(regexString);
Matcher matcher = pattern.matcher(scannerString);
while (matcher.find()) {
String textInBetween = matcher.group(1); // Since (.*?) is capturing group 1
// You can insert match into a List/Collection here
helloList.add(textInBetween);
logger.info("-------------->>>> " + textInBetween);
}
If you have to parse an XML file, I suggest you to use XPath language. So you have to do basically these actions:
Parse the XML String inside a DOM object
Create an XPath query
Query the DOM
Try to have a look at this link.
An example of what you haveto do is this:
String xml = ...;
try {
// Build structures to parse the String
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
// Parse the XML string into a DOM object
Document document= builder.parse(new ByteArrayInputStream(xml.getBytes()));
// Create an XPath query
XPath xPath = XPathFactory.newInstance().newXPath();
// Query the DOM object with the query '//hello'
NodeList nodeList = (NodeList) xPath.compile("//hello").evaluate(document, XPathConstants.NODESET);
} catch (Exception e) {
e.printStackTrace();
}
You have to parse your xml with an xml parser. It is easier than using regular expressions.
DOM parser is the simplest to use, but if your xml is very big use the SAX parser
I would highly recommend using one of the multiple public XML parsers available:
Woodstox
Stax
dom4j
It is simply easier to achieve what you're trying to achieve (even if you wish to elaborate on your request in the future). If you have no issues with speed and memory, go ahead and use dom4j. There is ALOT of resource online if you wish me to post good examples on this answer for you, as my answer right now is simply redirecting you alternative options but I'm not sure what your limitations are.
Regarding REGEX when parsing XML, Dour High Arch gave a great response:
XML is not a regular language. You cannot parse it using a regular expression. An expression you think will work will break when you get nested tags, then when you fix that it will break on XML comments, then CDATA sections, then processor directives, then namespaces, ... It cannot work, use an XML parser.
Parsing XML with REGEX in Java
With Java 8 you could use the Dynamics library to do this in a straightforward way
XmlDynamic xml = new XmlDynamic(
"<bunch_of_data>" +
"<xyz1>...</xyz1>" +
"<hello>text between strings #1</hello>" +
"<xyz2>...</xyz2>" +
"<hello>text between strings #2</hello>" +
"<xyz3>...</xyz3>" +
"</bunch_of_data>");
List<String> hellos = xml.get("bunch_of_data").children()
.filter(XmlDynamic.hasElementName("hello"))
.map(hello -> hello.asString())
.collect(Collectors.toList()); // ["text between strings #1", "text between strings #2"]
See https://github.com/alexheretic/dynamics#xml-dynamics
I have a xpath of an element and need to write a java code which gives me exactly the same element as an object. I believe i need to use SAX or DOM ? i m totally newbie..
xpath :
/*[local-name(.)='feed']/*[local-name(.)='entry']/*[local-name(.)='title']
Your comment suggests you want to use DOM4J, which supports XPath out of the box:
SAXReader reader = new SAXReader();
Document doc = reader.read(new File(....)); // or URL, or wherever the XML comes from
Node selectedNode = doc.selectSingleNode("/*[local-name(.)='feed']/*[local-name(.)='entry']/*[local-name(.)='title']");
(or there's also selectNodes which returns a List, if there might be more than one node matching that XPath expression - quite likely if this is an Atom feed).
But rather than using the local-name hack like this, if you know the namespace URI of the elements in your XML you can declare a prefix for this namespace and select the nodes by their fully qualified name:
SAXReader reader = new SAXReader();
Map<String, String> namespaces = new HashMap<>();
namespaces.put("atom", "http://www.w3.org/2005/Atom");
reader.getDocumentFactory().setXPathNamespaceURIs(namespaces);
Document doc = reader.read(new File(....)); // or URL, or wherever the XML comes from
List selectedNodes = doc.selectNodes("/atom:feed/atom:entry/atom:title");
read here:
https://howtodoinjava.com/java/xml/java-xpath-tutorial-example/
I found it while I were searching to find how to convert Xpath PMD-rule to java-rule,, I did not find what I need in it.
but, anyway may be you can find yours.
I want to count some child nodes of a given xml. But it always returns me 0 and I can't figure out why.
Here's the xml:
<FirstOne xmlns:xxx="http://www.w3.org/2001/XMLSchema-instance">
<Formulas xmlns:d2p1="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<xxx:yyy>
<aa:bb>something</aa:bb>
<cc:dd>something</cc:dd>
</xxx:yyy>
<xxx:yyy>
<aa:bb>something</aa:bb>
<cc:dd>something</cc:dd>
</xxx:yyy>
<xxx:yyy>
<aa:bb>something</aa:bb>
<cc:dd>something</cc:dd>
</xxx:yyy>
</Formulas>
</FirstOne>
I want to count the number of "xxx:yyy". In this example 3.
I tried the following:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new FileInputStream(new File(fileArray[i].toString())));
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
String expression;
expression = "count(//Formulas/xxx:yyy)";
Double result = (Double) xpath.evaluate(expression, doc, XPathConstants.NUMBER);
It always gives me 0.0 ...
Thanks for your help!
The problems all stem from the namespaces.
Firstly, XPath evaluation is only defined over namespace-well-formed XML, so you need to ensure that the aa and cc prefixes are properly mapped to namespace URIs in the XML.
Secondly, you need to parse the XML into a DOM tree using a namespace-aware parser (for what I can only assume are historical reasons, DocumentBuilderFactory is not namespace-aware by default).
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new FileInputStream(new File(fileArray[i].toString())));
Now you have a proper namespace-well-formed DOM tree you need to handle the namespaces correctly in the XPath. You need to define a NamespaceContext telling the XPath how to relate prefixes and namespace URIs. Annoyingly there's no default implementation of this interface available in the core Java libraries but there are third-party implementations such as Spring's SimpleNamespaceContext, or it's only three methods to implement it yourself. With a SimpleNamespaceContext:
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
SimpleNamespaceContext nsCtx = new SimpleNamespaceContext();
xpath.setNamespaceContext(nsCtx);
nsCtx.bindNamespaceUri("x", "http://www.w3.org/2001/XMLSchema-instance");
With this context in place you can now select namespaced nodes in your XPath expression:
String expression = "count(//Formulas/x:yyy)";
(the prefixes you use are the ones in the NamespaceContext, not necessarily the ones in the original XML source).
While some DOM parsers and XPath implementations might let you get away with parsing non-namespace-aware and omitting the prefixes in the XPath expressions, this is an implementation detail and the behaviour is not defined by the specifications. It might work in one version but fail in another, or behave differently if you add additional JARs to your project that change the default parser, etc.
While xxx is the tag prefix, use just count(//Formulas/yyy).
I'm trying to follow http://www.ibm.com/developerworks/xml/library/x-nmspccontext/index.html
UniversalNamespaceResolver
example for resolving namespaces of the XPath evaluation agains an XML. The problem I encountered is that lookupNamespaceURI call below returns null on the XML, I given below:
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document dDoc = builder.parse(new InputSource(new StringReader(xml)));
String nsURI = dDoc.lookupNamespaceURI("h");
the XML:
<?xml version="1.0"?>
<h:root xmlns:h="http://www.w3.org/TR/html4/">
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>`
</h:root>
while I'd expect it to return "http://www.w3.org/TR/html4/".
When configuring a DocumentBuilder, you have to explicitly make it namespace aware (a silly relic from the first days of xml when there were no namespaces):
domFactory.setNamespaceAware(true);
As a side note, the advice in that article is not very good. it fundamentally misses the point that you don't care what the namespace prefixes are in the actual document, they are irrelevant. you need the xpath namespace resolver to match the xpath expressions that you are using, and that is all. if you do what they are suggesting, you will have to change your xpath code whenever the document's prefixes change, which is a horrible idea.
Note, they sort of cede this point in their last bullet, but the rest of the article seems to miss that this is the fundamental idea when using xpath.
But if you don't have control over the XML file, and someone can send you any prefixes they wish, it might be better to be independent of their choices. You can code your own namespace resolution as in Example 1 (HardcodedNamespaceResolver), and use them in your XPath expressions.
Currently I am parsing XML messages with XPath Expression. It works very well. However I have the following problem:
I am parsing the whole data of the XML, thus I instantiate for every call made to xPath.evaulate a new InputSource.
StringReader xmlReader = new StringReader(xml);
InputSource source = new InputSource(xmlReader);
XPathExpression xpe = xpath.compile("msg/element/#attribute");
String attribute = (String) xpe.evaluate(source, XPathConstants.STRING);
Now I would like to go deeper into my XML message and evaluate more information. For this I found myself in the need to instantiate source another time. Is this required? If I don't do it, I get Stream closed Exceptions.
Parse the XML to a DOM and keep a reference to the node(s). Example:
XPath xpath = XPathFactory.newInstance()
.newXPath();
InputSource xml = new InputSource(new StringReader("<xml foo='bar' />"));
Node root = (Node) xpath.evaluate("/", xml, XPathConstants.NODE);
System.out.println(xpath.evaluate("/xml/#foo", root));
This avoids parsing the string more than once.
If you must reuse the InputSource for a different XML string, you can probably use the setters with a different reader instance.