I have this text from file which I would like to get values using pattern match:
<!--
#author: batman
#description: 100000
-->
I'm using this code to find comments into XML files and get values using Patterns match:
XMLStreamReader xr = XMLInputFactory.newInstance().createXMLStreamReader(new FileInputStream(file));
Pattern pattern = Pattern.compile("#(?<key>([\\w]+)?): (?<value>(.+)?)");
while (xr.hasNext())
{
if (xr.next() == XMLStreamConstants.COMMENT) {
String comment = xr.getText();
Matcher matcher = pattern.matcher(comment);
if(!matcher.matches()){
continue;
}
.....
}
When I run the code I get error:
persistence-configuration.xsd (No such file or directory)
at [row,col {unknown-source}]: [4,69]
But when I remove matcher.matches() I get several iterations if xr because I have several comments into XML file.
The idea is to get only the comment with the proper match and to skip the rest of the comments. Do you know why the code is not working fine?
I tried also with "#(?<key>[\\w]+): (?<value>.*)" but again I get the same issue.
Is you input source actually there? The exception you get hints at a file problem.
Matcher.matches() tests a String for complete confirmity with the entire pattern. All characters of the input sequence need to be consumed. So if your line starts with blanks (tabs, whitespaces...) then your matcher may fail to match, if you use matches.
You are looking for matcher.find(), which can be used like this:
Pattern myPattern = Pattern.compile("somePattern");
Matcher matcher = pattern.matcher("someInput");
while (matcher.find()) {
// this is where you can check what you found...
}
Besides, you may want to check, at what chunks the stream provides elements. Are they really coming linewise?
Related
I have htmlBody field which has html of a web page assigned to it. I want to check for all occurences for relative links ending in .html and for each of them to remove their extension. I do not want htmlBody.replaceAll(".html", "") because it will remove for all links and break some external links so my approach is to find all occurences that matches regex, and for each occurence to remove their extension using replaceAll() and append to sb. I tried to follow the example from official documentation but apparently it does not change any link, what could be the problem?
StringBuilder sb = new StringBuilder();
Pattern p = Pattern.compile("^\\/(.+\\\\)*(.+).(html)$");
Matcher m = p.matcher(htmlBody);
while (m.find()) {
String updatedLink = m.group().replaceAll(".html", "");
m.appendReplacement(sb, updatedLink);
}
m.appendTail(sb);
your regex was wrong, ^ match start of string, $ match end of string.
so matcher in your code will never match.
right regex like Pattern p = Pattern.compile("['\"]\\/(.+\\\\)*(.+).(html)");
but, it can't match <a href=/a.html>
To begin with the XML file 2,84GB and none of SAX or DOM parser seems to be working. I've already tried them and every time crashes. So, I choose to read the file and export the data I want with BufferedReader, parsing the XML file like it is txt.
XML File(small part):
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2019-11-22.dtd">
<dblp>
<phdthesis mdate="2016-05-04" key="phd/dk/Heine2010">
<author>Carmen Heine</author>
<title>Modell zur Produktion von Online-Hilfen.</title>
<year>2010</year>
<school>Aarhus University</school>
<pages>1-315</pages>
<isbn>978-3-86596-263-8</isbn>
<ee>http://d-nb.info/996064095</ee>
</phdthesis><phdthesis mdate="2020-02-12" key="phd/Hoff2002">
<author>Gerd Hoff</author>
<title>Ein Verfahren zur thematisch spezialisierten Suche im Web und seine Realisierung im Prototypen HomePageSearch</title>
<year>2002</year>
From that XML file I want to retrieve the data which is between the tags <year>. I also used Pattern and Matcher with regEx to find out the information I want. My code so far:
public class Publications {
public static void main(String[] args) throws IOException {
File file = new File("dblp-2020-04-01.xml");
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
String regex = "\\d+";
// Reading line by line from the
// file until a null is returned
while ((line = reader.readLine()) != null) {
final Pattern pattern = Pattern.compile("<year>(.+?)</year>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher("<year>"+regex+"</year>");
matcher.find();
System.out.println(matcher.group(1)); // Prints String I want to extract
}
}
}
After compiling , the results aren't what I expected to be. Instead of printing me the exact year everytime the parser finds the ... tag the results are the following:
\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+
Any suggestions?
Please don't try parsing XML using regular expressions. We get hundreds of questions on this forum from people trying to generate XML in peculiar formats because that's the only thing the receiving application can handle, and the reason the receiving application has such restrictions is that it's trying to do the XML parsing "by hand". You're storing up trouble for yourself, for the people you want to exchange data with, and for the people on StackOverflow that you will turn to for help when it all goes pear-shaped. XML standards exist for a reason, and work very well when everyone conforms to them.
The right approach in this case is a streaming XML approach, using SAX, StAX, or streaming XSLT 3.0, and you've abandoned those approaches for completely spurious reasons.
Remark
Regexen are the wrong tool to extract information from xml (or similar structured formats). The general approach is not recommended. For the right way to handle it, cf. Michael Kay's answer.
Answer
You provide the wrong argument in constructing the matcher. Instead of the expression in your code you need to provide the current line:
// ...
final Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
System.out.println(matcher.group(1)); // Prints String I want to extract
}
// ...
Note the extra conditional to check whether the current line does match at all.
Also note that the pattern you match against is defined in the Pattern constructor. Thus to match only <year> tags that contain numerical values, the line has to be changed to
final Pattern pattern = Pattern.compile("<year>(" + regex + ")</year>", Pattern.DOTALL);
I have a string that contains file names like:
"file1.txt file2.jpg tricky file name.txt other tricky filenames containing áéíőéáóó.gif"
How can I get the file names, one by one?
I am looking for the most safe most through method, preferably something java standard. There has got to be some regular expression already out there, I am counting on your experience.
Edit: expected results:
"file1.txt", "file2.jpg", "tricky file name.txt", "other tricky filenames containing áéíőéáóó.gif"
Thanks for the help,
Sziro
Regular expresion that enrico.bacis suggested (\S.?.\S+)* will not work if there are filenames without characters before "." like .project.
Correct pattern would be:
(([^ .]+ +)*\S*\.\S+)
You can try it here.
Java program that could extract filenames will look like:
String patternStr = "([^ .]+ +)*\\S*\\.\\S+";
String input = "file1.txt .project file2.jpg tricky file name.txt other tricky filenames containing áéíoéáóó.gif";
Pattern pattern = Pattern.compile(patternStr, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
If you want to use regular expressions you can find all the occurrences of:
(\S.*?\.\S+)
(you can test it here)
If there are spaces in the file names, it makes it trickier.
If you can assume there are no dots (.) in the file names, you can use the dot to find each individual records as has been suggested.
If you can't assume there are no dots in file names, e.g. my file.new something.txt
In this situation, I'd suggest you create a list of acceptable extentions, e.g. .doc, .jpg, .pdf etc.
I know the list may be long, so it's not ideal. Once you have done this you can look for these extensions and assume what's before it is a valid filename.
String txt = "file1.txt file2.jpg tricky file name.txt other tricky filenames containing áéíőéáóó.gif";
Pattern pattern = Pattern.compile("\\S.*?\\.\\S+"); // Get regex from enrico.bacis
Matcher matcher = pattern.matcher(txt);
while (matcher.find()) {
System.out.println(matcher.group().trim());
}
I'm trying to build a Java regex to search a .txt file for a Windows formatted file path, however, due to the file path containing literal backslashes, my regex is failing.
The .txt file contains the line:
C\Windows\SysWOW64\ntdll.dll
However, some of the filenames in the text file are formatted like this:
C\Windows\SysWOW64\ntdll.dll (some developer stuff here...)
So I'm unable to use String.equals
To match this line, I'm using the regex:
filename = "C\\Windows\\SysWOW64\\ntdll.dll"
read = BufferedReader.readLine();
if (Pattern.compile(Pattern.quote(filename), Pattern.CASE_INSENSITIVE).matcher(read).find()) {
I've tried escaping the literal backslashes, using the replace method, i.e:
filename.replace("\\", "\\\\");
However, this is failing to find, I'm guessing this is because I need to further escape the backslashes after the Pattern has been built, I'm thinking I might need to escape upto an additional four backslashes, i.e:
Pattern.replaceAll("\\\\", "\\\\\\\\");
However, each time I try, the pattern doesn't get matched. I'm certain it's a problem with the backslashes, but I'm not sure where to do the replacement, or if there's a better way of building the pattern.
I think the problem is further being compounded as the replaceAll method also uses a regex, with means the pattern will have it's own backslashes in there, to deal with the case insensitivity.
Any input or advice would be appreciated.
Thanks
Seems like you're attempting to to a direct comparison of String against another. For exact matches, you could do (
if (read.equalsIgnoreCase(filename)) {
of simply
if (read.startsWith(filename)) {
Try this :
While reading each line from the file, replace '\' by '\\'.
Then :
String lLine = "C\\Windows\\SysWOW64\\ntdll.dll";
Pattern lPattern = Pattern.compile("C\\\\Windows\\\\SysWOW64\\\\ntdll\\.dll");
Matcher lMatcher = lPattern.matcher(lLine);
if(lMatcher.find()) {
System.out.println(lMatcher.group());
}
lLine = "C\\Windows\\SysWOW64\\ntdll.dll (some developer stuff here...)";
lMatcher = lPattern.matcher(lLine);
if(lMatcher.find()) {
System.out.println(lMatcher.group());
}
The correct usage will be:
String filename = "C\\Windows\\SysWOW64\\ntdll.dll";
String file = filename.replace('\\', ' ');
i have a text file like:
"GET /opacial/index.php?op=results&catalog=1&view=1&language=el&numhits=10&query=\xce\x95\xce\xbb\xce\xbb\xce\xac\xce\xb4\xce\xb1%20--%20\xce\x95\xce\xb8\xce\xbd\xce\xb9\xce\xba\xce\xad\xcf\x82%20\xcf\x83\xcf\x87\xce\xad\xcf\x83\xce\xb5\xce\xb9\xcf\x82%20--%20\xce\x99\xcf\x83\xcf\x84\xce\xbf\xcf\x81\xce\xaf\xce\xb1&search_field=11&page=1
And i want to cut all the characters after the word "query" and before "&search". (bolds above).
I am trying to cut the data, using patterns but something is wrong.. Can you give me an example for the example code above?
EDIT:
An other problem , except the one above is that the matcher is used only for charSequences, and i have a file, which can not casted to charSequence... :\
something like that:
String yourNewText=yourOldText.split("query")[1].split("&search")[0];
?
to see how to read a file into a String, you can look here (there are different possiblities)
".*query\\=(.*)\\&search_field.*"
This regex should work to give you a capture of what you want to remove. Then String.replace should do the trick.
Edit - response to comment. The following code...
String s = "GET /opacial/index.php?op=results&catalog=1&view=1&language=el&numhits=10&query=\\xce\\x95\\xce\\xbb\\xce\\xbb\\xce\\xac\\xce\\xb4\\xce\\xb1%20--%20\\xce\\x95\\xce\\xb8\\xce\\xbd\\xce\\xb9\\xce\\xba\\xce\\xad\\xcf\\x82%20\\xcf\\x83\\xcf\\x87\\xce\\xad\\xcf\\x83\\xce\\xb5\\xce\\xb9\\xcf\\x82%20 --%20\\xce\\x99\\xcf\\x83\\xcf\\x84\\xce\\xbf\\xcf\\x81\\xce\\xaf\\xce\\xb1&search_field=11&page=1";
Pattern p = Pattern.compile(".*query\\=(.*)\\&search_field.*");
Matcher m = p.matcher(s);
if (m.matches()){
String betweenQueryAndSearch = m.group(1);
System.out.println(betweenQueryAndSearch);
}
Produced the following output....
\xce\x95\xce\xbb\xce\xbb\xce\xac\xce\xb4\xce\xb1%20--%20\xce\x95\xce\xb8\xce\xbd\xce\xb9\xce\xba\xce\xad\xcf\x82%20\xcf\x83\xcf\x87\xce\xad\xcf\x83\xce\xb5\xce\xb9\xcf\x82%20 --%20\xce\x99\xcf\x83\xcf\x84\xce\xbf\xcf\x81\xce\xaf\xce\xb1