I am trying to parse this XML:
<?xml version="1.0" encoding="UTF-8"?>
<veranstaltungen>
<veranstaltung id="201611211500#25045271">
<titel>Mal- und Zeichen-Treff</titel>
<start>2016-11-21 15:00:00</start>
<veranstaltungsort id="20011507">
<name>Freizeitclub - ganz unbehindert </name>
<anschrift>Macht los e.V.
Lipezker Straße 48
03048 Cottbus
</anschrift>
<telefon>xxxx xxxx </telefon>
<fax>0355 xxxx</fax>
[...]
</veranstaltungen>
As you can see, some of the texts have whitespace or even linebreaks. I am having issues with the text from the node anschrift, because I need to find the right location data in a database. Problem is, the returned String is:
Macht los e.V.Lipezker Straße 4803048 Cottbus
instead of:
Macht los e.V. Lipezker Straße 48 03048 Cottbus
I know the correct way to parse it should be with normalie-space() but I cannot quite work out how to do it. I tried this:
// Does not work; afaik because xpath 1 normalizes just the first node
xPath.compile("normalize-space(veranstaltungen/veranstaltung[position()=1]/veranstaltungsort/anschrift/text()"));
// Does not work
xPath.compile("veranstaltungen/veranstaltung[position()=1]/veranstaltungsort[normalize-space(anschrift/text())]"));
I also tried the solution given here: xpath-normalize-space-to-return-a-sequence-of-normalized-strings
xPathExpression = xPath.compile("veranstaltungen/veranstaltung[position()=1]/veranstaltungsort");
NodeList result = (NodeList) xPathExpression.evaluate(doc, XPathConstants.NODESET);
String normalize = "normalize-space(.)";
xPathExpression = xPath.compile(normalize);
int length = result.getLength();
for (int i = 0; i < length; i++) {
System.out.println(xPathExpression.evaluate(result.item(i), XPathConstants.STRING));
}
System.out prints:
Macht los e.V.Lipezker Straße 4803048 Cottbus
What am I doing wrong?
Update
I have a workaround already, but this can't be the solution. The following few lines show how I put the String together from the HTTPResponse:
try (BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent(), Charset.forName(charset)))) {
final StringBuilder stringBuilder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
// stringBuilder.append(line);
// WORKAROUND: Add a space after each line
stringBuilder.append(line).append(" ");
}
// Work with the red lines
}
I would rather have a solid solution.
Originally, you seem to be using the following code for reading the XML:
try (BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent(), Charset.forName(charset)))) {
final StringBuilder stringBuilder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
}
}
This is where your newlines get eaten: readline() does not return the trailing newline characters. If you then parse the contents of the stringBuilder object, you will get an incorrect DOM, where the text nodes do not contain the original newlines from the XML.
Thanks to the help of Markus, I was able to solve the issue. The reason was the readLine() method of the BufferedReader discarding line breaks. The following codesnippet works for me (Maybe it can be improved):
public Document getDocument() throws IOException, ParserConfigurationException, SAXException {
final HttpResponse response = getResponse(); // returns a HttpResonse
final HttpEntity entity = response.getEntity();
final Charset charset = ContentType.getOrDefault(entity).getCharset();
// Not 100% sure if I have to close the InputStreamReader. But I guess so.
try (InputStreamReader isr = new InputStreamReader(entity.getContent(), charset == null ? Charset.forName("UTF-8") : charset)) {
return documentBuilderFactory.newDocumentBuilder().parse(new InputSource(isr));
}
}
Related
I am trying, using a BufferedReader to count the appearances of a string inside a .txt file. I am using:
File file = new File(path);
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
int appearances = 0;
while ((line = br.readLine()) != null) {
if (line.contains("Hello")) {
appearances++;
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Found " + appearances);
But the problem is that if my .txt file contains for example the string "Hello, world\nHello, Hello, world!" and "Hello" is to be found then the appearances become two instead of three because it searches a line for only one appearance of the string. How could I fix this? Thanks a lot
The simplest solution is to do
while ((line = br.readLine()) != null)
appearances += line.split("Hello", -1).length-1;
Note that, if instead of "Hello", you search for anything with regex-reserved characters, you should escape the string before splitting:
String escaped = Pattern.quote("Hello."); // avoid '.' special meaning in regex
while ((line = br.readLine()) != null)
appearances += line.split(escaped, -1).length-1;
This is an efficent and correct solution:
String line;
int count = 0;
while ((line = br.readLine()) != null)
int index = -1;
while((index = line.indexOf("Hello",index+1)) != -1){
count++;
}
}
return count;
It walks through the line and looks for the next index, starting from the previous index+1.
The problem with Peter's solution is that it is wrong (see my comment). The problem with TheLostMind's solution is that it creates a lot of new strings by replacement which is an unnecessary performance drawback.
A regex-driven version:
String line;
Pattern p = Pattern.compile(Pattern.quote("Hello")); // quotes in case you need 'Hello.'
int count = 0;
while ((line = br.readLine()) != null)
for (Matcher m = p.matcher(line); m.find(); count ++) { }
}
return count;
I am now curious as to performance between this and gexicide's version - will edit when I have results.
EDIT: benchmarked by running 100 times on a ~800k log file, looking for strings that were found once at the start, once around middle-ish, once at the end, and several times throughout. Results:
IndexFinder: 1579ms, 2407200hits. // gexicide's code
RegexFinder: 2907ms, 2407200hits. // this code
SplitFinder: 5198ms, 2407200hits. // Peter Lawrey's code, after quoting regexes
Conclussion: for non-regex strings, the repeated-indexOf approach is fastest by a nice margin.
Essential benchmark code (log file from vanilla Ubuntu 12.04 installation):
public static void main(String ... args) throws Exception {
Finder[] fs = new Finder[] {
new SplitFinder(), new IndexFinder(), new RegexFinder()};
File log = new File("/var/log/dpkg.log.1"); // around 800k in size
Find test = new Find();
for (int i=0; i<100; i++) {
for (Finder f : fs) {
test.test(f, log, "2014"); // start
test.test(f, log, "gnome"); // mid
test.test(f, log, "ubuntu1"); // end
test.test(f, log, ".1"); // multiple; not at start
}
}
test.printResults();
}
while (line.contains("Hello")) { // search until line has "Hello"
appearances++;
line = line.replaceFirst("Hello",""); // replace first occurance of "Hello" with empty String
}
I need to parse XML Tags which are commented out like
<DataType Name="SecureCode" Size="4" Type="NVARCHAR">
<!-- <Validation>
<Regex JavaPattern="^[0-9]*$" JSPattern="^[0-9]*$"/>
</Validation> -->
<UIType Size="4" UITableSize="4"/>
</DataType>
But all I found was setIgnoringComments(boolean)
Document doc = docBuilder.parse(new File(PathChecker.getDataTypesFile()));
docFactory.setIgnoringComments(true); // ture or false, no difference
But it doesn't seem to change anything.
Is there any other way to parse this comments? I have to use DOM.
Regards
Method "setIgnoringComments" removed comments from DOM tree during parsing.
With "setIgnoringComments(false)" you can get comment text like:
NodeList nl = doc.getDocumentElement().getChildNodes();
for (int i = 0; i < nl.getLength(); i++) {
if (nl.item(i).getNodeType() == Element.COMMENT_NODE) {
Comment comment=(Comment) nl.item(i);
System.out.println(comment.getData());
}
}
Since there seems not to exist a "regular way" of solving the problem I've just removed the comments.
BufferedReader br = new BufferedReader(new FileReader(new File(PathChecker.getDataTypesFile())));
BufferedWriter bw = new BufferedWriter(new FileWriter(new File(PathChecker.getDataTypesFileWithoutComments())));
String line = "";
while ((line = br.readLine()) != null) {
line = line.replace("<!--", "").replace("-->", "") + "\n";
bw.write(line);
}
I have a XML file that contains non-standard characters (like a weird "quote").
I read the XML using UTF-8 / ISO / ascii + unmarshalled it:
BufferedReader br = new BufferedReader(new InputStreamReader(
(conn.getInputStream()),"ISO-8859-1"));
String output;
StringBuffer sb = new StringBuffer();
while ((output = br.readLine()) != null) {
//fetch XML
sb.append(output);
}
try {
jc = JAXBContext.newInstance(ServiceResponse.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
ServiceResponse OWrsp = (ServiceResponse) unmarshaller
.unmarshal(new InputSource(new StringReader(sb.toString())));
I have a oracle function that will take iso-8859-1 codes, and converts/maps them to "literal" symbols. i.e: "’" => "left single quote"
JAXB unmarshal using iso, displays the characters with iso conversion fine. i.e all weird single quotes will be encoded to "’"
so suppose my string is: class of 10–11‐year‐olds (note the weird - between 11 and year)
jc = JAXBContext.newInstance(ScienceProductBuilderInfoType.class);
Marshaller m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_ENCODING, "ISO-8859-1");
//save a temp file
File file2 = new File("tmp.xml");
this will save in file :
class of 10–11‐year‐olds. (what i want..so file saving works!)
[side note: i have read the file using java file reader, and it out puts the above string fine]
the issue i have is that the STRING representation using jaxb unmarshaller has weird output, for some reason i cannot seem to get the string to represent –.
when I
1: check the xml unmarshalled output:
class of 10?11?year?olds
2: the File output:
class of 10–11‐year‐olds
i even tried to read the file from the saved XML, and then unmarshal that (in hopes of getting the – in my string)
String sCurrentLine;
BufferedReader br = new BufferedReader(new FileReader("tmp.xml"));
StringBuffer sb = new StringBuffer();
while ((sCurrentLine = br.readLine()) != null) {
sb.append(sCurrentLine);
}
ScienceProductBuilderInfoType rsp = (ScienceProductBuilderInfoType) unm
.unmarshal(new InputSource(new StringReader(sb.toString())));
no avail.
any ideas how to get the iso-8859-1 encoded character in jaxb?
Solved: using this tibid code found on stackoverflow
final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
HtmlEncoder.escapeNonLatin(MYSTRING)
I'm calling grep in java to separately count the number of a list of words in a corpus.
BufferedReader fb = new BufferedReader(
new InputStreamReader(
new FileInputStream("french.txt"), "UTF8"));
while ((l = fb.readLine()) != null){
String lpt = "\\b"+l+"\\b";
String[] args = new String[]{"grep","-ic",lpt,corpus};
Process grepCommand = Runtime.getRuntime().exec(args);
grep.waitFor()
}
BufferedReader grepInput = new BufferedReader(new InputStreamReader(grep.getInputStream()));
int tmp = Integer.parseInt(grepInput.readLine());
System.out.println(l+"\t"+tmp);
This works well for my English word-list and corpus. But I also have a French word list and corpus. It doesn't work for french and a sample output on java console looks like this:
� bord 0
� c�t� 0
correct form: "à bord" and "à côté".
Now my question is: where is the problem? Should I fix my java code, or it's a grep issue?
If so how do I fix it. (I also can't see french characters on my terminal correctly even though I changed the encoding to UTF-8).
The problem is in your design. Do not call grep from java. Use pure java implementation instead: read file line by line and implement your own "grep" using pure java API.
But seriously I believe that the problem is in your shell. Did you try to run grep manually and filter French characters? I believe it will not work for you. It depends on your shell configuration and therefore depends on platform. Java can provide platform independent solution. To achieve this you should avoid as much as it is possible using non-pure-java techniques including executing command line utilities.
BTW code that reads line-by-line your file and uses String.contains() or pattern matching for lines filtering even shorter than code that runs grep.
I would suggest that you read the file line by line then call split on the word boundary to get the number of words.
public static void main(String[] args) throws IOException {
final File file = new File("myFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final String[] words = line.split("\\b");
System.out.println(words.length + " words in line \"" + line + "\".");
}
}
}
This avoids calling grep from you program.
The odd characters you are getting may well be do to with using the wrong encoding. Are you sure your file is in "UTF-8"?
EDIT
OP wants to read one file line-by-line and then search for occurrences of the read line in another file.
This can still be done more easily using java. Depending on how big your other file is you can either read it into memory first and search it or search it line-by-line also
A simple example reading the file into memory:
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
final File corpusFile = new File("corpus");
final String corpusFileContent = readFileToString(corpusFile);
final File file = new File("myEngramFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final int matches = countOccurencesOf(line, corpusFileContent);
};
}
}
private static String readFileToString(final File file) throws IOException {
final StringBuilder stringBuilder = new StringBuilder();
try (final FileChannel fc = new RandomAccessFile(file, "r").getChannel()) {
final ByteBuffer byteBuffer = ByteBuffer.allocate(4096);
final CharsetDecoder charsetDecoder = Charset.forName("UTF-8").newDecoder();
while (fc.read(byteBuffer) > 0) {
byteBuffer.flip();
stringBuilder.append(charsetDecoder.decode(byteBuffer));
byteBuffer.reset();
}
}
return stringBuilder.toString();
}
private static int countOccurencesOf(final String countMatchesOf, final String inString) {
final Matcher matcher = Pattern.compile("\\b" + countMatchesOf + "\\b").matcher(inString);
int count = 0;
while (matcher.find()) {
++count;
}
return count;
}
This should work fine if your "corpus" file is less than a hundred megabytes or so. Any bigger and you will want to change the "countOccurencesOf" method to something like this
private static int countOccurencesOf(final String countMatchesOf, final File inFile) throws IOException {
final Pattern pattern = Pattern.compile("\\b" + countMatchesOf + "\\b");
int count = 0;
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
++count;
}
};
}
return count;
}
Now you would just pass your "File" object into the method rather than the stringified file.
Note that the streaming approach reads files line-by-line and hence drops the linebreaks, you need to add them back before parsing the String if your Pattern relies on them being there.
I am doing my first Android app and I have to take the code of a html page.
Actually I am doing this:
private class NetworkOperation extends AsyncTask<Void, Void, String > {
protected String doInBackground(Void... params) {
try {
URL oracle = new URL("http://www.nationalleague.ch/NL/fr/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
String s1 = "";
while ((inputLine = in.readLine()) != null)
s1 = s1 + inputLine;
in.close();
//return
return s1;
}
catch (IOException e) {
e.printStackTrace();
}
return null;
}
but the problem is it takes too much time. How to take for exemple the HTML from the line 200 to the line 300 ?
Sorry for my bad english :$
Best case use instead of readLine() use read(char[] cbuf, int off, int len). Another dirty way
int i =0;
while(while ((inputLine = in.readLine()) != null)
i++;
if(i>200 || i<300 )
DO SOMETHING
in.close();)
You get the HTML document through HTTP. HTTP usually relies on TCP. So... you can't just "skip lines"! The server will always try to send you all data preceding the portion of your interest, and your side of communication must acknowledge the reception of such data.
Do not read line by line [use read(char[] cbuf, int off, int len)]
Do not concat Strings [use a StringBuilder]
Open The buffered reader (much like you already do):
URL oracle = new URL("http://www.nationalleague.ch/NL/fr/");
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
Instead of reading line by line, read in a char[] (I would use one of size about 8192)
and than use a StringBuilder to append all the read chars.
Reading secific lines of HTML-source seams a little risky because formatting of the source code of the HTML page may change.