using a regex in jsoup - java

I'm trying my first serious project in jsoup and I've got stuck in this matter-
I'm trying to get zipcodes from a site. There is a list of zipcodes.
Here is one of the lines that presents the zipcode-
<td align="center">33011</td>
So the idea I've got is going through the page and getting all the strings that contain 6 digits from 1-9. Regex is ^[0-9]{6,6}$
code was -
doc.select("td:matchesOwn(^[0-9]{5,5}$)");
but nothing came out. I can't find the way to get these zipcodes out of that site....
Does anyone know how to do it?
the real question here is how do i get the numbers that are not in any tags,but just written out in the open (i guess there is a term for that but im not that good with xml terms)

I solved it using Element#getElementsMatchingOwnText:
public static void main(String[] args) {
final String html = "<td align=\"center\">33011</td> ";
final Elements elements = Jsoup.parse(html).getElementsMatchingOwnText("^[0-9]{5,5}$");
for (final Element element : elements) {
System.out.println("element = [" + element + "]");
System.out.println("zip = [" + element.text() + "]");
}
}
Output:
element = [33011]
zip = [33011]

Related

Can't undertand why my regex in Java doesn't work

I'm trying to find pieces of text on the webpage I fetch that lay between 'align="left">\n" and '</form>\n</td>' substrings.
I wrote a regex:
(align=\"left\">\\n)(?<part>.*?)(<\/form>\\n<\/td>)
and tested it at https://www.freeformatter.com/java-regex-tester.html where it works just as I need.
But in the Java code it can't find anything.
My test code that I'm trying make working:
String frontPage = "<html>\n<head>\n<title>Hello</title>\n</head>\n" +
"<body>\n<table>\n<tr align=\"left\">\n" +
"<td>Hello \n<form>\n<input type=\"submit\" value=\"ok\">\n" +
"</form>\n</td>\n" +
"<td>World \n<form>\n<input type=\"submit\" value=\"ok\">\n" +
"</form>\n</td>\n" +
"</tr>\n</table>\n</body>\n</html>";
java.util.regex.Pattern p =
java.util.regex.Pattern.compile(
"(align=\"left\">\\n)(?<part>.*?)(<\\/form>\\n<\\/td>)");
java.util.regex.Matcher m = p.matcher(frontPage);
List<String> parts = new ArrayList<>();
while (m.find()) {
parts.add(m.group("part"));
}
if (parts.size() == 0)
System.out.println("No page parts found");
else {
System.out.println("Something matches at least");
}
It finds matches if only first two groups specified, but when I add at least simple (form) sequence to the last group, it stops matching anything, and I can't even guess why.
Add DOTALL to the compile. Like
java.util.regex.Pattern.compile(
"(align=\"left\">\\n)(?<part>.*?)(<\\/form>\\n<\\/td>)",
java.util.regex.Pattern.DOTALL
);
See it here at ideone.

JSoup get specific data from webpage

I've been trying to get data from: http://www.betvictor.com/sports/en/to-lead-anytime, where I would like to get the list of matches using JSoup.
For example:
Caen v AS Saint Etienne
Celtic v Rangers
and so on...
My current code is:
String couponPage = "http://www.betvictor.com/sports/en/to-lead-anytime";
Document doc1 = Jsoup.connect(couponPage).get();
String match = doc1.select("#coupon_143751140 > table:nth-child(3) > tbody > tr:nth-child(2) > td.event_description").text();
System.out.println("match:" + match);
Once I can figure out how to get one item of data, I will put it in a for loop to loop through the whole table, but first I need to get one item of data.
Currently, the output is "match: " so it looks like the "match" variable is empty.
Any help is most appreciated,
I have worked out how to answer my question after a few hours of experimenting. Turns out the page didn't load properly straight away, and had to implement the "timeout" method.
Document doc;
try {
// need http protocol
doc = Jsoup.connect("http://www.betvictor.com/sports/en/football/coupons/100/0/0/43438/0/100/0/0/0/0/1").timeout(10000).get();
// get all links
Elements matches = doc.select("td.event_description a");
Elements odds = doc.select("td.event_description a");
for (Element match : matches) {
// get the value from href attribute
String matchEvent = match.text();
String[] parts = matchEvent.split(" v ");
String team1 = parts[0];
String team2 = parts[1];
System.out.println("text : " + team1 + " v " + team2);
}

Can I include white space between all html text() elements in Jsoup

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.
This is the Java code I am using:
String resultText = scrapePage(htmldoc);
private String scrapePage(Document doc) {
Element allHTML = doc.select("html").first();
return allHTML.text();
}
Run against the following HTML:
<html>
<body>
<h1>Title</h1>
<p>here is para1</p>
<p>here is para2</p>
</body>
</html>
Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".
I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching
(e.g. data1data2 would come from):
<td>data1</td><td>data2</td>
Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.
I don't have this issue using JSoup 1.7.3.
Here's the full code i used for testing:
final String html = "<html>\n"
+ " <body>\n"
+ " <h1>Title</h1>\n"
+ " <p>here is para1</p>\n"
+ " <p>here is para2</p>\n"
+ " </body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Element element = doc.select("html").first();
System.out.println(element.text());
And the output:
Title here is para1 here is para2
Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.
Previous answer is not right, because it works just thanks to "\n" end of lines added to each line, but in reality you may not have end of line on end of each HTML line...
void example2text() throws IOException {
String url = "http://www.example.com/";
String out = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next();
org.jsoup.nodes.Document doc = Jsoup.parse(out);
String text = "";
Elements tags = doc.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
text += tagText + " ";
}
}
}
System.out.println(text);
}
By using answer: https://stackoverflow.com/a/35798214/4642669

Trim() in Java not working the way I expect? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Query about the trim() method in Java
I am parsing a site's usernames and other information, and each one has a bunch of spaces after it (but spaces in between the words).
For example: "Bob the Builder " or "Sam the welder ". The numbers of spaces vary from name to name. I figured I'd just use .trim(), since I've used this before.
However, it's giving me trouble. My code looks like this:
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).trim());
}
The result is just the same; no spaces are removed at the end.
Thank you in advance for your excellent answers!
UPDATE:
The full code is a bit more complicated, since there are HTML tags that are parsed out first. It goes exactly like this:
for (String s : splitSource2) {
if (s.length() > "<td class=\"dddefault\">".length() && s.substring(0, "<td class=\"dddefault\">".length()).equals("<td class=\"dddefault\">")) {
splitSource3.add(s.substring("<td class=\"dddefault\">".length()));
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).substring(0, splitSource3.get(i).length() - 5));
splitSource3.set(i, splitSource3.get(i).trim());
System.out.println(i + ": " + splitSource3.get(i));
}
}
UPDATE:
Calm down. I never said the fault lay with Java, and I never said it was a bug or broken or anything. I simply said I was having trouble with it and posted my code for you to collaborate on and help solve my issue. Note the phrase "my issue" and not "java's issue". I have actually had the code printing out
System.out.println(i + ": " + splitSource3.get(i) + "*");
in a for each loop afterward.
This is how I knew I had a problem.
By the way, the problem has still not been fixed.
UPDATE:
Sample output (minus single quotes):
'0: Olin D. Kirkland                                          '
'1: Sophomore                                          '
'2: Someplace, Virginia  12345<br />VA SomeCity<br />'
'3: Undergraduate                                          '
EDIT the OP rephrased his question at Query about the trim() method in Java, where the issue was found to be Unicode whitespace characters which are not matched by String.trim().
It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.
If my assumption is correct then you've got two choices:
Use a binary reader and figure out what those characters are - and delete them with String.replace(); E.g.:
private static void cutCharacters(String fromHtml) {
String result = fromHtml;
char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
for (char ch : problematicCharacters) {
result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
}
return result;
}
If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:
private String getImportantParts(String fromHtml) {
Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
Matcher m = p.matcher(fromHtml);
StringBuilder buff = new StringBuilder();
while (m.find()) {
buff.append(m.group(1));
}
return buff.toString().trim();
}
Works without a problem for me.
Here your code a bit refactored and (maybe) better readable:
final String openingTag = "<td class=\"dddefault\">";
final String closingTag = "</td>";
List<String> splitSource2 = new ArrayList<String>();
splitSource2.add(openingTag + "Bob the Builder " + closingTag);
splitSource2.add(openingTag + "Sam the welder " + closingTag);
for (String string : splitSource2) {
System.out.println("|" + string + "|");
}
List<String> splitSource3 = new ArrayList<String>();
for (String s : splitSource2) {
if (s.length() > openingTag.length() && s.startsWith(openingTag)) {
String nameWithoutOpeningTag = s.substring(openingTag.length());
splitSource3.add(nameWithoutOpeningTag);
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
String name = splitSource3.get(i);
int closingTagBegin = splitSource3.get(i).length() - closingTag.length();
String nameWithoutClosingTag = name.substring(0, closingTagBegin);
String nameTrimmed = nameWithoutClosingTag.trim();
splitSource3.set(i, nameTrimmed);
System.out.println("|" + splitSource3.get(i) + "|");
}
I know that's not a real answer, but i cannot post comments and this code as a comment wouldn't fit, so I made it an answer, so that Olin Kirkland can check his code.

Why Jsoup cannot select td element?

I have made little test (with Jsoup 1.6.1):
String s = "" +Jsoup.parse("<td></td>").select("td").size();
System.out.println("Selected elements count : " + s);
It outputs:
Selected elements count : 0
But it should return 1, because I have parsed html with td element. What is wrong with my code or is there bug in Jsoup?
Because Jsoup is a HTML5 compliant parser and you feeded it with invalid HTML. A <td> has to go inside at least a <table>.
int size = Jsoup.parse("<table><td></td></table>").select("td").size();
System.out.println("Selected elements count : " + size);
String url = "http://foobar.com";
Document doc = Jsoup.connect(url).get();
Elements td = doc.select("td");
Jsoup 1.6.2 allows to parse with different parser and simple XML parser is provided. With following code I could solve my problem. You can later parse your fragment with HTML parse, to get valid HTML.
// Jsoup 1.6.2
String s = "" + Jsoup.parse("<td></td>", "", Parser.xmlParser()).select("td").size();
System.out.println("Selected elements count : " + s);

Categories

Resources