Java regex , jsoup - java

How to extract these messages by regex or jsoup ? 19040172b-1、 SQL Server Develop 、zheng 、3-5,7-14 、D-101 ,
<div id="AE9D7F630640426F8457A661607D2B8E-5-2" style="display: none;" class="kbcontent">
19040172b-1
<br>SQL Server Develop
<br>
<font title="teacher">zheng</font>
<br>
<font title="week">3-5,7-14</font>
<br>
<font title="classroom">D-101</font>
<br>
</div>
I have tried the following ways but failed.
1. Pattern pattern = Pattern.compile(">(.*?)<br>");
2. Elements msg = doc.select(":matchesOwn([>.*?<br>])");

1) First, it's never a good idea to parse HTMl with a regex. You can read more about that here.
2)You can just take all text between tags.
Document doc = Jsoup.parse(file, charsetName);
String text= doc.text();
System.out.println(text);

String html = "<div id=\"AE9D7F630640426F8457A661607D2B8E-5-2\" style=\"display: none;\" class=\"kbcontent\"> 19040172b-1 <br>SQL Server Develop <br> <font title=\"teacher\">zheng</font> <br> <font title=\"week\">3-5,7-14</font> <br> <font title=\"classroom\">D-101</font> <br> </div> ";
html = html.replaceAll("<br>", "#~#");
Document doc = Jsoup.parse(html.toString());
String newHtml = doc.text();
String[] ary = newHtml.split("#~#");
This will do the job, yet there may be other clean ways to replace the br tag.

Related

getting Russian input from web into java applcation

I obviously am missing something here. I have a web app where the input for a form may be in English or, after a keyboard switch, Russian. The meta tag for the page is specifying that the page is UTF-8. That does not seem to matter.
If I type in "вв", two of the unicode character: CYRILLIC SMALL LETTER VE
What do I get? A string. I call getCodePoints().toArray() and I get:
[208, 178, 208, 178]
If I call chars().toArray[], I get the same.
What the heck?
I am completely in control of the web page, but of course there will be different browsers. But how can I get something back from the web page that will let me get the proper cyrillic characters?
This is on java 1.8.0_312. I can upgrade some, but not all the way to the latest java.
The page is this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>Cards</title>
<link rel = "stylesheet" href = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity = "sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin = "anonymous" />
<link rel = "stylesheet" href = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap-theme.min.css" integrity = "sha384-rHyoN1iRsVXV4nD0JutlnGaslCJuC7uwjduW9SVrLvRYooPp2bWYgmgJQIXwl/Sp" crossorigin = "anonymous" />
<script src = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity = "sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin = "anonymous">
</script>
<meta http-equiv = "Content-Type" content = "text/html; charset=UTF-8" />
<style>.table-nonfluid { width: auto !important; }</style>
</head>
<body>
<div style = "padding: 25px 25px 25px 25px;">
<h2 align = "center">Cards</h2>
<div style = "white-space: nowrap;">
Home
<div>
<form name="f_3_1" method="post" action="/cgi-bin/WebObjects/app.woa/wo/ee67KCNaHEiW1WdpdA8JIM/2.3.1">
<table class = "table" border = "1" style = "max-width: 50%; font-size: 300%; text-align: center;">
<tr>
<td>to go</td>
</tr>
<tr>
<td><input size="25" type="text" name="3.1.5.3.3" /></td>
</tr>
<td>
<input type="submit" value="Submit" name="3.1.5.3.5" /> Skip
</td>
</table>
<input type="hidden" name="wosid" value="ee67KCNaHEiW1WdpdA8JIM" />
</form>
</div>
</div>
</div>
</body>
</html>
Hm. Well, here is at least part of the story.
I have this code:
System.out.println("start: " + start);
int[] points = start.chars().toArray();
byte[] next = new byte[points.length];
int idx = 0;
System.out.print("fixed: ");
for (int p : points) {
next[idx] = (byte)(p & 0xff);
System.out.print(Integer.toHexString(next[idx]) + " ");
idx++;
}
System.out.println("");
The output is:
start: вв
fixed: ffffffd0 ffffffb2 ffffffd0 ffffffb2
And the UTF-8 value for "В", in hex, is d0b2.
So, there it is. The question is, why is this not more easily accessible? Do I really have to put this together byte-pair by byte-pair?
If the string is already in UTF-8, as I think we can see it is, why does the codePoints() method not give us, you know, the codePoints?
Ok, so now I do:
new String(next, StandardCharsets.UTF_8);
and I get the proper string. But it still seems strange that codePoints() gives me an IntStream, but if you use these things as int values, it is broken.
It was a problem with the frameworks I was using. I thought I was setting the request and response content type to utf-8 but I was not.

JSOUP - Help getting <IMG SRC> from <DIV CLASS>

I have the HTML snippet below. There are multiple div classes for "teaser-img" throughout the document. I want to be able to grab all the "img src" from all these "teaser-img" classes.
<div class="teaser-img">
<a href="/julien/blog/failure-consciousness-vs-success-consciousness-shifting-focus-become-badass-or-loser">
<img src="http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg" alt="" title=""/>
</a>
</div>
I have tried many things so I wouldn't know what code to share with you guys. Your help will be much appreciated.
final String html = "<div class=\"teaser-img\">\n"
+ " <a href=\"/julien/blog/failure-consciousness-vs-success-consciousness-shifting-focus-become-badass-or-loser\">\n"
+ " <img src=\"http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg\" alt=\"\" title=\"\"/>\n"
+ " </a>\n"
+ "</div>";
// Parse the html from string or eg. connect to a website using connect()
Document doc = Jsoup.parseBodyFragment(html);
for( Element element : doc.select("div.teaser-img img[src]") )
{
System.out.println(element);
}
Output:
<img src="http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg" alt="" title="">
See here for documentation about the selector syntax.

Add a new html tag to an html string in android

I have a string obtained from an EditText. The string contains html tags.
Spannable s = mainEditText.getText();
String webText = Html.toHtml(s);
The contents of the string is :
<p dir="ltr">test</p>
<p dir="ltr"><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /><br /></p>
<p dir="ltr"><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /><br /> </p>
Now, what I want to do is, wherever there is an img src tag, I want to precede it with a center tag.
What should I do to get the following output?
<p dir="ltr">test</p>
<p dir="ltr"><center><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /></center><br /></p>
<p dir="ltr"><center><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /></center><br /> </p>
Can a regex solve the issue or should it be done in a different way?
Can JSOUP help in any way? Is there any other type of HTML parser which can do the job?
(<img\s+[^>]*>)
You can try this.Replace with <center>$1</centre>.See demo.
http://regex101.com/r/sU3fA2/38
Something like
var re = /(<img\s+[^>]*>)/g;
var str = '<p dir="ltr">test</p> \n<p dir="ltr"><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /><br /></p> \n<p dir="ltr"><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /><br /> </p>';
var subst = '<center>$1</centre>';
var result = str.replace(re, subst);
By using Jsoup, you can use the wrap() method of the Element class of Jsoup.
It would look like this :
public String wrapImgWithCenter(String html) {
Document doc = Jsoup.parse(html);
doc.getElementsByTag("img").wrap("<center></center>");
return doc.html();
}
I implemented the JSOUP solution suggested by mourphy. But, I had edited the method a little and it did the miracle for me. The new method is:
public String wrapImgWithCenter(String html){
Document doc = Jsoup.parse(html);
doc.select("img").wrap("<center></center>");
return doc.html();
}
Thanks mourphy and vks for your help!
Using Regex, you could also do this in java:
String formatted = str.replaceAll("(<img\\s+[^>]*>)", "<center>$1</center>");

Java-Jsoup, scrape html

I am using Jsoup with Java to Parse an HTML file. My question is how can I just extract the line that says "Hourly Rate: 23,016 orders"
I am parsing a lot of files, so the number next to the Hourly Rate will change.
<html>
<head>
<title>Testing</title>
</head>
<body>
<p class=MsoNormal align=center style='background:#DEDEDF'>
<span style='font-size:18.0pt'><b>Testing</b></span></p>
Hourly Rate: 23,016 orders<br>
<table border=0 cellpadding=0>
<tr valign=top>
<td>
Thanks
I just added this code:
String HourlyRate = doc.body().ownText();
//String text = doc.body().text();
System.out.println(HourlyRate);
This Printed out:
Hourly Rate: 23,016 orders
Grab the MsoNormal class then use a regular expression to look for a number i.e.
Document doc = Jsoup.parse(htmlString);
Element msoNormal = doc.getElementsByClass("MsoNormal").first();
if(msoNormal!=null){
Pattern p = Pattern.compile("[0-9]+,[0-9]+");
Matcher m = pattern.matcher(msoNormal.text());
if(matcher.find())
System.out.println(m.get());
}

java: regular expression

I have a Html string which include lots of image tag, I need to get the tag and change it. for example:
String imageRegex = "(<img.+(src=\".+\").+/>){1}";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
Matcher matcher = Pattern.compile(imageRegex, Pattern.CASE_INSENSITIVE).matcher(msg);
int i = 0;
while (matcher.find()) {
i++;
Log.i("TAG", matcher.group());
}
the result is :
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />hello world<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
but it's not I want, I want the result is
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
what's wrong with my regular expression?
Try (<img)(.*?)(/>), this should do the trick, although yes, you shouldn't use Regex for parsing HTML, as people will tell you over and over.
I don't have eclipse installed, but I have VS2010, and this works for me.
String imageRegex = "(<img)(.*?)(/>)";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
System.Text.RegularExpressions.MatchCollection match = System.Text.RegularExpressions.Regex.Matches(str, imageRegex, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
StringBuilder sb = new StringBuilder();
foreach (System.Text.RegularExpressions.Match m in match)
{
sb.AppendLine(m.Value);
}
System.Windows.MessageBox.Show(sb.ToString());
Result:
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
David M is correct, you really shouldn't try to do this, but your specific problem is that the + quantifier in your regex is greedy, so it will match the longest possible substring that could match.
See The regex tutorial for more details on the quantifiers.
I'd NOT recommend to use regex for parsing HTML. Please consider JSoup or similar solutions
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements images = doc.select("img");
Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.

Categories

Resources