I have this string
<div><img width="100px" src="http://www.mysite.com/Content/dataImages/news/small/some-pic.png" /><br />This is some text that I need to get.</div>
and i need to get the image link and the text This is some text that I need to get.from the string above in Java. Can anybody tell me how can I do this?
Use regex to get what you want.
If this is all you have to do there's no point in bringing in extra packages just use regex:
The pattern "(?<=src=\")(.*?)(?=\")" can be used to get the link, you can modify that to give you the text.
Try this, just change the patter if you must.
String str = "<div><img width=\"100px\" src=\"http://www.mysite.com/Content/dataImages/news/small/some-pic.png\" /><br />This is some text that I need to get.</div>";
Pattern p = Pattern.compile("src=\"(.*?)\" /><br />(.*?)</div>");
Matcher m = p.matcher(str);
if (m.find()) {
String link = m.group(1);
String text = m.group(2);
}
My solution was:
String tmp=xpp.nextText();
desc=android.text.Html.fromHtml(tmp).toString();
img=FindUrls.extractUrls(tmp);
for extracting the text from the string I used:
desc=android.text.Html.fromHtml(tmp).toString();
img=FindUrls.extractUrls(tmp);
and for the link inside the string I've used this function:
public static String extractUrls(String input) {
String result = null;
Pattern pattern = Pattern.compile(
"\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" +
"(\\w+:\\w+#)?(([-\\w]+\\.)+(com|org|net|gov" +
"|mil|biz|info|mobi|name|aero|jobs|museum" +
"|travel|[a-z]{2}))(:[\\d]{1,5})?" +
"(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" +
"((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" +
"(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" +
"(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
result=matcher.group();
}
return result;
}
Hope It will help someone that has similar problem
Related
Im trying to retrieve data-product id from the String which goes like this:
<img class="lazy" src="/b/mp/img/svg/no_picture.svg" lazy-img="https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg" alt="">
The output should be
prod14290034
I tried to achieve this with a regular expression, but I'm beginner in it.
Is regular expression good for it? If so, how to do it?
/EDIT
According to Emma's comment.
I've made something like this:
String z = element.toString();
Pattern pattern = Pattern.compile("data-product-id=\"\\s*([^\\s\"]*?)\\s*\"");
Matcher matcher = pattern.matcher(z);
System.out.println(matcher.find());
if (matcher.find()) {
System.out.println(matcher.group());
}
it returns true, but dont print any value. Why?
You might use some HTML/XHTML/XML library which could transform your string data into document or at least Element and then you can easily obtain the attribute value from there. But if you want to use regex then you can try this snippet
#Test
public void productId() {
String src =
" <img class=\"lazy\" src=\"/b/mp/img/svg/no_picture.svg\" lazy-img=\"https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg\" alt=\"\"> ";
final Pattern pattern = Pattern.compile("(data-product-id=)\"(p[a-zA-Z]+[0-9]+)\"");
final Matcher matcher = pattern.matcher(src);
String prodId = null;
if (matcher.find()) {
System.out.println(matcher.groupCount());
prodId = matcher.group(2);
}
System.out.println(prodId);
Assert.assertNotNull(prodId);
Assert.assertEquals(prodId, "prod14290034");
}
You can use jsoup for Java - it is a library for parsing HTML pages. There are a lot of other libraries for different languages, beautifulSoup for python.
EDIT: Here is a snippet for jsoup, you can select any element with a tag, and then get needed attribute with attr method.
Document doc = Jsoup.parse(
"<a href=\"/w-pustyni-i-w-puszczy-sienkiewicz-henryk,prod14290034,ksiazka-p\" " +
"class=\"img seoImage\" " +
"title=\"W pustyni i w puszczy - Sienkiewicz Henryk\" " +
"rel=\"nofollow\" " +
"data-product-id=\"prod14290034\"> " +
"<img class=\"lazy\" src=\"/b/mp/img/svg/no_picture.svg\" lazy-img=\"https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg\" alt=\"\"> </a>\n"
);
String dataProductId = doc.select("a").first().attr("data-product-id");
I need to extract the values after :70: in the following text file using RegEx. Value may contain line breaks as well.
My current solution is to extract the string between :70: and : but this always returns only one match, the whole text between the first :70: and last :.
:32B:xxx,
:59:yyy
something
:70:ACK1
ACK2
:21:something
:71A:something
:23E:something
value
:70:ACK2
ACK3
:71A:something
How can I achive this using Java? Ideally I want to iterate through all values, i.e.
ACK1\nACK2,
ACK2\nACK3
Thanks :)
Edit: What I'm doing right now,
Pattern pattern = Pattern.compile("(?<=:70:)(.*)(?=\n)", Pattern.DOTALL);
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
System.out.println(matcher.group())
}
Try this.
String data = ""
+ ":32B:xxx,\n"
+ ":59:yyy\n"
+ "something\n"
+ ":70:ACK1\n"
+ "ACK2\n"
+ ":21:something\n"
+ ":71A:something\n"
+ ":23E:something\n"
+ "value\n"
+ ":70:ACK2\n"
+ "ACK3\n"
+ ":71A:something\n";
Pattern pattern = Pattern.compile(":70:(.*?)\\s*:", Pattern.DOTALL);
Matcher matcher = pattern.matcher(data);
while (matcher.find())
System.out.println("found="+ matcher.group(1));
result:
found=ACK1
ACK2
found=ACK2
ACK3
You need a loop to do this.
Pattern p = Pattern.compile(regexPattern);
List<String> list = new ArrayList<String>();
Matcher m = p.matches(input);
while (m.find()) {
list.add(m.group());
}
As seen here Create array of regex matches
How to edit this string and split it into two?
String asd = {RepositoryName: CodeCommitTest,RepositoryId: 425f5fc5-18d8-4ae5-b1a8-55eb9cf72bef};
I want to make two strings.
String reponame;
String RepoID;
reponame should be CodeCommitTest
repoID should be 425f5fc5-18d8-4ae5-b1a8-55eb9cf72bef
Can someone help me get it? Thanks
Here is Java code using a regular expression in case you can't use a JSON parsing library (which is what you probably should be using):
String pattern = "^\\{RepositoryName:\\s(.*?),RepositoryId:\\s(.*?)\\}$";
String asd = "{RepositoryName: CodeCommitTest,RepositoryId: 425f5fc5-18d8-4ae5-b1a8-55eb9cf72bef}";
String reponame = "";
String repoID = "";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(asd);
if (m.find()) {
reponame = m.group(1);
repoID = m.group(2);
System.out.println("Found reponame: " + reponame + " with repoID: " + repoID);
} else {
System.out.println("NO MATCH");
}
This code has been tested in IntelliJ and runs without error.
Output:
Found reponame: CodeCommitTest with repoID: 425f5fc5-18d8-4ae5-b1a8-55eb9cf72bef
Assuming there aren't quote marks in the input, and that the repository name and ID consist of letters, numbers, and dashes, then this should work to get the repository name:
Pattern repoNamePattern = Pattern.compile("RepositoryName: *([A-Za-z0-9\\-]+)");
Matcher matcher = repoNamePattern.matcher(asd);
if (matcher.find()) {
reponame = matcher.group(1);
}
and you can do something similar to get the ID. The above code just looks for RepositoryName:, possibly followed by spaces, followed by one or more letters, digits, or hyphen characters; then the group(1) method extracts the name, since it's the first (and only) group enclosed in () in the pattern.
I'm trying to extract the second url from Stings like these
submitted by thecrappycoder <br /> [link] [3 comments]
submitted by durdn <br /> [link] [1 comment]
by using regex. I tried this.
String regex = "\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while(m.find()) {
String urlStr = m.group();
urlStr = urlStr.substring(1, 3);
links.add(urlStr);
}
I also tried in that way
System.out.println(("http://"+text.split("http://")[1]).split("")[0]);
Unfortunately, I couldn't get it. Any help, thank you.
You can take the same approach with a simplified regex pattern:
String text = "submitted by thecrappycoder <br />" +
" [link] " +
"[3 comments]\n" +
" ";
String regex = "href=.(http.*?)\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
m.find(); // ignore the 1st match
m.find(); // find the 2nd match
String urlStr = m.group(); // read the 2nd match
System.out.println("urlStr = " + urlStr); // prints: urlStr = http://blogs.msdn.com/b/bethmassi/archive/2015/02/25/understanding-net-2015.aspx
Can somebody tell me how to find E-Mail Adresses in a text?
Example text:
"Hey,
I just blahblah
E-Mail: lolcat#catinator.com
Another would be lolcat2#catinator.com"
So the output is:
lolcat#catinator.com
lolcat2#catinator.com
I tried Regex, but I got no idea how I can do this over an entire text...
Pattern pattern = Pattern.compile("[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,4}");
Matcher matcher = pattern.matcher("asd#asdasd.de".toUpperCase());
if(matcher.matches()){
System.out.println("Mail found!");
}else{
System.out.println("No Mail...");
}
Can somebody help me? :(
Greetings!
They're so many different types of email address formats that it is hard to match all of them. A simple (for your structured data) but no so effective approach would be the following:
String s = "Hey,\n" +
"I just blahblah\n" +
"E-Mail: lolcat#catinator.com\n" +
"Another would be lolcat2#catinator.com";
Pattern p = Pattern.compile("\\S+#\\S+");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
Output
lolcat#catinator.com
lolcat2#catinator.com
I am not sure about the regex expression you have provided. But if it is good and serves your purpose then you can use following to extract the string,
Matcher matcher = pattern.matcher("asd#asdasd.de".toUpperCase());
String result;
while (matcher.find()) {
// result now will contain the email address
result = matcher .group();
System.out.println(result);
}