Java Help Manipulating Anchor with Pattern - java

I'm having trouble accomplishing a few things with my program, I'm hoping someone is able to help out.
I have a String containing the source code of a HTML page.
What I would like to do is extract all instances of the following HTML and place it in an array:
<img src="http://*" alt="*" style="max-width:460px;">
So I would then have an array of X size containing values similar to the above, obviously with the src and alt attributes updated.
Is this possible? I know there are XML parsers, but the formatting is ALWAYS the same.
Any help would be greatly appreciated.

I'll suggest using ArrayList instead of a static array since it looks like you don't know how many matches you are going to have.
Also not good idea to have REGEX for HTML but if you are sure the tags always use the same format then I'll recommend:
Pattern pattern = Pattern.compile(".*<img src=\"http://(.*)\" alt=\"(.*)\"\\s+sty.*>", Pattern.MULTILINE);
Here is an example:
public static void main(String[] args) throws Exception {
String web;
String result = "";
for (int i = 0; i < 10; i++) {
web = "<img src=\"http://image" + i +".jpg\" alt=\"Title of Image " + i + "\" style=\"max-width:460px;\">";
result += web + "\n";
}
System.out.println(result);
Pattern pattern = Pattern.compile(".*<img src=\"http://(.*)\" alt=\"(.*)\"\\s+sty.*>", Pattern.MULTILINE);
List<String> imageSources = new ArrayList<String>();
List<String> imageTitles = new ArrayList<String>();
Matcher matcher = pattern.matcher(result);
while (matcher.find()) {
String imageSource = matcher.group(1);
String imageTitle = matcher.group(2);
imageSources.add(imageSource);
imageTitles.add(imageTitle);
}
for(int i = 0; i < imageSources.size(); i++) {
System.out.println("url: " + imageSources.get(i));
System.out.println("title: " + imageTitles.get(i));
}
}
}

As your getting an ArrayIndexOutOfBoundsException, it is most likely that the String array imageTitles is not big enough to hold all instances of ALT that are found in the regex search. In this case it is likely that it is a zero-size array.

Related

trim the string when starts *Artist*

CurrentlyPlaying(context=null, timestamp=1610137729201, progress_ms=38105, is_playing=false, item=Track(name=Put Your Head on My Shoulder, artists=[ArtistSimplified(name=Paul Anka, externalUrls=ExternalUrl)
i want to turn it into Paul Anka and another ones what will be in this row
so this what i tried:
String info = currentlyPlayingFuture.get().toString(); // returns first text
System.out.println("just: " + info);
char[] infoCh = info.toCharArray();
for (int i = 0; i < infoCh.length - 1; i++) {
if ((infoCh[i] + infoCh[i + 1]+"").equals("Ar")){
System.out.println(info.substring(i, i+10));
}
}
```
and it doesn't works. how to do it?
and it doesn't works. how to do it?
The problem is infoCh[i] + infoCh[i + 1]+"". That isn't concatenating characters. It is concatenating the ascii values of those characters. One thing you could do is turn that into "" + infoCh[i] + infoCh[i + 1].
A regex would work better than what you are trying here. Something like
final Pattern pattern = Pattern.compile("ArtistSimplified\\(name=(.*?), ext");
final Matcher matcher = pattern.matcher(input);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
The best solution would perhaps be to parse the string into some object structure though.

Retrieving concrete data from String

Im trying to retrieve data-product id from the String which goes like this:
<img class="lazy" src="/b/mp/img/svg/no_picture.svg" lazy-img="https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg" alt="">
The output should be
prod14290034
I tried to achieve this with a regular expression, but I'm beginner in it.
Is regular expression good for it? If so, how to do it?
/EDIT
According to Emma's comment.
I've made something like this:
String z = element.toString();
Pattern pattern = Pattern.compile("data-product-id=\"\\s*([^\\s\"]*?)\\s*\"");
Matcher matcher = pattern.matcher(z);
System.out.println(matcher.find());
if (matcher.find()) {
System.out.println(matcher.group());
}
it returns true, but dont print any value. Why?
You might use some HTML/XHTML/XML library which could transform your string data into document or at least Element and then you can easily obtain the attribute value from there. But if you want to use regex then you can try this snippet
#Test
public void productId() {
String src =
" <img class=\"lazy\" src=\"/b/mp/img/svg/no_picture.svg\" lazy-img=\"https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg\" alt=\"\"> ";
final Pattern pattern = Pattern.compile("(data-product-id=)\"(p[a-zA-Z]+[0-9]+)\"");
final Matcher matcher = pattern.matcher(src);
String prodId = null;
if (matcher.find()) {
System.out.println(matcher.groupCount());
prodId = matcher.group(2);
}
System.out.println(prodId);
Assert.assertNotNull(prodId);
Assert.assertEquals(prodId, "prod14290034");
}
You can use jsoup for Java - it is a library for parsing HTML pages. There are a lot of other libraries for different languages, beautifulSoup for python.
EDIT: Here is a snippet for jsoup, you can select any element with a tag, and then get needed attribute with attr method.
Document doc = Jsoup.parse(
"<a href=\"/w-pustyni-i-w-puszczy-sienkiewicz-henryk,prod14290034,ksiazka-p\" " +
"class=\"img seoImage\" " +
"title=\"W pustyni i w puszczy - Sienkiewicz Henryk\" " +
"rel=\"nofollow\" " +
"data-product-id=\"prod14290034\"> " +
"<img class=\"lazy\" src=\"/b/mp/img/svg/no_picture.svg\" lazy-img=\"https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg\" alt=\"\"> </a>\n"
);
String dataProductId = doc.select("a").first().attr("data-product-id");

Regular Expression WildCard matching split with java split method

I know there's similar questions like this asked before, but i want to do a custom operation and i don't know how to go about it.
I want to split a string of data with a regular expression like, but this time like i know the starting character and the ending character like:
String myString="Google is a great search engine<as:...s>";
The <as: and s> is the beginning and closing characters
the ... is dynamic which i cant predict its value
I want to be able to split the string from the beginning <as: to the end s>
with the dynamic string in it.
Like:
myString.split("<as:/*s>");
Something like that. I also want to get all the occurrence of the <as:..s> in the string.
i know this can be done with regex, but I've never done it before. I need a simple and neat way to do this.
Thanks in advance
Rather than using a .split(), I would just extract using Pattern and Matcher. This approach finds everything between <as: and s> and extracts it to a capture group. Group 1 then has the text you would like.
public static void main(String[] args)
{
final String myString="Google is a great search engine<as:Some stuff heres>";
Pattern pat = Pattern.compile("^[^<]+<as:(.*)s>$");
Matcher m = pat.matcher(myString);
if (m.matches()) {
System.out.println(m.group(1));
}
}
Output:
Some stuff here
If you need the text at the beginning, you can put it in a capture group as well.
Edit: If there are more than one <as...s> in the input, then the following will gather all of them.
Edit 2: increased the logic. Added checks for emptiness.
public static List<String> multiEntry(final String myString)
{
String[] parts = myString.split("<as:");
List<String> col = new ArrayList<>();
if (! parts[0].trim().isEmpty()) {
col.add(parts[0]);
}
Pattern pat = Pattern.compile("^(.*?)s>(.*)?");
for (int i = 1; i < parts.length; ++i) {
Matcher m = pat.matcher(parts[i]);
if (m.matches()) {
for (int j = 1; j <= m.groupCount(); ++j) {
String s = m.group(j).trim();
if (! s.isEmpty()) {
col.add(s);
}
}
}
}
return col;
}
Output:
[Google is a great search engine, Some stuff heress, Here is Facebook, More Stuff, Something else at the end]
Edit 3: This approach uses find and looping to do the parsing. It uses optional capture groups as well.
public static void looping()
{
final String myString="Google is a great search engine"
+ "<as:Some stuff heresss>Here is Facebook<as:More Stuffs>"
+ "Something else at the end" +
"<as:Stuffs>" +
"<as:Yet More Stuffs>";
Pattern pat = Pattern.compile("([^<]+)?(<as:(.*?)s>)?");
Matcher m = pat.matcher(myString);
List<String> col = new ArrayList<>();
while (m.find()) {
String prefix = m.group(1);
String contents = m.group(3);
if (prefix != null) { col.add(prefix); }
if (contents != null) { col.add(contents); }
}
System.out.println(col);
}
Output:
[Google is a great search engine, Some stuff heress, Here is Facebook, More Stuff, Something else at the end, Stuff, Yet More Stuff]
Additional Edit: wrote some quick test cases (with super hacked helper class) to help validate. These all pass (updated) multiEntry:
public static void main(String[] args)
{
Input[] inputs = {
new Input("Google is a great search engine<as:Some stuff heres>", 2),
new Input("Google is a great search engine"
+ "<as:Some stuff heresss>Here is Facebook<as:More Stuffs>"
+ "Something else at the end" +
"<as:Stuffs>" +
"<as:Yet More Stuffs>" +
"ending", 8),
new Input("Google is a great search engine"
+ "<as:Some stuff heresss>Here is Facebook<as:More Stuffs>"
+ "Something else at the end" +
"<as:Stuffs>" +
"<as:Yet More Stuffs>", 7),
new Input("No as here", 1),
new Input("Here is angle < input", 1),
new Input("Angle < plus <as:Stuff in as:s><as:Other stuff in as:s>", 3),
new Input("Angle < plus <as:Stuff in as:s><as:Other stuff in as:s>blah", 4),
new Input("<as:To start with anglass>Some ending", 2),
};
List<String> res;
for (Input inp : inputs) {
res = multiEntry(inp.inp);
if (res.size() != inp.cnt) {
System.err.println("FAIL: " + res.size()
+ " did not match exp of " + inp.cnt
+ " on " + inp.inp);
System.err.println(res);
continue;
}
System.out.println(res);
}
}

cant remove all occurrence of html tag in java using pattern matching

I have very long html string which has multiple
<dl id="divmap"> .... </dl>.
I want to remove all content between this .
i wrote this code in java:
String triphtml= htmlString;
System.out.println("triphtml is "+triphtml);
System.out.println("test1 ");
final Pattern pattern = Pattern.compile("(<dl id=\""+selectedArray[i]+"\">)(.+?)(</dl>)",
Pattern.DOTALL);
final Matcher matcher = pattern.matcher(triphtml);
// matcher.find();
System.out.println("pattern of test1 is : "
+ pattern); // Prints
System.out.println("MATCHER of test1 is : "
+ matcher); // Prints
System.out.println("MATCH COUNT of test1 a: "
+ matcher.groupCount()); // Prints
System.out.println("MATCH COUNT of test1 a: "
+ matcher.find()); // Prints
while (matcher.find()) {
// System.out.println("MATCH GP 3: "+matcher.group(3).substring(1,10));
for (int z = 0; z <= matcher.groupCount(); z++) {
String extstr = matcher.group(z);
System.out.println("matcher group of "+z+" test1 is " + extstr);
System.out.println("ext a of test1 is " + extstr);
triphtml = triphtml.replaceAll(extstr, "");
System.out.println("Group found of test1 is :\n" + extstr);
}
}
But this code removes some dl and some remains in triphtml.
I dont why this thing is happening.
Here triphtml is a html string which has multiple dl's. Please help me how I remove content between all
<dl id="divmap">.
Thanks in advance.
I suggest to NOT use regex for html. Just use any library used for traversing xml/html.
For example JSoup
Try using JSoup
It uses selectors and syntax like JQuery, it it very easy to use.
You can try this
String triphtml = htmlString;
Document doc = Jsoup.parse(htmlString);
Elements divmaps = doc.select("#divmap");
then you can remove (or alter) the elements in the DOM.
divmaps.remove();
triphtml = doc.html();
By using regex you can do as follows:
String orgString = "<dl id=\"divmap\"> .... </dl>";
orgString = orgString.replaceAll("<[^>]*>", "");
//for removing html tag
orgString = orgString.replaceAll(orgString.replaceAll("<[^>]*>", ""),"");
//for removing content inside html tag
But it is better to use html parsing
Edit:
String htmlString = "<dl id=\"divmap\"> Content </dl>";
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(htmlString);
while(m.find()){
htmlString = htmlString.replaceAll(m.group(), "");
}
System.out.println("Ans"+htmlString);

Trim() in Java not working the way I expect? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Query about the trim() method in Java
I am parsing a site's usernames and other information, and each one has a bunch of spaces after it (but spaces in between the words).
For example: "Bob the Builder " or "Sam the welder ". The numbers of spaces vary from name to name. I figured I'd just use .trim(), since I've used this before.
However, it's giving me trouble. My code looks like this:
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).trim());
}
The result is just the same; no spaces are removed at the end.
Thank you in advance for your excellent answers!
UPDATE:
The full code is a bit more complicated, since there are HTML tags that are parsed out first. It goes exactly like this:
for (String s : splitSource2) {
if (s.length() > "<td class=\"dddefault\">".length() && s.substring(0, "<td class=\"dddefault\">".length()).equals("<td class=\"dddefault\">")) {
splitSource3.add(s.substring("<td class=\"dddefault\">".length()));
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).substring(0, splitSource3.get(i).length() - 5));
splitSource3.set(i, splitSource3.get(i).trim());
System.out.println(i + ": " + splitSource3.get(i));
}
}
UPDATE:
Calm down. I never said the fault lay with Java, and I never said it was a bug or broken or anything. I simply said I was having trouble with it and posted my code for you to collaborate on and help solve my issue. Note the phrase "my issue" and not "java's issue". I have actually had the code printing out
System.out.println(i + ": " + splitSource3.get(i) + "*");
in a for each loop afterward.
This is how I knew I had a problem.
By the way, the problem has still not been fixed.
UPDATE:
Sample output (minus single quotes):
'0: Olin D. Kirkland                                          '
'1: Sophomore                                          '
'2: Someplace, Virginia  12345<br />VA SomeCity<br />'
'3: Undergraduate                                          '
EDIT the OP rephrased his question at Query about the trim() method in Java, where the issue was found to be Unicode whitespace characters which are not matched by String.trim().
It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.
If my assumption is correct then you've got two choices:
Use a binary reader and figure out what those characters are - and delete them with String.replace(); E.g.:
private static void cutCharacters(String fromHtml) {
String result = fromHtml;
char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
for (char ch : problematicCharacters) {
result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
}
return result;
}
If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:
private String getImportantParts(String fromHtml) {
Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
Matcher m = p.matcher(fromHtml);
StringBuilder buff = new StringBuilder();
while (m.find()) {
buff.append(m.group(1));
}
return buff.toString().trim();
}
Works without a problem for me.
Here your code a bit refactored and (maybe) better readable:
final String openingTag = "<td class=\"dddefault\">";
final String closingTag = "</td>";
List<String> splitSource2 = new ArrayList<String>();
splitSource2.add(openingTag + "Bob the Builder " + closingTag);
splitSource2.add(openingTag + "Sam the welder " + closingTag);
for (String string : splitSource2) {
System.out.println("|" + string + "|");
}
List<String> splitSource3 = new ArrayList<String>();
for (String s : splitSource2) {
if (s.length() > openingTag.length() && s.startsWith(openingTag)) {
String nameWithoutOpeningTag = s.substring(openingTag.length());
splitSource3.add(nameWithoutOpeningTag);
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
String name = splitSource3.get(i);
int closingTagBegin = splitSource3.get(i).length() - closingTag.length();
String nameWithoutClosingTag = name.substring(0, closingTagBegin);
String nameTrimmed = nameWithoutClosingTag.trim();
splitSource3.set(i, nameTrimmed);
System.out.println("|" + splitSource3.get(i) + "|");
}
I know that's not a real answer, but i cannot post comments and this code as a comment wouldn't fit, so I made it an answer, so that Olin Kirkland can check his code.

Categories

Resources