I am trying to get xml from string.
Specific symbols locate in tags title.
I did it:
public class Demo {
public static void main(String[] args) throws Exception {
String data = "<title> \"sad\" <<dd> ><\n </title>";
String pattern = "(<title>)(.+?)([<>'\"&])(.+?)(\n </title>)";
Matcher m = Pattern.compile(pattern).matcher(data);
while (m.find()) {
String bugString = m.group(3) + m.group(4);
String fixed = bugString.replaceAll("<", "<");
fixed = fixed.replaceAll(">", ">");
fixed = fixed.replaceAll(">", ">");
fixed = fixed.replaceAll("'", "'");
fixed = fixed.replaceAll("\"", """);
fixed = fixed.replaceAll("&", "&");
data = data.replace(bugString, fixed);
}
System.out.println(data);
}
}
But it looks a little ugly. How I can improve it, if I don't want use additional library?
If you could influence the String you could put the titles tag text within a CDATA section. Within this you do not have to encode the special XML characters.
CDATA section is explained e.g. here http://en.m.wikipedia.org/wiki/CDATA
So your title could be like
<title> <![CDATA[ here comes my special title with "/<> ]]> </title>
Related
I am trying to parse a string from a website using Jsoup and wrote the following test to verify that the parsing
This is my test:
#Test
public void extractBookData() throws Exception {
String bookLink = ""; //some address
Document doc = Jsoup.connect(bookLink).get().html();
Book book = new Book();
assertEquals("Literatür Yayıncılık", book.getPublisher(doc));
}
This is getPublisher(Element) method:
public String getPublisher(Element element){
String tableRowSelector = "tr:contains(Yayınevi)";
String tableColumnSelector = "td";
String tableRowData = "";
element = element.select(tableRowSelector).last();
if (element != null) {
element = element.select(tableColumnSelector).last();
if (element != null) {
tableRowData = element.text().replaceAll(tableRow.getRowName() + " ?:", "").replaceAll(tableRow.getRowName() + " :?", "").replaceAll(" ?: ?", "").trim();
}
}
return tableRowData;
}
The problem is that the actual and expected strings appears the same even though JUnit tells otherwise.
I am open to your suggestions please.
I have had this same issue before, this is a non-breaking space (char 160) wich is in your text instead of a space (char 32). In my case the text came from an html text input value, yours looks like it hes also come from html.
The solution I used was just too replace all non breaking space chars with a space.
I am reading the content from a web page and then I am parsing it with the help of Jsoup parser to get only the hyperlinks that exists in the body section. I am getting the output as:
<font color="#0000FF">Sports</font>
<font color="#0000FF">Titanic</font>
license plates
miracle cars
Clear
and even more hyperlinks.
From all of them, all I am interested in is data like
/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp
How can I do this using Strings or is there any other way or method to extract this information usinf Jsoup Parser itself?
You can try this, its works.
public class AttributeParsing {
/**
* #param args
*/
public static void main(String[] args) {
final String html = "<font color=\"#0000FF\">Sports</font>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Element th = doc.select("a[href]").first();
String href = th.attr("href");
System.out.println(th);
System.out.println(href);
}
}
Output :
th : <font color="#0000FF">Sports</font>
href : /sports/sports.asp
Try this it may help
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String nextIndex = linkHref .indexOf ("\"", linkHref );
This should be a basic bit of parsign using
String.indexOf
as in
index = jsoupOutput.indexOf ("href=\"");
and
nextIndex = jsoupOutput.indexOf ("\"", index);
with the necessary checks in place.
Let's assume that String anchor contains one of these links then the beginning index of the substring will after href=" and the end index will be the first quotation mark after index 9 this way:
String anchor = "<font color=\"#0000FF\">Sports</font>";
int beginIndex = anchor.indexOf("href=\"") + 6; //To start after <a href="
int endIndex = anchor.indexOf("\"", beginIndex);
String desiredPart = anchor.substring(beginIndex, endIndex);
And that's it if the shape of the anchor is going to always be that way.. better options are using regular expressions and best would be using an XML parser.
Use this as reference
import java.util.regex.*;
public class HelloWorld{
public static void main(String []args){
String s = "<font color=\"#0000FF\">Sports</font>"+
"<font color=\"#0000FF\">Titanic</font>"+
"license plates"+
"miracle cars"+
"Clear";
Pattern p = Pattern.compile("href=\".+?\"");
Matcher m = p.matcher(s);
while(m.find())
{
System.out.println(m.group().split("=")[1].replace("\"",""));
}
}
}
Output
/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp
You can do it in one line:
String[] paths = str.replaceAll("(?m)^.*?\"(.*?)\".*?$", "$1").split("(?ms)$.*?^");
The first method call removes everything except the target from each line, and the second splits on newlines (will work on all OS terminators).
FYI (?m) turns on "multiline mode" and (?ms) also turns on the "dotall" flag.
I want to generate xPath from html file. So far, I have been succeded to store Html source in a String and generating basic xpath using matcher regex as follows:-
String text = "<html><body><table><tr id=\"x\"><td>abc</td><td></td><td>xyz</td></tr></table></body></html>";
//I want xpath till label "xyz"
String unwanted= "xyz";
//so splitting and storing needed String
String[] neededString=text.split(unwanted);
String a="";
//pattern for extracting tags
String patternString1 = "<(.+?)>";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(neededString[0]);
while(matcher.find()) {
a=a.concat(matcher.group(1)+"/");
System.out.println(a);
}
This code works for basic tag Structure without multiple child nodes like multiple <td>'s in <tr>. Can anyone improve my above code to include xpath generation for multiple childs and also for capturing attrributes like Ids,Class etc.
Any help is much appreciated.
Thanks in advance.
Regex is not so Accurate for Extracting the Html content.
Use Jsoup Html Parser
public static void main(String[] args){
String html = "<html><body><table><tr id=\"x\"><td>abc</td><td></td>" +
"<td>xyz</td></tr></table></body></html>";
Document doc = Jsoup.parse(html);
for (Element table : doc.select("table")) {
for (Element row : table.select("tr[id=x]")) {
Elements tds = row.select("td)");
System.out.println(tds.get(2).text());
}
}
}
I'm new to regular expressions, but I believe this is the method for my solution. I'm trying to take an arbitrary HTML snippet and customize the image tags. For example,
If I had this HTML code:
<><><><><img src="blah.jpg"><><><><><><><><img src="blah2.jpg"><><><>
I want to turn it into:
<><><><><img src="images/blah.jpg"><><><><><><><><img src="images/blah2.jpg"><><><>
The Code I have now is this:
Pattern p = Pattern.compile("<img.*src=\".*\\..*\"");
Matcher m = p.matcher(htmlString);
boolean b = m.find();
String imgPath = "src=\"images/";
while(b)
{
//Get file name.
String name="test.jpg\"";
//Assign new path.
m.group().replaceAll("src=\".*\"",imgPath+name);
}
Regular expressions are not the correct way to parse HTML. Don't do it. It's not possible to do correctly.
Use a proper parser.
Document doc = Jsoup.parse(someHtml);
Elements imgs = doc.select("img");
for (Element img : imgs) {
img.attr("src", "images/" + img.attr("src")); // or whatever
}
doc.outerHtml(); // returns the modified HTML
This code is almost perfect. It prints out alot of info, so look for where it says "Final result" and "original" to see the result of customizing the IMG tags. There's a small flaw that I'm still not sure how to fix. "in10" is the variable for testing an input string. The rest are regex.
I noticed problems occur when I use newline characters and when "src=" is left blank instead of "src=\"\"" or "src=''" The quotes seem to effect the results.
private static String r16 = "(?s)(<img.*?)(src\\s*?=\\s*?(?:\"|').*?(?:\"|'))";
private static String in10 = "<><><><><img width=1 height=888 src=\"bnm.jpg\"<><><><><img src=\"\"> <img src = \"\"><img src ='folder1/folder2/bnm.jpg'><><><img src =\"'>";
private static String r14 = "(?s)\\/|\\=";
String path="images/";
String name="";
Pattern p = Pattern.compile(r16);
Matcher m = p.matcher(in10);
StringBuffer sb = new StringBuffer();
int i=1;
while(m.find())
{
String g0 = m.group();
String g2 = m.group(2);
System.out.println("Main group"+i+":"+g0);
System.out.println("Inner group1:"+m.group(1));
System.out.println("Inner group2:"+g2);
String[] names=g2.split(r14);
printNames(names);
/*
* src="/folder1/folder2/blah.jpg" ---> blah.jpg
* src="bnm.jpg" ---> src="bnm.jp"
*/
if(names.length>=1)
{
name = names[names.length-1];
}
else
{
name = "";
}
//Name might be empty string.
name = name.replaceAll("\"|'","");
System.out.println("Retrieved Name:"+name);
m.appendReplacement(sb,"$1src=\""+path+name+"\"");
i++;
}
m.appendTail(sb);
INPUT=sb.toString();
System.out.println("Final Result:"+INPUT);
System.out.println("Original____:"+in10);
System.out.println("Count:"+m.groupCount());
}
You should not use regex for this.The way which josh3736 said is robust.But if you want to use regex you should use :
String s = "<><><><><img src=\"blah.jpg\"><><><><><><><><img src=\"blah2.jpg\"><><><>";
s = s.replaceAll("(?<=img src=\")([^\"]+)(?=\">)","images/$1");
System.out.println(s);
output :
<><><><><img src="images/blah.jpg"><><><><><><><><img src="images/blah2.jpg"><><><>
Although I agree with the others that doing this with regular expressions is the wrong way to modify html fragments, here is a JUnit test case that shows how to replace src elements with a Pattern in Java:
import static org.junit.Assert.*;
import static org.hamcrest.CoreMatchers.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import org.junit.Test;
public class ImgSrcReplace {
#Test
public void replaceWithRegex() {
String dir = "image/";
String htmlFragment = "<body>\n"+
"<img src=\"single-line.jpg\">"+
"<img src=\n"+
"\"multiline.jpg\">\n"+
"<img src='single-quote.jpg'><img src=\"broken.gif\'>"+
"<img class=\"before\" src=\"class-before.jpg\">"+
"<img src=\"class-after.gif\" class=\"after\">"+
"</body>";
Pattern replaceImgSrc =
Pattern.compile(
"(<img\\b[^>]*\\bsrc\\s*=\\s*)([\"\'])((?:(?!\\2)[^>])*)\\2(\\s*[^>]*>)",
Pattern.CASE_INSENSITIVE&Pattern.MULTILINE);
String result =
replaceImgSrc.matcher(htmlFragment)
.replaceAll("$1$2"+Matcher.quoteReplacement(dir)+"$3$2$4");
assertThat("the single line image tag was updated", result,
containsString("image/single-line.jpg"));
assertThat("the multiline image tag was updated", result,
containsString("image/multiline.jpg"));
assertThat("the single quote image tag was updated", result,
containsString("image/single-quote.jpg"));
assertThat("the broken gif was ignored.", result,
containsString("\"broken.gif'"));
assertThat("attributes before are preseved.", result,
containsString("<img class=\"before\" src=\"image/class-before.jpg\">"));
assertThat("attributes after are preseved.", result,
containsString("<img src=\"image/class-after.gif\" class=\"after\">"));
}
}
I'm having trouble accomplishing a few things with my program, I'm hoping someone is able to help out.
I have a String containing the source code of a HTML page.
What I would like to do is extract all instances of the following HTML and place it in an array:
<img src="http://*" alt="*" style="max-width:460px;">
So I would then have an array of X size containing values similar to the above, obviously with the src and alt attributes updated.
Is this possible? I know there are XML parsers, but the formatting is ALWAYS the same.
Any help would be greatly appreciated.
I'll suggest using ArrayList instead of a static array since it looks like you don't know how many matches you are going to have.
Also not good idea to have REGEX for HTML but if you are sure the tags always use the same format then I'll recommend:
Pattern pattern = Pattern.compile(".*<img src=\"http://(.*)\" alt=\"(.*)\"\\s+sty.*>", Pattern.MULTILINE);
Here is an example:
public static void main(String[] args) throws Exception {
String web;
String result = "";
for (int i = 0; i < 10; i++) {
web = "<img src=\"http://image" + i +".jpg\" alt=\"Title of Image " + i + "\" style=\"max-width:460px;\">";
result += web + "\n";
}
System.out.println(result);
Pattern pattern = Pattern.compile(".*<img src=\"http://(.*)\" alt=\"(.*)\"\\s+sty.*>", Pattern.MULTILINE);
List<String> imageSources = new ArrayList<String>();
List<String> imageTitles = new ArrayList<String>();
Matcher matcher = pattern.matcher(result);
while (matcher.find()) {
String imageSource = matcher.group(1);
String imageTitle = matcher.group(2);
imageSources.add(imageSource);
imageTitles.add(imageTitle);
}
for(int i = 0; i < imageSources.size(); i++) {
System.out.println("url: " + imageSources.get(i));
System.out.println("title: " + imageTitles.get(i));
}
}
}
As your getting an ArrayIndexOutOfBoundsException, it is most likely that the String array imageTitles is not big enough to hold all instances of ALT that are found in the regex search. In this case it is likely that it is a zero-size array.