I am trying to parse a string from a website using Jsoup and wrote the following test to verify that the parsing
This is my test:
#Test
public void extractBookData() throws Exception {
String bookLink = ""; //some address
Document doc = Jsoup.connect(bookLink).get().html();
Book book = new Book();
assertEquals("Literatür Yayıncılık", book.getPublisher(doc));
}
This is getPublisher(Element) method:
public String getPublisher(Element element){
String tableRowSelector = "tr:contains(Yayınevi)";
String tableColumnSelector = "td";
String tableRowData = "";
element = element.select(tableRowSelector).last();
if (element != null) {
element = element.select(tableColumnSelector).last();
if (element != null) {
tableRowData = element.text().replaceAll(tableRow.getRowName() + " ?:", "").replaceAll(tableRow.getRowName() + " :?", "").replaceAll(" ?: ?", "").trim();
}
}
return tableRowData;
}
The problem is that the actual and expected strings appears the same even though JUnit tells otherwise.
I am open to your suggestions please.
I have had this same issue before, this is a non-breaking space (char 160) wich is in your text instead of a space (char 32). In my case the text came from an html text input value, yours looks like it hes also come from html.
The solution I used was just too replace all non breaking space chars with a space.
Related
I am getting value from a Bluetooth device, values separated by : colon. I want to get the first value and add it to the same string:
public String process(String raw) {
if (raw != null) {
String[] str_array = raw.split(":");
String humid1 = str_array[0];
if (humid1 != null) {
return raw.add(humid1);
}
} else {
Log.w(TAG, "provided string was null");
}
}
There is no add() method in String class. One can perform String concatenation in Java using + operator.
String s = "abc:def:ghi";
s = s + s.split(":")[0]; // or s += s.split(":")[0]
System.out.println(s);
The above snippet will print abc:def:ghiabc
Add null checks and index bounds checks as per your requirements.
Please check alternative ways at String Concatenation | StackOverflow
EDIT (As per OP's comment on question)
If you're looking to make changes to the input String object 'raw' and expect it to reflect in caller method, then it's not possible as String is immutable in Java. Correct way to achieve that would be to return the result from your method and assign that to String in caller method.
public void myMethod() {
String s = "abc:def:ghi";
s = process(s);
System.out.println(s);
}
public String process(String raw) {
if (raw != null) {
String[] str_array = raw.split(":");
String humid1 = str_array[0];
if (humid1 != null) {
return humid1;
}
}
return null; \\ or throw exception as per your choice.
}
The above snippet should print abc. More details String Immutability in Java | StackOverflow.
I am reading the content from a web page and then I am parsing it with the help of Jsoup parser to get only the hyperlinks that exists in the body section. I am getting the output as:
<font color="#0000FF">Sports</font>
<font color="#0000FF">Titanic</font>
license plates
miracle cars
Clear
and even more hyperlinks.
From all of them, all I am interested in is data like
/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp
How can I do this using Strings or is there any other way or method to extract this information usinf Jsoup Parser itself?
You can try this, its works.
public class AttributeParsing {
/**
* #param args
*/
public static void main(String[] args) {
final String html = "<font color=\"#0000FF\">Sports</font>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Element th = doc.select("a[href]").first();
String href = th.attr("href");
System.out.println(th);
System.out.println(href);
}
}
Output :
th : <font color="#0000FF">Sports</font>
href : /sports/sports.asp
Try this it may help
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String nextIndex = linkHref .indexOf ("\"", linkHref );
This should be a basic bit of parsign using
String.indexOf
as in
index = jsoupOutput.indexOf ("href=\"");
and
nextIndex = jsoupOutput.indexOf ("\"", index);
with the necessary checks in place.
Let's assume that String anchor contains one of these links then the beginning index of the substring will after href=" and the end index will be the first quotation mark after index 9 this way:
String anchor = "<font color=\"#0000FF\">Sports</font>";
int beginIndex = anchor.indexOf("href=\"") + 6; //To start after <a href="
int endIndex = anchor.indexOf("\"", beginIndex);
String desiredPart = anchor.substring(beginIndex, endIndex);
And that's it if the shape of the anchor is going to always be that way.. better options are using regular expressions and best would be using an XML parser.
Use this as reference
import java.util.regex.*;
public class HelloWorld{
public static void main(String []args){
String s = "<font color=\"#0000FF\">Sports</font>"+
"<font color=\"#0000FF\">Titanic</font>"+
"license plates"+
"miracle cars"+
"Clear";
Pattern p = Pattern.compile("href=\".+?\"");
Matcher m = p.matcher(s);
while(m.find())
{
System.out.println(m.group().split("=")[1].replace("\"",""));
}
}
}
Output
/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp
You can do it in one line:
String[] paths = str.replaceAll("(?m)^.*?\"(.*?)\".*?$", "$1").split("(?ms)$.*?^");
The first method call removes everything except the target from each line, and the second splits on newlines (will work on all OS terminators).
FYI (?m) turns on "multiline mode" and (?ms) also turns on the "dotall" flag.
I'm new to regular expressions, but I believe this is the method for my solution. I'm trying to take an arbitrary HTML snippet and customize the image tags. For example,
If I had this HTML code:
<><><><><img src="blah.jpg"><><><><><><><><img src="blah2.jpg"><><><>
I want to turn it into:
<><><><><img src="images/blah.jpg"><><><><><><><><img src="images/blah2.jpg"><><><>
The Code I have now is this:
Pattern p = Pattern.compile("<img.*src=\".*\\..*\"");
Matcher m = p.matcher(htmlString);
boolean b = m.find();
String imgPath = "src=\"images/";
while(b)
{
//Get file name.
String name="test.jpg\"";
//Assign new path.
m.group().replaceAll("src=\".*\"",imgPath+name);
}
Regular expressions are not the correct way to parse HTML. Don't do it. It's not possible to do correctly.
Use a proper parser.
Document doc = Jsoup.parse(someHtml);
Elements imgs = doc.select("img");
for (Element img : imgs) {
img.attr("src", "images/" + img.attr("src")); // or whatever
}
doc.outerHtml(); // returns the modified HTML
This code is almost perfect. It prints out alot of info, so look for where it says "Final result" and "original" to see the result of customizing the IMG tags. There's a small flaw that I'm still not sure how to fix. "in10" is the variable for testing an input string. The rest are regex.
I noticed problems occur when I use newline characters and when "src=" is left blank instead of "src=\"\"" or "src=''" The quotes seem to effect the results.
private static String r16 = "(?s)(<img.*?)(src\\s*?=\\s*?(?:\"|').*?(?:\"|'))";
private static String in10 = "<><><><><img width=1 height=888 src=\"bnm.jpg\"<><><><><img src=\"\"> <img src = \"\"><img src ='folder1/folder2/bnm.jpg'><><><img src =\"'>";
private static String r14 = "(?s)\\/|\\=";
String path="images/";
String name="";
Pattern p = Pattern.compile(r16);
Matcher m = p.matcher(in10);
StringBuffer sb = new StringBuffer();
int i=1;
while(m.find())
{
String g0 = m.group();
String g2 = m.group(2);
System.out.println("Main group"+i+":"+g0);
System.out.println("Inner group1:"+m.group(1));
System.out.println("Inner group2:"+g2);
String[] names=g2.split(r14);
printNames(names);
/*
* src="/folder1/folder2/blah.jpg" ---> blah.jpg
* src="bnm.jpg" ---> src="bnm.jp"
*/
if(names.length>=1)
{
name = names[names.length-1];
}
else
{
name = "";
}
//Name might be empty string.
name = name.replaceAll("\"|'","");
System.out.println("Retrieved Name:"+name);
m.appendReplacement(sb,"$1src=\""+path+name+"\"");
i++;
}
m.appendTail(sb);
INPUT=sb.toString();
System.out.println("Final Result:"+INPUT);
System.out.println("Original____:"+in10);
System.out.println("Count:"+m.groupCount());
}
You should not use regex for this.The way which josh3736 said is robust.But if you want to use regex you should use :
String s = "<><><><><img src=\"blah.jpg\"><><><><><><><><img src=\"blah2.jpg\"><><><>";
s = s.replaceAll("(?<=img src=\")([^\"]+)(?=\">)","images/$1");
System.out.println(s);
output :
<><><><><img src="images/blah.jpg"><><><><><><><><img src="images/blah2.jpg"><><><>
Although I agree with the others that doing this with regular expressions is the wrong way to modify html fragments, here is a JUnit test case that shows how to replace src elements with a Pattern in Java:
import static org.junit.Assert.*;
import static org.hamcrest.CoreMatchers.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import org.junit.Test;
public class ImgSrcReplace {
#Test
public void replaceWithRegex() {
String dir = "image/";
String htmlFragment = "<body>\n"+
"<img src=\"single-line.jpg\">"+
"<img src=\n"+
"\"multiline.jpg\">\n"+
"<img src='single-quote.jpg'><img src=\"broken.gif\'>"+
"<img class=\"before\" src=\"class-before.jpg\">"+
"<img src=\"class-after.gif\" class=\"after\">"+
"</body>";
Pattern replaceImgSrc =
Pattern.compile(
"(<img\\b[^>]*\\bsrc\\s*=\\s*)([\"\'])((?:(?!\\2)[^>])*)\\2(\\s*[^>]*>)",
Pattern.CASE_INSENSITIVE&Pattern.MULTILINE);
String result =
replaceImgSrc.matcher(htmlFragment)
.replaceAll("$1$2"+Matcher.quoteReplacement(dir)+"$3$2$4");
assertThat("the single line image tag was updated", result,
containsString("image/single-line.jpg"));
assertThat("the multiline image tag was updated", result,
containsString("image/multiline.jpg"));
assertThat("the single quote image tag was updated", result,
containsString("image/single-quote.jpg"));
assertThat("the broken gif was ignored.", result,
containsString("\"broken.gif'"));
assertThat("attributes before are preseved.", result,
containsString("<img class=\"before\" src=\"image/class-before.jpg\">"));
assertThat("attributes after are preseved.", result,
containsString("<img src=\"image/class-after.gif\" class=\"after\">"));
}
}
This is a part of a string
test="some text" test2="othertext"
It contains a lot more of similar text with same formating. Each "statment" is separate by empty space
How to search by name(test, test2) and replace its values(stuff between "")?
in java
I dont know if its clear enough but i dont know how else to explain it
I want to search for "test" and replace its content with something else
replace
test="some text" test2="othertext"
with something else
Edit:
This is a content of a file
test="some text" test2="othertext"
I read content of that file in a string
Now i want to replace some text with something else
some text is not static it can be anything
You can use the replace() method of String, which comes in 3 types and 4 variants:
revStr.replace(oldChar, newChar)
revStr.replace(target, replacement)
revStr.replaceAll(regex, replacement)
revStr.replaceFirst(regex, replacement)
Eg:
String myString = "Here is the home of the home of the Stars";
myString = myString.replace("home","heaven");
///////////////////// Edited Part //////////////////////////////////////
String s = "The quick brown fox test =\"jumped over\" the \"lazy\" dog";
String lastStr = new String();
String t = new String();
Pattern pat = Pattern.compile("test\\s*=\\s*\".*\"");
Matcher mat = pat.matcher(s);
while (mat.find()) {
// arL.add(mat.group());
lastStr = mat.group();
}
Pattern pat1 = Pattern.compile("\".*\"");
Matcher mat1 = pat1.matcher(lastStr);
while (mat1.find()) {
t = mat.replaceAll("test=" + "\"Hello\"");
}
System.out.println(t);
So you want to replace every instance of "test" with something else?
Let's say the string name is myString:
myString = myString.replace("test","something else");
Is this what you are looking to do?
I think you are asking that you fetch data from file in the form of string,
lets suppose, your string is,
String s = "My name="sahil" and my company="microsoft", also i live in
country="india"".
Now you want to replace "sahil" with "mahajan" and "microsoft" with "google".
I have tried experimenting with the string methods to implement this functionality, but didnt find a relavent result. But i could provide you with some methods. You could use regionMatches, indexOf("name=""). But these functions will help you in finding where sahil(suppose) is located. but the replcae function here is difficult to work, because it replaces character sequence, for which you should know the exact character sequence.
Now you might try experimenting with the string methods. It could help.
I haven't tested this, but it should work:
String mFileContents;
private void replaceValue(String name, String newValue) {
int nameIndex = mFileContents.indexOf(name);
int equalSignIndex = mFileContents.indexOf("=", nameIndex);
int oldValueIndex = equalSignIndex + 2;
int oldValueLength = mFileContents.indexOf("\"", oldValueIndex);
String oldValue = mFileContents.substring(oldValueIndex, oldValueLength);
String firstHalf = mFileContents.substring(0, oldValueIndex -1);
String secondHalf = mFileContents.substring(oldValueIndex);
secondHalf.replaceFirst(oldValue, newValue);
mFileContents = firstHalf + secondHalf;
}
String a = "some text";
a = a.replace("text", "inserted value");
System.out.print(a);
Try this
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Query about the trim() method in Java
I am parsing a site's usernames and other information, and each one has a bunch of spaces after it (but spaces in between the words).
For example: "Bob the Builder " or "Sam the welder ". The numbers of spaces vary from name to name. I figured I'd just use .trim(), since I've used this before.
However, it's giving me trouble. My code looks like this:
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).trim());
}
The result is just the same; no spaces are removed at the end.
Thank you in advance for your excellent answers!
UPDATE:
The full code is a bit more complicated, since there are HTML tags that are parsed out first. It goes exactly like this:
for (String s : splitSource2) {
if (s.length() > "<td class=\"dddefault\">".length() && s.substring(0, "<td class=\"dddefault\">".length()).equals("<td class=\"dddefault\">")) {
splitSource3.add(s.substring("<td class=\"dddefault\">".length()));
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).substring(0, splitSource3.get(i).length() - 5));
splitSource3.set(i, splitSource3.get(i).trim());
System.out.println(i + ": " + splitSource3.get(i));
}
}
UPDATE:
Calm down. I never said the fault lay with Java, and I never said it was a bug or broken or anything. I simply said I was having trouble with it and posted my code for you to collaborate on and help solve my issue. Note the phrase "my issue" and not "java's issue". I have actually had the code printing out
System.out.println(i + ": " + splitSource3.get(i) + "*");
in a for each loop afterward.
This is how I knew I had a problem.
By the way, the problem has still not been fixed.
UPDATE:
Sample output (minus single quotes):
'0: Olin D. Kirkland '
'1: Sophomore '
'2: Someplace, Virginia 12345<br />VA SomeCity<br />'
'3: Undergraduate '
EDIT the OP rephrased his question at Query about the trim() method in Java, where the issue was found to be Unicode whitespace characters which are not matched by String.trim().
It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.
If my assumption is correct then you've got two choices:
Use a binary reader and figure out what those characters are - and delete them with String.replace(); E.g.:
private static void cutCharacters(String fromHtml) {
String result = fromHtml;
char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
for (char ch : problematicCharacters) {
result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
}
return result;
}
If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:
private String getImportantParts(String fromHtml) {
Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
Matcher m = p.matcher(fromHtml);
StringBuilder buff = new StringBuilder();
while (m.find()) {
buff.append(m.group(1));
}
return buff.toString().trim();
}
Works without a problem for me.
Here your code a bit refactored and (maybe) better readable:
final String openingTag = "<td class=\"dddefault\">";
final String closingTag = "</td>";
List<String> splitSource2 = new ArrayList<String>();
splitSource2.add(openingTag + "Bob the Builder " + closingTag);
splitSource2.add(openingTag + "Sam the welder " + closingTag);
for (String string : splitSource2) {
System.out.println("|" + string + "|");
}
List<String> splitSource3 = new ArrayList<String>();
for (String s : splitSource2) {
if (s.length() > openingTag.length() && s.startsWith(openingTag)) {
String nameWithoutOpeningTag = s.substring(openingTag.length());
splitSource3.add(nameWithoutOpeningTag);
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
String name = splitSource3.get(i);
int closingTagBegin = splitSource3.get(i).length() - closingTag.length();
String nameWithoutClosingTag = name.substring(0, closingTagBegin);
String nameTrimmed = nameWithoutClosingTag.trim();
splitSource3.set(i, nameTrimmed);
System.out.println("|" + splitSource3.get(i) + "|");
}
I know that's not a real answer, but i cannot post comments and this code as a comment wouldn't fit, so I made it an answer, so that Olin Kirkland can check his code.