get attribute value from html code in java - java

i have HTML string value and i want to get one attribute(id) value from that html String value
can u help me how to do it??
String msHTMLFile = "<ABBR class='HighlightClass' id='highlight40001' style=\"BACKGROUND-COLOR: yellow\" >Fetal/Neonatal Morbidity and Mortality</ABBR>";
result should come - highlight40001;

Try using this regular expression pattern:
\bid='([^']*)'
And then extract the string captured by group 1. This is not foolproof; using regex to parse HTML never is. You can try to complicate the regex to make it more flexible. Or you can just use a HTML parser. I recommend the latter.

Also not so clean, but this should work for you.
You can treat it as xml and parse it using JAXB:
ABBR.java:
import javax.xml.bind.annotation.XmlAttribute;
public class ABBR
{
#XmlAttribute public String id;
}
Main.java:
[..]
String msHTMLFile = "<ABBR class='HighlightClass' id='highlight40001' style=\"BACKGROUND-COLOR: yellow\" >Fetal/Neonatal Morbidity and Mortality</ABBR>";
ABBR obj = JAXB.unmarshal(new StringReader(msHTMLFile), ABBR.class);
System.out.println(obj.id);
[..]

If you're lucky and your HTML source produces XML-compliant HTML, JAXB or other XML parsers will do fine with it. A lot of people aren't writing particularly well-formed HTML (unclosed tags, etc), though some of my coworkers have gotten good results parsing HTML with HotSAX: http://sourceforge.net/projects/hotsax/

Related

Remove custom tag from a string then format its content

I need help to parse, modify and show a string on an Android App (Java language, max API level 22)
This is a example string I'm getting from an API which contains only custom tag:
<BOLD> Something <RED> went wrong </RED> </BOLD> <NEWLINE> Server unreachable </NEWLINE>
I need to remove all this custom tags then format its content based on the tags that were wrapping that substring (so I'm expeting, for example, to get "went wrong" in red color and bold). I already tried looking up for similar problems but can't get to the final result.
The string (cleaned and formatted) will then be used to set the Text of a TextView inside a List View
One way of doing this is like this....
String testString="<BOLD> Something <RED> went wrong </RED> </BOLD> <NEWLINE> Server unreachable </NEWLINE>";
testString=testString.replaceAll("<BOLD>","<font> <b>");
testString=testString.replaceAll("</BOLD>","</b> </font>");
testString=testString.replaceAll("<RED>","<font color =\"#FF0000\">"); //#FF0000 is hex code for red color
testString=testString.replaceAll("</RED>","</font> ");
testString=testString.replaceAll("<NEWLINE>","<br>");
testString=testString.replaceAll("</NEWLINE>","");
TextView textView=findViewById(R.id.text);
textView.setText(Html.fromHtml(testString));
Output :
Using Regex (Regular Expressions)
Just give your string to the Regex Pattern and it removes all the extra tags for you.
Kotlin
This removes all the HTML tags inside your String:
val result = yourString.replace(Regex("(<[a-z]*>)|(<.[a-z]*>)"), "")
Java
String result = yourString.replaceAll("(<[a-z]*>)|(<.[a-z]*>)", "");

How do i Jsoup query for the value of a html key/value pair

sorry if my terms are off, i havent done this before
Im using jsoup to scrape a single value off a website page,
I am trying to find the "serialno" which is stored within this function (java script?)
function set(obj, val)
{
document.getElementById(obj).innerHTML= val;
}
called by
{set("modelname", "NPort 5650-16");set("mac", "00:90:E8:22:76:F4");set("serialno", "2583");set("ver", "3.3 Build 08042219");setlabel("NPORT");uptime("264 days, 03h:31m:34s");}<
i am unsure how i can use jsoup to extract/print the serialno value, which in this case happens to be 2583. ive tried basic commands using getElementById but ive never used jsoup before. i am familiar with maps, but not sure how i can manipulate with jsoup, and most of the tutorials online need the actual 'path' to the exact cell within the table (where this info is displayed).
You can't use Jsoup to do this. Jsoup can parse HTML, but javascipt is out of its reach and is recognized as text. It can't be executed and selecting things from javascript is not possible.
But if you already have HTML parsed to Document and you're looking for an alternative solution you may try to use regular expressions to grab this value.
Document doc = Jsoup.parse...
String html = doc.toString();
Pattern p = Pattern.compile("set\\(\"serialno\", \"(\\d+)\"\\)");
Matcher m = p.matcher(html);
if (m.find()) {
String serialno = m.group(1);
System.out.println(serialno);
}

How to replace xml empty tags using regex

I have a lot of empty xml tags which needs to be removed from string.
String dealData = dealDataWriter.toString();
someData = someData.replaceAll("<somerandomField1/>", "");
someData = someData.replaceAll("<somerandomField2/>", "");
someData = someData.replaceAll("<somerandomField3/>", "");
someData = someData.replaceAll("<somerandomField4/>", "");
This uses a lot of string operations which is not efficient, what can be better ways to avoid these operations.
I would not suggest to use Regex when operating on HTML/XML... but for a simple case like yours maybe it is ok to use a rule like this one:
someData.replaceAll("<\\w+?\\/>", "");
Test: link
If you want to consider also the optional spaces before and after the tag names:
someData.replaceAll("<\\s*\\w+?\\s*\\/>", "");
Test: link
Try the following code, You can remove all the tag which does not have any space in it.
someData.replaceAll("<\w+/>","");
Alternatively to using regex or string matching, you can use an xml parser to find empty tags and remove them.
See the answers given over here: Java Remove empty XML tags
If you like to remove <tagA></tagA> and also <tagB/> you can use following regex. Please note that \1 is used to back reference matching group.
// identifies empty tag i.e <tag1></tag> or <tag/>
// it also supports the possibilities of white spaces around or within the tag. however tags with whitespace as value will not match.
private static final String EMPTY_VALUED_TAG_REGEX = "\\s*<\\s*(\\w+)\\s*></\\s*\\1\\s*>|\\s*<\\s*\\w+\\s*/\\s*>";
Run the code on ideone

JAVA How to retrieve text content from custom JLabel without the HTML taggings?

How do I retrieve the text from JLabel without the HTML taggings?
E.g.
CustomJLabel:
public CustomJLabel extends JLabel(){
private String text;
public CustomJLabel(String text) {
super("<html><div style='text-align: center;'>"+text+"</div></html>"),
this.text=text;
}
}
Main method:
testCustomLbl = new CustomJLabel("Testing");
System.out.println(testCustomLbl.getText());
Output I got:
<html><div style='text-align: center;'>Testing</div></html>
Desired output:
Testing
There are three options:
You pick your favorite HTML parser and parse HTML; see here for some inspiration. This is by far the most robust and straight forward solution; but of course: costly.
If you are well aware of the exact HTML content that goes into your labels, then you could turn to regular expressions; or other means of string parsing. The problem is: if you don't control those strings, then coming up with your own custom "parsing" is hard. Because each and any change somewhere to the HTML that goes in ... might break your little parser.
You rework your whole design: if having HTML text is such a core thing in your application, you might consider to really "represent" that in your class. For example by creating your own versions of JLabels that take some HtmlString input ... and simply remember which parts are HTML, and which one "pure text".
And whoops; the code you are showing is already suited for option 3. So if you want that getText() returns that original text, you could add a simple
#Override
public void String getText() {
return this.text;
}
to your CustomLabel class.
Edit: alternatively, you could simply add a new method like
public void String getTextWithoutHtmlTags()
or something alike; as overriding that inherited method somehow changes the "contract" of that method. Which (depending on the context) might be ok, or not so ok.
There's no need for complex code or 3rd party JARS / Libraries.
Here's a simple solution using RegEx:
String htmlStr = "<html><h1>Heading</h1> ...... </html>";
String noHtmlStr = htmlStr.replaceAll("\\<.*?\\>", "");
Works great for me.
Hope this helps.

Retrieve value of attribute using XPath

I am trying to retrieve the value of an attribute from an xmel file using XPath and I am not sure where I am going wrong..
This is the XML File
<soapenv:Envelope>
<soapenv:Header>
<common:TestInfo testID="PI1" />
</soapenv:Header>
</soapenv:Envelope>
And this is the code I am using to get the value. Both of these return nothing..
XPathBuilder getTestID = new XPathBuilder("local-name(/*[local-name(.)='Envelope']/*[local-name(.)='Header']/*[local-name(.)='TestInfo'])");
XPathBuilder getTestID2 = new XPathBuilder("Envelope/Header/TestInfo/#testID");
Object doc2 = getTestID.evaluate(context, sourceXML);
Object doc3 = getTestID2.evaluate(context, sourceXML);
How can I retrieve the value of testID?
However you're iterating within the java, your context node is probably not what you think, so remove the "." specifier in your local-name(.) like so:
/*[local-name()='Header']/*[local-name()='TestInfo']/#testID worked fine for me with your XML, although as akaIDIOT says, there isn't an <Envelope> tag to be seen.
The XML file you provided does not contain an <Envelope> element, so an expression that requires it will never match.
Post-edit edit
As can be seen from your XML snippet, the document uses a specific namespace for the elements you're trying to match. An XPath engine is namespace-aware, meaning you'll have to ask it exactly what you need. And, keep in mind that a namespace is defined by its uri, not by its abbreviation (so, /namespace:element doesn't do much unless you let the XPath engine know what the namespace namespace refers to).
Your first XPath has an extra local-name() wrapped around the whole thing:
local-name(/*[local-name(.)='Envelope']/*[local-name(.)='Header']
/*[local-name(.)='TestInfo'])
The result of this XPath will either be the string value "TestInfo" if the TestInfo node is found, or a blank string if it is not.
If your XML is structured like you say it is, then this should work:
/*[local-name()='Envelope']/*[local-name()='Header']/*[local-name()='TestInfo']/#testID
But preferably, you should be working with namespaces properly instead of (ab)using local-name(). I have a post here that shows how to do this in Java.
If you don't care for the namespaces and use an XPath 2.0 compatible engine, use * for it.
//*:Header/*:TestInfo/#testID
will return the desired input.
It will probably be more elegant to register the needed namespaces (not covered here, depends on your XPath engine) and query using these:
//soapenv:Header/common:TestInfo/#testID

Categories

Resources