I'm trying to parse a error file from a system which is presented to me in HTML. Don't find it very pretty, but this is what I have to work with.
The errors are presented with codes which I can find a reference to in a catalog based on a set and a message id.
<HTML>
<BODY>
<h4>2020-07-16 10:24:22.614</h4>
<SPAN STYLE="color:black; font:bold;"> Set:<INPUT TYPE="text" VALUE="158" SIZE=3</INPUT> Id: <INPUT TYPE="text" VALUE="10420" SIZE=5</INPUT>
</SPAN>
</BODY>
</HTML>
I'm trying to parse the timestamp, and the two values in the input fields with JSoup. The timestamp is not a problem at all, but I don't seem to find a way to parse the Set and the Id of the message.
Document doc = Jsoup.parse(errorLog, "UTF-8", "");
Element body = doc.body();
Elements MessageTimestamps = doc.select("h4");
Elements MessageSets = doc.getElementsByAttributeValue("SIZE", "3");
Elements MessageID = doc.getElementsByAttributeValue("SIZE","5");
String[] timestampArray = new String[MessageTimestamps.size()];
System.out.println("Total: " + timestampArray.length);
for(int i = 0; i< MessageTimestamps.size(); i++) {
System.out.println("Timestamp: " + MessageTimestamps.get(i).text());
System.out.println("MessageSets: " + MessageSets.get(i).text());
}
Result:
Total: 6
Timestamp: 2020-07-16 10:24:22.614
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
Anyone an idea?
You could select the input fields having a SIZE attribute which contain the values 3 or 5 by doing something like:
public static void main(String[] args){
String html = "<HTML>\n"
+ "<BODY>\n"
+ "<h4>2020-07-16 10:24:22.614</h4>\n"
+ "<SPAN STYLE=\"color:black; font:bold;\"> Set:<INPUT TYPE=\"text\" VALUE=\"158\" SIZE=3</INPUT> Id: <INPUT TYPE=\"text\" VALUE=\"10420\" SIZE=5</INPUT>\n"
+ "</SPAN>\n"
+ "</BODY>\n"
+ "</HTML>";
Document doc = Jsoup.parse(html);
Element time = doc.selectFirst("h4");
Element set = doc.selectFirst("INPUT[SIZE*=3]");
Element id = doc.selectFirst("INPUT[SIZE*=5]");
System.out.println(time.text());
System.out.println(set.attr("value"));
System.out.println(id.attr("value"));
}
Basically, I am using Jsoup to parse a site, I want to get all the links from the following html:
<ul class="detail-main-list">
<li>
Dis Be the link
</li>
</ul>
Any idea how?
Straight from jsoup.org, right there, first thing you see:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
Modifying this to what you need seems trivial:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
Elements anchorTags = doc.select("ul.detail-main-list a");
for (Element anchorTag : anchorTags) {
System.out.println("Links to: " + anchorTag.attr("href"));
System.out.println("In absolute form: " + anchorTag.absUrl("href"));
System.out.println("Text content: " + anchorTag.text());
}
The ul.detail-main-list a part is a so-called selector string. A real short tutorial on these:
foo means: Any HTML element with that tag name, i.e. <foo></foo>.
.bar means: Any HTML element with class bar, i.e. <foo class="bar baz"></foo>
#bar means: Any HTML element with id bar, i.e. <foo id="bar">
These can be combined: ul.detail-main-list matches any <ul> tags that have the string detail-main-list in their list of classes.
a b means: all things matching the 'b' selection that have something matching 'a' as a parent. So ul a matches all <a> tags that have a <ul> tag around them someplace.
The JSoup docs are excellent.
You can do a specific a href link in this way from any website.
public static void main(String[] args) {
String htmlString = "<html>\n" +
" <head></head>\n" +
" <body>\n" +
"<ul class=\"detail-main-list\">\n" +
" <li> \n" +
" Dis Be the link\n" +
" </li> \n" +
"</ul>" +
" </body>\n" +
"</html>"
+ "<head></head>";
Document html = Jsoup.parse(htmlString);
Elements elements = html.select("a");
for(Element element: elements){
System.out.println(element.attr("href"));
}
}
Output:
/manga/toki_wa/v01/c001/1.html
Hey does anyone know how to parse the "Light rain", " 7°C", and "Limited"? These are stored as #text so that's kind of throwing me off. For reference, to parse "Temperature:", it would be Element element5 = doc.select("strong").get(3);
Thanks!
The nodes from your example are called text nodes. In Jsoup, you can read the text nodes of a node by using the text() method. So given your example using Jsoup we'd select the td element and then use text() to get it's text value.
However, this would also output the text value from any child nodes, so in your case this would produce Weather: Light rain as a single string. Fortunately, Jsoup also has a ownText() method that only extracts the value from the text nodes that are a direct descendant of the element (and not all children). So given your example code, you could write it like this:
Element element5 = doc.select("td").get(3);
String value = element5.ownText()
You can use variuos ways to extract required text and one of them is td.childNode(1).toString() and complete solution is mentioned below:
public static void main(String[] args) {
// Parse HTML String using JSoup library
String HTMLSTring = "<html>\n" +
" <head></head>\n" +
" <body>\n" +
" <table class=\"table\"> \n" +
" <tbody>\n" +
" <tr> \n" +
" <td><strong>Weather: </strong>Light Rain</td> \n" +
" </tr> \n" +
" <tr> \n" +
" <td><strong>Tempratue: </strong>70 C</td> \n" +
" </tr> \n" +
" <tr> \n" +
" <td><strong>Visibility: </strong>Limited</td> \n" +
" </tr> \n" +
" <tr> \n" +
" <td><strong>Runs open: </strong>0</td> \n" +
" </tr>\n" +
" </tbody>\n" +
" </table>\n" +
" </body>\n" +
"</html>"
+ "<head></head>";
Document html = Jsoup.parse(HTMLSTring);
Elements tds = html.getElementsByTag("td");
for (Element td : tds) {
//String tdStrongText = td.childNode(0).childNodes().get(0).toString();
String tdStrongText = td.select("strong").text();
System.out.print(tdStrongText + " : ");
String tdText = td.childNode(1).toString();
System.out.println(tdText);
}
}
Check out code on github.
I receive a Http response after a call as Html String and I would like to scrape certain value stored inside the ReportViewer1 variable.
<html>
....................
...........
<script type="text/javascript">
var ReportViewer1 = new ReportViewer('ReportViewer1', 'ReportViewer1_ReportToolbar', 'ReportViewer1_ReportArea_WaitControl', 'ReportViewer1_ReportArea_ReportCell', 'ReportViewer1_ReportArea_PreviewFrame', 'ReportViewer1_ParametersAreaCell', 'ReportViewer1_ReportArea_ErrorControl', 'ReportViewer1_ReportArea_ErrorLabel', 'ReportViewer1_CP', '/app/Telerik.ReportViewer.axd', 'a90a0d41efa6429eadfefa42fc529de1', 'Percent', '100', '', 'ReportViewer1_EditorPlaceholder', 'ReportViewer1_CalendarFrame', 'ReportViewer1_ReportArea_DocumentMapCell', {
CurrentPageToolTip: 'STR_TELERIK_MSG_CUR_PAGE_TOOL_TIP',
ExportButtonText: 'Export',
ExportToolTip: 'Export',
ExportSelectFormatText: 'Export to the selected format',
FirstPageToolTip: 'First page',
LabelOf: 'of',
LastPageToolTip: 'Last Page',
ProcessingReportMessage: 'Generating report...',
NoPageToDisplay: 'No page to display.',
NextPageToolTip: 'Next page',
ParametersToolTip: 'Click to close parameters area|Click to open parameters area',
DocumentMapToolTip: 'Hide document map|Show document map',
PreviousPageToolTip: 'Previous page',
TogglePageLayoutToolTip: 'Switch to interactive view|Switch to print preview',
SessionHasExpiredError: 'Session has expired.',
SessionHasExpiredMessage: 'Please, refresh the page.',
PrintToolTip: 'Print',
RefreshToolTip: 'Refresh',
NavigateBackToolTip: 'Navigate back',
NavigateForwardToolTip: 'Navigate forward',
ReportParametersSelectAllText: '<select all>',
ReportParametersSelectAValueText: '<select a value>',
ReportParametersInvalidValueText: 'Invalid value.',
ReportParametersNoValueText: 'Value required.',
ReportParametersNullText: 'NULL',
ReportParametersPreviewButtonText: 'Preview',
ReportParametersFalseValueLabel: 'False',
ReportParametersInputDataError: 'Missing or invalid parameter value. Please input valid data for all parameters.',
ReportParametersTrueValueLabel: 'True',
MissingReportSource: 'The source of the report definition has not been specified.',
ZoomToPageWidth: 'Page Width',
ZoomToWholePage: 'Full Page'
}, 'ReportViewer1_ReportArea_ReportArea', 'ReportViewer1_ReportArea_SplitterCell', 'ReportViewer1_ReportArea_DocumentMapCell', true, true, 'PDF', 'ReportViewer1_RSID', true);
</script>
...................
...................
</html>
The value is a90a0d41efa6429eadfefa42fc529de1 and this is in the middle of this content:
'/app/Telerik.ReportViewer.axd', 'a90a0d41efa6429eadfefa42fc529de1', 'Percent', '100',
Whats the best way I can parse this value using Java?
Parse the HTML with String class
public class HtmlParser {
public static void main(String args[]){
String result = getValuesProp(html);
System.out.println("Result: "+ result);
}
static String PIVOT = "Telerik.ReportViewer.axd";
public static String getValuesProp(String json) {
String subString;
int i = json.indexOf(PIVOT);
i+= PIVOT.length();
//', chars
i+=2;
subString = json.substring(i);
i = subString.indexOf("'");
i++;
subString = subString.substring(i);
i = subString.indexOf("'");
subString = subString.substring(0,i);
return subString;
}
static String html ="<html>\n" +
"\n" +
"<script type=\"text/javascript\">\n" +
" var ReportViewer1 = new ReportViewer('ReportViewer1', 'ReportViewer1_ReportToolbar', 'ReportViewer1_ReportArea_WaitControl', 'ReportViewer1_ReportArea_ReportCell', 'ReportViewer1_ReportArea_PreviewFrame', 'ReportViewer1_ParametersAreaCell', 'ReportViewer1_ReportArea_ErrorControl', 'ReportViewer1_ReportArea_ErrorLabel', 'ReportViewer1_CP', '/app/Telerik.ReportViewer.axd', 'a90a0d41efa6429eadfefa42fc529de1', 'Percent', '100', '', 'ReportViewer1_EditorPlaceholder', 'ReportViewer1_CalendarFrame', 'ReportViewer1_ReportArea_DocumentMapCell', {\n" +
" CurrentPageToolTip: 'STR_TELERIK_MSG_CUR_PAGE_TOOL_TIP',\n" +
" ExportButtonText: 'Export',\n" +
" ExportToolTip: 'Export',\n" +
" ExportSelectFormatText: 'Export to the selected format',\n" +
" FirstPageToolTip: 'First page',\n" +
" LabelOf: 'of',\n" +
" LastPageToolTip: 'Last Page',\n" +
" ProcessingReportMessage: 'Generating report...',\n" +
" NoPageToDisplay: 'No page to display.',\n" +
" NextPageToolTip: 'Next page',\n" +
" ParametersToolTip: 'Click to close parameters area|Click to open parameters area',\n" +
" DocumentMapToolTip: 'Hide document map|Show document map',\n" +
" PreviousPageToolTip: 'Previous page',\n" +
" TogglePageLayoutToolTip: 'Switch to interactive view|Switch to print preview',\n" +
" SessionHasExpiredError: 'Session has expired.',\n" +
" SessionHasExpiredMessage: 'Please, refresh the page.',\n" +
" PrintToolTip: 'Print',\n" +
" RefreshToolTip: 'Refresh',\n" +
" NavigateBackToolTip: 'Navigate back',\n" +
" NavigateForwardToolTip: 'Navigate forward',\n" +
" ReportParametersSelectAllText: '<select all>',\n" +
" ReportParametersSelectAValueText: '<select a value>',\n" +
" ReportParametersInvalidValueText: 'Invalid value.',\n" +
" ReportParametersNoValueText: 'Value required.',\n" +
" ReportParametersNullText: 'NULL',\n" +
" ReportParametersPreviewButtonText: 'Preview',\n" +
" ReportParametersFalseValueLabel: 'False',\n" +
" ReportParametersInputDataError: 'Missing or invalid parameter value. Please input valid data for all parameters.',\n" +
" ReportParametersTrueValueLabel: 'True',\n" +
" MissingReportSource: 'The source of the report definition has not been specified.',\n" +
" ZoomToPageWidth: 'Page Width',\n" +
" ZoomToWholePage: 'Full Page'\n" +
" }, 'ReportViewer1_ReportArea_ReportArea', 'ReportViewer1_ReportArea_SplitterCell', 'ReportViewer1_ReportArea_DocumentMapCell', true, true, 'PDF', 'ReportViewer1_RSID', true);\n" +
" </script>\n" +
"\n" +
"</html>";
}
I would read the text a line at a time like how most files are read. Because the format will always be the same, you look for a line that begins with the characters "var ReportViewer1." Then you know you have found the line you want. You may need to strip some white space, although it will always be formatted with the same whitespace too (up to you really.)
When you have the line, use the String .split() method to split that line into an array. There are nice delimiters there to split on ... "," or " " or ", " ... again, see what works best for you.
Test the split up line parts for '/app/Telerik.ReportViewer.axd' ... the next member of your split array will be the value you are looking for.
Again, the formatting will always be the same, so you can rely on that to find your variable. Of course, study the html text to make sure it does always follow the same format within the line you are investigating, but looking at it, I assume it probably does.
Again, find your line ... split it on a delimiter ... and use some logic to find the element you are after in the split up line parts.
I have a span class like in the attached picture. I want to fetch all three values i.e. 0.413%, 0.012%, -- and --
When I traverse to this span class and get text then all three values stored in the string but i want them one by one.
'--' can be at anywhere. How to fetch these values.
<span class="text-light ng-binding" ng-show="calculatorStatus == 'COMPLETED'" style="font-size: 0.85em;">
0.413%
<br/>
0.012%
<br/>
--
</span>
Actual: 0.413% \n 0.012% \n --
Expected: 0.413%, 0.012%, --
Looks like homework. Hmmm. OK. I keep this in my kitbag:
public String getTextFromElementsTextNodes(WebDriver webDriver, WebElement element) throws IllegalArgumentException {
String text = "";
if (webDriver instanceof JavascriptExecutor) {
text = (String)((JavascriptExecutor) webDriver).executeScript(
"var nodes = arguments[0].childNodes;" +
"var text = '';" +
"for (var i = 0; i < nodes.length; i++) {" +
" if (nodes[i].nodeType == Node.TEXT_NODE) {" +
" text += nodes[i].textContent;" +
" }" +
"}" +
"return text;"
, element);
} else {
throw new IllegalArgumentException("driver is not an instance of JavascriptExecutor");
}
return text;
}
It returns all characters including non-ASCII line breaks. I usually just want the text so I add this
getTextFromElementsTextNodes(driver, anElement).replaceAll("[^\\x00-\\x7F]", " ");