Partial parse Comment tag in JSoup - java

I want to check if HTML comment tags are inserted properly.. If there is the opening tag but no closing tag, i want to display error.
I referred this link and was able to retrieve valid comment nodes
for(Element e : document.getAllElements()){
for(Node n: e.childNodes()){
if(n instanceof Comment){
commentNodes++;
System.out.println(n);
}
}
}
if(html.contains("<!--") && commentNodes == 0) {
System.out.println("error");
} else if(html.contains("-->") && commentNodes == 0) {
System.out.println("error1");
}
Is there a better way of doing this?

I'm not sure whether JSoup will parse not closed comment tag, although there's another way to achieve your goal:
int openings = StringUtils.countMatches(html, "<!--");
int closures = StringUtils.countMatches(html, "-->");
if (openings > closures)
System.out.println("Error: There are not closed comment tags!");
It will count number of occurrences of comment tag openings and closures and by comparing it you can assume whether there is an not closed. To be entirely certain the result you can also compare the values with number of tags obtained by JSoup (commentNodes in your example).

Related

Find occurrences of a specific tag in an XML file in java using recursion

I need to return the number of occurrences of the given tag, for example, a user will provide a link to an xml file and the name of the tag to find and it will return the number of occurrences of that specific tag. My code so far only works for the child of the parent node, whereas I need to check all the child of the child nodes as well, and I quite don't understand how to iterate through all of the elements of the xml file.
Modify your code to make use of recursion properly. You need to ALWAYS recurse, not only if a tag has the name you are looking for, because the children still might have the name you are looking for. Also, you need to add the result of the recursive call to the sum. Something like this:
private static int tagCount(XMLTree xml, String tag) {
assert xml != null : "Violation of: xml is not null";
assert tag != null : "Violation of: tag is not null";
int count = 0;
if (xml.isTag()) {
for (int i = 0; i < xml.numberOfChildren(); i++) {
if (xml.child(i).label().equals(tag)) {
count++;
}
count = count + tagCount(xml.child(i), tag);
}
}
return count;
}

Read a portion of ArrayList for n times of lines?

If you have a HTML page stored in String ArrayList, and you want to for example read the whole <div> tag of certain class type, how do you read the next lines so that it would reach the end of div tag?
for (String l : line) {
if (l.contains("<div class=\"somne_class\">"){
//read the next n strings in ArrayList until </div> tag is reached
}
Generally, it's bad idea to store HTML file as list of raw strings. Why do you store it in such way?
Imagine you have string like <div id="outer_div"><div id="inner_div">Hei!</div></div>. Here you have multiple nested HTML tags in a single line, so you won't easily get the closing tag.
Consider using HTML parser, then you can get desired tag(s) by type or attribute. There are plenty of HTML parsers implemented in Java. One of the most popular is jsoup.
I agree with Vladimir, you're probably looking for an HTML parser.
To answer the exact question in the post: to simply find the next </div> tag, you can use a for loop instead of a foreach loop.
for (int i = 0; i < line.size(); ++i) {
String l = line.get(i);
if (l.contains("<div class=\"somne_class\">") {
for (int j = i; j < line.size(); ++j) {
String l2 = line.get(j);
if (l2.contains("</div>")) {
// l2 is the next line that contains a </div> tag
}
}
}
}
Note that this might not be the matching closing tag for the opening tag, even if you assume that every tag is in a different line.
I recommend you to use jsoup
It is nice for parsing an writing html file.Althought i hadn't
yet digged to much on it here is an example of taking all the elements
with tag div:
Document htmlFile = null;
// Read the html file
try {
htmlFile = Jsoup.parse(new File("path"),"UTF-8");//path,encoding
} catch (IOException e) {
e.printStackTrace();
}
Elements images = htmlFile.getElementsByTag("div");
You can do much more read here

Extracting Table Data with JSoup on Yahoo Finance

Trying to practice extracting data from tables using JSoup. Can't figure out why I can't pull the "Shares Outstanding" field from
https://finance.yahoo.com/q/ks?s=AAPL+Key+Statistics
Here's two attempts where 's' is AAPL:
public class YahooStatistics {
String sharesOutstanding = "Shares Outstanding:";
public YahooStatistics(String s) {
String keyStatisticsURL = ("https://finance.yahoo.com/q/ks?s="+s+"+Key+Statistics");
//Attempt 1
try {
Document doc = Jsoup.connect(keyStatisticsURL).get();
for (Element table : doc.select("table.yfnc_datamodoutline1")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element td : tds.select(sharesOutstanding)) {
System.out.println(td.ownText());
}
}
}
}
catch (IOException ex) {
ex.printStackTrace();
}
//Attempt 2
try {
Document doc = Jsoup.connect(keyStatisticsURL).get();
for (Element table : doc.select("table.yfnc_datamodoutline1")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (int j = 0; j < tds.size() - 1; j++) {
Element td = tds.get(j);
if ((td.ownText()).equals(sharesOutstanding)) {
System.out.println(tds.get(j+1).ownText());
}
}
}
}
}
catch(IOException ex) {
ex.printStackTrace();
}
The attempts return: BUILD SUCCESSFUL and nothing else.
I've disabled JavaScript on my browser and the table still shows, so I'm assuming this is not written in JavaScript but HTML.
Any suggestions are appreciated.
Notes about your source after the edit:
You should compare ownText() rather than text(). text() gives you the combined text of all the element and all its sub-elements. In this case the element contains Shares Outstanding<font size="-1"><sup>5</sup></font>:, so its combined text is "Shares Outstanding5:". If you use ownText it will just be "Shares Outstanding:".
Note the colon (:). Update the value in sharesOutstanding accordingly.
You are passing it the wrong URL. There should be a + following the AAPL.
Your current query (at least the second attempt) is returning the element twice, because there is a nested table so it finds the TDs twice.
You can either break from your loops once you found a match, go back to your original version (with corrections as above) - see note - or you can try using a more sophisticated query which will only match once:
Elements elems = doc.select("td.yfnc_tablehead1:containsOwn("+sharesOutstanding+") + td.yfnc_tabledata1");
if ( ! elems.isEmpty() ) {
System.out.println( elems.get(0).owntext() );
}
This selector gives you all the td elements whose class is yfnc_tabledata1, whose immediate preceding sibling is a td element whose class is yfnc_tablehead1 and whose own text contains the "Shares Outstanding:" string. This should basically select the exact TD you need.
Note: the previous version of this answer was a long rattle about the difference between Elements.select() and Element.select(). It turns out that I was dead wrong and your original version should have worked - if you had corrected the four points above. So to set the record straight: select() on an Elements actually does look inside each element and the resulting list may contain descendents of any of the elements in the original list that match the selection. Sorry about that.

Multiple arguments for !somearray.contains

Is it possible to have multiple arguments for a .contains? I am searching an array to ensure that each string contains one of several characters. I've hunted all over the web, but found nothing useful.
for(String s : fileContents) {
if(!s.contains(syntax1) && !s.contains(syntax2)) {
found.add(s);
}
}
for (String s : found) {
System.out.println(s); // print array to cmd
JOptionPane.showMessageDialog(null, "Note: Syntax errors found.");
}
How can I do this with multiple arguments? I've also tried a bunch of ||s on their own, but that doesn't seem to work either.
No, it can't have multiple arguments, but the || should work.
!s.contains(syntax1+"") || !s.contains(syntax2+"") means s doesn't contain syntax1 or it doesn't contain syntax2.
This is just a guess but you might want s contains either of the two:
s.contains(syntax1+"") || s.contains(syntax2+"")
or maybe s contains both:
s.contains(syntax1+"") && s.contains(syntax2+"")
or maybe s contains neither of the two:
!s.contains(syntax1+"") && !s.contains(syntax2+"")
If syntax1 and syntax2 are already strings, you don't need the +""'s.
I believe s.contains("") should always return true, so you can remove it.
It seems that what you described can be done with a regular expression.
In regular expression, the operator | marks you need to match one of several choices.
For example, the regex (a|b) means a or b.
The regex ".*(a|b).*" means a string that contains a or b, and other then that - all is OK (it assumes one line string, but that can be dealt with easily as well if needed).
Code example:
String s = "abc";
System.out.println(s.matches(".*(a|d).*"));
s = "abcd";
System.out.println(s.matches(".*(a|d).*"));
s = "fgh";
System.out.println(s.matches(".*(a|d).*"));
Regular Exprsssions is a powerful tool that I recommend learning. Have a look at this tutorial, you might find it helpful.
There is not such thing as multiple contains.
if you require to validate that a list of string is included in some other string you must iterate through them all and check.
public static boolean containsAll(String input, String... items) {
if(input == null) throw new IllegalArgumentException("Input must not be null"); // We validate the input
if(input.length() == 0) {
return items.length == 0; // if empty contains nothing then true, else false
}
boolean result = true;
for(String item : items) {
result = result && input.contains(item);
}
return result;
}

How can I determine if a HTML document is well formed or not in JAVA?

Heyy guys, I need to determine if a given HTML Document is well formed or not.
I just need a simple implementation using only Java core API classes i.e. no third party stuff like JTIDY or something. Thanks.
Actually, what is exactly needed is an algorithm that scans a list of TAGS. If it finds an open tag, and the next tag isn't its corresponding close tag, then it should be another open tag which in turn should have its close tag as the next tag, and if not it should be another open tag and then its corresponding close tag next, and the close tags of the previous open tags in reverse order coming next on the list. I've already written methods to convert a tag to a close tag. If the list conforms to this order then it returns true or else false.
Here is the skeleton code of what I've started working on already. Its not too neat, but it should give you guys a basic idea of what I'm trying to do.
public boolean validateHtml(){
ArrayList<String> tags = fetchTags();
//fetchTags returns this [<html>, <head>, <title>, </title>, </head>, <body>, <h1>, </h1>, </body>, </html>]
//I create another ArrayList to store tags that I haven't found its corresponding close tag yet
ArrayList<String> unclosedTags = new ArrayList<String>();
String temp;
for (int i = 0; i < tags.size(); i++) {
temp = tags.get(i);
if(!tags.get(i+1).equals(TagOperations.convertToCloseTag(tags.get(i)))){
unclosedTags.add(tags.get(i));
if(){
}
}else{
return true;//well formed html
}
}
return true;
}
Yeah string manipulation can seem like a pickle sometimes,
you need to do something like
First copy html into an array
bool tag = false;
string str = "";
List<string> htmlTags = new List();
for(int i = 0; i < array.length; i++)
{
//Check for the start of a tag
if(array[i] == '<')
{
tag == true;
}
//If the current char is part of a tag start copying
if(tag)
{
str += char;
}
//When a tag ends add the tag to your tag list
if(array[i] == '>')
{
htmlTags.Add(str);
str = "";
tag == false;
}
}
Something like this should get you started, you should end up with an array of tags, this is only pseudo code so it wont shouldn't compile
Don't think you can do this without undertaking a huge amount of work, would be much easier to use a third party package
Try validating against HTML4 or 4.1 or XHTML 1 DTD
"strict.dtd"
"loose.dtd"
"frameset.dtd"
Which might help !

Categories

Resources