Read in html table to java - java

I need to pull data from an html page using Java code. The java part is required.
The page i am trying to pull info from is http://www.weather.gov/data/obhistory/KMCI.html
.
I need to create a list of hashmaps...or some kind of data object that i can reference in later code.
This is all i have so far:
URL weatherDataKC = new URL("http://www.weather.gov/data/obhistory/KMCI.html");
InputStream is = weatherDataKC.openStream();
int cnt = 0;
StringBuffer buffer = new StringBuffer();
while ((cnt = is.read()) != -1){
buffer.append((char) cnt);
}
System.out.print(buffer.toString());
Any suggestions where to start?

there is a nice HTML parser called Neko:
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
More information here.

Use an HTML parser like CyberNeko

J2SE includes HTML parsing capabilities, in packages javax.swing.text.html and javax.swing.text.html.parser. HTMLEditorKit.ParserCallback receives events pushed by DocumentParser (better be used through ParserDelegator). The framework is very similar to the SAX parsers for XML.
Beware, there are some bugs. It won't be able to handle bad HTML very well.
Dealing with colspan and rowspan is your business.

HTML scraping is notoriously difficult, unless you have a lot of "hooks" like unique IDs. For example, the table you want starts with this HTML:
<table cellspacing="3" cellpadding="2" border="0" width="670">
...which is very generic and may match several tables on the page. The other problem is, what happens if the HTML structure changes? You'll have to redefine all your parsing rules...

Related

Using JAVA, how can I parse .cshtml file and add parameters for the existing C# code in that file

I have some .CSHTML files that were incorrectly generated by a tool. I would like to modify the C# code in them to append additional parameters and remove incorrect parameters from method calls.
I've used JSoup to parse the HTML and JSP files. I am able to add or remove attribute in the HTML and JSP files via JSoup DOM iteration.
But in the .CSHTML files contains C# code (I'm new to C#) and couldn't get control over the code. Hence I am not able to append parameters for that C# code using JSoup library. For example,
<td>
#Html.Label(Resource.Get("Label_Name"), new Dictionary<string,object>{{ "Class","label"},{ "name","Name"},{ "id","Name"}})
</td>
<td>
#Html.TextBoxFor(m=> m.TextBox1,new Dictionary<string,object>{{ "Class","txtfield controlWidth"},{ "name","TextBox1"},{ "id","TextBox1"}})
</td>
As above "#Html.xxxx" codes are treated as value for the 'tr' tag in Jsoup DOM iteration. I could only think on adding if..else logic to add or remove parameters as snippet given below. I don't know what is the standard way of parsing such .cshtml file.
if(str.contains("#Html.")) {
ctrlType=str.substring(str.indexOf('.')+1,str.indexOf('('));
if(ctrlType.equalsIgnorecase("Label")) {
// logic to add parameters.
}
}
Using Java, is there way to parse the .cshtml file and add or remove parameters for C# code ? Can you please suggest to solve the problem with open standard API?

Java / Android HTML custom tag parser

I'm trying to figure out a way to parse a html file with custom tags in the form:
[custom tag="id"]
Here's an example of a file I'm working with:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds <a href="http://youtu.be/F5nLu232KRo"> bro
What I would like (in an ideal world) is to get back is a list of elements):
List foundElements = [text, custom tag, text, link, text]
Where the element in the above list contains:
Text:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds
Custom tag:
[custom tag="amaze"]
Link:
<a href="http://youtu.be/F5nLu232KRo">
Text:
appears.</p>We need maor embeds
What I've tried:
Jsoup
Jsoup is great, it works perfectly for HTML. The issue is I can't define custom tags with opening "[" and closing "]". Correct me if I'm wrong?
Jericho
Again like Jsoup, Jericho works great..except for defining custom tags. You're required to use "<".
Java Regex
This is the option I really don't want to go for. It's not reliable and there's a lot of string manipulation that is brittle, especially when you're matching against a lot of regexes.
Last but not least, I'm looking for a performance orientated solution as this is done on an Android client.
All suggestions welcome!

Java : HTML Parsing

I am having HTML contents as given below. The tag that i am looking out for here are "img src" and "!important". Does Java provide any HTML parsing techniques?
<fieldset>
<table cellpadding='0'border='0'cellspacing='0'style="clear :both">
<tr valign='top' ><td width='35' >
<a href='http://mypage.rediff.com/android/32868898'class='space' onmousedown="return
enc(this,'http://track.rediff.com/clickurl=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F3 868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >
<div style='width:25px;height:25px;overflow:hidden;'>
<img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb' width='25' vspace='0' /></div></a></td> <td><span>
<a href='http://mypage.rediff.com/android/32868898' class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >Android </a> </span><span style='color:#000000
!important;'>android se updates...</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/>
String value = Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("img").attr("src");
System.out.println(value); //http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb
System.out.println(Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("span[style$=important;]").first().text());//android se updates...
JSoup
What-are-the-pros-and-cons-of-the-leading-java-html-parsers
Try NekoHtml. This is the HTML parsing library used by various higher-level testing frameworks such as HtmlUnit.
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
I used jsoup - this library have nice selector syntax (http://jsoup.org/cookbook/extracting-data/selector-syntax), and for your problem you can use code like this:
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements pngs = doc.select("img[src$=.png]");
I like using Jericho: http://jericho.htmlparser.net/docs/index.html
It is invulnerable to bad formed html, links leading to unavailable locations etc.
There's a lot of examples on their page, you just get all IMG tags and analyze their attributes to extracts those that pass your needs.

Export HTML table to Excel - using jQuery or Java

I've a HTML table on my JSP page, that I want to be exported to Excel on a button click.
What would be the best way of going about this?
(For ex., how would I do this using may be a jQuery function?)
Any code examples for demo purposes should be great.
I would recommend Apache POI, we've been using it for years, never had any problems.
Alot of examples online to get a good start, and the documentation on the site is also good: http://poi.apache.org/spreadsheet/quick-guide.html
Rather export to CSV format. It's supported by Excel as well. Exporting to a fullworthy XLS(X) format using for example Apache POI HSSF or JExcepAPI is slow and memory hogging.
Exporting to CSV is relatively simple. You can find a complete code example in this answer.
As to exporting to files using JavaScript, this is not possible without interaction of Flash or the server side. In Flash there's as far only the Downloadify library which can export to simple txt files. Further, ExtJs seems to have a CSV export library, but I can't find any feasible demo page.
You can parse the table using a library like http://jsoup.org/
After you get the data, you can store it in Excel-compatible format (CSV), or using Java Excel library for that like POI, or using JDBC to write data into Excel sheet, see this example:
Password Protected Excel File
I also spend lot of time to convert html to excel after lot of R & D i found following easiest way.
create hidden field and in that pass your html data to your servlet or controller for e.g
<form id="formexcel" action="" method="post" name="formexcel">
<input type="hidden" name="exceldata" id="exceldata" value="" />
</form>
on your button of href click call following function and pass your html data using in document.formexcel.exceldata.value and your servlet or controller in document.formstyle.action
function exportDivToExcel() {
document.formexcel.exceldata.value=$('#htmlbody').html();
$("#lblReportForPrint").html("Procurement operation review report");
document.formstyle.method='POST';
document.formstyle.action='${pageContext.servletContext.contextPath}/generateexcel';
document.formstyle.submit();
}
Now in your controller or servlet write following code
StringBuilder exceldata = new StringBuilder();
exceldata.append(request.getParameter("exceldata"));
ServletOutputStream outputStream = response.getOutputStream();
response.setContentType("application/vnd.ms-excel");
response.setCharacterEncoding("UTF-8");
response.setHeader("Content-Disposition", "attachment;filename=\"exportexcel.xls\"");
outputStream.write(exceldata.toString().getBytes());
Excel can load CSV (comma-separated value) files, which are basically just files with everything that would go into separate Excel cells separated by comma.
I don't know enough about how jQuery can handle pushing information into a file that you would download, but it seems a jQuery library has been written that at least transforms html tables to CSV format, and it is here:
http://www.kunalbabre.com/projects/table2CSV.php
Edit (February 29, 2016):
You can use the table2csv implementation above in conjunction with FileSaver.js (which is a wrapper for the HTML5 W3C saveAs() spec).
The usage will end up looking something like:
var resultFromTable2CSV = $('#table-id').table2CSV({delivery:'value'});
var blob = new Blob([resultFromTable2CSV], {type: "text/csv;charset=utf-8"});
saveAs(blob, 'desiredFileName.csv');
Exporting to Excel file format with JQuery is impossible.
You can try with Java. There are a lot of libraries to do that.
You would have to create something on the server-side (like a servlet) to read the html and create the excel file and serve it back to the user.
You could use this library to help you do the transformation.
I can suggest you to try http://code.google.com/p/gwt-table-to-excel/, at least the server part.
I have been using the jQuery plugin table2excel. It works very well and no serverside coding is needed.
Using it is easy. Simply include jQuery
<script src="http://ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min.js"></script>
Now include the table2excel script (Remember to change the src destination to match yours)
<script src="dist/jquery.table2excel.min.js"></script>
Now simply call the script on the table you want exportet.
$("#yourHtmTable").table2excel({
exclude: ".excludeThisClass",
name: "Worksheet Name",
filename: "SomeFile" //do not include extension
});
It's also easy to attach to a button like so:
$("button").click(function(){
$("#table2excel").table2excel({
// exclude CSS class
exclude: ".noExl",
name: "Excel Document Name"
});
});
All examples are taken directly from the authors github page and from jqueryscript.net

Is it possible to extract SCRIPT tags using SiteMesh?

I have custom JSP tags that generate some HTML content, along with some javascript functions that get called by this HTML code. In the current implementation, the SCRIPT tags are created just above the HTML code.
To avoid modifying the existing code base, I want to pull up these scripts inside the HEAD section of the page using SiteMesh or some other decorator tool.
I know SiteMesh can extract content from <content tag="..."> elements, but I was wondering if it was possible also with other tags, such as SCRIPT.
Is this possible with SiteMesh, or know of any tools that could allow me to do that?
Thank you!
SiteMesh's HTMLPageParser is extensible, so you can add your own custom rule to extract <script> elements by extending HTMLPageParser and configuring SiteMesh to use your class instead of HTMLPageParser, something like this:
import com.opensymphony.module.sitemesh.parser.HTMLPageParser;
public CustomPageParser extends HTMLPageParser {
protected void addUserDefinedRules(State html, PageBuilder page) {
super.addUserDefinedRules(html, page);
html.addRule(new ScriptExtractingRule(page));
}
}
I imagine your ScriptExtractingRule would be modeled after the standard SiteMesh ContentBlockExtractingRule, storing the content in the page context so your decorator can access the blocks as if they were <content> blocks.

Categories

Resources