Read JSP page and Write HTML file UTF-8 issuses - java

i want read JSP page and write it to HTML page. I have 3 method in parse class. first readHTMLBody(), second WriteNewHTML(), third ZipToEpub().
When I called this method in parse class, all method work. But called in JSP or webservice UTF-8 character looks like "?" in readHTMLBody(). How can I fix it?
public String readHTMLBody() {
try {
String url = "http://localhost:8080/Library/part.jsp";
Document doc = Jsoup.parse((new URL(url)).openStream(), "utf-8", url);
String body = doc.html();
Elements title = doc.select("xxx");
linkURI = title.toString();
linkURI = linkURI.replaceAll("<xxx>", "");
linkURI = linkURI.replaceAll("</xxx>", "");
linkURI = linkURI.replaceAll("\\s", "");
resultBody = body;
resultBody = resultBody.replaceAll("part/" + linkURI + "/assets/", "assets/");
} catch (IOException e) {
}
return resultBody;
}

Related

java regex to retrieve link from text

I have a input String as:
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";
I want to convert this text to:
Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it
So here:
1) I want to replace the link tag with plain link. If the tag contains label then it should go in braces after the URL.
2) If the URL is relative, I want to prefix the base URL (http://www.google.com).
3) I want to append a parameter to the URL. (&myParam=pqr)
I am having issues retrieving the tag with URL and label, and replacing it.
I wrote something like:
public static void main(String[] args) {
String text = "String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";";
text = text.replaceAll("<", "<");
text = text.replaceAll(">", ">");
text = text.replaceAll("&", "&");
// this is not working
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(text);
String url = null;
if (m.find()) {
url = m.group(1);
}
}
// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri;
}
Edit1:
Pattern p = Pattern.compile("HREF=\"(.*?)\"");
This works. But then I want it to be capitalization agnostic. Href, HRef, href, hrEF, etc. all should work.
Also, how do I handle if my text has several URLs.
Edit2:
Some progress.
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1);
System.out.println(url);
}
This handles the case of multiple URLs.
Last pending issue is, how do I get hold of the label and replace the href tags in original text with URL and label.
Edit3:
By multiple URL cases, I mean there are multiple url present in given text.
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it and another link <A HREF=\"/relative-path/vegetables.cgi?param1=abc&param2=xyz\">URL2 Label</A> and some more text";
Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1); // this variable should contain the link URL
url = appendBaseURI(url);
url = appendQueryParams(url, "license=ABCXYZ");
System.out.println(url);
}
public static void main(String args[]) {
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it and another link <A HREF=\"/relative-path/vegetables.cgi?param1=abc&param2=xyz\">URL2 Label</A> and some more text";
text = StringEscapeUtils.unescapeHtml4(text);
Pattern p = Pattern.compile("(.*?)", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
while (m.find()) {
text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
}
System.out.println(text);
}
private static String cleanUrlPart(String url, String label) {
if (!url.startsWith("http") && !url.startsWith("www")) {
if (url.startsWith("/")) {
url = "http://www.google.com" + url;
} else {
url = "http://www.google.com/" + url;
}
}
url = appendQueryParams(url, "myParam=pqr").toString();
if (label != null && !label.isEmpty()) url += " (" + label + ")";
return url;
}
Output
Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text
You can use apache commons text StringEscapeUtils to decode the html entities and then replaceAll, i.e.:
import org.apache.commons.text.StringEscapeUtils;
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";
String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+\"(.*?)\">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
System.out.print(output);
// Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it
Demos:
jdoodle
Regex Explanation
// this is not working
Because your regex is case-sensitive.
Try:-
Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Edit1:
To get the label, use Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE) and m.group(0).
Edit2:
To replace the tag (including label) with your final string, use:-
text.replaceAll("(?i)<a href=\"(.*?)</a>", "new substring here")
Almost there:
public static void main(String[] args) throws URISyntaxException {
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it and another link <A HREF=\"/relative-path/vegetables.cgi?param1=abc&param2=xyz\">URL2 Label</A> and some more text";
text = StringEscapeUtils.unescapeHtml4(text);
System.out.println(text);
System.out.println("**************************************");
Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
Pattern patternLink = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Matcher matcherTag = patternTag.matcher(text);
while (matcherTag.find()) {
String href = matcherTag.group(1); // href
String linkText = matcherTag.group(2); // link text
System.out.println("Href: " + href);
System.out.println("Label: " + linkText);
Matcher matcherLink = patternLink.matcher(href);
String finalText = null;
while (matcherLink.find()) {
String link = matcherLink.group(1);
System.out.println("Link: " + link);
finalText = getFinalText(link, linkText);
break;
}
System.out.println("***************************************");
// replacing logic goes here
}
System.out.println(text);
}
public static String getFinalText(String link, String label) throws URISyntaxException {
link = appendBaseURI(link);
link = appendQueryParams(link, "myParam=ABCXYZ");
return link + " (" + label + ")";
}
public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri.toString();
}
public static String appendBaseURI(String url) {
String baseURI = "http://www.google.com/";
if (url.startsWith("/")) {
url = url.substring(1, url.length());
}
if (url.startsWith(baseURI)) {
return url;
} else {
return baseURI + url;
}
}

How to correctly parse HTML in Java

I'm trying to extract information from websites using Jsoup but I don't get the same HTML code as in my browser.
I tried to use .userAgent() but it didn't work. I currently use the following function wich works for Amazon.com :
public static String getHTML(String urlToRead) throws Exception {
StringBuilder result = new StringBuilder();
URL url = new URL(urlToRead);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0");
conn.setRequestMethod("GET");
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
String line;
while ((line = rd.readLine()) != null) {
result.append(line);
}
rd.close();
return result.toString();
}
The website I'm trying to parse is http://www.asos.com/ but the price of the product is always missing.
I fond this topic which is pretty close to mine but I would like to do it using only java and no external app.
So after a little playing around with the site I came up with a solution.
Now the site uses API responses to get the prices for each item, this is why you are not getting the prices in your HTML that you are receiving from Jsoup. Unfortunately there's a little more code than first expected, and you'll have to do some working out on how it should know which product Id to use instead of the hardcoded value. However, other than that the following code should work in your case.
I've included comments that hopefully explain each step, and I recommend taking a look at the API response, as there maybe some other data you require, in fact this maybe the same with the product details and description, as further data will need to be parsed out of elementById field.
Good luck and let me know if you need any further help!
import org.json.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Main
{
final String productID = "8513070";
final String productURL = "http://www.asos.com/prd/";
final Product product = new Product();
public static void main( String[] args )
{
new Main();
}
private Main()
{
getProductDetails( productURL, productID );
System.out.println( "ID: " + product.productID + ", Name: " + product.productName + ", Price: " + product.productPrice );
}
private void getProductDetails( String url, String productID )
{
try
{
// Append the product url and the product id to retrieve the product HTML
final String appendedURL = url + productID;
// Using Jsoup we'll connect to the url and get the HTML
Document document = Jsoup.connect( appendedURL ).get();
// We parse the HTML only looking for the product section
Element elementById = document.getElementById( "asos-product" );
// To simply get the title we look for the H1 tag
Elements h1 = elementById.getElementsByTag( "h1" );
// Because more than one H1 tag is returned we only want the tag that isn't empty
if ( !h1.text().isEmpty() )
{
// Add all data to Product object
product.productID = productID;
product.productName = h1.text().trim();
product.productPrice = getProductPrice(productID);
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
private String getProductPrice( String productID )
{
try
{
// Append the api url and the product id to retrieve the product price JSON document
final String apiURL = "http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=" + productID + "&store=COM";
// Using Jsoup again we connect to the URL ignoring the content type and retrieve the body
String jsonDoc = Jsoup.connect( apiURL ).ignoreContentType( true ).execute().body();
// As its JSON we want to parse the JSONArray until we get to the current price and return it.
JSONArray jsonArray = new JSONArray( jsonDoc );
JSONObject currentProductPriceObj = jsonArray
.getJSONObject( 0 )
.getJSONObject( "productPrice" )
.getJSONObject( "current" );
return currentProductPriceObj.getString( "text" );
}
catch ( IOException e )
{
e.printStackTrace();
}
return "";
}
// Simple Product object to store the data
class Product
{
String productID;
String productName;
String productPrice;
}
}
Oh, and you'll also need org.json for parse the JSON response from the API.

Java convert string into url title characters only [duplicate]

How do you encode a URL in Android?
I thought it was like this:
final String encodedURL = URLEncoder.encode(urlAsString, "UTF-8");
URL url = new URL(encodedURL);
If I do the above, the http:// in urlAsString is replaced by http%3A%2F%2F in encodedURL and then I get a java.net.MalformedURLException when I use the URL.
You don't encode the entire URL, only parts of it that come from "unreliable sources".
Java:
String query = URLEncoder.encode("apples oranges", Charsets.UTF_8.name());
String url = "http://stackoverflow.com/search?q=" + query;
Kotlin:
val query: String = URLEncoder.encode("apples oranges", Charsets.UTF_8.name())
val url = "http://stackoverflow.com/search?q=$query"
Alternatively, you can use Strings.urlEncode(String str) of DroidParts that doesn't throw checked exceptions.
Or use something like
String uri = Uri.parse("http://...")
.buildUpon()
.appendQueryParameter("key", "val")
.build().toString();
I'm going to add one suggestion here. You can do this which avoids having to get any external libraries.
Give this a try:
String urlStr = "http://abc.dev.domain.com/0007AC/ads/800x480 15sec h.264.mp4";
URL url = new URL(urlStr);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
url = uri.toURL();
You can see that in this particular URL, I need to have those spaces encoded so that I can use it for a request.
This takes advantage of a couple features available to you in Android classes. First, the URL class can break a url into its proper components so there is no need for you to do any string search/replace work. Secondly, this approach takes advantage of the URI class feature of properly escaping components when you construct a URI via components rather than from a single string.
The beauty of this approach is that you can take any valid url string and have it work without needing any special knowledge of it yourself.
For android, I would use
String android.net.Uri.encode(String s)
Encodes characters in the given string as '%'-escaped octets using the UTF-8 scheme. Leaves letters ("A-Z", "a-z"), numbers ("0-9"), and unreserved characters ("_-!.~'()*") intact. Encodes all other characters.
Ex/
String urlEncoded = "http://stackoverflow.com/search?q=" + Uri.encode(query);
Also you can use this
private static final String ALLOWED_URI_CHARS = "##&=*+-_.,:!?()/~'%";
String urlEncoded = Uri.encode(path, ALLOWED_URI_CHARS);
it's the most simple method
try {
query = URLEncoder.encode(query, "utf-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
you can use below methods
public static String parseUrl(String surl) throws Exception
{
URL u = new URL(surl);
return new URI(u.getProtocol(), u.getAuthority(), u.getPath(), u.getQuery(), u.getRef()).toString();
}
or
public String parseURL(String url, Map<String, String> params)
{
Builder builder = Uri.parse(url).buildUpon();
for (String key : params.keySet())
{
builder.appendQueryParameter(key, params.get(key));
}
return builder.build().toString();
}
the second one is better than first.
Find Arabic chars and replace them with its UTF-8 encoding.
some thing like this:
for (int i = 0; i < urlAsString.length(); i++) {
if (urlAsString.charAt(i) > 255) {
urlAsString = urlAsString.substring(0, i) + URLEncoder.encode(urlAsString.charAt(i)+"", "UTF-8") + urlAsString.substring(i+1);
}
}
encodedURL = urlAsString;

parse html from a web page which uses infinite scroll

I would like to parse html from web page which use infinite scroll, such as: pinterest.com so as to get all items.
public List<String> popularTagsPinterest(String tag) throws Exception {
List<String> results = new ArrayList<>();
try {
Document doc = Jsoup.connect(
urlPinterest + tag + "&eq=%23" + tag + "&etslf=6622&term_meta[]=%23" + tag + "%7Cautocomplete%7C0")
.timeout(90000).get();
Elements img1 = doc.select("a.pinImageWrapper img.pinImg");
for (Element e : img1) {
results.add(e.attr("src"));
System.out.println(e.attr("src"));
}
} catch (Exception e) {
e.printStackTrace();
}
return results;
}
Get base url and the ajax call for loading another part can do.
Check this page, is a good example.
https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016

Save file from a website with java

I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).
Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...
Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();

Categories

Resources