I want to extract full link from a HTML file. Full link I mean absolute links. I used Tika for this purpose. Here is my code:
URL url = new URL("http://www.domainname.com/");
InputStream input = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
ContentHandler textHandler = new BodyContentHandler();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
TeeContentHandler teeHandler = new TeeContentHandler(linkHandler,
textHandler, toHTMLHandler);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
HtmlParser parser = new HtmlParser();
parser.parse(input, teeHandler, metadata, parseContext);
System.out.println("title:\n" + metadata.get("title"));
for (Link link : linkHandler.getLinks()) {
System.out.println(link.getUri());
}
This give me relative url like /index.html or documents/US/economicreport.html but the absolute url in this case is http://domainname.com/index.html.
How can I get all the link correctly means the full link including domain name? How can I do that in Java?
If you have stored the base website URL in url, the following should work:
URL url = new URL("http://www.domainname.com/");
String givenUrl = ""; //This is the parsed address
if (givenUrl.charAt(0) == '/') {
String absoluteUrl = url + givenURL;
} else {
String absoluteUrl = givenUrl;
}
Slightly better than the previous, but only slightly, is
URL targetDocumentUrl = new URL("http://www.domainname.com/content.html");
String parsedUrl = link.getURI();
String absoluteLink = new URL(targetDocumentUrl, parsedURL);
However, it is still not a good solution as it has problems when the html document has the following tag
base href="/"
and the link being parsed is relative and starts with "../".
Of course you can get around this a number of ways but they involve a bit of work such as implementing a ContentHandler. I have to think for something so basic there must be a simple way to do this with the Tika LinkContentHandler.
Related
I am having a textbox and submit button in my jsp page. When submitting this button with some url in textbox, I am getting the response of that url using URLConnection
String strUrl = request.getParameter("url");
URL url = new URL(strUrl);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
String encoding = new String(encodedBytes);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.connect();
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
try {
fWriter = new FileWriter(new File("f:\\new.html"));
writer = new BufferedWriter(fWriter);
while ((line = in.readLine()) != null) {
String s = line.toString();
writer.write(s);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
In the resulting html page, every css and js and images were missing as they are pointed to get from local.
for example, js is placed as followed in my generated html page.
<script src="/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
But this actual src is as follows,
<script src="https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
I know that there are many solution to replace all src, href with url host. Found many answers related to that.
I used a solution as follows,
if (s.contains(("href="))) {
if (s.contains("\"../") || s.contains("\"/")) {
s = s.replace("\"../", "\"http://" + url.getHost() + "/");
s = s.replace("\"/", "\"http://" + url.getHost() + "/");
writer.write(s);
out.println(s);
}
}
Now I am able to get link,but its not useful in all the web sites. which means that it will helpful for only sites having that kind of host only prefix with src and hrefs.
In some websites, links are defined as href="frmArticles.aspx". In this case its not enough to add host with href url, because href and src are different even though I prefix with host. For example, folowing URL having href links as different than its URL.
http://www.nakkheeran.in/Users/frmMagazine.aspx?M=2
தை தை தை
If, I am adding host to this href it becomes as follows,
தை தை தை
And this is not available. Because, the actual url is
தை தை தை
There are essentially two ways to get the absolute URL:
Using Jsoup's abs:href attribute getter. It works like this:
Element a = myDoc.select("a").first(); //selects tue first link on the page, replace with whatever selector you need to get your link (a element)
String url = a.attr("abs:href"); //gets the absolute url of the link (href attribute)
Note that you need to provide Jsoup with the URL of the HTML document you are using, so it can resolve the URL correctly, this is done automatically if you use Jsoup.connect(myHtmlUrl).get(), if you are parsing HTML from a String or from a file, you need to provide it, use the appropriate Jsoup.parse() method which allows you to provide a base URL
The other way is with Java's built in URL class, which is probably what you should use in your case. You can use it like this:
String absoluteUrl = new URL(new URL("http://example.com/example.html"), "script.js")
Which would print:
http://example.com/script.js
To clarify a bit, the first parameter (in this case example.com) is the url your HTML document is from, and the second parameter ("script.js") is the URL found in your HTML.
In your case, you could use it like:
String absoluteUrl = new URL(new URL("https://www.url.com/"), "/ajax/libs/jquery/2.1.1/jquery.min.js")
Which will print:
https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js
The URL class has a constructor URL(URL context, String url) that does what you tried doing with regexps.
Edit: In your case the context URL is the source URL of the parsed resource. Let's say you parse something from URL context = new URL("http://example.com/path/to/some.html#where?is+carmen+sandiego"). Then you just take the reference of any link and create a URL ref = new URL(context, src).
I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).
Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...
Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();
So I was attempting to use this String in a URL :-
http://site-test.com/Meetings/IC/DownloadDocument?meetingId=c21c905c-8359-4bd6-b864-844709e05754&itemId=a4b724d1-282e-4b36-9d16-d619a807ba67&file=\\s604132shvw140\Test-Documents\c21c905c-8359-4bd6-b864-844709e05754_attachments\7e89c3cb-ce53-4a04-a9ee-1a584e157987\myDoc.pdf
In this code: -
String fileToDownloadLocation = //The above string
URL fileToDownload = new URL(fileToDownloadLocation);
HttpGet httpget = new HttpGet(fileToDownload.toURI());
But at this point I get the error: -
java.net.URISyntaxException: Illegal character in query at index 169:Blahblahblah
I realised with a bit of googling this was due to the characters in the URL (guessing the &), so I then added in some code so it now looks like so: -
String fileToDownloadLocation = //The above string
fileToDownloadLocation = URLEncoder.encode(fileToDownloadLocation, "UTF-8");
URL fileToDownload = new URL(fileToDownloadLocation);
HttpGet httpget = new HttpGet(fileToDownload.toURI());
However, when I try and run this I get an error when I try and create the URL, the error then reads: -
java.net.MalformedURLException: no protocol: http%3A%2F%2Fsite-test.testsite.com%2FMeetings%2FIC%2FDownloadDocument%3FmeetingId%3Dc21c905c-8359-4bd6-b864-844709e05754%26itemId%3Da4b724d1-282e-4b36-9d16-d619a807ba67%26file%3D%5C%5Cs604132shvw140%5CTest-Documents%5Cc21c905c-8359-4bd6-b864-844709e05754_attachments%5C7e89c3cb-ce53-4a04-a9ee-1a584e157987%myDoc.pdf
It looks like I can't do the encoding until after I've created the URL else it replaces slashes and things which it shouldn't, but I can't see how I can create the URL with the string and then format it so its suitable for use. I'm not particularly familiar with all this and was hoping someone might be able to point out to me what I'm missing to get string A into a suitably formatted URL to then use with the correct characters replaced?
Any suggestions greatly appreciated!
You need to encode your parameter's values before concatenating them to URL.
Backslash \ is special character which have to be escaped as %5C
Escaping example:
String paramValue = "param\\with\\backslash";
String yourURLStr = "http://host.com?param=" + java.net.URLEncoder.encode(paramValue, "UTF-8");
java.net.URL url = new java.net.URL(yourURLStr);
The result is http://host.com?param=param%5Cwith%5Cbackslash which is properly formatted url string.
I have the same problem, i read the url with an properties file:
String configFile = System.getenv("system.Environment");
if (configFile == null || "".equalsIgnoreCase(configFile.trim())) {
configFile = "dev.properties";
}
// Load properties
Properties properties = new Properties();
properties.load(getClass().getResourceAsStream("/" + configFile));
//read url from file
apiUrl = properties.getProperty("url").trim();
URL url = new URL(apiUrl);
//throw exception here
URLConnection conn = url.openConnection();
dev.properties
url = "https://myDevServer.com/dev/api/gate"
it should be
dev.properties
url = https://myDevServer.com/dev/api/gate
without "" and my problem is solved.
According to oracle documentation
Thrown to indicate that a malformed URL has occurred. Either no legal protocol could be found in a specification string or the string
could not be parsed.
So it means it is not parsed inside the string.
You want to use URI templates. Look carefully at the README of this project: URLEncoder.encode() does NOT work for URIs.
Let us take your original URL:
http://site-test.test.com/Meetings/IC/DownloadDocument?meetingId=c21c905c-8359-4bd6-b864-844709e05754&itemId=a4b724d1-282e-4b36-9d16-d619a807ba67&file=\s604132shvw140\Test-Documents\c21c905c-8359-4bd6-b864-844709e05754_attachments\7e89c3cb-ce53-4a04-a9ee-1a584e157987\myDoc.pdf
and convert it to a URI template with two variables (on multiple lines for clarity):
http://site-test.test.com/Meetings/IC/DownloadDocument
?meetingId={meetingID}&itemId={itemID}&file={file}
Now let us build a variable map with these three variables using the library mentioned in the link:
final VariableMap = VariableMap.newBuilder()
.addScalarValue("meetingID", "c21c905c-8359-4bd6-b864-844709e05754")
.addScalarValue("itemID", "a4b724d1-282e-4b36-9d16-d619a807ba67e")
.addScalarValue("file", "\\\\s604132shvw140\\Test-Documents"
+ "\\c21c905c-8359-4bd6-b864-844709e05754_attachments"
+ "\\7e89c3cb-ce53-4a04-a9ee-1a584e157987\\myDoc.pdf")
.build();
final URITemplate template
= new URITemplate("http://site-test.test.com/Meetings/IC/DownloadDocument"
+ "meetingId={meetingID}&itemId={itemID}&file={file}");
// Generate URL as a String
final String theURL = template.expand(vars);
This is GUARANTEED to return a fully functional URL!
Thanks to Erhun's answer I finally realised that my JSON mapper was returning the quotation marks around my data too! I needed to use "asText()" instead of "toString()"
It's not an uncommon issue - one's brain doesn't see anything wrong with the correct data, surrounded by quotes!
discoveryJson.path("some_endpoint").toString();
"https://what.the.com/heck"
discoveryJson.path("some_endpoint").asText();
https://what.the.com/heck
This code worked for me
public static void main(String[] args) {
try {
java.net.URL url = new java.net.URL("http://path");
System.out.println("Instantiated new URL: " + url);
}
catch (MalformedURLException e) {
e.printStackTrace();
}
}
Instantiated new URL: http://path
Very simple fix
String encodedURL = UriUtils.encodePath(request.getUrl(), "UTF-8");
Works no extra functionality needed.
I try to extract some part of page. I use parser HtmlCleaner, and it remove all tags. Are there some settings to save all html tags? Or maybe is better way to extract this part of code, using something else?
My code:
static final String XPATH_STATS = "//div[#class='text']/p/";
// config cleaner properties
HtmlCleaner htmlCleaner = new HtmlCleaner();
CleanerProperties props = htmlCleaner.getProperties();
props.setAllowHtmlInsideAttributes(false);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
props.setTransSpecialEntitiesToNCR(true);
// create URL object
URL url = new URL(BLOG_URL);
// get HTML page root node
TagNode root = htmlCleaner.clean(url);
Object[] statsNode = root.evaluateXPath(XPATH_STATS);
for (Object tag : statsNode) {
stats = stats + tag.toString().trim();
}
return stats;
thanks for nikhil.thakkar!
I do this by JSON.
The code may help someone:
URL url2 = new URL(BLOG_URL);
Document doc2 = Jsoup.parse(url2, 3000);
Element masthead = doc2.select("div.main_text").first();
String linkOuterH = masthead.outerHtml();
You can use jSoup parser.
More info here: http://jsoup.org/
I am trying to fetch base URL using java. I have used jtidy parser in my code to fetch the title. I am getting the title properly using jtidy, but I am not getting the base url from the given URL.
I have some URL as input:
String s1 = "http://staff.unak.is/andy/GameProgramming0910/new_page_2.htm";
String s2 = "http://www.complex.com/pop-culture/2011/04/10-hottest-women-in-fast-and-furious-movies";
From the first string, I want to fetch "http://staff.unak.is/andy/GameProgramming0910/" as a base URL and from the second string, I want "http://www.complex.com/" as a base URL.
I am using code:
URL url = new URL(s1);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
InputStream in = conn.getInputStream();
Document doc = new Tidy().parseDOM(in, null);
String titleText = doc.getElementsByTagName("title").item(0).getFirstChild()
.getNodeValue();
I am getting titletext, but please can let me know how to get base URL from above given URL?
Try to use the java.net.URL class, it will help you:
For the second case, that it is easier, you could use new URL(s2).getHost();
For the first case, you could get the host and also use getFile() method, and remove the string after the last slash ("/"). something like: (code not tested)
URL url = new URL(s1);
String path = url.getFile().substring(0, url.getFile().lastIndexOf('/'));
String base = url.getProtocol() + "://" + url.getHost() + path;
You use the java.net.URL class to resolve relative URLs.
For the first case: removing the filename from the path:
new URL(new URL(s1), ".").toString()
For the second case: setting the root path:
new URL(new URL(s2), "/").toString()