I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).
Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...
Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();
Related
Here's my problem. I have a txt file called "sites.txt" . In these i type random internet sites. My Goal is to save the first image of each site. I tried to filter the Server response by the img tag and it actually works for some sites, but for some not.
The sites where it works the img src starts with http:// ... the sites it doesnt work start with anything else.
I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:
Exception in thread "main" java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(Unknown Source)
My current code is:
public static void main(String[] args) throws IOException{
try {
File file = new File ("sites.txt");
Scanner scanner = new Scanner (file);
String url;
int counter = 0;
while(scanner.hasNext())
{
url=scanner.nextLine();
URL page = new URL(url);
URLConnection yc = page.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine = in.readLine();
while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
in.close();
String[] parts = inputLine.split(" ");
int i=0;
while(!parts[i].contains("src"))i++;
String destinationFile = "image"+(counter++)+".jpg";
saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
String tmp=scanner.nextLine();
System.out.println(url);
}
scanner.close();
}
catch (FileNotFoundException e)
{
System.out.println ("File not found!");
System.exit (0);
}
}
public static void saveImage(String imageUrl, String destinationFile) throws IOException {
// TODO Auto-generated method stub
URL url = new URL(imageUrl);
String fileName = url.getFile();
String destName = fileName.substring(fileName.lastIndexOf("/"));
System.out.println(destName);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(destinationFile);
byte[] b = new byte[2048];
int length;
while ((length = is.read(b)) != -1) {
os.write(b, 0, length);
}
is.close();
os.close();
}
I also got a tip to use the apache jakarte http client libraries but i got absolutely no idea how i could use those i would appreciate any help.
A URL (a type of URI) requires a scheme in order to be valid. In this case, http.
When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.
Make sure you always have http://. You can easily fix this using regex:
String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");
or
if(!stringUrl.startsWith("http://"))
stringUrl = "http://" + stringUrl;
An alternative solution
Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.
Sample code:
// read a image from the URL
// I used the URL that is your profile pic on StackOverflow
BufferedImage image = ImageIO
.read(new URL(
"https://www.gravatar.com/avatar/3935223a285ab35a1b21f31248f1e721?s=32&d=identicon&r=PG&f=1"));
// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));
When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.
Some examples are:
resource = https://google.com/images/srpr/logo9w.png
resource = google.com/images/srpr/logo9w.png
resource = //google.com/images/srpr/logo9w.png
resource = /images/srpr/logo9w.png
resource = images/srpr/logo9w.png
For the second through fifth ones, you'll need to build the rest of the URL.
The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.
The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:
url = site.getProtocol() + ":" + resource
For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.
Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.
import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class SiteScraper {
public static void main(String[] args) throws IOException {
URL site = new URL("https://google.com/");
Document doc = Jsoup.connect(site.toString()).get();
Elements images = doc.select("img");
for (Element image : images) {
String src = image.attr("src");
System.out.println(buildResourceUrl(site, src));
}
}
static URL buildResourceUrl(URL site, String resource)
throws MalformedURLException {
if (!resource.matches("^(http|https|ftp)://.*$")) {
if (resource.startsWith("//")) {
return new URL(site.getProtocol() + ":" + resource);
} else {
return new URL(site.getProtocol() + "://" + site.getHost() + "/"
+ resource.replaceAll("^/", ""));
}
}
return new URL(resource);
}
}
This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.
I have an issue when displaying strings received from a server in a JTable. Some specific characters appear as little white squares instead of "é" or "à" etc. I tried a lot of things but none of them fixed my problem. I'm working with Eclipse under Windows. The server was developped using Visual Studio 2010.
The server sends an XML file using tinyXML2, the client uses JDom to read it. The font used is "Dialog". The server takes the strings from an Oracle database.
I assume this is an encoding problem, but I haven't been able to fix it yet.
Does anyone have an idea ?
Thx
Arnaud
EDIT : As requested, this is how I use JDom
public static Player fromXML(Element e)
{
Player result = new Player();
String e_text = null;
try
{
e_text = e.getChildText(XMLTags.XML_Player_playerId);
if (e_text != null) result.setID(Integer.parseInt(e_text));
e_text = e.getChildText(XMLTags.XML_Player_lastName);
if (e_text != null) result.setName(e_text);
e_text = e.getChildText(XMLTags.XML_Player_point_scored);
if (e_text != null) result.addSpecial(STAT_SCORED, Double.parseDouble(e_text));
e_text = e.getChildText(XMLTags.XML_Player_point_scored_last);
if (e_text != null) result.addSpecial(STAT_SCORED_LAST, Double.parseDouble(e_text));
}
catch (Exception ex) {
ex.printStackTrace();
}
return result;
}
public static Document load(String filename) {
File XMLFile = new File(CLIENT_TO_SERVER, filename);
SAXBuilder sxb = new SAXBuilder();
Document document = new Document();
try
{
document = sxb.build(new File(XMLFile.getPath()));
} catch(Exception e){e.printStackTrace();}
return document;
}
read the file using correct encoding, something like:
document = sxb.build(new BufferedReader(new InputStreamReader(new FileInputStream(XMLFile.getPath()), "UTF8")));
Note: 1. 1st determine which char encoding used in that file. specify that charset instead of UTF8 above.
Incase encoding is not known or it's being generated from various systems with different encoding, you may use 'encoding detector library of Mozilla'. #see https://code.google.com/p/juniversalchardet/
need to handle UnsupportedEncodingException
I trying to parse website. After all links collect to ArrayList, I wanna parse them again, but I have trouble with initialization of them.
This is my ArrayList:
public ArrayList<String> linkList = new ArrayList<String>();
How I collect links in "doInBackground":
try {
Document doc = Jsoup.connect("http://forurl.com/archive/").get();
Element links = doc.select("a[href]");
for (Element link : links)
{
linkList.add(link.attr("abs:href"));
}
}
In "onPostExecute" showing what I get:
lk.setText("Collected: " +linkList.size()); // showing how much is collected
lj.setText("First link: " +linkList.get(0)); // showing first link
Try to parse child links:
public class imgTread extends AsyncTask<Void, Void, Void> {
Bitmap bitmap;
String[] url = {"http://forurl.com/link1/",
"http://forurl.com/link2/"}; // this way work well
protected Void doInBackground(Void... params) {
try {
for (int i = 0; i < url.length; i++){
Document doc1 = Jsoup.connect(url[0]).get(); // connect to 1 link for example
Elements img = doc1.select("#strip");
String imgSrc = img.attr("src");
InputStream input = new java.net.URL(imgSrc).openStream();
bitmap = BitmapFactory.decodeStream(input);
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
I traing to make String[] from ArrayList, but it doesn't work.
String[] url = linkList.toArray(new String[linkList.size()]);
Output for this way will be Ljava.lang.String;#45ds364
The idea is 1) collected all links from url; 2) Connect to them 1 by 1 and get that information what I need.
First point work, the second too, but how tie it.
Thanks for any advise.
Working code:
Document doc = Jsoup.connect(url).get(); // connect to site
Elements links = doc.select("a[href]"); // get all links
String link_addr = links.get(3).attr("abs:href"); // choose 3 link
Document link_doc = Jsoup.connect(link_addr).get(); // connetect to it
Elements img = link_doc.select("#strip"); // get all elements by tag #strip
String imgSrc = img.attr("src"); // get url
InputStream input = new java.net.URL(imgSrc).openStream();
bitmap = BitmapFactory.decodeStream(input);
I hope this helps someone.
You are doing many unnecessary steps. You have a perfectly fine collection of Element objects in your Elements links object. Why do you have to add them to an ArrayList?
If I have understood your question correctly, your thought process should be something like this.
Get all the links from a URL
Establish a new connection to each link
Download all the images on that page where the element id = "strip".
Get all the links:
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
doInBackground(links);
Call the doInBackground method with the links as a parameter:
public static void doInBackground(Elements links) throws IOException {
try {
for (Element element : links) {
//Connect to the first link
Document doc = Jsoup.connect(element.attr("abs:href")).get();
//Select all the elements with ID 'strip'
Elements img = doc.select("#strip");
//Get the source of the image
String imgSrc = img.attr("abs:src");
//Open an InputStream
InputStream input = new java.net.URL(imgSrc).openStream();
//Get the image
bitmap = BitmapFactory.decodeStream(input);
...
//Perhaps save the image somewhere?
//Close the InputStream
input.close();
}
} catch (IllegalArgumentException e) {
System.out.println(e.getMessage());
} catch (MalformedURLException ex) {
System.out.println(ex.getMessage());
}
}
Of course, you will have to properly use AsyncTask and call the methods from preferred places, but this is the overall idea of how you can use Jsoup to do the job you want it to.
If you want create a array instead of list you can do:
try {
Document doc = Jsoup.connect("http://forurl.com/archive/").get();
Element links = doc.select("a[href]");
String array = new String[links.size();
for (int i = 0; i < links.size(); i++)
{
array[i] = link.attr("abs:href");
}
}
First of all what the hell is that?
Why link variable is a collection of Elements and links is a single member of that collection?? Isn't that confusing?
Second, keep to java naming convention and name variables with noncapitalized letters so change LinkList to linkList. Even syntax highlighter got crazy thanks to you.
Third
traing to make String[] from ArrayList, but it doesn't work.
Where are you trying to do that and it does not work? I don't see it anywhere in the code.
Forth
To create Array out of List you have to do something like that
String links[]=linksList.toArray(new String[linksList.size()]);
Fifth
Change the topic to more apropriate, as present one is very missleading (you have no trouble with Jsoup here)
I am crawling a webpage and after crawling it extract all the links from that webpage and then I am trying to parse all the url using Apache Tika and BoilerPipe by using below code so for some url it is parsing very well but for some I get error like this. And it shows some error on HTMLParser.java: line number 102. This is line number 102 in HTMLParser.java
String parsedText = tika.parseToString(htmlStream, md);
I have provided the HTMLParse code also.
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser#67c28a6a
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:203)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.Tika.parseToString(Tika.java:357)
at edu.uci.ics.crawler4j.crawler.HTMLParser.parse(HTMLParser.java:102)
at edu.uci.ics.crawler4j.crawler.WebCrawler.handleHtml(WebCrawler.java:227)
at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:299)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:118)
at java.lang.Thread.run(Unknown Source)
Caused by: java.util.zip.ZipException: invalid block type
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.ZipInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:114)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 8 more
This is my HTMLParser.java file-
public void parse(String htmlContent, String contextURL) {
InputStream htmlStream = null;
text = null;
title = null;
metaData = new HashMap<String, String>();
urls = new HashSet<String>();
char[] chars = htmlContent.toCharArray();
bulletParser.setCallback(textExtractor);
bulletParser.parse(chars);
try {
text = articleExtractor.getText(htmlContent);
} catch (BoilerpipeProcessingException e) {
e.printStackTrace();
}
if (text == null){
text = textExtractor.text.toString().trim();
}
title = textExtractor.title.toString().trim();
try {
Metadata md = new Metadata();
String utfHtmlContent = new String(htmlContent.getBytes(),"UTF-8");
htmlStream = new ByteArrayInputStream(utfHtmlContent.getBytes());
//The below line is at the line number 102 according to error above
String parsedText = tika.parseToString(htmlStream, md);
//very unlikely to happen
if (text == null){
text = parsedText.trim();
}
processMetaData(md);
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(htmlStream);
}
bulletParser.setCallback(linkExtractor);
bulletParser.parse(chars);
Iterator<String> it = linkExtractor.urls.iterator();
String baseURL = linkExtractor.base();
if (baseURL != null) {
contextURL = baseURL;
}
int urlCount = 0;
while
(it.hasNext()) {
String href = it.next();
href = href.trim();
if (href.length() == 0) {
continue;
}
String hrefWithoutProtocol = href.toLowerCase();
if (href.startsWith("http://")) {
hrefWithoutProtocol = href.substring(7);
}
if (hrefWithoutProtocol.indexOf("javascript:") < 0
&& hrefWithoutProtocol.indexOf("#") < 0) {
URL url = URLCanonicalizer.getCanonicalURL(href, contextURL);
if (url != null) {
urls.add(url.toExternalForm());
urlCount++;
if (urlCount > MAX_OUT_LINKS) {
break;
}
}
}
}
}
Any suggestions will be appreciated.
Sounds like a malformed OOXML document (.docx, .xlsx, etc.). To check whether the problem still occurs with the latest Tika version, you can download the tika-app jar and run it like this:
java -jar tika-app-1.0.jar --text http://url.of.the/troublesome/document.docx
This should print out the text contained in the document. If it doesn't work, please file a bug report with the URL of the troublesome document (or attach the document if it's not publicly available).
I had same issue, I found that documents(docx) files which I was trying to parse was not actually simple document, It was form developed in microsoft word with text and input fields beside label text.
I removed such files from folder and post rest of all files to Solr engine for parsing and indexing, It worked.
I am new to org.xhtmlrenderer.pdf.ITextRenderer and have this problem:
The PDFs that my test servlet streams to my Downloads folder are in fact empty files.
The relevant method, streamAndDeleteTheClob, is shown below.
The first try block is definitely not a problem.
The server spends a lot of time in the second try block. No exception thrown.
Can anyone suggest a solution to this problem or a good approach to to debugging it?
Can anyone point me to essentially similar code that really works?
Any help would be much appreciated.
res.setContentType("application/pdf");
ServletOutputStream out = res.getOutputStream();
...
private boolean streamAndDeleteTheClob(int pageid,
Connection con,
ServletOutputStream out) throws IOException, ServletException {
Statement statement;
Clob htmlpage;
StringBuffer pdfbuf = new StringBuffer();
final String pageToSendQuery = "SELECT text FROM page WHERE pageid = " + pageid;
// create xhtml file as a CLOB (Oracle large character object) and stream it into StringBuffer pdfbuf
try { // definitely no problem in this block
statement = con.createStatement();
resultSet = statement.executeQuery(pageToSendQuery);
if (resultSet.next()) {
htmlpage = resultSet.getClob(1);
} else {
return true;
}
final Reader in = htmlpage.getCharacterStream();
final char[] buffer = new char[4096];
while ((in.read(buffer)) != -1) {
pdfbuf.append(buffer);
}
} catch (Exception ex) {
out.println("buffering CLOB failed: " + ex);
}
// create pdf from StringBuffer
try {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader(pdfbuf.toString())));
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(doc, null);
renderer.layout();
renderer.createPDF(out);
out.close();
} catch (Exception ex) {
out.println("streaming of pdf failed: " + ex);
}
deleteClob(con, pageid);
return false;
}
Using the DocumentBuilder.parse this way will try to resolve the DTD referenced in the XHTML page. It takes a really long time. The easyest way to aviod that if you are using the Flying Saucer (xhtmlrenderer), is to create the document this way:
Document myDocument = XMLResource.load(myInputStream).getDocument();
Note that you can use XMLResource.load with a Reader too.
Two things I can think of.
1) If the iText document is not closed, it'll be empty. Looks like renderer.finish() will work, but createPDF(out) should do that already.
2) If there are no pages, you could get an empty doc as well... so an empty input could result in a 0-byte PDF.
3) You might be getting a perfectly valid PDF that's not being streamed properly. Try writing to a ByteArrayOutputStream and checking the length there.
4) An almost fanatical dedication to the Pope!