How to get link from ArrayList filling by Jsoup - java

I trying to parse website. After all links collect to ArrayList, I wanna parse them again, but I have trouble with initialization of them.
This is my ArrayList:
public ArrayList<String> linkList = new ArrayList<String>();
How I collect links in "doInBackground":
try {
Document doc = Jsoup.connect("http://forurl.com/archive/").get();
Element links = doc.select("a[href]");
for (Element link : links)
{
linkList.add(link.attr("abs:href"));
}
}
In "onPostExecute" showing what I get:
lk.setText("Collected: " +linkList.size()); // showing how much is collected
lj.setText("First link: " +linkList.get(0)); // showing first link
Try to parse child links:
public class imgTread extends AsyncTask<Void, Void, Void> {
Bitmap bitmap;
String[] url = {"http://forurl.com/link1/",
"http://forurl.com/link2/"}; // this way work well
protected Void doInBackground(Void... params) {
try {
for (int i = 0; i < url.length; i++){
Document doc1 = Jsoup.connect(url[0]).get(); // connect to 1 link for example
Elements img = doc1.select("#strip");
String imgSrc = img.attr("src");
InputStream input = new java.net.URL(imgSrc).openStream();
bitmap = BitmapFactory.decodeStream(input);
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
I traing to make String[] from ArrayList, but it doesn't work.
String[] url = linkList.toArray(new String[linkList.size()]);
Output for this way will be Ljava.lang.String;#45ds364
The idea is 1) collected all links from url; 2) Connect to them 1 by 1 and get that information what I need.
First point work, the second too, but how tie it.
Thanks for any advise.
Working code:
Document doc = Jsoup.connect(url).get(); // connect to site
Elements links = doc.select("a[href]"); // get all links
String link_addr = links.get(3).attr("abs:href"); // choose 3 link
Document link_doc = Jsoup.connect(link_addr).get(); // connetect to it
Elements img = link_doc.select("#strip"); // get all elements by tag #strip
String imgSrc = img.attr("src"); // get url
InputStream input = new java.net.URL(imgSrc).openStream();
bitmap = BitmapFactory.decodeStream(input);
I hope this helps someone.

You are doing many unnecessary steps. You have a perfectly fine collection of Element objects in your Elements links object. Why do you have to add them to an ArrayList?
If I have understood your question correctly, your thought process should be something like this.
Get all the links from a URL
Establish a new connection to each link
Download all the images on that page where the element id = "strip".
Get all the links:
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
doInBackground(links);
Call the doInBackground method with the links as a parameter:
public static void doInBackground(Elements links) throws IOException {
try {
for (Element element : links) {
//Connect to the first link
Document doc = Jsoup.connect(element.attr("abs:href")).get();
//Select all the elements with ID 'strip'
Elements img = doc.select("#strip");
//Get the source of the image
String imgSrc = img.attr("abs:src");
//Open an InputStream
InputStream input = new java.net.URL(imgSrc).openStream();
//Get the image
bitmap = BitmapFactory.decodeStream(input);
...
//Perhaps save the image somewhere?
//Close the InputStream
input.close();
}
} catch (IllegalArgumentException e) {
System.out.println(e.getMessage());
} catch (MalformedURLException ex) {
System.out.println(ex.getMessage());
}
}
Of course, you will have to properly use AsyncTask and call the methods from preferred places, but this is the overall idea of how you can use Jsoup to do the job you want it to.

If you want create a array instead of list you can do:
try {
Document doc = Jsoup.connect("http://forurl.com/archive/").get();
Element links = doc.select("a[href]");
String array = new String[links.size();
for (int i = 0; i < links.size(); i++)
{
array[i] = link.attr("abs:href");
}
}

First of all what the hell is that?
Why link variable is a collection of Elements and links is a single member of that collection?? Isn't that confusing?
Second, keep to java naming convention and name variables with noncapitalized letters so change LinkList to linkList. Even syntax highlighter got crazy thanks to you.
Third
traing to make String[] from ArrayList, but it doesn't work.
Where are you trying to do that and it does not work? I don't see it anywhere in the code.
Forth
To create Array out of List you have to do something like that
String links[]=linksList.toArray(new String[linksList.size()]);
Fifth
Change the topic to more apropriate, as present one is very missleading (you have no trouble with Jsoup here)

Related

Get Google Search Result with Java using Jsoup

first of all i search this problem in stackoverflow database and google. Unfortunately i couldn't find a solution.
I am trying to get Google Search Result for a keyword. Heres my code :
public static void main(String[] args) throws Exception {
Document doc;
try{
doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = (Elements) doc.select("li[class=g]");
for (Element link : links) {
Elements titles = link.select("h3[class=r]");
String title = titles.text();
Elements bodies = link.select("span[class=st]");
String body = bodies.text();
System.out.println("Title: "+title);
System.out.println("Body: "+body+"\n");
}
}
catch (IOException e) {
e.printStackTrace();
}
}
And heres the errors : https://prnt.sc/ro4ooi
It says : can only iterate over an array or an instance of java.lang.iterable ( at links )..
When i delete the (Elements) : https://prnt.sc/ro4pa9
Thank you.

Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file
After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:
org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage
Now I am stuck and unable to find the solution. Please assist if anyone can.
//////UPDATE AS REPLY ON COMMENTS///
I am using pdfbox-1.8.10
Here is the code:
public void getimg ()throws Exception {
try {
String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
File oldFile = new File(sourceDir);
if (oldFile.exists()){
PDDocument document = PDDocument.load(sourceDir);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {
PDResources pdResources = page.getResources();
Map pageImages = pdResources.getXObjects();
if (pageImages != null){
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()){
String key = (String) imageIter.next();
Object obj = pageImages.get(key);
if(obj instanceof PDXObjectImage) {
PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;
pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
totalImages++;
}
}
}
}
} else {
System.err.println("File not exist");
}
}
catch (Exception e){
System.err.println(e.getMessage());
}
}
//// PARTIAL SOLUTION/////
I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.
The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.
Code for 1.8 can be found here:
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup
Code for 2.0 can be found here:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date
(Even these are not always perfect, see this answer)
The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.

Save file from a website with java

I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).
Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...
Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();

JSoup core web text extraction

I am new to JSoup, Sorry if my question is too trivial.
I am trying to extract article text from http://www.nytimes.com/ but on printing the parse document
I am not able to see any articles in the parsed output
public class App
{
public static void main( String[] args )
{
String url = "http://www.nytimes.com/";
Document document;
try {
document = Jsoup.connect(url).get();
System.out.println(document.html()); // Articles not getting printed
//System.out.println(document.toString()); // Same here
String title = document.title();
System.out.println("title : " + title); // Title is fine
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
ok I have tried to parse "http://en.wikipedia.org/wiki/Big_data" to retrieve the wiki data, same issue here as well I am not getting the wiki data in the out put.
Any help or hint will be much appreciated.
Thanks.
Here's how to get all <p class="summary> text:
final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();
for( Element element : doc.select("p.summary") )
{
if( element.hasText() ) // Skip those tags without text
{
System.out.println(element.text());
}
}
If you need all <p> tags, without any filtering, you can use doc.select("p") instead. But in most cases it's better to select only those you need (see here for Jsoup Selector documentation).

Jsoup display data to textview

I parsed a html web page with jsoup. now i want to display my parsed data in my textview.
code
String ID = loginpreferences.getString("ID", null);
String Type = loginpreferences.getString("Type", null);
String myURL = "http://roosters.gepro-osi.nl/roosters/rooster.php?leerling="+ID+"&type=Leerlingrooster&afdeling="+Type+"&tabblad=2&school=905";
Document doc = null;
try {
doc = Jsoup.connect(myURL).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements data = doc.select(".1nameheader");
}
}
I tried
Textview1.SetText(data);
But that didn't work.
Seems as if you want to print the text values from a list of Elements. To do so you need to iterate over the list of Elements and get the text out of them.
StringBuilder text = new StringBuilder();
for(Element e: data){
text.append(e.text());
}
Textview1.setText(text.toString());
Line
Textview1.SetText(data);
shouldn't even compile.
From Android TextView class reference:
final void setText(CharSequence text)
Sets the string value of the TextView.
You're giving Elements class instance to the method.
Element and Elements classes of JSoup provide you with html() and text() methods that you should use in that case.
Have you tried android.text.html.forHtml(String)?
This method gets a html as input and returns a spanned text that you cat set it to a TextView

Categories

Resources