Download all pdf files in website - java

Trying to download all pdf files in the website and I have a bad code. I guess there is a better out there. Anyways here is it:
try {
System.out.println("Download started");
URL getURL = new URL("http://cs.lth.se/eda095/foerelaesningar/?no_cache=1");
URL pdf;
URLConnection urlC = getURL.openConnection();
InputStream is = urlC.getInputStream();
BufferedReader buffRead = new BufferedReader(new InputStreamReader(is));
FileOutputStream fos = null;
byte[] b = new byte[1024];
String line;
double i = 1;
int t = 1;
int length;
while((line = buffRead.readLine()) != null) {
while((length = is.read(b)) > -1) {
if(line.contains(".pdf")) {
pdf = new URL("http://fileadmin.cs.lth.se/cs/Education/EDA095/2015/lectures/"
+ "f" + i + "-" + t + "x" + t);
fos = new FileOutputStream(new File("fil" + i + "-" + t + "x" + t + ".pdf"));
fos.write(b, 0, line.length());
i += 0.5;
t += 1;
if(t > 2) {
t = 1;
}
}
}
}
is.close();
System.out.println("Download finished");
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
The files I get is damage, BUT is there a better way to download the PDF files? Because on the site some of the files are f1-1x1, f1-2x2, f2-1x1.. But what IF the files were donalds.pdf stack.pdf etc..
So the question would be, How do I make my code better to download all the pdf files?

Basically you are asking: "how can I parse HTML reliably; to identify all download links that point to PDF files".
Anything else (like what you have right now; to anticipate how links would/could/should look like) will be a constant source for grieve; because any update to your web site; or trying to run your code against another different web site is very likely to fail. And that is because HTML is complex and has so many flavors that you should simply forget about "easy" solutions to analyse HTML content.
In that sense: learn how to use an HTML parser; a first starting point could be Which HTML Parser is the best?

Related

Extracted ZIP file failed - No files inside, and having .zip25 extension after extract

I was trying to extract the ZIP file from my Linux, I'm able to extract it, but the expected output is failing/wrong. The extract file suddenly has no files inside and the folder extracted has .zip25 extension. I searched on this, and there is saying that it is corrupted. However, I don't think it is corrupted because I am able to open and extract the zip files perfectly in local (Windows directory).
Example:
Zip file: FolderZip.zip
After extract: FolderZip.zip25 (Note: This is already extracted but still has .zip25 extension, also the files inside are missing).
Below is my code, I've worked on this for almost a month, but still can't figure it out. Can someone help me to figure out what did I do wrong?
public void unZipFolder(String zipFile, String outputFolder){
byte[] buffer = new byte[1024];
System.out.println("ZipFileLocation: " + zipFile);
LOG.info(" ZipFileLocation: " + zipFile);
File folder = new File(outputFolder);
if(!folder.exists())folder.mkdirs();
try{
FileInputStream fis = new FileInputStream(zipFile);
ZipInputStream zis = new ZipInputStream(fis);
ZipEntry ze = zis.getNextEntry();
while(ze != null) {
new File(folder.getParent()).mkdirs();
FileOutputStream fos = new FileOutputStream(folder);
File newFile = new File(outputFolder + FilenameUtils.indexOfLastSeparator(ze.getName()));
if (ze.isDirectory()) {
if (!newFile.isDirectory() && !newFile.mkdirs()) {
throw new IOException("Failed to create directory " + newFile);
}else if(ze.isDirectory()){
newFile.mkdirs();
continue;
}else{
int len;
while ((len = zis.read(buffer)) >= 0) {
fos.write(buffer, 0, len);
}
System.out.println("File Unzip: " + newFile);
LOG.info(" File Unzip: " + newFile);
newFile.mkdirs();
fos.close();
zis.closeEntry();
ze = zis.getNextEntry();
}
}
boolean result = Files.deleteIfExists(Paths.get(zipFile));
if (result) {
System.out.println("ZipFile is deleted....");
} else {
System.out.println("Unable to delete the file.....");
}
}
zis.closeEntry();
zis.close();
fis.close();
}catch(IOException ex){
ex.printStackTrace();
}
}
I'd love to be able to tell you exactly what's wrong with your code, but FileOutputStream fos = new FileOutputStream(folder); throws an exception because, well, folder is, a directory, so you can't write to it.
I'm also scratching my head over what your expecting new File(folder.getParent()).mkdirs(); to do.
I basically threw out your code and started again with...
public void unZipFolder(File zipFile, File outputFolder) throws IOException {
byte[] buffer = new byte[1024];
System.out.println("ZipFileLocation: " + zipFile);
System.out.println("outputFolder = " + outputFolder);
if (!outputFolder.exists() && !outputFolder.mkdirs()) {
throw new IOException("Unable to create output folder: " + outputFolder);
} else if (outputFolder.exists() && !outputFolder.isDirectory()) {
throw new IOException("Output is not a directory: " + outputFolder);
}
try (ZipFile zipper = new ZipFile(zipFile)) {
Enumeration<? extends ZipEntry> entries = zipper.entries();
while (entries.hasMoreElements()) {
ZipEntry ze = entries.nextElement();
File destination = new File(outputFolder, ze.getName());
if (ze.isDirectory()) {
if (!destination.exists() && !destination.mkdirs()) {
throw new IOException("Could not create directory: " + destination);
}
} else {
System.out.println("Writing " + destination);
try (InputStream is = zipper.getInputStream(ze); FileOutputStream fos = new FileOutputStream(destination)) {
// You could use is.transferTo(fos) here but I'm a grump old coder
byte[] bytes = new byte[1024 * 4];
int bytesRead = -1;
while ((bytesRead = is.read(bytes)) != -1) {
fos.write(bytes, 0, bytesRead);
}
}
}
}
}
}
Now, what's important to know about this is, it expects the directory contents of the zip files to be relative (ie no root directory information). If your zip file does contain root directory information (ie C:/... or /...), then you're going to need to clean that up yourself.
Now, if you have trouble with this, I would suggest commenting out the "extraction" portion of the code and placing in more System.out.println statements
transferTo
After reading through the code for transferTo, it's basically doing the same thing that the code example above is doing - so, if you wanted to reduce the code complexity (and reduce the risk of possible bugs), you could use it - been some what old school, I'd probably still do it the "old way" in order to provide support for progress monitoring of some kind - but that's beyond the scope of the question.
"Security issues"
This ones a little harder to tie down, as no solution is 100% safe.
I modified the above code to use something like...
Path parent = outputFolder.toPath().toAbsolutePath();
String name = "../" + ze.getName();
Path child = parent.resolveSibling(new File(outputFolder, name).toPath());
And this ended up throwing a NoSuchFileException, so, at least you could "fail fast", assuming that's what you want.
You might also consider removing .., leading / or leading path specifications in an attempt to make the path "relative", but that could become complicated as something like somePath/../file could still be valid within your use case.

Java - Download file from URL with matching file name pattern

I want to download few files from a URL. I know the starting of the file name. But the next part would be different. Mostly a date. But it could be different for different files. From Java code, is there any way to download file with matching pattern?
If I hit the below URL in chrome, all the files are listed and I have to download the required files manually.
http://<ip_address>:<port>/MR/build/report/scan/daily/2021-12-13_120/data/
File names can b like below. It will have known file name and date. The date can be different. Either the same as in URL or some older one.
scan_report_2021_12_13_120.txt
build_report_2021_12_10_110.txt
my_reportdata_2021_11_30_110.txt
As of now, my Java code is like below. I have to pass the complete URL with exact file name to download the files. Most of the cases it would be same as the date and number in URL. So in the program I take the date part from URL and add it to my file name nd pass as the URL. But for some files it might change and for those I have to manually download.
private static void downloadFile(String remoteURLPath, String localPath) {
System.out.println("DownloadFileTest.downloadFile() Downloading from " + remoteURLPath + " to = " + localPath);
FileOutputStream fos = null;
try {
URL website = new URL(remoteURLPath);
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
fos = new FileOutputStream(localPath);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (fos != null) {
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
The argument remoteURLPath is passed like http://<ip_address>:<port>/MR/build/report/scan/daily/2021-12-13_120/data/scan_report_2021_12_13_120.txt
And localPath is passed like C:\\MyDir\\MyData\\scan_report_2021_12_13_120.txt
Similarly other files also with date as 2021_12_13_120. Other files wont get downloaded. But will create empty file in the same directory which I will delete later since size is 0.
Is there any way we can pass pattern here?
Like http://<ip_address>:<port>/MR/build/report/scan/daily/2021-12-13_120/data/scan_report_*.txt
And instead of passing complete local path, is there any way to pass only directory where the file should get downloaded with exact same name as in the remote system?
In Linux I can use wget with pattern matching. But was looking for Java way to download in all platforms.
wget -r -np -nH --cut-dirs=10 -A "scan_report*.txt" "http://<ip_address>:<port>/MR/build/report/scan/daily/2021-12-13_120/data/"
Thanks to comment from #FedericoklezCulloca. I modified my code using this answer
The solution I did is read all html page and get all href values as it had only the file names with extension. From there I had another list which I used to get the matching files and those I downloaded then using my code in the Question.
Method to get all href list from URL. may be optimisation can be done. Also I did not use any extra library.
private static List<String> getAllHREFListFromURL(String downloadURL) {
URL url;
InputStream is = null;
List<String> hrefListFromURL = new ArrayList<>();
try {
url = new URL(downloadURL);
is = url.openStream();
byte[] buffer = new byte[1024];
int bytesRead = -1;
StringBuilder page = new StringBuilder(1024);
while ((bytesRead = is.read(buffer)) != -1) {
String str = new String(buffer, 0, bytesRead);
page.append(str);
}
StringBuilder htmlPage = new StringBuilder(page);
String search_start = "href=\"";
String search_end = "\"";
while (!htmlPage.isEmpty()) {
int indexOf = htmlPage.indexOf(search_start);
if (indexOf != -1) {
String substring = htmlPage.substring(indexOf + search_start.length());
String linkName = substring.substring(0, substring.indexOf(search_end));
hrefListFromURL.add(linkName);
htmlPage = new StringBuilder(substring);
} else {
htmlPage = new StringBuilder();
}
}
} catch (MalformedURLException e1) {
e1.printStackTrace();
} catch (IOException ex) {
ex.printStackTrace();
} finally {
try {
is.close();
} catch (Exception e) {
}
}
return hrefListFromURL;
}
Method to get list of files that I needed.
private static List<String> getDownloadList(List<String> allHREFListFromURL) {
List<String> filesList = getMyFilesList();
List<String> downloadList = new ArrayList<>();
for (String fileName : filesList) {
Predicate<String> fileFilter = Pattern.compile(fileName + "*").asPredicate();
List<String> collect = allHREFListFromURL.stream().filter(fileFilter).collect(Collectors.toList());
downloadList.addAll(collect);
}
return downloadList;
}
private static List<String> getMyFilesList() {
List<String> filesList = new ArrayList<>();
filesList.add("scan_report");
filesList.add("build_report");
filesList.add("my_reportdata");
return filesList;
}
The downloadList I iterate and uses my original download method to download.

Downloading an image in java

I have to download an image from the nasa website. Problem is, that my code sometimes works, sucessfully downloading an image, while sometimes saves only 186B (don't know why exactly 186).
Problems is for sure connected with the way nasa sahres those photos. For instance, an image from that link https://mars.jpl.nasa.gov/msl-raw-images/msss/00001/mcam/0001ML0000001000I1_DXXX.jpg is saved sucessfully, while from that link https://mars.nasa.gov/mer/gallery/all/2/f/001/2F126468064EDN0000P1001L0M1-BR.JPG fails.
Here is my code
public static void saveImage(String imageUrl, String destinationFile){
URL url;
try {
url = new URL(imageUrl);
System.out.println(url);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(destinationFile);
byte[] b = new byte[2048];
int length;
while ((length = is.read(b)) != -1) {
os.write(b, 0, length);
}
is.close();
os.close();
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Does someone have an idea, why is doesn't work?
public boolean downloadPhotosSol(int i) throws JSONException, IOException {
String url0 = "https://api.nasa.gov/mars-photos/api/v1/rovers/spirit/photos?sol=" + this.chosenMarsDate + "&camera=" + this.chosenCamera + "&page=" + i + "&api_key=###";
JSONObject json = JsonReader.readJsonFromUrl(url0);
if(json.getJSONArray("photos").length() == 0) return true;
String workspace = new File(".").getCanonicalPath();
String pathToFolder = workspace+File.separator+this.getManifest().getName() + this.chosenMarsDate + this.chosenCamera +"Strona"+i;
new File(pathToFolder).mkdirs();
for(int j = 0;j<json.getJSONArray("photos").length();j++) {
String url = ((JSONObject) json.getJSONArray("photos").get(j)).getString("img_src");
SaveImage.saveImage(url, pathToFolder+File.separator+"img"+j+".jpg");
}
return false;
}
When you get a 186 byte file, open it with a text editor and see what is inside. It could contain an HTTP error message in HTML format. If instead you see the first 186 bytes of your image file, then something is not working right with your program.
EDIT: From your comments it looks like you are getting an HTTP 301 response, which is a redirect to another location. A web browser handles this automatically without you noticing. However, your Java program is not following the redirect to the new location. You need to use an HTTP Java library that handles redirects.
Best and short way of doing it:
try(InputStream in = new URL("http://example.com/image.jpg").openStream()){
Files.copy(in, Paths.get("C:/File/To/Save/To/image.jpg"));
}

Java. Save File to Client side not working

I want to save File to a client side. How it can be done ?
When i start server localy all is good Files are saved # needed place, when run on server then files are saved on server side :( . Because System.getProperty("user.home") are returning :/root .
User select File from system and wants to open it. Code example:
mylog.pl("Blob in use + stop counter:" + stop);
File file = new File(SU.userHome + "/" + fileName);
mylog.pl("File maked ! Path:" + file.getAbsolutePath());
in = blob.getBinaryStream();
out = new FileOutputStream(file);
byte[] buff = new byte[4096];
int len = 0;
while ((len = in.read(buff)) != -1) {
out.write(buff, 0, len);
}
try {
mylog.pl("Desktop Open!");
if (Desktop.isDesktopSupported())
{
Desktop.getDesktop().open(file);
}
else
{
mylog.pl("Desktop is not suported!");
//For other IS
DesktopApi.open(file);
}
}
catch (Exception e) {
mylog.pl("err # runtime" + e.getMessage());
}
Thanks ! Correct answers guaranteed !
//From server to client
final FileResource res = new FileResource(file);
FileDownloader fd = new FileDownloader(res);
p.open(res, "MyWindow", false);
file.delete();

Download a file through the Internet with RandomAccessFile

I was browsing the Internet for random Java code, and I found this source code for a download manager. It uses RandomAccessFile to download the files. The one thing I could not figure out though, was where it would download to. Here is the method that downloads the file:
public void startDownload() {
System.out.println("Starting...");
RandomAccessFile file = null;
InputStream stream = null;
try {
URL downloadLink = new URL("http://www.website.com/file.txt");
// Open the connection to the URL
HttpURLConnection connection = (HttpURLConnection) downloadLink.openConnection();
// Specify what portion of file to download
connection.setRequestProperty("Range", "bytes=" + downloaded + "-");
// Connect to the server
connection.connect();
// Make sure the code is in the 200 range
if (connection.getResponseCode() / 100 != 2) {
error();
}
// Check for valid content length
int contentLength = connection.getContentLength();
if (contentLength < 1) {
error();
}
// Set the size for the download if it hasn't been already set
if (size == -1) {
size = contentLength;
stateChanged();
}
// Open file and seek to the end of it
file = new RandomAccessFile(getFileName(downloadLink), "rw");
// getFileName returns the name of the file mentioned in the URL
file.seek(downloaded);
stream = connection.getInputStream();
while (status == DOWNLOADING) {
System.out.println("Progress: " + getProgress() + "%");
// Size the buffer according to how much of the file is left to download
byte buffer[];
if (size - downloaded > MAX_BUFFER_SIZE) {
buffer = new byte[MAX_BUFFER_SIZE];
} else {
buffer = new byte[size - downloaded];
}
// Read from the server into the buffer
int read = stream.read(buffer);
if (read == -1) {
break;
}
// Write buffer to file
file.write(buffer, 0, read);
downloaded += read;
stateChanged();
}
if (status == DOWNLOADING) {
status = COMPLETE;
stateChanged();
}
} catch (Exception e) {
error();
} finally {
// Close the stream and RAF
}
System.out.println("Done!");
}
I am sorry if this is obvious. I am new to the RandomAccessFile class, as I just learned of it today.
It will download it in the current working directory (i.e. where you run your java command) and the name of the file will be given by getFileName(downloadLink).
I am new to this too. getFileName appears to be a method within the same class and that code is missing.

Categories

Resources