Java: Read in text files from a directory, from the internet - java

Does anybody know how to recursively read in files from a specific directory on the internet, in Java?
I want to read in all the text files from this web directory: http://www.cs.ucdavis.edu/~davidson/courses/170-S11/Female/
I know how to read in multiple files that are in a folder on my computer, and I how to read in a single file from the internet. But how can I read in multiple files on the internet, without hardcoding the URLs in?
Stuff I tried:
// List the files on my Desktop
final File folder = new File("/Users/crystal/Desktop");
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
File fileEntry = listOfFiles[i];
if (!fileEntry.isDirectory()) {
System.out.println(fileEntry.getName());
}
}
Another thing I tried:
// Reading data from the web
try
{
// Create a URL object
URL url = new URL("http://www.cs.ucdavis.edu/~davidson/courses/170-S11/Female/5_1_1.txt");
// Read all of the text returned by the HTTP server
BufferedReader in = new BufferedReader (new InputStreamReader(url.openStream()));
String htmlText; // String that holds current file line
// Read through file one line at a time. Print line
while ((htmlText = in.readLine()) != null)
{
System.out.println(htmlText);
}
in.close();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
// If another exception is generated, print a stack trace
e.printStackTrace();
}
Thanks!

Since the URL you mentioned has indexes enabled, you're in luck.
You've got a few options here.
Parse the html to find the attribute of the a tags, using SAX2 or any other XML parser. htmlunit would also work I think.
Use a little regexp magic to match all string between <a href=" and "> and use that as the urls to read from.
Once you've got a list of all the URLs you need, then the second piece of code should work just fine. Just iterate over your list, and construct your URL from that list.
Here's a sample regex that should match what you want. It does catch a few extra links, but you should be able to filter those out.
<a\ href="(.+?)">

Related

PDF Box - Unable to renameTo or Delete files

I'm fairly new to programming and I've been trying to use PDFBox for a personal project that I have. I'm basically trying to verify if the PDF has specific keywords in it, if YES I want to transfer the file to a "approved" folder.
I know the code below is poor written, but I'm not able to transfer nor delete the file correctly:
try (Stream<Path> filePathStream = Files.walk(Paths.get("C://pdfbox_teste"))) {
filePathStream.forEach(filePath -> {
if (Files.isRegularFile(filePath)) {
String arquivo = filePath.toString();
File file = new File(arquivo);
try {
// Loading an existing document
PDDocument document = PDDocument.load(file);
// Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
String[] words = text.split("\\.|,|\\s");
for (String word : words) {
// System.out.println(word);
if (word.equals("Revisão") || word.equals("Desenvolvimento")) {
// System.out.println(word);
if(file.renameTo(new File("C://pdfbox_teste//Aprovados//" + file.getName()))){
document.close();
System.out.println("Arquivo transferido corretamente");
file.delete();
};
}
}
System.out.println("Fim do documento: " + arquivo);
System.out.println("----------------------------");
document.close();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
});
I wanted to have the files transferred into the new folder. Instead, sometimes they only get deleted and sometimes nothing happens. I imagine the error is probably on the foreach, but I can't seem to find a way to fix it.
You try to rename the file while it is still open, and only close it afterwards:
// your code, does not work
if(file.renameTo(new File("C://pdfbox_teste//Aprovados//" + file.getName()))){
document.close();
System.out.println("Arquivo transferido corretamente");
file.delete();
};
Try to close the document first, so the file is no longer accessed by your process, and then it should be possible to rename it:
// fixed code:
document.close();
if(file.renameTo(new File("C://pdfbox_teste//Aprovados//" + file.getName()))){
System.out.println("Arquivo transferido corretamente");
};
And as Mahesh K pointed out, you don't have to delete the (original) file after you renamed it. Rename does not make a duplicate where the original file would need to be deleted, it just renames it.
After calling renameTo, you shouldn't be using delete.. as per my understanding renameTo works like move command. Pls see this

reading an external file using TextIO

I don't understand how to use TextIO's readFile(String Filename)
Can someone please explain how can I read an external file?
public static void readFile(String fileName) {
if (fileName == null) // Go back to reading standard input
readStandardInput();
else {
BufferedReader newin;
try {
newin = new BufferedReader( new FileReader(fileName) );
}
catch (Exception e) {
throw new IllegalArgumentException("Can't open file \"" + fileName + "\" for input.\n"
+ "(Error :" + e + ")");
}
if (! readingStandardInput) { // close current input stream
try {
in.close();
}
catch (Exception e) {
}
}
emptyBuffer(); // Added November 2007
in = newin;
readingStandardInput = false;
inputErrorCount = 0;
inputFileName = fileName;
}
}
I had to use TextIO for a school assignment and I got stuck on it too. The problem I had was that using the Scanner class I could just pass the name of the file as long as the file was in the same folder as my class.
Scanner fileScanner = new Scanner("data.txt");
That works fine. But with TextIO, this won't work;
TextIO.readfile("data.txt"); // can't find file
You have to include the path to the file like this;
TextIo.readfile("src/package/data.txt");
Not sure if there is a way to get it to work like the Scanner class or not, but this is what I've been doing in my course at school.
The above answer (about using the correct file name) is correct, however, as a clarification, make sure that you actually use the proper file path. The file path suggested above, i.e. src/package/ will not work in all circumstances. While this will be obvious to some, for those of you who need clarification, keep reading.
For example (and I use NetBeans), if you have already moved the file into NetBeans, and the file is already in the folder you want it to be in, then right click on the folder itself, and click 'properties'. Then expand the 'file path' section by clicking on the three dots next to the hidden file path. You will see the actual file path in its entirety.
For example, if the entire file path is:
C:\Users..\NetBeansProjects\IceCream\src\icecream\icecream.dat
Then, in the java code file itself, you can write:
TextIo.readfile("src/icecream/icecream.dat");
In other words, make sure you include the words 'src' but also everything that follows the src as well. If it's in the same folder as the rest of the files, you won't need anything prior to the 'src'.

How do I get a Java resource as a File?

I have to read a file containing a list of strings. I'm trying to follow the advice in this post. Both solutions require using FileUtils.readLines, but use a String, not a File as the parameter.
Set<String> lines = new HashSet<String>(FileUtils.readLines("foo.txt"));
I need a File.
This post would be my question, except the OP was dissuaded from using files entirely. I need a file if I want to use the Apache method, which is the my preferred solution to my initial problem.
My file is small (a hundred lines or so) and a singleton per program instance, so I do not need to worry about having another copy of the file in memory. Therefore I could use more basic methods to read the file, but so far it looks like FileUtils.readLines could be much cleaner. How do I go from resource to file.
Apache Commons-IO has an IOUtils class as well as a FileUtils, which includes a readLines method similar to the one in FileUtils.
So you can use getResourceAsStream or getSystemResourceAsStream and pass the result of that to IOUtils.readLines to get a List<String> of the contents of your file:
List<String> myLines = IOUtils.readLines(ClassLoader.getSystemResourceAsStream("my_data_file.txt"));
I am assuming the file you want to read is a true resource on your classpath, and not simply some arbitrary file you could just access via new File("path_to_file");.
Try the following using ClassLoader, where resource is a String representation of the path to your resource file in your class path.
Valid String values for resource can include:
"foo.txt"
"com/company/bar.txt"
"com\\company\\bar.txt"
"\\com\\company\\bar.txt"
and path is not limited to com.company
Relevant code to get a File not in a JAR:
File file = null;
try {
URL url = null;
ClassLoader classLoader = {YourClass}.class.getClassLoader();
if (classLoader != null) {
url = classLoader.getResource(resource);
}
if (url == null) {
url = ClassLoader.getSystemResource(resource);
}
if (url != null) {
try {
file = new File(url.toURI());
} catch (URISyntaxException e) {
file = new File(url.getPath());
}
}
} catch (Exception ex) { /* handle it */ }
// file may be null
Alternately, if your resource is in a JAR, you will have to use Class.getResourceAsStream(resource); and cycle through the file using a BufferedReader to simulate the call to readLines().
using a resource to read the file to a string:
String contents =
FileUtils.readFileToString(
new File(this.getClass().getResource("/myfile.log").toURI()));
using inputstream:
List<String> listContents =
IOUtils.readLines(
this.getClass().getResourceAsStream("/myfile.log"));

Add docx file in resources and create executable jar

I want to add
docx files in resources folder, use those files in code written in class located at another package of same application.
And then I want to make an executable jar out of it which will be working on windows.
I read its not easy to make such jar :( and there is no fool prroof way...
I have tried searching for it on net and found I will have to create URL and then file and then use it...
however, when I use below code, I am not able to get URL itself...
URL urlOfDraftInSamePackage = CreateDraft.class.getResource("Draft_in_same_package.docx");
System.out.println("urlOfDraftInSamePackage is "+urlOfDraftInSamePackage.toString());
//This prints : urlOfDraftInSamePackage is file:/D:/aditya_workspace/SampleDraftMaker/bin/draftProcessing/Draft_in_same_package.docx
URL urlOfDraftInResourceFolder = CreateDraft.class.getResource("resouces/Draft_Apartment.docx");
System.out.println("urlOfDraftInResourceFolder is "+urlOfDraftInResourceFolder.toString());
//this gives null pointer exception
URI uri = null;
try {
uri = urlOfDraftInSamePackage.toURI();
File file = new File(uri);
System.out.println("file made");
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
below is my folder structure:
can anyone pls help me in creating such executable jar using eclipse?
Thanks In Advance!!!
Following code works for me:
public static void testResource() throws IOException {
InputStream stream = Deserializace.class.getResourceAsStream("resources/ser.log");
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
String s;
while ( (s = reader.readLine()) != null) {
System.out.println(s);
}
}
Build directory structure:
Test.class
resources/ser.log
You must ensure that your resource directory is copied to correct place.

File Delete and Rename in Java

I have the following Java code which will search in an xml for a specific tag and then will add some text to it and save that file. I couldnt find a way to rename the emporary file to the original file. Please suggest.
import java.io.*;
class ModifyXML {
public void readMyFile(String inputLine) throws Exception
{
String record = "";
File outFile = new File("tempFile.tmp");
FileInputStream fis = new FileInputStream("InfectiousDisease.xml");
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
FileOutputStream fos = new FileOutputStream(outFile);
PrintWriter out = new PrintWriter(fos);
while ( (record=br.readLine()) != null )
{
if(record.endsWith("<add-info>"))
{
out.println(" "+"<add-info>");
out.println(" "+inputLine);
}
else
{
out.println(record);
}
}
out.flush();
out.close();
br.close();
//Also we need to delete the original file
//outFile.renameTo(InfectiousDisease.xml);//Not working
}
public static void main (String[] args) {
try
{
ModifyXML f = new ModifyXML();
f.readMyFile("This is infectious disease data");
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
Thanks
First delete the original file and then rename the new file:
File inputFile = new File("InfectiousDisease.xml");
File outFile = new File("tempFile.tmp");
if(inputFile.delete()){
outFile.renameTo(inputFile);
}
A good method to rename files is.
File file = new File("path-here");
file.renameTo(new File("new path here"));
In your code there are several issues.
First your description mentions renameing the original file and adding some text to it. Your code doesn't do that, it opens two files, one for reading and one for writing (with the additional text). That is the right way to do things, as adding text in-place is not really feasible using the techniques you are using.
The second issue is that you are opening a temporary file. Temporary files remove themselves upon closing, so all the work you did adding your text disappears as soon as you close the file.
The third issue is that you are modifying XML files as plain text. This sometimes works as XML files are a subset of plain text files, but there is no indication that you attempted to ensure that the output file was an XML file. Perhaps you know more about your input files than is mentioned, but if you want this to work correctly for 100% of the input cases, you probably want to create a SAX writer that writes out all a SAX reader reads, with the additional information in the correct tag location.
You can use
outFile.renameTo(new File(newFileName));
You have to ensure these files are not open at the time.

Categories

Resources