Groovy script xml parser for multiple files - java

Hi my groovy script strips out an xml tag from a file and writes to a file.
import org.apache.commons.lang.RandomStringUtils
import groovy.util.XmlSlurper
inputFile = 'C:\\sample.xml'
outputFile = 'C:\\ouput.txt'
XMLTag='Details'
fileContents = new File(inputFile).getText('UTF-8')
def xmlFile=new XmlSlurper().parseText(fileContents)
def myPayload= new String(xmlFile.'**'.find{node-> node.name() == XMLTag} *.text().toString())
file = new File(outputFile)
w = file.newWriter()
w << myPayload.substring(1, myPayload.length()-1)
w.close()
My question is how do I write it so the it goes through an entire directory and performs it on multiple xml files and creates multiple output as at the moment it is hard coded. ('C:\sample.xml' and 'C:\ouput.txt')
Thanks
Leon

First, I would recommend that you take what you have and put it into a single function; it's good coding practrice and improves readabililty.
Now to executing the function on each xml file in a directory, you can use groovy's File.eachFileMatch(). For example, if you want to run it on each xml file in the current directory, you could do:
import org.apache.commons.lang.RandomStringUtils
import groovy.util.XmlSlurper
import static groovy.io.FileType.*
void stripTag(File inputFile, String outputFile) {
def XMLTag='Details'
fileContents = inputFile.getText('UTF-8')
def xmlFile=new XmlSlurper().parseText(fileContents)
def myPayload= new String(xmlFile.'**'.find{node-> node.name() == XMLTag}*.text().toString())
def file = new File(outputFile)
w = file.newWriter()
w << myPayload.substring(1, myPayload.length()-1)
w.close()
}
// This will match all files in the current directory with the file extension .xml
new File(".").eachFileMatch(FILES, ~/.*\.xml/) { File input ->
// Set the output file name to be <file_name>_out.txt
// Change this as you need
stripTag(input, input.name + "_out.txt")
}
If you want to, you can add reading in the directory from the command line as well.

Related

Running multiple PDF through an PDFBox program

Currently I am trying to use PDFBox in Eclipse to run multiple PDF files in a folder through a text reader that will extract certain terms and output them into a text file that I will then convert to an excel sheet. Currently I have the program and it works correctly for a single PDF file:
public static void main(String args[]) throws IOException {
//Loading an existing document
File file = new File("ADE_acetylfuranoside_120319_pfister.pdf");
PDDocument document = PDDocument.load(file);
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String text = pdfStripper.getText(document);
//..."Actual code that extracts text"...
PrintStream o = new PrintStream(new File("output.txt"));
PrintStream console = System.out;
System.setOut(o);
System.out.println(finalSheet);
my problem is that I want to run 500 PDFs in one folder through this program on eclipse rather than putting in the name of each one individually. I also want it to output like:
Name1, Number1, ID1
Name2, Number2, ID2
but I think the way it is written now it will just overwrite line number one if I run multiple PDFs though it.
Thanks for the help!
For the first part, you could just use the File class with a FileFilter:
// directoryName could be as simple a "."
File folder = new File(directoryName);
File[] listOfFiles = folder.listFiles(new FileFilter() {
#Override
public boolean accept(File pathname) {
return pathname.getName().toLowerCase().endsWith(".pdf");
}
});
This gives you an array of File objects of all the files in a particular folder/directory. Now you can loop through it with pretty much the code you have.
On the output side, you'll likely want to correlate the output with the input. I'm a bit confused by your code and I'm guessing you'd just like an output file for each input file. So, perhaps, something like:
// index is the value you used to loop through the `listOfFiles` array
try( FileWriter fileWriter = new FileWriter(listOfFiles[index].getName() + ".output.txt" ) ) {
fileWriter.write( // the String text you want in the file );
}
This creates a file named (as taken from your example) "ADE_acetylfuranoside_120319_pfister.pdf.output.txt". Obviously this could change. In this case a new file is created for each input file.

Provide log if the two files are identical and has same contents in Java

I have below code where i am reading the file from particular directory, processing it and once processed i am moving the file to archive directory. This is working fine. I am receiving new file everyday and i am using Control-M scheduler job to run this process.
Now in next run i am reading the new file from that particularly directory again and checking this file with the file in the archive directory and if the content is different then only process the file else dont do anything. There is shell script written to do this job and we dont see any log for this process.
Now i want to produce log message in my java code if the files are identical from the particular directory and in the archive directory then generate log that 'files are identical'. But i dont know exactly how to do this. I dont want to write the the logic to process or move anything in the file ..i just need to check the files are equal and if it is then
produce log message. The file which i recieve are not very big and the max size can be till 10MB.
Below is my code:
for(Path inputFile : pathsToProcess) {
// read in the file:
readFile(inputFile.toAbsolutePath().toString());
// move the file away into the archive:
Path archiveDir = Paths.get(applicationContext.getEnvironment().getProperty(".archive.dir"));
Files.move(inputFile, archiveDir.resolve(inputFile.getFileName()),StandardCopyOption.REPLACE_EXISTING);
}
return true;
}
private void readFile(String inputFile) throws IOException, FileNotFoundException {
log.info("Import " + inputFile);
try (InputStream is = new FileInputStream(inputFile);
Reader underlyingReader = inputFile.endsWith("gz")
? new InputStreamReader(new GZIPInputStream(is), DEFAULT_CHARSET)
: new InputStreamReader(is, DEFAULT_CHARSET);
BufferedReader reader = new BufferedReader(underlyingReader)) {
if (isPxFile(inputFile)) {
Importer.processField(reader, tablenameFromFilename(inputFile));
} else {
Importer.processFile(reader, tablenameFromFilename(inputFile));
}
}
log.info("Import Complete");
}
}
Based on the limited information about the size of file or performance needs, something like this can be done. This may not be 100% optimized, but just an example. You may also have to do some exception handling in the main method, since the new method might throw an IOException:
import org.apache.commons.io.FileUtils; // Add this import statement at the top
// Moved this statement outside the for loop, as it seems there is no need to fetch the archive directory path multiple times.
Path archiveDir = Paths.get(applicationContext.getEnvironment().getProperty("betl..archive.dir"));
for(Path inputFile : pathsToProcess) {
// Added this code
if(checkIfFileMatches(inputFile, archiveDir); {
// Add the logger here.
}
//Added the else condition, so that if the files do not match, only then you read, process in DB and move the file over to the archive.
else {
// read in the file:
readFile(inputFile.toAbsolutePath().toString());
Files.move(inputFile, archiveDir.resolve(inputFile.getFileName()),StandardCopyOption.REPLACE_EXISTING);
}
}
//Added this method to check if the source file and the target file contents are same.
// This will need an import of the FileUtils class. You may change the approach to use any other utility file, or read the data byte by byte and compare. If the files are very large, probably better to use Buffered file reader.
private boolean checkIfFileMatches(Path sourceFilePath, Path targetDirectoryPath) throws IOException {
if (sourceFilePath != null) { // may not need this check
File sourceFile = sourceFilePath.toFile();
String fileName = sourceFile.getName();
File targetFile = new File(targetDirectoryPath + "/" + fileName);
if (targetFile.exists()) {
return FileUtils.contentEquals(sourceFile, targetFile);
}
}
return false;
}

FileNotFoundException java cannot basic find file while it's there

I'm trying to read a basic txt file that contains prices in euros. My program is supposed to loop through these prices and then create a new file with the other prices. Now, the problem is that java says it cannot find the first file.
It is in the exact same package like this:
Java already fails at the following code:
FileReader fr = new FileReader("prices_usd.txt");
Whole code :
import java.io.*;
public class DollarToEur {
public static void main(String[] arg) throws IOException, FileNotFoundException {
FileReader fr = new FileReader("prices_usd.txt");
BufferedReader br = new BufferedReader(fr);
FileWriter fw = new FileWriter("prices_eur");
PrintWriter pw = new PrintWriter(fw);
String regel = br.readLine();
while(regel != null) {
String[] values = regel.split(" : ");
String beschrijving = values[0];
String prijsString = values[1];
double prijs = Double.parseDouble(prijsString);
double newPrijs = prijs * 0.913;
pw.println(beschrijving + " : " + newPrijs);
regel = br.readLine();
}
pw.close();
br.close();
}
}
Your file looks to be named "prices_usd" and your code is looking for "prices_usd.txt"
There are a couple of things you need to do:
Put the file directly under the project folder in Eclipse. When your execute your code in Eclipse, the project folder is considered to be the working directory. So you need to put the file there so that Java can find it.
Rename the file correctly with the .txt extn. From your screen print it looks like the file does not have an extension or may be it's just not visible.
Hope this helps!
It is bad practice to put resource files (like prices_usd.txt) in a package. Please put it under the resources/ directory. If you put it directly in the resources/ directory, you can access the file like this:
new FileReader(new File(this.getClass().getClassLoader().getResource("prices_usd.txt").getFile()));
But if you really have a good reason to put it in the package, you can access it like this:
new FileReader("src/main/java/week5/practicum13/prices_usd.txt");
But this will not work when you export your project (for example: as a jar).
EDIT 0: Also of course, your file's name needs to be "prices_usd.txt" and not just "prices_usd".
EDIT 1: The first (recommended) solution does return a string on .getFile() which can not directly be passed to the new File(...) constructor when the application is built / not run in the IDE. Spring has a solution to it though: org.springframework.core.io.ClassPathResource.
Simply use this code with Spring:
new FileReader(new ClassPathResource("prices_usd.txt").getFile());

Creating single input stream for multiple files in a hadoop path - Java

I want to create a single input stream for multiple files in the HDFS path. The HDFS Path contains many data files (eg: data-1, data-2, ....data-n). It also contains the _SUCCESS file. I want to create a single input stream for all the data files execluding the _SUCCESS file.
Right now, I am going into loop through FileStatus and create an input stream for individual files. I merge these input streams to create a sequenceStream.
Below is the sample of my code:
Path filePath = new Path(hdfsPath);
hdfsFS = FileSystem.get(conf);
CharSequence fileNameFormat = "data";
FileStatus[] inputFiles = hdfsFS.listStatus(filePath);
Vector<InputStream> inputStreams = new Vector<InputStream>();
for (int i =0; i < inputFiles.length; i++)
{
System.out.println(inputFiles[i].getPath().getName());
if (inputFiles[i].getPath().getName().contains(fileNameFormat)) {
Path fileName = new Path(hdfsPath + inputFiles[i].getPath().getName());
fileInputStream = hdfsFS.open(fileName);
inputStreams.add(fileInputStream);
}
}
Enumeration<InputStream> enu = inputStreams.elements();
sequenceStream = new SequenceInputStream(enu);
I want to create input stream, I do not want to put any file to local and then create an input stream. Is there any other efficient way of doing this?

I can't locate files which created by program

I created a desktop project in netbeans, in the project folder I have three files : file.txt, file2.txt and file3.txt, in the load of the program I want to call these three files, and this is the code I tried :
public void run() {
Path path = Paths.get("file.txt");
Path path2 = Paths.get("file2.txt");
Path path3 = Paths.get("file3.txt");
if(Files.exists(path) && Files.exists(path2) && Files.exists(path3)) {
lireFichiers();
}else{
JOptionPane.showConfirmDialog(null, "Files didn't found !");
}
}
but when I run my program I get the message : "Files didn't found !" which means he didn't found those files.
those files are created by this code :
File file = new File("Id.txt");
File file2 = new File("Pass.txt");
File file3 = new File("Remember.txt");
The following three lines will only create file handlers for your program to use. This will not create a file by itself. If you are using the handler to write it will also create a file for you provided you close correctly after writing.
File file = new File("Id.txt");
File file2 = new File("Pass.txt");
File file3 = new File("Remember.txt");
So, a sample code will look like:
File file = new File("Id.txt");
FileWriter fw = new FileWriter(file);
try
{
// write to file
}
finally
{
fw.close();
}
If the file is in the root of your project, this should work:
Path path = Paths.get("foo.txt");
System.out.println(Files.exists(path)); // true
Where exatlcy are the files you want to open in your project?
Please specify the language you use.
Generally you could search the file to see whether the files are in the program bootup folder. For webapps you should pay attention to the "absolute path and the relative path".
=========Edit============
If you are using Jave, then the file should be write out using FileWriter.close() before you can find them in your hard disk.
Ref
Thank you all for your help, I just tried this :
File file = new File("Id.txt");
File file2 = new File("Pass.txt");
File file3 = new File("Remember.txt");
if(file.exists() && file2.exists() && file3.exists()){
// manipulation
}
and it works

Categories

Resources