I have files which consist of json elements in an array.
(several file. each file has json array of elements)
I have a process that knows to take each json element as a line from file and process it.
So I created a small program that reads the JSON array, and then writes the elements to another file.
The output of this utility will be the input of the other process.
I used Java 7 NIO (and gson).
I tried to use as much Java 7 NIO as possible.
Is there any improvement I can do?
What about the filter? Which approach is better?
Thanks,
public class TransformJsonsUsers {
public TransformJsonsUsers() {
}
public static void main(String[] args) throws IOException {
final Gson gson = new Gson();
Path path = Paths.get("C:\\work\\data\\resources\\files");
final Path outputDirectory = Paths
.get("C:\\work\\data\\resources\\files\\output");
DirectoryStream.Filter<Path> filter = new DirectoryStream.Filter<Path>() {
#Override
public boolean accept(Path entry) throws IOException {
// which is better?
// BasicFileAttributeView attView = Files.getFileAttributeView(entry, BasicFileAttributeView.class);
// return attView.readAttributes().isRegularFile();
return !Files.isDirectory(entry);
}
};
DirectoryStream<Path> directoryStream = Files.newDirectoryStream(path, filter);
directoryStream.forEach(new Consumer<Path>() {
#Override
public void accept(Path filePath) {
String fileOutput = outputDirectory.toString() + File.separator + filePath.getFileName();
Path fileOutputPath = Paths.get(fileOutput);
try {
BufferedReader br = Files.newBufferedReader(filePath);
User[] users = gson.fromJson(br, User[].class);
BufferedWriter writer = Files.newBufferedWriter(fileOutputPath, Charset.defaultCharset());
for (User user : users) {
writer.append(gson.toJson(user));
writer.newLine();
}
writer.flush();
} catch (IOException e) {
throw new RuntimeException(filePath.toString(), e);
}
}
});
}
}
There is no point of using Filter if you want to read all the files from the directory. Filter is primarily designed to apply some filter criteria and read a subset of files. Both of them may not have any real difference in over all performance.
If you looking to improve performance, you can try couple different approaches.
Multi-threading
Depending on how many files exists in the directory and how powerful your CPU is, you can apply multi threading to process more than one file at a time
Queuing
Right now you are reading and writing to another file synchronously. You can queue content of the file using Queue and create asynchronous writer.
You can combine both of these approaches as well to improve performance further.
Don't put the I/O into the filter. That's not what it's for. You should get the complete list of files and then process it. For example if the I/O creates another file in the directory, the behaviour is undefined. You might miss a file, or see the new file in the accept() method.
Related
I've been searching Google for some time now but can't seem to find any library that allows me to open password protected RAR files using Java (compressed files).
If anyone knows of one please share it with me (if possible one including a maven dependency).
I've been looking at JUnRar and java-UnRar, but both do not support password protected files for as far as I could discover.
WinRAR is shipped with two utility programs (unrar.exe and rar.exe). From Powershell, you can unrar an archive by calling: unrar e .\my-archive.rar -p[your-password]
Now, you could place this call using the exec() method of Java's Runtime class:
public class UnArchiver {
public static void main(String[] args) {
try {
String command = "unrar.exe e .\my-archive.rar -pQWERT";
Runtime.getRuntime().exec(command);
} catch (Exception e) {
e.printStackTrace();
}
}
}
// Code not tested
However, this option has some drawbacks:
Password is handled as string (bad practice when handling password)
I do not know how exec() is implemented for Windows JVMs. I think there is a risk the password ends up in an unsafe place (log file?) where it does not belong.
For me, exec() always has a smell to it (because it introduces coupling to the environment - in this case unrar.exe that is not visible on first glance for later maintainers of your code)
You introduce a platform dependency (in this case to Windows) as unrar.exe can run only on Windows (thanks #SapuSeven)
Note: When searching on Stackoverflow.com, you probably stumbled over the Junrar library. It cannot be used to extract encrypted archives (see line 122 of this file).
SevenZip library could extract many types of archive files including RAR
randomAccessFile= new RandomAccessFile(sourceZipFile, "r");
inArchive = SevenZip.openInArchive(null, // autodetect archive type
new RandomAccessFileInStream(randomAccessFile));
simpleInArchive = inArchive.getSimpleInterface();
for (int i = 0; i < inArchive.getNumberOfItems(); i++) {
ISimpleInArchiveItem archiveItem = simpleInArchive.getArchiveItem(i);
final File outFile = new File(destFolder,archiveItem.getPath());
outFile.getParentFile().mkdirs();
logger.debug(String.format("extract(%s) in progress: %s",sourceZipFile.getName(),archiveItem.getPath()));
final BufferedOutputStream out=new BufferedOutputStream(new FileOutputStream(outFile));
ExtractOperationResult result = archiveItem.extractSlow(new ISequentialOutStream() {
public int write(byte[] data) throws SevenZipException {
try {
out.write(data);
} catch (IOException e) {
throw new SevenZipException(String.format("error in writing extracted data from:%s to:%s ",sourceZipFile.getName(),outFile.getName()),e);
}finally{
try{out.close();}catch(Exception e){}
}
return data.length; // return amount of consumed data
}
});
if(result!=ExtractOperationResult.OK){
throw new SevenZipException(String.format(" %s error occured in extracting : %s item of file : %s ",result.name(),archiveItem.getPath(),sourceZipFile.getName()));
}
}
I want to read data from FTP Server.I am providing path of the file which resides on FTP server in the format ftp://Username:Password#host/path.
When I use map reduce program to read data from file it works fine. I want to read data from same file through Cascading framework. I am using Hfs tap of cascading framework to read data. It throws following exception
java.io.IOException: Stream closed
at org.apache.hadoop.fs.ftp.FTPInputStream.close(FTPInputStream.java:98)
at java.io.FilterInputStream.close(Unknown Source)
at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:254)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:440)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Below is the code of cascading framework from where I am reading the files:
public class FTPWithHadoopDemo {
public static void main(String args[]) {
Tap source = new Hfs(new TextLine(new Fields("line")), "ftp://user:pwd#xx.xx.xx.xx//input1");
Tap sink = new Hfs(new TextLine(new Fields("line1")), "OP\\op", SinkMode.REPLACE);
Pipe pipe = new Pipe("First");
pipe = new Each(pipe, new RegexSplitGenerator("\\s+"));
pipe = new GroupBy(pipe);
Pipe tailpipe = new Every(pipe, new Count());
FlowDef flowDef = FlowDef.flowDef().addSource(pipe, source).addTailSink(tailpipe, sink);
new HadoopFlowConnector().connect(flowDef).complete();
}
}
I tried to look in Hadoop Source code for the same exception. I found that in the MapTask class there is one method runOldMapper which deals with stream. And in the same method there is finally block where stream gets closed (in.close()). When I remove that line from finally block it works fine. Below is the code:
private <INKEY, INVALUE, OUTKEY, OUTVALUE> void runOldMapper(final JobConf job, final TaskSplitIndex splitIndex,
final TaskUmbilicalProtocol umbilical, TaskReporter reporter)
throws IOException, InterruptedException, ClassNotFoundException {
InputSplit inputSplit = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset());
updateJobWithSplit(job, inputSplit);
reporter.setInputSplit(inputSplit);
RecordReader<INKEY, INVALUE> in = isSkipping()
? new SkippingRecordReader<INKEY, INVALUE>(inputSplit, umbilical, reporter)
: new TrackedRecordReader<INKEY, INVALUE>(inputSplit, job, reporter);
job.setBoolean("mapred.skip.on", isSkipping());
int numReduceTasks = conf.getNumReduceTasks();
LOG.info("numReduceTasks: " + numReduceTasks);
MapOutputCollector collector = null;
if (numReduceTasks > 0) {
collector = new MapOutputBuffer(umbilical, job, reporter);
} else {
collector = new DirectMapOutputCollector(umbilical, job, reporter);
}
MapRunnable<INKEY, INVALUE, OUTKEY, OUTVALUE> runner = ReflectionUtils.newInstance(job.getMapRunnerClass(),
job);
try {
runner.run(in, new OldOutputCollector(collector, conf), reporter);
collector.flush();
} finally {
// close
in.close(); // close input
collector.close();
}
}
please assist me in solving this problem.
Thanks,
Arshadali
After some efforts I found out that hadoop uses org.apache.hadoop.fs.ftp.FTPFileSystem Class for FTP.
This class doesn't supports seek, i.e. Seek to the given offset from the start of the file. Data is read in one block and then file system seeks to next block to read. Default block size is 4KB for FTPFileSystem. As seek is not supported it can only read data less than or equal to 4KB.
this is my first question in this forum....
I'm making adata-mining application in java with the WEKA API.
I make first a pre-processing stage and when I save the ARFF file i would like to add a couple of lines (as comments) specifing the preprocessing task that i have done to the file...
the problem is that i don't know how to add comments to an ARFF file from the java WEKA API.
To save the file i use the class ArffSaver like this...
try {
ArffSaver saver = new ArffSaver();
saver.setInstances(dataPost);
saver.setFile(arffFile);
saver.writeBatch();
return true;
} catch (IOException ex) {
Logger.getLogger(Preprocesamiento.class.getName()).log(Level.SEVERE, null, ex);
return false;
}
I would be really greatfull if someone could give some idea...
thanks!
You should AVOID writting comments on an .arff file, even more when writting it from Java. These files are very "parser-sensitive". The Weka API to create these files is restrictive for this particular reason.
Even though, you can always add your comments manually with the % symbol. This said, I wouldn't recommend you writting anything more than instances, attributes and values into an .arff file. ;-)
I don't see a reason to not write comments into the header of an ARFF file. The specification clearly says:
Lines that begin with a % are comments.
So while it is technically valid, it can be difficult if you want to use the ArffSaver#setFile method. This method does a lot of (convenient, but somewhat arbitrary and unspecified) work internally, until it finally calls
setDestination(new FileOutputStream(m_outputFile));
If this is not required, the easiest option is to write directly to an OutputStream, which then can simply be set as the destination for the ArffSaver. This can be wrapped in a small helper method, for example, like this:
static void writeArff(
Instances instances,
List<String> commentLines,
OutputStream outputStream) throws IOException
{
ArffSaver saver = new ArffSaver();
saver.setInstances(instances);
if (commentLines != null && !commentLines.isEmpty())
{
BufferedWriter bw = new BufferedWriter(
new OutputStreamWriter(outputStream));
for (String commentLine : commentLines)
{
bw.write("% " + commentLine + "\n");
}
bw.write("\n");
bw.flush();
}
saver.setDestination(outputStream);
saver.writeBatch();
}
When calling it like this
List<String> comments = Arrays.asList("A comment", "Another one");
writeArff(instances, comments, outputStream);
then the given comments will be inserted at the top of the ARFF file.
I'm a newbie in Spring Batch, and I would appreciate some help to resolve this situation: I read some files with a MultiResourceItemReader, make some marshalling work, in the ItemProcessor I receive a String and return a Map<String, List<String>>, so my problem is that in the ItemWriter I should iterate the keys of the Map and for each one of them generate a new file containing the value associated with that key, can someone point me out in the right direction in order to create the files?
I'm also using a MultiResourceItemWriter because I need to generates files with a maximum of lines.
Thanks in advance
Well, finaly got a solution, I'm not really excited about it but it's working and I don't have much more time, so I've extended the MultiResourceItemWriter and redefined the "write" method, processing the map's elements and writing the files by myself.
In case anyone out there needs it, here it is.
#Override
public void write(List items) throws Exception {
for (Object o : items) {
//do some processing here
writeFile(anotherObject);
}
private void writeFile (AnotherObject anotherObject) throws IOException {
File file = new File("name.xml");
boolean restarted = file.exists();
FileUtils.setUpOutputFile(file, restarted, true, true);
StringBuffer sb = new StringBuffer();
sb.append(xStream.toXML(anotherObject));
FileOutputStream os = new FileOutputStream(file, true);
BufferedWriter bufferedWriter = new BufferedWriter(new OutputStreamWriter(os, Charset.forName("UTF-8")));
bufferedWriter.write(sb.toString());
bufferedWriter.close();
}
And that's it, I want to believe that there is a better option that I don't know, but for the moment this is my solution. If anyone knows how can I enhance my implementation, I'd like to know it.
Is it possible to write objects in Java to a binary file? The objects I want to write would be 2 arrays of String objects. The reason I want to do this is to save persistent data. If there is some easier way to do this let me know.
You could
Serialize the Arrays, or a class
that contains the arrays.
Write the arrays as two lines in a formatted
way, such as JSON,XML or CSV.
Here is some code for the first one (You could replace the Queue with an array)
Serialize
public static void main(String args[]) {
String[][] theData = new String[2][1];
theData[0][0] = ("r0 c1");
theData[1][0] = ("r1 c1");
System.out.println(theData.toString());
// serialize the Queue
System.out.println("serializing theData");
try {
FileOutputStream fout = new FileOutputStream("thedata.dat");
ObjectOutputStream oos = new ObjectOutputStream(fout);
oos.writeObject(theData);
oos.close();
}
catch (Exception e) { e.printStackTrace(); }
}
Deserialize
public static void main(String args[]) {
String[][] theData;
// unserialize the Queue
System.out.println("unserializing theQueue");
try {
FileInputStream fin = new FileInputStream("thedata.dat");
ObjectInputStream ois = new ObjectInputStream(fin);
theData = (Queue) ois.readObject();
ois.close();
}
catch (Exception e) { e.printStackTrace(); }
System.out.println(theData.toString());
}
The second one is more complicated, but has the benefit of being human as well as readable by other languages.
Read and Write as XML
import java.beans.XMLEncoder;
import java.beans.XMLDecoder;
import java.io.*;
public class XMLSerializer {
public static void write(String[][] f, String filename) throws Exception{
XMLEncoder encoder =
new XMLEncoder(
new BufferedOutputStream(
new FileOutputStream(filename)));
encoder.writeObject(f);
encoder.close();
}
public static String[][] read(String filename) throws Exception {
XMLDecoder decoder =
new XMLDecoder(new BufferedInputStream(
new FileInputStream(filename)));
String[][] o = (String[][])decoder.readObject();
decoder.close();
return o;
}
}
To and From JSON
Google has a good library to convert to and from JSON at http://code.google.com/p/google-gson/ You could simply write your object to JSOn and then write it to file. To read do the opposite.
You can do it using Java's serialization mechanism, but beware that serialization is not a good solution for long-term persistent storage of objects. The reason for this is that serialized objects are very tightly coupled to your Java code: if you change your program, then the serialized data files become unreadable, because they are not compatible anymore with your Java code. Serialization is good for temporary storage (for example for an on-disk cache) or for transferring objects over a network.
For long-term storage, you should use a standard and well-documented format (for example XML, JSON or something else) that is not tightly coupled to your Java code.
If, for some reason, you absolutely want to use a binary format, then there are several options available, for example Google protocol buffers or Hessian.
One possibility besides serialization is to write Objects to XML files to make them more human-readable. The XStream API is capable of this and uses an approach that is similar to serialization.
http://x-stream.github.io/
If you want to write arrays of String, you may be better off with a text file. The advantage of using a text file is that it can be easily viewed, edited and is usuable by many other tools in your system which mean you don't have to have to write these tools yourself.
You can also find that a simple text format will be faster and more compact than using XML or JSON. Note: Those formats are more useful for complex data structures.
public static void writeArray(PrintStream ps, String... strings) {
for (String string : strings) {
assert !string.contains("\n") && string.length()>0;
ps.println(strings);
}
ps.println();
}
public static String[] readArray(BufferedReader br) throws IOException {
List<String> strings = new ArrayList<String>();
String string;
while((string = br.readLine()) != null) {
if (string.length() == 0)
break;
strings.add(string);
}
return strings.toArray(new String[strings.size()]);
}
If your start with
String[][] theData = { { "a0 r0", "a0 r1", "a0 r2" } {"r1 c1"} };
This could result in
a0 r0
a0 r1
a0 r2
r1 c1
As you can see this is easy to edit/view.
This makes some assumptions about what a string can contain (see the asset). If these assumptions are not valid, there are way of working around this.
You need to write object, not class, right? Because classes are already compiled to binary .class files.
Try ObjectOutputStream, there's an example
http://java.sun.com/javase/6/docs/api/java/io/ObjectOutputStream.html