Java code hangs when try to compare huge files

Java code hangs when try to compare huge files - java

I am exploring an option to compare two files in Java and show the difference in html.
Below is the code, I am using -
import java.io.File;
import java.io.IOException;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;
import org.apache.commons.text.diff.CommandVisitor;
import org.apache.commons.text.diff.StringsComparator;
public class FileDiff {
public static void main(String[] args) throws IOException {
// Read both files with line iterator.
LineIterator file1 = FileUtils.lineIterator(new File("file-1.txt"), "utf-8");
LineIterator file2 = FileUtils.lineIterator(new File("file-2.txt"), "utf-8");
// Initialize visitor.
FileCommandsVisitor fileCommandsVisitor = new FileCommandsVisitor();
// Read file line by line so that comparison can be done line by line.
while (file1.hasNext() || file2.hasNext()) {
/*
* In case both files have different number of lines, fill in with empty
* strings. Also append newline char at end so next line comparison moves to
* next line.
*/
String left = (file1.hasNext() ? file1.nextLine() : "") + "\n";
String right = (file2.hasNext() ? file2.nextLine() : "") + "\n";
// Prepare diff comparator with lines from both files.
StringsComparator comparator = new StringsComparator(left, right);
if (comparator.getScript().getLCSLength() > (Integer.max(left.length(), right.length()) * 0.4)) {
/*
* If both lines have atleast 40% commonality then only compare with each other
* so that they are aligned with each other in final diff HTML.
*/
comparator.getScript().visit(fileCommandsVisitor);
} else {
/*
* If both lines do not have 40% commanlity then compare each with empty line so
* that they are not aligned to each other in final diff instead they show up on
* separate lines.
*/
StringsComparator leftComparator = new StringsComparator(left, "\n");
leftComparator.getScript().visit(fileCommandsVisitor);
StringsComparator rightComparator = new StringsComparator("\n", right);
rightComparator.getScript().visit(fileCommandsVisitor);
}
}
fileCommandsVisitor.generateHTML();
}
}
/*
* Custom visitor for file comparison which stores comparison & also generates
* HTML in the end.
*/
class FileCommandsVisitor implements CommandVisitor<Character> {
// Spans with red & green highlights to put highlighted characters in HTML
private static final String DELETION = "<span style=\"background-color: #FB504B\">${text}</span>";
private static final String INSERTION = "<span style=\"background-color: #45EA85\">${text}</span>";
private String left = "";
private String right = "";
#Override
public void visitKeepCommand(Character c) {
// For new line use <br/> so that in HTML also it shows on next line.
String toAppend = "\n".equals("" + c) ? "<br/>" : "" + c;
// KeepCommand means c present in both left & right. So add this to both without
// any
// highlight.
left = left + toAppend;
right = right + toAppend;
}
#Override
public void visitInsertCommand(Character c) {
// For new line use <br/> so that in HTML also it shows on next line.
String toAppend = "\n".equals("" + c) ? "<br/>" : "" + c;
// InsertCommand means character is present in right file but not in left. Show
// with green highlight on right.
right = right + INSERTION.replace("${text}", "" + toAppend);
}
#Override
public void visitDeleteCommand(Character c) {
// For new line use <br/> so that in HTML also it shows on next line.
String toAppend = "\n".equals("" + c) ? "<br/>" : "" + c;
// DeleteCommand means character is present in left file but not in right. Show
// with red highlight on left.
left = left + DELETION.replace("${text}", "" + toAppend);
}
public void generateHTML() throws IOException {
// Get template & replace placeholders with left & right variables with actual
// comparison
String template = FileUtils.readFileToString(new File("difftemplate.html"), "utf-8");
String out1 = template.replace("${left}", left);
String output = out1.replace("${right}", right);
// Write file to disk.
FileUtils.write(new File("finalDiff.html"), output, "utf-8");
System.out.println("HTML diff generated.");
}
}
For smaller files this works good and gives me good results on my laptop. But if file size is more (200MB) with half a million of rows then my IntelliJ seems to hang. RAM for my laptop is 16GB.
How can I improve this to handle large files for comparison?
Thanks

The way you wrote FileCommandsVisitor might cause it to fail to get optimized. What you're doing is adding strings for every character visited, for instance:
left = left + toAppend;
right = right + toAppend;
That might cause a new instance of a String to happen for every addition you do - new instance of a string that by the end is nearly 200 MB long. A new one for every character you visit. And old ones will need to get garbage collected. If your class held StringBuilders instead, and you used append() method it might drastically speed up. For more details read String concatenation: concat() vs "+" operator
For clarity (since according to comments you missed the point twice now):
class FileCommandsVisitor implements CommandVisitor<Character> {
//StringBuilder as properties
private StringBuilder left = new StringBuilder();
private StringBuilder right = new StringBuilder();
#Override
public void visitKeepCommand(Character c) {
String toAppend = "\n".equals("" + c) ? "<br/>" : "" + c;
// append to the StringBuilders where you would concat strings
left.append(toAppend);
right.append(toAppend);
}
//same as above for other methods
..
public void generateHTML() throws IOException {
String template = FileUtils.readFileToString(new File("difftemplate.html"), "utf-8");
//turn StringBuilders into Strings only when you actually need a String.
String out1 = template.replace("${left}", left.toString());
String output = out1.replace("${right}", right.toString());
FileUtils.write(new File("finalDiff.html"), output, "utf-8");
System.out.println("HTML diff generated.");
}
}
If that doesn't help however, and it was optimized at runtime - I don't see anything else fundamentally wrong with the way you're doing it. Comparing huge files is not a cheap operation, it won't be faster than the speed with which you can read two files line by line from your hard drive. You're still making a shortcut (that increases speed, not decreases) in having your FileCommandsVisitor hold both diffs in memory instead of writing it as it goes, which means that at best your code can diff a file of a size equal to half your available RAM. I note however, that you never mentioned how long it actually takes, so it's hard to say if the time you're seeing is expected or an anomaly.

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}

Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.

I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

Convert .prn file to csv file format in java

need your help to convert prn file to csv file using java.
Thank you so much.
Below is my prn file.
i would like to make it shows like this
Thank you so much.

In your example you have four entries as input, each in a row. In your result table they all are in one row. I assume the input describes a complete prn set. So if a file would contain n prn sets, it would have n * 4 rows.
To map the pm set to a csv file you have to
read in the entries from the input file
write a header row (with eight titles)
extract in each entry the relevant values
combine the extracted values from four entries in sequence to one csv row
write the row
repeat steps 3 to 5 as long as there are further entries
Here is my suggestion:
public class PrnToCsv {
private static final String DILIM_PRN = " ";
private static final String DILIM_CSV = ",";
private static final Pattern PRN_SPLITTER = Pattern.compile(DILIM_PRN);
public static void main(String[] args) throws URISyntaxException, IOException {
List<String> inputLines = Files.readAllLines(new File("C://Temp//csv/input.prn").toPath());
List<String[]> inputValuesInLines = inputLines.stream().map(l -> PRN_SPLITTER.split(l)).collect(Collectors.toList());
try (BufferedWriter bw = Files.newBufferedWriter(new File("C://Temp//csv//output.csv").toPath())) {
// header
bw.append("POL1").append(DILIM_CSV).append("POL1_Time").append(DILIM_CSV).append("OLV1").append(DILIM_CSV).append("OLV1_Time").append(DILIM_CSV);
bw.append("POL2").append(DILIM_CSV).append("POL2_Time").append(DILIM_CSV).append("OLV2").append(DILIM_CSV).append("OLV2_Time");
bw.newLine();
// data
for (int i = 0; i + 3 < inputValuesInLines.size(); i = i + 4) {
String[] firstValues = inputValuesInLines.get(i);
bw.append(getId(firstValues)).append(DILIM_CSV).append(getDateTime(firstValues)).append(DILIM_CSV);
String[] secondValues = inputValuesInLines.get(i + 1);
bw.append(getId(secondValues)).append(DILIM_CSV).append(getDateTime(secondValues)).append(DILIM_CSV);
String[] thirdValues = inputValuesInLines.get(i + 2);
bw.append(getId(thirdValues)).append(DILIM_CSV).append(getDateTime(thirdValues)).append(DILIM_CSV);
String[] fourthValues = inputValuesInLines.get(i + 3);
bw.append(getId(fourthValues)).append(DILIM_CSV).append(getDateTime(fourthValues));
bw.newLine();
}
}
}
public static String getId(String[] values) {
return values[1];
}
public static String getDateTime(String[] values) {
return values[2] + " " + values[3];
}
}
Some remarks to the code:
Using the nio-API you can read the whole file with one line of code.
To extract the values of an entry line I used a Pattern to split the line into an array with each single word as a value.
Then it is easy get the relevant values of an entry using the appropriate array indexes.
To write the csv file line by line (without additional libs) you can use a BufferedWriter.
The file you're writting to is a resource. It is recommended to use resources with the try-with-resource-statement.
I hope I could answer your question.

Generating custom text files in java

public class ScriptCreator {
public static void main(String[] args) throws IOException {
#Choose the CSV file that I am importing the data from
String fName = "C:\\Users\\MyUser\\Downloads\\CurrentApplications (1).csv";
String thisLine;
int count = 0;
FileInputStream fis = new FileInputStream(fName);
DataInputStream myInput = new DataInputStream(fis);
int i = 0;
#Prints the List of names in the CSV file
while((thisLine = myInput.readLine()) != null){
String strar[] = thisLine.split(",");
Printer(strar[0]);
}
}
public static void Printer(String arg) throws IOException{
#Want to pull from the String strar[0] from above
#Says that it cannot be resolved to a variable
String name = arg;
String direc = "C:/Users/MyUser/Documents/";
String path = "C:/Users/MyUser/Documents";
Iterable<String> lines = Arrays.asList("LOGIN -acceptssl ServerName","N " + name + " " + direc ,"cd " + name,"import " + path + "*.ppf" + " true","scan", "publishassessase -aseapplication " + name,"removeassess *","del " + name );
Path file = Paths.get(name + ".txt");
Files.write(file, lines, Charset.forName("UTF-8"));
}
}
Hello everyone and thank you in advance for any help that you may be able to give me. I am trying to create a java program that will pull names from a CSV file and take those names to generate custom outputs for text files. I am having a hard time being able to set a variable that I can use to grab the names that are being printed and using them to generate a text file by setting the name variable.
I am also going to need some help in making sure that it creates the amount of scripts for the amount of names in the CSV file. Ex. 7 names in CSV makes 7 custom .txt files, each with its appropriate name.
Any help is greatly appreciated!
Edit: I have updated my code to match the correction that was needed to make the code work.

It looks like you have some scoping issues. Whenever you declare a variable, it only exists within the boundaries of its closest set of braces. By declaring strar in your main method, the only place you can explicitly use it is within your main method. Your Printer() method doesn't have any previous mention of strar, and the only way it can know about it is by passing it as an argument to the function.
i.e.
Printer(String[] args)
Or, better yet:
Printer(String arg)
and then call it in your while loop with
Printer(strar[0]);
Also, your Printer method begins with a "for each" loop called on strar[0], which is not a valid target for a foreach loop anyway, because if I recall correctly, String isn't an Iterable object. If you implemented the Printer function in the way I recommended, you won't need a for each loop anyway, as there will only be one name passed at a time.

files.equal not returning true

Basically, I have two Strings that are Fully Qualified File Names. I want to compare that the two files are the same thing. So I converted both Strings to File Objects. Using google's Files.equal(File file, File file2) method, I tried to compare them, but the value returned was false. However, wondering what was wrong, I converted both file objects to byte arrays and output those which equaled the same number. So, does anyone know why Files.equal is considering them false.
I'm just curious why the method is returning false because after reading the doc Files.equal compares the two files by bytes.
Thanks.
Code:
public class WhenEncrypting {
private String[] args = new String[4];
/**
* encrypts a plain text file
*
* #throws IOException
* IOException could occur
*/
#Test()
public void normalEncryption() throws IOException {
this.args[0] = "-e";
this.args[1] = "./src/decoderwheel/tests/valid.map";
this.args[2] = "./src/decoderwheel/tests/input.txt";
this.args[3] = "./src/decoderwheel/tests/crypt.txt";
DecoderWheel.main(this.args);
File plainFile = new File("./src/decoderwheel/tests/input.txt");
File crypted = new File("./src/decoderwheel/tests/crypt.txt");
byte[] f1 = Files.toByteArray(plainFile);
byte[] f2 = Files.toByteArray(crypted);
int number = f1.length;
int size = f2.length;
Files.equal(crypted, plainFile);
System.out.println(number);
System.out.println(size);
System.out.println(Files.equal(crypted, plainFile));
assertTrue(Files.equal(crypted, plainFile));
}
}
Output:
360
360
false

Based on what you've shown us, I think that the problem is most likely to be that the two files' contents are NOT equal.
The fact that the two byte arrays (read from the files) have the same lengths does not mean that their contents (and hence the files' contents) are the same.
Add something like this:
for (int i = 0; i < f1.length; i++) {
if (f1[i] != f2[i]) {
System.out.println("File content mismatch at index " + i + ": " +
f1[i] + " != " + f2[i]);
}
}

Java: splitting the filename into a base and extension

Is there a better way to get file basename and extension than something like
File f = ...
String name = f.getName();
int dot = name.lastIndexOf('.');
String base = (dot == -1) ? name : name.substring(0, dot);
String extension = (dot == -1) ? "" : name.substring(dot+1);

I know others have mentioned String.split, but here is a variant that only yields two tokens (the base and the extension):
String[] tokens = fileName.split("\\.(?=[^\\.]+$)");
For example:
"test.cool.awesome.txt".split("\\.(?=[^\\.]+$)");
Yields:
["test.cool.awesome", "txt"]
The regular expression tells Java to split on any period that is followed by any number of non-periods, followed by the end of input. There is only one period that matches this definition (namely, the last period).
Technically Regexically speaking, this technique is called zero-width positive lookahead.
BTW, if you want to split a path and get the full filename including but not limited to the dot extension, using a path with forward slashes,
String[] tokens = dir.split(".+?/(?=[^/]+$)");
For example:
String dir = "/foo/bar/bam/boozled";
String[] tokens = dir.split(".+?/(?=[^/]+$)");
// [ "/foo/bar/bam/" "boozled" ]

Old question but I usually use this solution:
import org.apache.commons.io.FilenameUtils;
String fileName = "/abc/defg/file.txt";
String basename = FilenameUtils.getBaseName(fileName);
String extension = FilenameUtils.getExtension(fileName);
System.out.println(basename); // file
System.out.println(extension); // txt (NOT ".txt" !)

Source: http://www.java2s.com/Code/Java/File-Input-Output/Getextensionpathandfilename.htm
such an utility class :
class Filename {
private String fullPath;
private char pathSeparator, extensionSeparator;
public Filename(String str, char sep, char ext) {
fullPath = str;
pathSeparator = sep;
extensionSeparator = ext;
}
public String extension() {
int dot = fullPath.lastIndexOf(extensionSeparator);
return fullPath.substring(dot + 1);
}
public String filename() { // gets filename without extension
int dot = fullPath.lastIndexOf(extensionSeparator);
int sep = fullPath.lastIndexOf(pathSeparator);
return fullPath.substring(sep + 1, dot);
}
public String path() {
int sep = fullPath.lastIndexOf(pathSeparator);
return fullPath.substring(0, sep);
}
}
usage:
public class FilenameDemo {
public static void main(String[] args) {
final String FPATH = "/home/mem/index.html";
Filename myHomePage = new Filename(FPATH, '/', '.');
System.out.println("Extension = " + myHomePage.extension());
System.out.println("Filename = " + myHomePage.filename());
System.out.println("Path = " + myHomePage.path());
}
}

http://docs.oracle.com/javase/6/docs/api/java/io/File.html#getName()
From http://www.xinotes.org/notes/note/774/ :
Java has built-in functions to get the basename and dirname for a given file path, but the function names are not so self-apparent.
import java.io.File;
public class JavaFileDirNameBaseName {
public static void main(String[] args) {
File theFile = new File("../foo/bar/baz.txt");
System.out.println("Dirname: " + theFile.getParent());
System.out.println("Basename: " + theFile.getName());
}
}

What's wrong with your code? Wrapped in a neat utility method it's fine.
What's more important is what to use as separator — the first or last dot. The first is bad for file names like "setup-2.5.1.exe", the last is bad for file names with multiple extensions like "mybundle.tar.gz".

File extensions are a broken concept
And there exists no reliable function for it. Consider for example this filename:
archive.tar.gz
What is the extension? DOS users would have preferred the name archive.tgz. Sometimes you see stupid Windows applications that first decompress the file (yielding a .tar file), then you have to open it again to see the archive contents.
In this case, a more reasonable notion of file extension would have been .tar.gz. There are also .tar.bz2, .tar.xz, .tar.lz and .tar.lzma file "extensions" in use. But how would you decide, whether to split at the last dot, or the second-to-last dot?
Use mime-types instead.
The Java 7 function Files.probeContentType will likely be much more reliable to detect file types than trusting the file extension. Pretty much all the Unix/Linux world as well as your Webbrowser and Smartphone already does it this way.

You can also user java Regular Expression. String.split() also uses the expression internally. Refer http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

Maybe you could use String#split
To answer your comment:
I'm not sure if there can be more than one . in a filename, but whatever, even if there are more dots you can use the split. Consider e.g. that:
String input = "boo.and.foo";
String[] result = input.split(".");
This will return an array containing:
{ "boo", "and", "foo" }
So you will know that the last index in the array is the extension and all others are the base.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java code hangs when try to compare huge files - java

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

Convert .prn file to csv file format in java

Generating custom text files in java

files.equal not returning true

Java: splitting the filename into a base and extension

Categories

Resources