How should I process really really large amounts of text in Java?

How should I process really really large amounts of text in Java? - java

I'm trying to tokenize a large amount of text in Java. When I say large, I mean entire chapters of books at a time. I wrote the first draft of my code by using a single page from a book and everything worked fine. Now that I'm trying to process entire chapters things aren't working. It processes part of the chapter correctly and then it just stops.
Below is all of the relevant code
File folder = new File(Constants.rawFilePath("eng"));
FileHelper fileHelper = new FileHelper();
BPage firstChapter = new BPage();
BPage firstChapterSpanish = new BPage();
File[] allFiles = folder.listFiles();
//read the files into memory
ArrayList<ArrayList<String>> allPages = new ArrayList<ArrayList<String>>();
//for the english
for(int i=0;i<allFiles.length;i++)
{
String filePath = Constants.rawFilePath("/eng/metamorph_eng_"+String.valueOf(i)+".txt");
ArrayList<String> pageToAdd = fileHelper.readFileToMemory(filePath);
allPages.add(pageToAdd);
}
String allPagesAsString = "";
for(int i=0;i<allPages.size();i++)
{
allPagesAsString = allPagesAsString+fileHelper.turnListToString(allPages.get(i));
}
firstChapter.setUnTokenizedPage(allPagesAsString);
firstChapter.tokenize(Languages.ENGLISH);
folder = new File(Constants.rawFilePath("spa"));
allFiles = folder.listFiles();
//for the spanish
for(int i=0;i<allFiles.length;i++)
{
String filePath = Constants.rawFilePath("/eng/metamorph_eng_"+String.valueOf(i)+".txt");
ArrayList<String> pageToAdd = fileHelper.readFileToMemory(filePath);
allPages.add(pageToAdd);
}
allPagesAsString = "";
for(int i=0;i<allPages.size();i++)
{
allPagesAsString = allPagesAsString+fileHelper.turnListToString(allPages.get(i));
}
firstChapterSpanish.setUnTokenizedPage(allPagesAsString);
firstChapterSpanish.tokenize(Languages.SPANISH);
fileHelper.writeFile(firstChapter.getTokenizedPage(), Constants.partiallyprocessedFilePath("eng_ch_1.txt"));
fileHelper.writeFile(firstChapterSpanish.getTokenizedPage(), Constants.partiallyprocessedFilePath("spa_ch_1.txt"));
}
even though I'm reading all of the files in the directory where I expect my text to be, only the first coups of files are being added to the string that I'm processing. It seems like after a while the code will still run but it only adds characters to my string up to a certain point.
What do I have to change so that I can process all of my files at once?

This part
String allPagesAsString = "";
for(int i=0;i<allPages.size();i++)
{
allPagesAsString = allPagesAsString+
fileHelper.turnListToString(allPages.get(i));
}
will be really slow if your copying larger strings.
Using a StringBuilder will speed things up a bit:
int expectedBookSize = 10000;
StringBuilder allPagesAsString = new StringBuilder(expectedBookSize);
for(int i=0;i<allPages.size();i++)
{
allPagesAsString.append(fileHelper.turnListToString(allPages.get(i)));
}
Can't you process one page at a time? That would be the best solution.

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}

Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.

I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

Implementing language locales into array to be used in a loop

I'm trying to read every file in a directory, clean up with java util.locale, then write to a new directory. The reading and writing methods work, the Locale.SPANISH might be the issue as I have read in other posts.
I iterated through the available languages in the java.util.locale, spanish was in there.
First, the array issue: the following extract of code below is the long way of entering the Locale.(LANGUAGE) into the array. This seems to work fine. However, I can't understand why the 'short' way doesn't seem to work.
String[] languageLocale = new String[fileArray.length];
languageLocale[0] = "Locale.ENGLISH";
languageLocale[1] = "Locale.FRENCH";
languageLocale[2] = "Locale.GERMAN";
languageLocale[3] = "Locale.ITALIAN";
languageLocale[4] = "Locale.SPANISH";
The short way:
String[] languageLocale = new String[("Locale.ENGLISH" , "Locale.FRENCH" , "Locale.GERMAN" , "Locale.ITALIAN" , "Locale.SPANISH")];
I need to input the Locale.(langauge) into a string so they can be called in the following:
File file = new File("\\LanguageGuessing5.0\\Learning\\");
File[] fileArray = file.listFiles();
ArrayList<String> words = new ArrayList<String>();
for (int i = 0; i < fileArray.length; i++) {
if (fileArray[i].isFile()) {
if (fileArray[i].isHidden()) {
continue;
} else {
String content = readUTF8File("\\LanguageGuessing5.0\\Learning\\"+fileArray[i].getName());
words = extractWords(content, languageLocale[i]);
outputWordsToUTF8File("\\LanguageGuessing5.0\\Model\\"+ fileArray[i].getName() + "out.txt", words);
}
} else if (fileArray[i].isDirectory()) {
System.out.println("Directory " + fileArray[i].getName());
}
}
The following method call:
words = extractWords(content, languageLocale[i]);
also presents the following error:
The method extractWords(String, Locale) in the type CleaningText(the class name) is not applicable for the arguments (String, String)
My understanding is that while the array argument is not a locale, the string holds the correct text to make it valid. I'm clearly incorrect, I'm hoping someone could explain how this works.
The input types of the methods are below for context:
public static String readUTF8File(String filePath)
public static ArrayList extractWords(String inputText, Locale currentLocale)
public static void outputWordsToUTF8File(String filePath, ArrayList wordList)
Many thanks in advance

ArrayList<String> in PDF from a new row

I want to send some survey in PDF from java, I tryed different methods. I use with StringBuffer and without, but always see text in PDF in one row.
public void writePdf(OutputStream outputStream) throws Exception {
Paragraph paragraph = new Paragraph();
Document document = new Document();
PdfWriter.getInstance(document, outputStream);
document.open();
document.addTitle("Survey PDF");
ArrayList nameArrays = new ArrayList();
StringBuffer sb = new StringBuffer();
int i = -1;
for (String properties : textService.getAnswer()) {
nameArrays.add(properties);
i++;
}
for (int a= 0; a<=i; a++){
System.out.println("nameArrays.get(a) -"+nameArrays.get(a));
sb.append(nameArrays.get(a));
}
paragraph.add(sb.toString());
document.add(paragraph);
document.close();
}
textService.getAnswer() this - ArrayList<String>
Could you please advise how to separate the text in order each new sentence will be starting from new row?
Now I see like this:

You forgot the newline character \n and your code seems a bit overcomplicated.
Try this:
StringBuffer sb = new StringBuffer();
for (String property : textService.getAnswer()) {
sb.append(property);
sb.append('\n');
}

What about:
nameArrays.add(properties+"\n");

You might be able to fix that by simply appending "\n" to the strings that you collecting in your list; but I think: that very much depends on the PDF library you are using.
You see, "newlines" or "paragraphs" are to a certain degree about formatting. It seems like a conceptual problem to add that "formatting" information to the data that you are processing.
Meaning: you might want to check if your library allows you to provide strings - and then have the library do the formatting for you!
In other words: instead of giving strings with newlines; you should check if you can keep using strings without newlines, but if there is way to have the PDF library add line breaks were appropriate.
Side note on code quality: you are using raw types:
ArrayList nameArrays = new ArrayList();
should better be
ArrayList<String> names = new ArrayList<>();
[ I also changed the name - there is no point in putting the type of a collection into the variable name! ]

This method is for save values in array list into a pdf document. In the mfilePath variable "/" in here you can give folder name. As a example "/example/".
and also for mFileName variable you can use name. I give the date and time that document will created. don't give static name other vice your values are overriding in same pdf.
private void savePDF()
{
com.itextpdf.text.Document mDoc = new com.itextpdf.text.Document();
String mFileName = new SimpleDateFormat("YYYY-MM-DD-HH-MM-SS", Locale.getDefault()).format(System.currentTimeMillis());
String mFilePath = Environment.getExternalStorageDirectory() + "/" + mFileName + ".pdf";
try
{
PdfWriter.getInstance(mDoc, new FileOutputStream(mFilePath));
mDoc.open();
for(int d = 0; d < g; d++)
{
String mtext = answers.get(d);
mDoc.add(new Paragraph(mtext));
}
mDoc.close();
}
catch (Exception e)
{
}
}

How can I work around a list resizing issue when an ArrayList is not an option to use?

We have a search tool that displays results in a table. The list I have to compose and return has to be something like this:
List[] res = new List[(some_int_initializer];
return res;
The problem this poses is that this type of list is not re-sizeable. This problem poses a problem I have, in which I have to resize this list when I don't
while(collection.iterator().hasNext())
{
StringBuilder sb = new StringBuilder();
sb.append("C:\\test\\sage\\data\\");
List<String> myList = collection.iterator().next();
// iterate through string list and compose file paths
for(String name: myList)
{
Matcher matcher = samplePattern.matcher(sampleName);
if(matcher.find())
{
sb.append(matcher.group(1));
sb.append(matcher.group(2));
sb.append(matcher.group(3));
sb.append("000\\nmr\\");
sb.append(name);
}
File file = new File(sb.toString());
int counter = 0;
List[] res = new List[myList.size()];
if(file.exists())
{
File[] dirs = file.listFiles();
for(int step=0;step<dirs.length;step++)
{
List row = new ArrayList();
row.add(name);
row.add(dirs[step].getAbsolutePath());
res[counter++] = row;
}
}
}
}
The name and path have to be displayed on a row, but a name can have more than one path associated with it. Also, even if the file path does not exist, the name still show in the table. This is make it really difficult to resize the list, especially when I have to add each Array list to 'res'.
Any thoughts or ideas appreciated.
UPDATE
Thanks to all who have responded. This is the solution that worked for me:
List[] results = allRows.toArray(new List[allRows.size()]);

Thanks to all who have responded. This is the solution that worked for me:
List[] results = allRows.toArray(new List[allRows.size()]);

Swt file dialog too much files selected?

the swt file dialog will give me an empty result array if I select too much files (approx. >2500files). The listing shows you how I use this dialog. If i select too many sound files, the syso will show 0. Debugging tells me, that the files array is empty in this case. Is there any way to get this work?
FileDialog fileDialog = new FileDialog(mainView.getShell(), SWT.MULTI);
fileDialog.setText("Choose sound files");
fileDialog.setFilterExtensions(new String[] { new String("*.wav") });
Vector<String> result = new Vector<String>();
fileDialog.open();
String[] files = fileDialog.getFileNames();
for (int i = 0, n = files.length; i < n; i++) {
if( !files[i].contains(".wav")) {
System.out.println(files[i]);
}
StringBuffer stringBuffer = new StringBuffer();
stringBuffer.append(fileDialog.getFilterPath());
if (stringBuffer.charAt(stringBuffer.length() - 1) != File.separatorChar) {
stringBuffer.append(File.separatorChar);
}
stringBuffer.append(files[i]);
stringBuffer.append("");
String finalName = stringBuffer.toString();
if( !finalName.contains(".wav")) {
System.out.println(finalName);
}
result.add(finalName);
}
System.out.println(result.size())
;

I've looked at the FileDialog source code and I'm afraid, there is an upper boundary. A 32kB byte buffer for all 0-terminated filenames (if I understood it correctly).
So calculating with your values, if the medium size of your filname strings is around 12 characters, then you've hit exactly that upper boundary.
So the only way out is to select the files in two or more steps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How should I process really really large amounts of text in Java? - java

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

Implementing language locales into array to be used in a loop

ArrayList<String> in PDF from a new row

How can I work around a list resizing issue when an ArrayList is not an option to use?

Swt file dialog too much files selected?

Categories

Resources