I am trying to make a program that will search the first 1 billion digits of pi and will find a user specified number, the problem I am facing is how to use pi... I have a .txt file that contains pi (I also broke it to 96 different files because java couldn't handle such a big file) all the digits are in the first line....
Code (only to read and save pi using the 96 files):
for(int i = 1;i <= 96; i++){
String filename = "";
if(i <= 9){
filename = "res//t//output2_00000" + i + "(500001).txt";
}else{
filename = "res//t//output2_0000" + i + "(500001).txt";
}
Scanner inFile = new Scanner(new FileReader(filename));
ar.add(inFile.nextLine());
}
List<String> pi = new ArrayList<String>();
for(int i = 0; i<97;i++){
System.out.println(i);
for(String j : ar.get(i).split("")){
pi.add(j);
}
}
This seems to work fine up to a point where it crashes with the following error (the last print statement is 3):
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.String.substring(Unknown Source)
at java.lang.String.subSequence(Unknown Source)
at java.util.regex.Pattern.split(Unknown Source)
at java.lang.String.split(Unknown Source)
at java.lang.String.split(Unknown Source)
at main.Main.main(Main.java:29)
Is there a way to overcome that, and is there a way to make it go faster?
Thanks in advance.
You are not required to load the whole file in memory. With RandomAccessFile, you can open a file, place the cursor at the place you want and read from it :
RandomAccessFile raf = new RandomAccessFile(
new File("/home/adenoyelle/dev/pi.txt"), "r");
raf.seek(1_000_000);
System.out.println(raf.read());
Note : raf.read() returns a byte of data. You might need to reinterpret it depending on what you need.
Example :
for(int i = 0; i< 10; i++) {
raf.seek(i);
System.out.println((char)raf.read());
}
Output :
3
.
1
4
1
5
9
2
6
5
Note 2 : As stated by SaviourSelf, if you need to read multiple bytes at a time, prefer read(byte [] b).
Don't split up the text file: that's the wrong solution, finding a number that's split across a file will be a pain. Of course Java can handle large files: how else do you think databases written in Java work?!
Consider using the Apache Commons IO, which gives you a LineIterator:
LineIterator it = FileUtils.lineIterator(theFile, "UTF-8"/*probably*/);
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
LineIterator.closeQuietly(it);
}
It is likely that you go out of heap memory if you try to load over 1GB of data into the heap. Just check every single file for the search string and then close the file.
Related
i have a large xml file of size 10 gb and i want to create a new xml file which is generated from the first record of the large file.i tried to do this in java and python but i got memory error since i'm loading the entire data.
In another post,someone suggested XSLT is the best solution for this.I'm new to XSLT,i don't know how to do this in xslt,pls suggest some style sheet to do this...
Large XML file(10gb) sample:
<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
<Header>
<BusinessPartner>CHILIS_US</BusinessPartner>
<FileType>mde</FileType>
<FileNumber>17</FileNumber>
<FormatVariant>1</FormatVariant>
<NumberOfRecords>22</NumberOfRecords>
<CreationDate>2016-06-07T12:00:46-07:00</CreationDate>
</Header>
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
.....
.....
</MemberDataExport>
I want to create a file like this..
<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
</MemberDataExport>
is there any other way i can do this without getting any memory error? pls suggest that too.
In Python (which you mentioned besides Java) you could use ElementTree.iterparse and then break parsing when you have found the element(s) you want to copy:
import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1
for event, elem in ET.iterparse('input1.xml', events = ('start', 'end')):
if event == 'start':
level = level + 1
if level == 0:
result = ET.ElementTree(ET.Element(elem.tag))
if event == 'end':
level = level - 1
if level == 0:
count = count + 1
if count <= copy:
result.getroot().append(elem)
else:
break
result.write('result1.xml', 'UTF-8', True, 'http://www.payback.net/lmsglobal/batch/memberdataexport')
As for better namespace prefix preservation, I have had some success using the event start-ns and registering the collected namespaces on the ElementTree:
import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1
for event, elem in ET.iterparse('input1.xml', events = ('start', 'end', 'start-ns')):
if event == 'start':
level = level + 1
if level == 0:
result = ET.ElementTree(ET.Element(elem.tag))
if event == 'end':
level = level - 1
if level == 0:
count = count + 1
if count <= copy:
result.getroot().append(elem)
else:
break
if event == 'start-ns':
ET.register_namespace(elem[0], elem[1])
result.write('result1.xml', 'UTF-8', True)
You didn't show your code, so we can't possibly know what you're doing right or wrong. However, I'd bet any parser would need to load the entire file just to check if syntax is OK, no missing tags etc. and that will surely cause an OutOfMemory error for a 10 GB file.
So, just in this case, my approach would be to read the file line by line using a BufferedStreamReader (see How to read a large text file line by line using Java?) and just stop when you reach a line that contains your closing tag, i.e. </MembershipInfoListItem>:
StringBuilder sb = new StringBuilder("<MemberDataExport xmlns=\"http://www.payback.net/lmsglobal/batch/memberdataexport\" xmlns:types=\"http://www.payback.net/lmsglobal/xsd/v1/types\">");
sb.append(System.lineSeparator());
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
// process the line
sb.append(line);
sb.append(System.lineSeparator());
if (line.contains("</MembershipInfoListItem>")) {
break;
}
}
sb.append("</MemberDataExport>");
} catch (IOException | AnyOtherExceptionNeeded ex) {
// log or rethrow
}
Now sb.toString() will return what you want.
I have a code that reads a file using buffered reader and split, said file was created via a method that automatically adds 4KB of empty space at the beginning of the file, this results in when I read the following happens:
First the Code:
BufferedReader metaRead = new BufferedReader(new FileReader(metaFile));
String metaLine = "";
String [] metaData = new String [100000];
while ((metaLine = metaRead.readLine()) != null){
metaData = metaLine.split(",");
for (int i = 0; i < metaData.length; i++){
System.out.println(metaData[i]);
}
}
This is the result, keep in mind this file already exists and contains the values:
//4096 spaces then the first actual word in the document which is --> testTable2
Name
java.lang.String
true
No Reference
Is there a way to skip the first 4096 spaces, and get straight to the actual value within the file so I can get the result regularly? Because I'll be using the metaData array later in other operations, and I'm pretty sure the spaces will mess up the number of slots within the array. Any suggestions would be appreciated.
If you're using Eclipse, the auto-completion should help.
metaRead.skip(4096);
https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
You could (as mentioned) simply do:
metaRead.skip(4096);
If the whitespace always occupies that many characters, or you could just avoid lines which are empty
while ((metaLine = metaRead.readLine()) != null){
if(metaLine.trim().length() > 0){
metaData = metaLine.split(",");
for (int i = 0; i < metaData.length; i++){
System.out.println(metaData[i]);
}
}
}
I want to read a csv files including millions of rows and use the attributes for my decision Tree algorithm. My code is below:
String csvFile = "myfile.csv";
List<String[]> rowList = new ArrayList();
String line = "";
String cvsSplitBy = ",";
String encoding = "UTF-8";
BufferedReader br2 = null;
try {
int counterRow = 0;
br2 = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), encoding));
while ((line = br2.readLine()) != null) {
line=line.replaceAll(",,", ",NA,");
String[] object = line.split(cvsSplitBy);
rowList.add(object);
counterRow++;
}
System.out.println("counterRow is: "+counterRow);
for(int i=1;i<rowList.size();i++){
try{
//this method includes many if elses only.
ImplementDecisionTreeRulesFor2012(rowList.get(i)[0],rowList.get(i)[1],rowList.get(i)[2],rowList.get(i)[3],rowList.get(i)[4],rowList.get(i)[5],rowList.get(i)[6]);
}
catch(Exception ex){
System.out.printlnt("Exception occurred");
}
}
}
catch(Exception ex){
System.out.println("fix"+ex);
}
It is working fine when the size of the csv file is not large. However, it is large indeed. Therefore I need another way to read a csv faster. Is there any advice? Appreciated, thanks.
Just use uniVocity-parsers' CSV parser instead of trying to build your custom parser. Your implementation will probably not be fast or flexible enough to handle all corner cases.
It is extremely memory efficient and you can parse a million rows in less than a second. This link has a performance comparison of many java CSV libraries and univocity-parsers comes on top.
Here's a simple example of how to use it:
CsvParserSettings settings = new CsvParserSettings(); // you'll find many options here, check the tutorial.
CsvParser parser = new CsvParser(settings);
// parses all rows in one go (you should probably use a RowProcessor or iterate row by row if there are many rows)
List<String[]> allRows = parser.parseAll(new File("/path/to/your.csv"));
BUT, that loads everything into memory. To stream all rows, you can do this:
String[] row;
parser.beginParsing(csvFile)
while ((row = parser.parseNext()) != null) {
//process row here.
}
The faster approach is to use a RowProcessor, it also gives more flexibility:
settings.setRowProcessor(myChosenRowProcessor);
CsvParser parser = new CsvParser(settings);
parser.parse(csvFile);
Lastly, it has built-in routines that use the parser to perform some common tasks (iterate java beans, dump ResultSets, etc)
This should cover the basics, check the documentation to find the best approach for your case.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
In this snippet I see two issues which will slow you down considerably:
while ((line = br2.readLine()) != null) {
line=line.replaceAll(",,", ",NA,");
String[] object = line.split(cvsSplitBy);
rowList.add(object);
counterRow++;
}
First, rowList starts with the default capacity and will have to be increased many times, always causing a copy of the old underlying array to the new.
Worse, however, ist the excessive blow-up of the data into a String[] object. You'll need the columns/cells only when you call ImplementDecisionTreeRulesFor2012 for that row - not all the time while you read that file and process all the other rows. Move the split (or something better, as suggested by comments) to the second row.
(Creating many objects is bad, even if you can afford the memory.)
Perhaps it would be better to call ImplementDecisionTreeRulesFor2012 while you read the "millions"? It would avoid the rowList ArrayList altogether.
Later
Postponing the split reduces the execution time for 10 million rows
from 1m8.262s (when the program ran out of heap space) to 13.067s.
If you aren't forced to read all rows before you can call Implp...2012, the time reduces to 4.902s.
Finally writing the split and replace by hand:
String[] object = new String[7];
//...read...
String x = line + ",";
int iPos = 0;
int iStr = 0;
int iNext = -1;
while( (iNext = x.indexOf( ',', iPos )) != -1 && iStr < 7 ){
if( iNext == iPos ){
object[iStr++] = "NA";
} else {
object[iStr++] = x.substring( iPos, iNext );
}
iPos = iNext + 1;
}
// add more "NA" if rows can have less than 7 cells
reduces the time to 1.983s. This is about 30 times faster than the original code, which runs into OutOfMemory anyway.
on top of the aforementioned univocity it's worth checking
https://github.com/FasterXML/jackson-dataformat-csv
http://simpleflatmapper.org/0101-getting-started-csv.html, it also have a low level api that by pass the String creation.
the 3 of them would as the time of the comment the fastest csv parser.
Chance is that writting your own parser would be slower and buggy.
If you're aiming for objects (i.e. data-binding), I've written a high-performance library sesseltjonna-csv you might find interesting. Benchmark comparison with SimpleFlatMapper and uniVocity here.
I'm trying to count pages from a word document with java.
This is my actual code, i'm using the Apache POI libraries
String path1 = "E:/iugkh";
File f = new File(path1);
File[] files = f.listFiles();
int pagesCount = 0;
for (int i = 0; i < files.length; i++) {
POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(files[i]));
HWPFDocument wdDoc = new HWPFDocument(fis);
int pagesNo = wdDoc.getSummaryInformation().getPageCount();
pagesCount += pagesNo;
System.out.println(files[i].getName()+":\t"+pagesNo);
}
The output is:
ten.doc: 1
twelve.doc: 1
nine.doc: 1
one.doc: 1
eight.doc: 1
4teen.doc: 1
5teen.doc: 1
six.doc: 1
seven.doc: 1
And this is not what i expected, as the first three documents' page length is 4 and the other are from 1 to 5 pages long.
What am i missing?
Do i have to use another library to count the pages correctly?
Thanks in advance
This may help you. It counts the number of form feeds (sometimes used to separate pages), but I'm not sure if it's gonna work for all documents (I guess it does not).
WordExtractor extractor = new WordExtractor(document);
String[] paragraphs = extractor.getParagraphText();
int pageCount = 1;
for (int i = 0; i < paragraphs.length; ++i) {
if (paragraphs[i].indexOf("\f") >= 0) {
++pageCount;
}
}
System.out.println(pageCount);
This alas is a bug some versions of Word (pre-2010 versions apparently, possibly just in Word 9.0 aka 2000) or at least in some versions of the COM previewer that's used to count the pages. The apache devs refused to implement a workaround for it: https://issues.apache.org/jira/browse/TIKA-1523
In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows "1". But here, the metadata as saved in the file is simply "1" or maybe nothing (see below). POI does not "reflow" the layout to calculate that information.
This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file "read only" (which it does because its downloaded from internet), it shows "" in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue.
I also found in there that the bug (for Word 9.0/2000) was confirmed by MS: http://support.microsoft.com/kb/212653/en-us
If opening and re-closing with a new version of Word is not possible/available, another workaround would be to covert the documents to pdf (or even xps) and count the pages of that.
So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}
Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html
A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.
If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.
Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
void write(InputStream is, OutputStream os) throws IOException {
byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
int num;
while((n = is.read(buf)) != -1){
os.write(buffer, 0, num);
}
}
EDIT:
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.