I have a code that reads a file using buffered reader and split, said file was created via a method that automatically adds 4KB of empty space at the beginning of the file, this results in when I read the following happens:
First the Code:
BufferedReader metaRead = new BufferedReader(new FileReader(metaFile));
String metaLine = "";
String [] metaData = new String [100000];
while ((metaLine = metaRead.readLine()) != null){
metaData = metaLine.split(",");
for (int i = 0; i < metaData.length; i++){
System.out.println(metaData[i]);
}
}
This is the result, keep in mind this file already exists and contains the values:
//4096 spaces then the first actual word in the document which is --> testTable2
Name
java.lang.String
true
No Reference
Is there a way to skip the first 4096 spaces, and get straight to the actual value within the file so I can get the result regularly? Because I'll be using the metaData array later in other operations, and I'm pretty sure the spaces will mess up the number of slots within the array. Any suggestions would be appreciated.
If you're using Eclipse, the auto-completion should help.
metaRead.skip(4096);
https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
You could (as mentioned) simply do:
metaRead.skip(4096);
If the whitespace always occupies that many characters, or you could just avoid lines which are empty
while ((metaLine = metaRead.readLine()) != null){
if(metaLine.trim().length() > 0){
metaData = metaLine.split(",");
for (int i = 0; i < metaData.length; i++){
System.out.println(metaData[i]);
}
}
}
Related
Hi I'm working on a simple imitation of Panda's fillna method which requires me to replace a null/missing value in a csv file with an input (in terms of parameter). Almost everything is working fine but I have one issue. My CSV reader can't recognize the null/missing at the beginning and at the end of a row. For example,
Name,Age,Class
John,20,CLass-1
,18,Class-1
,21,Class-3
It will return errors.
Same goes to this example ..
Name,Age,Class
John,20,CLass-1
Mike,18,
Tyson,21,
But for this case (at the end of the row problem), I can solve this by adding another comma at the end. Like this
Name,Age,Class
John,20,CLass-1
Mike,18,,
Tyson,21,,
However, for the beginning of the row problem, I have no idea how to solve it.
Here's my code for the CSV file reader:
public void readCSV(String fileName) {
fileLocation = fileName;
File csvFile = new File(fileName);
Scanner sfile;
// noOfColumns = 0;
// noOfRows = 0;
data = new ArrayList<ArrayList>();
int colCounter = 0;
int rowCounter = 0;
try {
sfile = new Scanner(csvFile);
while (sfile.hasNextLine()) {
String aLine = sfile.nextLine();
Scanner sline = new Scanner(aLine);
sline.useDelimiter(",");
colCounter = 0;
while (sline.hasNext()) {
if (rowCounter == 0)
data.add(new ArrayList<String>());
data.get(colCounter).add(sline.next());
colCounter++;
}
rowCounter++;
sline.close();
}
// noOfColumns = colCounter;
// noOfRows = rowCounter;
sfile.close();
} catch (FileNotFoundException e) {
System.out.println("File to read " + csvFile + " not found!");
}
}
Unless you write a CSV file yourself, the writer mechanism will never arbitrarily add delimiters to suit the needs of your application method so, give up on that train of thought altogether because you shouldn't do it either. If you do indeed have access to the CSV file creation process then the simple solution would be to not allow the possibility of null or empty values to enter the file. In other words, have the defaults (in such a case) placed into empty elements as the CSV file is being written.
The Header line within a CSV file is there for a reason, it tells you the number of data columns and the names of those columns within each line (row) that make up the file. Between the header line and the actual data in the file you can also establish a pretty good idea of what each column Data Type should be.
In my opinion, the first thing your readCSV() method should do is read this Header Line (if it exists) and gather some information about the file that the method is about to iterate through. In your case the Header Line consists of:
Name,Age,Class
Right off the start we know that each line within the file consists of three (3) data columns. The first column contains the name of Name, the second column contains the name of Age, and the third column contains the name of Class. Based on all the information provided within the CSV file we can actually quickly assume the data types:
Name (String)
Age (Integer)
Class (String)
I'm only pointing this out because in my opinion, although not mandatory, I think it would be better to store the CSV data in an ArrayList or List Interface of an Object class, for example:
ArrayList<Student> studentData = new ArrayList<>();
// OR //
List<Student> studentData = new ArrayList<>();
where Student is an object class.
You seem to want everything within a 2D ArrayList so with that in mind, below is a method to read CSV files and place its' contents into this 2D ArrayList. Any file column elements that contain the word null or nothing at all will have a default string applied. There are lots of comments within the code explaining what is going on and I suggest you give them a read. This code can be easily modified to suit your needs. At the very least I hope it gives you an idea of what can be done to apply defaults to empty values within the CSV file:
/**
* Reads a supplied CSV file with any number of columnar rows and returns
* the data within a 2D ArrayList of String ({#code ArrayList<ArrayList<String>>}).
* <br><br>File delimited data that contains 'null' or nothing (a Null String (""))
* will have a supplied common default applied to that column element before it is
* stored within the 2D ArrayList.<br><br>
*
* Modify this code to suit your needs.<br>
*
* #param fileName (String) The CSV file to process.<br>
*
* #param csvDelimiterUsed (String) // The delimiter use in CSV file.<br>
*
* #param commonDefault (String) A default String value that can be common
* to all columnar elements within the CSV file that contains the string
* 'null' or nothing at all (a Null String ("")). Those empty elements will
* end up containing this supplied string value postfixed with the name of
* that column. As an Example, If the CSV file Header line was
* 'Name,Age,Class Room' and if the string "Unknown " is supplied to the
* commonDefault parameter and during file parsing a specific data column
* (let's say Age) contained the word 'null' or nothing at all (ex:
* Bob,null,Class-Math OR Bob,,Class-Math) then this line will be stored
* within the 2D ArrayList as:<pre>
*
* Bob, Unknown Age, Class-Math</pre>
*
* #return (2D ArrayList of String Type - {#code ArrayList<ArrayList<String>>})
*/
public ArrayList<ArrayList<String>> readCSV(final String fileName, final String csvDelimiterUsed,
final String commonDefault) {
String fileLocation = fileName; // The student data file name to process.
File csvFile = new File(fileLocation); // Create a File Object (use in Scanner reader).
/* The 2D ArrayList that will be returned containing all the CSV Row/Column data.
You should really consider creating a Class to hold Student instances of this
data however, this can be accomplish by working the ArrayList later on when it
is received. */
ArrayList<ArrayList<String>> fileData = new ArrayList<>();
// Open the supplied data file using Scanner (as per OP).
try (Scanner reader = new Scanner(csvFile)) {
/* Read the Header Line and gather information... This array
will ultimately be setup to hold default values should
any file columnar data hold null OR null-string (""). */
String[] columnData = reader.nextLine().split("\\s*\\" + csvDelimiterUsed + "\\s*");
/* How many columns of data will be expected per row.
This will be used in the String#split() method later
on as the limiter when we parse each file data line.
This limiter value is rather important in this case
since it ensures that a Null String ("") is in place
of where valid Array element should be should there
be no data available instead of just providing an
array of 'lesser length'. */
int csvValuesPerLineCount = columnData.length;
// Copy column Names Array: To just hold the column Names.
String[] columnName = new String[columnData.length];
System.arraycopy(columnData, 0, columnName, 0, columnData.length);
/* Create default data for columns based on the supplied
commonDefault String. Here the supplied default prefixes
the actual column name (see JavaDoc). */
for (int i = 0; i < columnData.length; i++) {
columnData[i] = commonDefault + columnData[i];
}
// An ArrayList to hold each row of columnar data.
ArrayList<String> rowData;
// Iterate through in each row of file data...
while (reader.hasNextLine()) {
rowData = new ArrayList<>(); // Initialize a new ArrayList.
// Read file line and trim off any leading or trailing white-spaces.
String aLine = reader.nextLine().trim();
// Only Process lines that contain something (blank lines are ignored).
if (!aLine.isEmpty()) {
/* Split the read in line based on the supplied CSV file
delimiter used and the number of columns established
from the Header line. We do this to determine is a
default value will be reguired for a specific column
that contains no value at all (null or Null String("")). */
String[] aLineParts = aLine.split("\\s*\\" + csvDelimiterUsed + "\\s*", csvValuesPerLineCount);
/* Here we determine if default values will be required
and apply them. We then add the columnar row data to
the rowData ArrayList. */
for (int i = 0; i < aLineParts.length; i++) {
rowData.add((aLineParts[i].isEmpty() || aLineParts[i].equalsIgnoreCase("null"))
? columnData[i] : aLineParts[i]);
}
/* Add the rowData ArrayList to the fileData
ArrayList since we are now done with this
file row of data and will now iterate to
the next file line for processing. */
fileData.add(rowData);
}
}
}
// Process the 'File Not Found Exception'.
catch (FileNotFoundException ex) {
System.err.println("The CSV file to read (" + csvFile + ") can not be found!");
}
// Return the fileData ArrayList to the caller.
return fileData;
}
And to use the method above you might do this:
ArrayList<ArrayList<String>> list = readCSV("MyStudentsData.txt", ",", "Unknown ");
if (list == null) { return; }
StringBuilder sb;
for (int i = 0; i < list.size(); i++) {
sb = new StringBuilder("");
for (int j = 0; j < list.get(i).size(); j++) {
if (!sb.toString().isEmpty()) { sb.append(", "); }
sb.append(list.get(i).get(j));
}
System.out.println(sb.toString());
}
i have a large xml file of size 10 gb and i want to create a new xml file which is generated from the first record of the large file.i tried to do this in java and python but i got memory error since i'm loading the entire data.
In another post,someone suggested XSLT is the best solution for this.I'm new to XSLT,i don't know how to do this in xslt,pls suggest some style sheet to do this...
Large XML file(10gb) sample:
<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
<Header>
<BusinessPartner>CHILIS_US</BusinessPartner>
<FileType>mde</FileType>
<FileNumber>17</FileNumber>
<FormatVariant>1</FormatVariant>
<NumberOfRecords>22</NumberOfRecords>
<CreationDate>2016-06-07T12:00:46-07:00</CreationDate>
</Header>
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
.....
.....
</MemberDataExport>
I want to create a file like this..
<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
</MemberDataExport>
is there any other way i can do this without getting any memory error? pls suggest that too.
In Python (which you mentioned besides Java) you could use ElementTree.iterparse and then break parsing when you have found the element(s) you want to copy:
import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1
for event, elem in ET.iterparse('input1.xml', events = ('start', 'end')):
if event == 'start':
level = level + 1
if level == 0:
result = ET.ElementTree(ET.Element(elem.tag))
if event == 'end':
level = level - 1
if level == 0:
count = count + 1
if count <= copy:
result.getroot().append(elem)
else:
break
result.write('result1.xml', 'UTF-8', True, 'http://www.payback.net/lmsglobal/batch/memberdataexport')
As for better namespace prefix preservation, I have had some success using the event start-ns and registering the collected namespaces on the ElementTree:
import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1
for event, elem in ET.iterparse('input1.xml', events = ('start', 'end', 'start-ns')):
if event == 'start':
level = level + 1
if level == 0:
result = ET.ElementTree(ET.Element(elem.tag))
if event == 'end':
level = level - 1
if level == 0:
count = count + 1
if count <= copy:
result.getroot().append(elem)
else:
break
if event == 'start-ns':
ET.register_namespace(elem[0], elem[1])
result.write('result1.xml', 'UTF-8', True)
You didn't show your code, so we can't possibly know what you're doing right or wrong. However, I'd bet any parser would need to load the entire file just to check if syntax is OK, no missing tags etc. and that will surely cause an OutOfMemory error for a 10 GB file.
So, just in this case, my approach would be to read the file line by line using a BufferedStreamReader (see How to read a large text file line by line using Java?) and just stop when you reach a line that contains your closing tag, i.e. </MembershipInfoListItem>:
StringBuilder sb = new StringBuilder("<MemberDataExport xmlns=\"http://www.payback.net/lmsglobal/batch/memberdataexport\" xmlns:types=\"http://www.payback.net/lmsglobal/xsd/v1/types\">");
sb.append(System.lineSeparator());
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
// process the line
sb.append(line);
sb.append(System.lineSeparator());
if (line.contains("</MembershipInfoListItem>")) {
break;
}
}
sb.append("</MemberDataExport>");
} catch (IOException | AnyOtherExceptionNeeded ex) {
// log or rethrow
}
Now sb.toString() will return what you want.
I want to read a csv files including millions of rows and use the attributes for my decision Tree algorithm. My code is below:
String csvFile = "myfile.csv";
List<String[]> rowList = new ArrayList();
String line = "";
String cvsSplitBy = ",";
String encoding = "UTF-8";
BufferedReader br2 = null;
try {
int counterRow = 0;
br2 = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), encoding));
while ((line = br2.readLine()) != null) {
line=line.replaceAll(",,", ",NA,");
String[] object = line.split(cvsSplitBy);
rowList.add(object);
counterRow++;
}
System.out.println("counterRow is: "+counterRow);
for(int i=1;i<rowList.size();i++){
try{
//this method includes many if elses only.
ImplementDecisionTreeRulesFor2012(rowList.get(i)[0],rowList.get(i)[1],rowList.get(i)[2],rowList.get(i)[3],rowList.get(i)[4],rowList.get(i)[5],rowList.get(i)[6]);
}
catch(Exception ex){
System.out.printlnt("Exception occurred");
}
}
}
catch(Exception ex){
System.out.println("fix"+ex);
}
It is working fine when the size of the csv file is not large. However, it is large indeed. Therefore I need another way to read a csv faster. Is there any advice? Appreciated, thanks.
Just use uniVocity-parsers' CSV parser instead of trying to build your custom parser. Your implementation will probably not be fast or flexible enough to handle all corner cases.
It is extremely memory efficient and you can parse a million rows in less than a second. This link has a performance comparison of many java CSV libraries and univocity-parsers comes on top.
Here's a simple example of how to use it:
CsvParserSettings settings = new CsvParserSettings(); // you'll find many options here, check the tutorial.
CsvParser parser = new CsvParser(settings);
// parses all rows in one go (you should probably use a RowProcessor or iterate row by row if there are many rows)
List<String[]> allRows = parser.parseAll(new File("/path/to/your.csv"));
BUT, that loads everything into memory. To stream all rows, you can do this:
String[] row;
parser.beginParsing(csvFile)
while ((row = parser.parseNext()) != null) {
//process row here.
}
The faster approach is to use a RowProcessor, it also gives more flexibility:
settings.setRowProcessor(myChosenRowProcessor);
CsvParser parser = new CsvParser(settings);
parser.parse(csvFile);
Lastly, it has built-in routines that use the parser to perform some common tasks (iterate java beans, dump ResultSets, etc)
This should cover the basics, check the documentation to find the best approach for your case.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
In this snippet I see two issues which will slow you down considerably:
while ((line = br2.readLine()) != null) {
line=line.replaceAll(",,", ",NA,");
String[] object = line.split(cvsSplitBy);
rowList.add(object);
counterRow++;
}
First, rowList starts with the default capacity and will have to be increased many times, always causing a copy of the old underlying array to the new.
Worse, however, ist the excessive blow-up of the data into a String[] object. You'll need the columns/cells only when you call ImplementDecisionTreeRulesFor2012 for that row - not all the time while you read that file and process all the other rows. Move the split (or something better, as suggested by comments) to the second row.
(Creating many objects is bad, even if you can afford the memory.)
Perhaps it would be better to call ImplementDecisionTreeRulesFor2012 while you read the "millions"? It would avoid the rowList ArrayList altogether.
Later
Postponing the split reduces the execution time for 10 million rows
from 1m8.262s (when the program ran out of heap space) to 13.067s.
If you aren't forced to read all rows before you can call Implp...2012, the time reduces to 4.902s.
Finally writing the split and replace by hand:
String[] object = new String[7];
//...read...
String x = line + ",";
int iPos = 0;
int iStr = 0;
int iNext = -1;
while( (iNext = x.indexOf( ',', iPos )) != -1 && iStr < 7 ){
if( iNext == iPos ){
object[iStr++] = "NA";
} else {
object[iStr++] = x.substring( iPos, iNext );
}
iPos = iNext + 1;
}
// add more "NA" if rows can have less than 7 cells
reduces the time to 1.983s. This is about 30 times faster than the original code, which runs into OutOfMemory anyway.
on top of the aforementioned univocity it's worth checking
https://github.com/FasterXML/jackson-dataformat-csv
http://simpleflatmapper.org/0101-getting-started-csv.html, it also have a low level api that by pass the String creation.
the 3 of them would as the time of the comment the fastest csv parser.
Chance is that writting your own parser would be slower and buggy.
If you're aiming for objects (i.e. data-binding), I've written a high-performance library sesseltjonna-csv you might find interesting. Benchmark comparison with SimpleFlatMapper and uniVocity here.
Basically, I want to parse, line by line, a Text file so that every line is in it's own array value.
E.g.
Hi There,
My Name's Aiden,
Not Really.
Array[0] = "Hi There"
Array[1] = "My Name's Aiden"
Array[2] = "Not Really"
But all the examples I have read already just confuse me and lead me to get frustrated. Maybe it's the way I approach it.
I don't know how to go about it, a point in the right direction would be most satisfying.
My suggestion is to use List<String> instead of String[] as arrays have fixed size, and that size is unknown before reading. Afterward one could make an array out of it, but to no real purpose.
For reading one has to know the encoding of the file.
Path path = Paths.get("C:/Users/Me/list.txt");
//Charset encoding = StandardCharsets.UTF_8;
Charset encoding = Charset.defaultCharset();
List<String> lines = Files.readAllLines(path, encoding);
for (String line : lines) {
...
}
for (int i = 0; i < lines.size(); ++i) {
String line = lines.get(i);
lines.set(i, "-- " + line;
}
File input = new File("1727209867.htm");
Document doc = Jsoup.parse(input, "UTF-8","http://www.facebook.com/people/Alison-Vella/1727209867");
I am trying to parse this html file which is saved and using in local system. But parsing dont parse all html. So i cant reach information that i need. Parse only work for 6k char with this code but actually html file has 60k char.
This is not possible in jsoup, but with a workaround:
final File input = new File("example.html");
final int maxLength = 6000; // Limit of char's to read
InputStream is = new FileInputStream(input); // Open file for reading
StringBuilder sb = new StringBuilder(maxLength); // Init the "buffer" with the size required
int count = 0; // Count of chars readen
int c; // Char for reading
while( ( c = is.read() ) != -1 && count < maxLength ) // Read a single char until limit is reached
{
sb.append((char) c); // Save the char into the buffer
count++; // increment the chars readen
}
Document doc = Jsoup.parse(sb.toString()); // Parse the Html from buffer
Explained:
Read the file char-by-char into a buffer until you reach the limit
Parse the text from buffer and process it with jsoup
Problem: This wont take care about closing tags etc - it will stop reading exactly if you are on the limit.
(Possible) Solutions:
ignore this and stop exactly where you are, parse this and "fix" or drop the hanging html
if you are at the end, read until you reach next closing tag or > char
if you are at the end, read until you reach next block-tag
if you are at the end, read until a specific tag or comment