I want to read a property file which has only the values (i.e. Not as a key value pair).
My property file will contain only the list of strings (more than 1000 words).
I am just using the IOUtils to read the file as below:
InputStream inputStream = ReadProperty.class.getClassLoader().getResourceAsStream(FILE_NAME);
keywords = IOUtils.toString(inputStream);
What would be the efficient way to maintain the property file.
Maintaining the words as comma separated
EG:
Good,Bad,Better,Best,Could,Would
Maintaining the words in each line
EG:
Good
Bad
Better
Best
Could
Would
I feel the second option is readable but i want to understand that is there any performance issue occurs due to new line character (\n)
If you go with the newline representation you could use this easy way to read the lines:
ArrayList<String> values = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader("false"))) {
String value;
while ((value = br.readLine()) != null) {
values.add(value);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
Regarding performance:
The character ',' and the character '\n' do use the same space on disc, unless you write the lines with a file-writer which is aware of the platform you're working on (it'll write "\r\n" on windows systems). Performance won't be influenced very much (especially if you only have about 1000 entrys)
Related
I am using CSVReader to read the csv file in Java. In my case, the csv file will have double quotes (") and single quotes ('). Something like this.
SL 12" WIR TREE ASST CD
The below code i am using to read the file.
CsvReader reader = null;
reader = readFile(fileName, delimiter, encoding);
while (reader.readRecord()) {
// Code Part
}
Whenever it cross the reader.readrecord(), its throwing the exception as 'Maximum column length of 100,000 exceeded in column 0 in record 0. Set the SafetySwitch property to false if you're expecting column lengths greater than 100,000 characters to avoid this error.'
What i am trying to do and what i need is,
Since i can't able to do any changes in the file, i am trying to replace the double quotes and single quotes to empty string in java. But it is throwing exception, what ever i mentioned above.
I don't know what CsvReader is (it is not part of standard JDK) but the problem seems to occur in readRecord() and thus way before you have the chance to replace any character. So, CsvReader is not usable here and you should use a less specialised reader such as java.io.BufferedReader, for example.
Given, the delimiter is not a quote or double quote (for obvious reasons) then this code snippet works:
File file = new File(fileName);
InputStream is = new FileInputStream(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(is, encoding));
try {
String line = reader.readLine();
while (line != null) {
//replace qoutes
line = line.replace("\"", "");
line = line.replace("'", "");
//split line according to given delimiter
String[] items = line.split(delimiter);
//handle items...
line = reader.readLine();
}
}
catch (IOException e) {
//handle exception...
}
I want to read text files and convert each word to a number. Then for each file write sequence of numbers instead of word in a new file. I used a HashMap to assigned just one number (identifier) for each word, for instance, the word apple is assigned to number 10 so whenever, I see apple in a text file I write 10 in the sequence. I need to have just one HashMap to prevent assigned more than one identifier to a word. I wrote the following code but it process file slowly. For instance, converting a text file with size 165.7 MB to a file of sequence took 20 hours. I need to convert 600 text file with the same size to sequence files. I want to know is there any way to improve the efficiency of my code . The following function is called for each text file.
public void ConvertTextToSequence(File file) {
try{
FileWriter filewriter=new FileWriter(path.keywordDocIdsSequence,true);
BufferedWriter bufferedWriter= new BufferedWriter(filewriter);
String sequence="";
FileReader fileReader = new FileReader(file);
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line = bufferedReader.readLine();
while(line!=null)
{
StringTokenizer tokens = new StringTokenizer(line);
String str;
while (tokens.hasMoreTokens())
{
str = tokens.nextToken();
if(keywordsId.containsKey(str))
sequence= sequence+" "+keywordsId.get(stmWord);
else
{
keywordsId.put(str,id);
sequence= sequence+" "+id;
id++;
}
if(keywordsId.size()%10000==0)
{
bufferedWriter.append(sequence);
sequence="";
start=id;
}
}
String line = bufferedReader.readLine();
}
}
if(start<id)
{
bufferedWriter.append(sequence);
}
bufferedReader.close();
fileReader.close();
bufferedWriter.close();
filewriter.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
The constructor of that class is:
public ConvertTextToKeywordIds(){
path= new LocalPath();
repository= new RepositorySQL();
keywordsId= new HashMap<String, Integer>();
id=1;
start=1;}
I suspect that the speed of your program is tied to the rehashing of the hash map as the number of words grows. Each rehash can incur a significant time penalty as the size of the hash map grows. You could try and estimate the number of unique words you expect and use that to initialize the hash map.
As mentioned by #JB Nizet you may want to write directly to the buffered writer rather than waiting to accumulate a number of entries. Since the buffered writer is already set up to write only when it has accumulated enough changes.
Your most effective performace boost is probably using StringBuilder instead of String for your sequence.
I would also write and flush the sequence each time it exceeds a certain length rather than whenever you've added 10000 words to your map.
This map could get pretty huge - have you considered improving that? If you hit millions of entries you may get better performance using a database.
I have noticed that using java.util.Scanner is very slow when reading large files (in my case, CSV files).
I want to change the way I am currently reading files, to improve performance. Below is what I have at the moment. Note that I am developing for Android:
InputStreamReader inputStreamReader;
try {
inputStreamReader = new InputStreamReader(context.getAssets().open("MyFile.csv"));
Scanner inputStream = new Scanner(inputStreamReader);
inputStream.nextLine(); // Ignores the first line
while (inputStream.hasNext()) {
String data = inputStream.nextLine(); // Gets a whole line
String[] line = data.split(","); // Splits the line up into a string array
if (line.length > 1) {
// Do stuff, e.g:
String value = line[1];
}
}
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
Using Traceview, I managed to find that the main performance issues, specifically are: java.util.Scanner.nextLine() and java.util.Scanner.hasNext().
I've looked at other questions (such as this one), and I've come across some CSV readers, like the Apache Commons CSV, but they don't seem to have much information on how to use them, and I'm not sure how much faster they would be.
I have also heard about using FileReader and BufferedReader in answers like this one, but again, I do not know whether the improvements will be significant.
My file is about 30,000 lines in length, and using the code I have at the moment (above), it takes at least 1 minute to read values from about 600 lines down, so I have not timed how long it would take to read values from over 2,000 lines down, but sometimes, when reading information, the Android app becomes unresponsive and crashes.
Although I could simply change parts of my code and see for myself, I would like to know if there are any faster alternatives I have not mentioned, or if I should just use FileReader and BufferedReader. Would it be faster to split the huge file into smaller files, and choose which one to read depending on what information I want to retrieve? Preferably, I would also like to know why the fastest method is the fastest (i.e. what makes it fast).
uniVocity-parsers has the fastest CSV parser you'll find (2x faster than OpenCSV, 3x faster than Apache Commons CSV), with many unique features.
Here's a simple example on how to use it:
CsvParserSettings settings = new CsvParserSettings(); // many options here, have a look at the tutorial
CsvParser parser = new CsvParser(settings);
// parses all rows in one go
List<String[]> allRows = parser.parseAll(new FileReader(new File("your/file.csv")));
To make the process faster, you can select the columns you are interested in:
parserSettings.selectFields("Column X", "Column A", "Column Y");
Normally, you should be able to parse 4 million rows around 2 seconds. With column selection the speed will improve by roughly 30%.
It is even faster if you use a RowProcessor. There are many implementations out-of-the box for processing conversions to objects, POJOS, etc. The documentation explains all of the available features. It works like this:
// let's get the values of all columns using a column processor
ColumnProcessor rowProcessor = new ColumnProcessor();
parserSettings.setRowProcessor(rowProcessor);
//the parse() method will submit all rows to the row processor
parser.parse(new FileReader(new File("/examples/example.csv")));
//get the result from your row processor:
Map<String, List<String>> columnValues = rowProcessor.getColumnValuesAsMapOfNames();
We also built a simple speed comparison project here.
Your code is good to load big files. However, when an operation is going to be longer than you're expecting, it's good practice to execute it in a task and not in UI Thread, in order to prevent any lack of responsiveness.
The AsyncTask class help to do that:
private class LoadFilesTask extends AsyncTask<String, Integer, Long> {
protected Long doInBackground(String... str) {
long lineNumber = 0;
InputStreamReader inputStreamReader;
try {
inputStreamReader = new
InputStreamReader(context.getAssets().open(str[0]));
Scanner inputStream = new Scanner(inputStreamReader);
inputStream.nextLine(); // Ignores the first line
while (inputStream.hasNext()) {
lineNumber++;
String data = inputStream.nextLine(); // Gets a whole line
String[] line = data.split(","); // Splits the line up into a string array
if (line.length > 1) {
// Do stuff, e.g:
String value = line[1];
}
}
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
return lineNumber;
}
//If you need to show the progress use this method
protected void onProgressUpdate(Integer... progress) {
setYourCustomProgressPercent(progress[0]);
}
//This method is triggered at the end of the process, in your case when the loading has finished
protected void onPostExecute(Long result) {
showDialog("File Loaded: " + result + " lines");
}
}
...and executing as:
new LoadFilesTask().execute("MyFile.csv");
You should use a BufferedReader instead:
BufferedReader reader = null;
try {
reader = new BufferedReader( new InputStreamReader(context.getAssets().open("MyFile.csv"))) ;
reader.readLine(); // Ignores the first line
String data;
while ((data = reader.readLine()) != null) { // Gets a whole line
String[] line = data.split(","); // Splits the line up into a string array
if (line.length > 1) {
// Do stuff, e.g:
String value = line[1];
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
I have the following code for compressing and decompressing string.
public static byte[] compress(String str)
{
try
{
ByteArrayOutputStream obj = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(obj);
gzip.write(str.getBytes("UTF-8"));
gzip.close();
return obj.toByteArray();
}
catch (IOException e)
{
e.printStackTrace();
}
return null;
}
public static String decompress(byte[] bytes)
{
try
{
GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(bytes));
BufferedReader bf = new BufferedReader(new InputStreamReader(gis, "UTF-8"));
StringBuilder outStr = new StringBuilder();
String line;
while ((line = bf.readLine()) != null)
{
outStr.append(line);
}
return outStr.toString();
}
catch (IOException e)
{
return e.getMessage();
}
}
I compress into byte array on windows, and then send the byte array through socket to the linux and uncompress it there. However upon uncompression it seem that all my newline characters are gone.
So I thought that the problem was linux to windows relationship. However I have tried writing a simple program on windows that uses it, and found that the newlines are still gone.
Can anyone shed any light as to what causes it? I can't figure out any explanation.
I think the problem is here:
while ((line = bf.readLine()) != null)
{
outStr.append(line);
}
The readLine see's the newline char but doesn't include it in the returned value for line
The problem is worse than you think, perhaps.
readLine() gets all the characters up to, but not including, a newline (or some variety of returns and linefeed characters) OR the end of file. So you don't know if the last line you get had a newline on the end or not.
This might not matter, and if so, you can just add this following the other append:
outStr.append('\n');
Some files might end up with an extra line ending at the end of file.
If it does matter, you will need to use read() and then output all the characters you receive. In that case, you might end up with the infamous "What's at the end of the line?" problem you allude to between Windows, Linux and the MacOS and the way they use different combinations of return and new-line characters to end lines.
It is not GZIP that is "eating" newlines.
It is this code:
while ((line = bf.readLine()) != null)
{
outStr.append(line);
}
The readLine() method reads a line (up to and including a line termination sequence) and then returns it without a newline. You then append it to outStr ... without replacing the line termination that was stripped.
But even if you replaced the line termination, you can't guarantee to preserve the actual line termination sequence that was used ... if you do it that way.
I recommend that you replace the readLine() calls with read() calls; i.e. read and then buffer the data one character at a time. It solves two problems at once. It may even be faster, because you are avoiding the unnecessary overhead of assembling line Strings.
I am having a problem reading files with bufferedReader... I am trying to read in a dictionary file where every word is on a newline. It works for one file I have, but when I tried adding a larger wordlist file the enable wordlist and then on the first read: 'while ((currentLine=br.readLine()) != null) ' it cause an exception with no description... Please help!
try
{
InputStream is = this.getResources().openRawResource(R.raw.enable1);
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String currentLine=null;
while ((currentLine=br.readLine()) != null)
{
dictionaryList.add(currentLine);
}
br.close();
}
catch (Exception e)
{
//error here
}
*Looks like there is a file size limit of 1048576 bytes... otherwise it crashes.
So I like I said in the edit the new wordlist was over 1048576 bytes and was causing an IO exception without any error... (i had a string set to e.Message() in the catch put the message was null)
What I did was divide the wordlist into separate files based on word size (btw there are 26 different files! message me if you want them)
then depending on the size of the word I have I load the specific wordlist where all of the files are in the format enable# (# is the word size). If anyone wants to know I am doing that like this:
int wordListID=0;
String wordList="enable"+goodText.length();
try {
Class res = R.raw.class;
Field field = res.getField(wordList);
wordListID= field.getInt(null);
}
catch (Exception e) {
//something
}
i then send that specific wordListID to:
InputStream is = this.getResources().openRawResource(wordListID);
and know I have a small enough file which actually helps my performance too!
*This is my first application so I may not be doing things the correct way... just trying to get the hang of things