splitting huge csv file periodically

splitting huge csv file periodically - java

I've got a huge csv file that keeps increasing forever [although sometimes it gets reset], I know that's not good but unfortunately I can't change the design since it's another application that keeps adding stuff there.
I have split this file into new smaller files considering that new stuff is going to appear in that csv file every time.For example, one csv file for each 1000 values or something like that.
I'm thinking about writing a small program to do it and run it periodically via Windows Scheduled Tasks, is it the best way of fixing this problem? If so, Can you help me with the code [Java, VB, C#...]. If it's not the best solution, which path should I follow?

Related

Running a multithreaded program sychronized very slow Java

This question is a little complicated but I will do my best to make it simple.
I have a program which I want to run multithreaded.
This is what the program does:
initializes an executable (commandline utility)
loads a file into the executable (files are provided from a data provider method)
sends commands to that executable based on the file which was loaded
parses the received responses from executable
writes results to a csv file
All this takes place in a single method.
However when running in multithreaded mode, everything runs fine except, all the results written to the csv file are wrong and out of order.
However when I added the keyword sychronized in the method declaration and run the program with multiple threads, the program works just fine.
public sychronized void run(Dataprovider data) {
...
}
However the program runs at the same speed as if I were running in single thread mode. How can I fix this? This is driving me nuts...
How can I run this program properly multithreaded?
I'm looking for ideas and/or guidance
Edit:
However when running in multithreaded mode, everything runs fine
except, all the results written to the csv file are wrong and out of
order.
I load a file in the executable, I run some calculations on that file, then save it. I then get the file size in bytes (file.length) for that newly generated file. I compare the results of that new file with the old file (file which was loaded) and I see that the new file is smaller than the old file (which is totally wrong). The file sizes for the new file is consistently 12263 bytes, which is incorrect
Edit:
Here is a partial code which does the writing to CSV file:
Edit:
Removing code example for simplicity

However when running in multithreaded mode, everything runs fine
except, all the results written to the csv file are wrong and out of
order.
I can make sore guesses as to what you mean by this statement, but it would help if it were more specific.
Is it the case that the results are wrong because outputs from different threads are being jumbled together into the same line or even the same token within a line?
In a csv file, the records are typically separated by newline characters. Can you refactor your solution so that a thread produces a complete line before writing to the output, and writes that line all in one go to the output?
Does your solution already do it that way? (It's not clear... there is no code in the question.)

Most effective way to read/write large .ser file for online applet

I am writing an applet that I eventually want to put online so that my friends/family can use it. I have the applet running now locally, but in order to work properly it needs to read a .ser file in when the applet opens, and update that same file when the applet closes. The file is quite large (~180 MB), though I am working on paring it down.
What would be the fastest/most effective way to read/write this file in java? There is a lot of information out there on this and I have never done anything like it before, so it's a bit overwhelming. The class HTTPURLConnection seems like an option to read it, but not write it. Any free web hosting that I have seen will not allow a file that big to be uploaded.
The size of the file should hopefully go down substantially, it is a list of 2.8 million musical artists, many of which I'm sure nobody using the program will ever encounter, but if this program is to be effective, many artists will have to be stored, so the problem most likely remains the same.
Thanks in advance for any help

It sounds like it would be wise to keep this large data and the processing of it on your server instead of making the applet operate on it. That's because you would avoid each user downloading a large file and processing it. If you had a server side piece that the applet could call to get useful information from, then only your server would have to load it, write it, and process it. You could implement a Java servlet, or a PHP program to respond to http requests from your applet in a format that suits the data. I'm assuming that your server can handle either servlets or custom PHP (most can).

Should I use a text file or Database?

So I'm putting together an RSS parser which will process an RSS feed, filter it, and then download the matched items. Assume that the files being downloaded are legal torrent files.
Now I need to keep a record of the files that I have already downloaded, so they aren't done again.
I've already got it working with SQLite (create database if not exists, insert row if a select statement returns nothing), but the resulting jar file is 2.5MB+ (due to the sqlite libs).
I'm thinking that if I use a text file, I could cut down the jar file to a few hundred kilobytes.
I could keep a list of the names of files downloaded - one per line - and reading the whole file into memory, search if a file exists, etc.
The few questions that occur to me know:
Say if 10 files are downloaded a day, would the text file method end
up taking too much resources?
Overall which one is faster
Anyway, what do you guys think? I could use some advice here, as I'm still new to programming and doing this as a hobby thing :)

If you need to keep track only of few informations (like name of the file), you can for sure use a simple text file.
Using a BufferedReader to read you should achieve good performance.

Theoretically DB (either relational or NoSQL is better. But if the distribution size is critical for you using file system can be preferable.
The only problem here is the performance of data access (either for write or for read). Probably think about the following approach. Do not use one single file. Use directory that contains several files instead. The file name will contain key (or keys) that allow access specific data just like key in map. In this case you will be able to access data relatively easily and fast.
Probably take a look on XStream. They have implementation of Map that is implemented as described above: stores entries on disk, each entry in separate file.

Read data from log file periodically

I am generating a log file and what i want is that i want to read the data periodically without having to read from the beginning each time. can anyone help.

Open the file and have a loop which,
get the size and compare with the size you have read already.
if the size has grown, read that many bytes and no more. Doing this means you can read more later.
if the size has shrink, close the file and start again.
You can use FileInputStream or RandomAccessFile.

use unix command 'tail', the option '-f' and '-F' is for the same command is very handy as well.
See here http://www.thegeekstuff.com/2009/08/10-awesome-examples-for-viewing-huge-log-files-in-unix/ for examples or just google around for examples.

If you want to Run a program to read your log file periodically then you can use schedulers like, Quartz Scheduler to run it periodically.

RandomAccessFile is a good option. If you leave the application you will have to persist the place of your last read before leaving, in order to avoid rereading information.
Log files, on the other hand, tend to become quite large for heavy event flow. Rotating log files will allow you to shift your problem a little towards file naming. Your can configure your system to produce one log file per day like here:
app_access.2011-11-28.log,
app_access.2011-11-29.log,
app_access.2011-11-30.log,
...
If the files you get are still very large, you may rotate them by date and time and you will have also the hour as part of the file name. Your files could then rotate, let's say, every three hours or even every hour. This will give you more log files to read, but they will be smaller, thus easier to process. The date and time range you want to seek for will be part of the file name.
You could also additionally rotate by file size. If you select a maximum file size you can deal with you could avoid accessing randomly a huge file completely.

Edit files single line in java

I'm trying to edit configuration file in Java. What I really need to do is to change single line, so reading the whole file and writing it back would be waste of time, since configuration file can be big.
Is there a more efficient way to do this? Except reading in/editing/writing out file. I thouhgt of converting entire file to string, replacing the line I want and writting it back.
I don't know how efficient would that be, can someone give me some other suggestions or the one I mentioned are ok, execution time is important.

I would recommend to use the Preferences API instead. Then on the Windows platform your preferences is stored in the registry. On other platforms the corresponding way to save application preferences is used. See also Preferences API Overview.

How big of a configuration file are we talking here? 1k lines? 10k? 1m lines? If the line you want to edit is the last line, just seek to the start of the line, truncate the file there and write the new one. If it's not... you will need to read it whole and write it again.
Oh, and the 2 options you mention are actually the same (read/edit/write).
On the third hand, I think it's irrelevant (unless you have weird constraints, like a flash storage device which takes too long to write, and has limited write cycles), given the sizes of most config files.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

splitting huge csv file periodically - java

Related

Running a multithreaded program sychronized very slow Java

Most effective way to read/write large .ser file for online applet

Should I use a text file or Database?

Read data from log file periodically

Edit files single line in java

Categories

Resources