I have two files assume its already sorted.
This is just example data, in real ill have around 30-40 Millions of records each file Size 7-10 GB file as row length is big, and fixed.
It's a simple text file, once searched record is found. ill do some update and write to file.
File A may contain 0 or more records of matching ID from File B
Motive is to complete this processing in least amount of time possible.
I am able to do but its time taking process...
Suggestions are welcome.
File A
1000000001,A
1000000002,B
1000000002,C
1000000002,D
1000000002,D
1000000003,E
1000000004,E
1000000004,E
1000000004,E
1000000004,E
1000000005,E
1000000006,A
1000000007,A
1000000008,B
1000000009,B
1000000010,C
1000000011,C
1000000012,C
File B
1000000002
1000000004
1000000006
1000000008
1000000010
1000000012
1000000014
1000000016
1000000018\
// Not working as of now. due to logic is wrong.
private static void readAndWriteFile() {
System.out.println("Read Write File Started.");
long time = System.currentTimeMillis();
try(
BufferedReader in = new BufferedReader(new FileReader(Commons.ROOT_PATH+"input.txt"));
BufferedReader search = new BufferedReader(new FileReader(Commons.ROOT_PATH+"search.txt"));
FileWriter myWriter = new FileWriter(Commons.ROOT_PATH+"output.txt");
) {
String inLine = in.readLine();
String searchLine = search.readLine();
boolean isLoopEnd = true;
while(isLoopEnd) {
if(searchLine == null || inLine == null) {
isLoopEnd = false;
break;
}
if(searchLine.substring(0, 10).equalsIgnoreCase(inLine.substring(0,10))) {
System.out.println("Record Found - " + inLine.substring(0, 10) + " | " + searchLine.substring(0, 10) );
myWriter.write(inLine + System.lineSeparator());
inLine = in.readLine();
}else {
inLine = in.readLine();
}
}
in.close();
myWriter.close();
search.close();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Read and Write to File done in - " + (System.currentTimeMillis() - time));
}
My suggestion would be to use a database. As said in this answer. Using txt files has a big disadvantage over DBs. Mostly because of the lack of indexes and the other points mentioned in the answer.
So what I would do, is create a Database (there are lots of good ones out there such as MySQL, PostgreSQL, etc). Create the tables that are needed, and read the file afterward. Insert each line of the file into the DB and use the db to search and update them.
Maybe this would not be an answer to your concrete question on
Motive is to complete this processing in the least amount of time possible.
But this would be a worthy suggestion. Good luck.
With this approach I am able to process 50M Records in 150 Second on i-3, 4GB Ram and SSD Hardrive.
private static void readAndWriteFile() {
System.out.println("Read Write File Started.");
long time = System.currentTimeMillis();
try(
BufferedReader in = new BufferedReader(new FileReader(Commons.ROOT_PATH+"input.txt"));
BufferedReader search = new BufferedReader(new FileReader(Commons.ROOT_PATH+"search.txt"));
FileWriter myWriter = new FileWriter(Commons.ROOT_PATH+"output.txt");
) {
String inLine = in.readLine();
String searchLine = search.readLine();
boolean isLoopEnd = true;
while(isLoopEnd) {
if(searchLine == null || inLine == null) {
isLoopEnd = false;
break;
}
// Since file is already sorted, i was looking for the //ans i found here..
long seachInt = Long.parseLong(searchLineSubString);
long inInt = Long.parseLong(inputLineSubString);
if(searchLine.substring(0, 10).equalsIgnoreCase(inLine.substring(0,10))) {
System.out.println("Record Found - " + inLine.substring(0, 10) + " | " + searchLine.substring(0, 10) );
myWriter.write(inLine + System.lineSeparator());
}
// Which pointer to move..
if(seachInt < inInt) {
searchLine = search.readLine();
}else {
inLine = in.readLine();
}
}
in.close();
myWriter.close();
search.close();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Read and Write to File done in - " + (System.currentTimeMillis() - time));
}
The funciton below copies all files with given extension from rootDirectory into given destination. It works well when names of the files differ, however when there are two files with the same name (see the recurvive call - it can be in the subdirectory), it does not do what it should. If there are more files with the same name, it should copy both and rename the second (adding _1, _2,... to its name).
I see there might be a problem with the Map I am using - every time a file is copied, I want to save it's name and add counter that counts how many times it has been copied (so the appropriate number can be added to its name). Could you please help me to fix the problem?
void copy(File rootDirectory, String destination, String fileExtension) {
File destFile = new File(destination);
HashMap<String, Integer> counter = new HashMap<>();
for (File file : rootDirectory.listFiles()) {
try {
if (file.isDirectory()) { copy(file, destination, fileExtension);
} else if (getExtension(file.getPath().toLowerCase()).equals(fileExtension.toLowerCase())) {
if (!destFile.exists()) { destFile.mkdirs();}
String fileName = file.getName();
if(counter.containsKey(fileName)){ // <<-- IS NEVER TRUE
int count = counter.get(fileName);
count++;
counter.put(fileName, count);
int i = fileName.contains(".") ? fileName.lastIndexOf('.') : fileName.length();
fileName = fileName.substring(0, i) + "_" + count + fileName.substring(i);
} else{ counter.put(fileName, 0);
}
Files.copy(file.toPath(), Paths.get(destination + "\\" + fileName), StandardCopyOption.REPLACE_EXISTING);
}
} catch (IOException e) {
//...
}
}
}
You are using recursion. In other words you always start from a new empty Map. Put the map outside of your method and that will solve your problem.
i have a performance problem when trying to create a csv file starting from another csv file.
this is how the original file looks:
country,state,co,olt,olu,splitter,ont,cpe,cpe.latitude,cpe.longitude,cpe.customer_class,cpe.phone,cpe.ip,cpe.subscriber_id
COUNTRY-0001,STATE-0001,CO-0001,OLT-0001,OLU0001,SPLITTER-0001,ONT-0001,CPE-0001,28.21487,77.451775,ALL,SIP:+674100002743#IMS.COMCAST.NET,SIP:E28EDADA06B2#IMS.COMCAST.NET,CPE_SUBSCRIBER_ID-QHLHW4
COUNTRY-0001,STATE-0002,CO-0002,OLT-0002,OLU0002,SPLITTER-0002,ONT-0002,CPE-0002,28.294018,77.068924,ALL,SIP:+796107443092#IMS.COMCAST.NET,SIP:58DD999D6466#IMS.COMCAST.NET,CPE_SUBSCRIBER_ID-AH8NJQ
potentially it could be millions of lines like this, i have detected the problem with 1.280.000 lines.
this is the algorithm:
File csvInputFile = new File(csv_path);
int blockSize = 409600;
brCsvInputFile = new BufferedReader(frCsvInputFile, blockSize);
String line = null;
StringBuilder sbIntermediate = new StringBuilder();
skipFirstLine(brCsvInputFile);
while ((line = brCsvInputFile.readLine()) != null) {
createIntermediateStringBuffer(sbIntermediate, line.split(REGEX_COMMA));
}
private static void skipFirstLine(BufferedReader br) throws IOException {
String line = br.readLine();
String[] splitLine = line.split(REGEX_COMMA);
LOGGER.debug("First line detected! ");
createIndex(splitLine);
createIntermediateIndex(splitLine);
}
private static void createIndex(String[] splitLine) {
LOGGER.debug("START method createIndex.");
for (int i = 0; i < splitLine.length; i++)
headerIndex.put(splitLine[i], i);
printMap(headerIndex);
LOGGER.debug("COMPLETED method createIndex.");
}
private static void createIntermediateIndex(String[] splitLine) {
LOGGER.debug("START method createIntermediateIndex.");
com.tekcomms.c2d.xml.model.v2.Metadata_element[] metadata_element = null;
String[] servicePath = newTopology.getElement().getEntity().getService_path().getLevel();
if (newTopology.getElement().getMetadata() != null)
metadata_element = newTopology.getElement().getMetadata().getMetadata_element();
LOGGER.debug(servicePath.toString());
LOGGER.debug(metadata_element.toString());
headerIntermediateIndex.clear();
int indexIntermediateId = 0;
for (int i = 0; i < servicePath.length; i++) {
String level = servicePath[i];
LOGGER.debug("level is: " + level);
headerIntermediateIndex.put(level, indexIntermediateId);
indexIntermediateId++;
// its identificator is going to be located to the next one
headerIntermediateIndex.put(level + "ID", indexIntermediateId);
indexIntermediateId++;
}
// adding cpe.latitude,cpe.longitude,cpe.customer_class, it could be
// better if it would be metadata as well.
String labelLatitude = newTopology.getElement().getEntity().getLatitude();
// indexIntermediateId++;
headerIntermediateIndex.put(labelLatitude, indexIntermediateId);
String labelLongitude = newTopology.getElement().getEntity().getLongitude();
indexIntermediateId++;
headerIntermediateIndex.put(labelLongitude, indexIntermediateId);
String labelCustomerClass = newTopology.getElement().getCustomer_class();
indexIntermediateId++;
headerIntermediateIndex.put(labelCustomerClass, indexIntermediateId);
// adding metadata
// cpe.phone,cpe.ip,cpe.subscriber_id,cpe.vendor,cpe.model,cpe.customer_status,cpe.contact_telephone,cpe.address,
// cpe.city,cpe.state,cpe.zip,cpe.bootfile,cpe.software_version,cpe.hardware_version
// now i need to iterate over each Metadata_element belonging to
// topology.element.metadata
// are there any metadata?
if (metadata_element != null && metadata_element.length != 0)
for (int j = 0; j < metadata_element.length; j++) {
String label = metadata_element[j].getLabel();
label = label.toLowerCase();
LOGGER.debug(" ==label: " + label + " index_pos: " + j);
indexIntermediateId++;
headerIntermediateIndex.put(label, indexIntermediateId);
}
printMap(headerIntermediateIndex);
LOGGER.debug("COMPLETED method createIntermediateIndex.");
}
Reading the entire dataset, 1.280.000 lines take 800 ms! so the problem is in this method
private static void createIntermediateStringBuffer(StringBuilder sbIntermediate, String[] splitLine) throws ClassCastException,
NullPointerException {
LOGGER.debug("START method createIntermediateStringBuffer.");
long start, end;
start = System.currentTimeMillis();
ArrayList<String> hashes = new ArrayList<String>();
com.tekcomms.c2d.xml.model.v2.Metadata_element[] metadata_element = null;
String[] servicePath = newTopology.getElement().getEntity().getService_path().getLevel();
LOGGER.debug(servicePath.toString());
if (newTopology.getElement().getMetadata() != null) {
metadata_element = newTopology.getElement().getMetadata().getMetadata_element();
LOGGER.debug(metadata_element.toString());
}
for (int i = 0; i < servicePath.length; i++) {
String level = servicePath[i];
LOGGER.debug("level is: " + level);
if (splitLine.length > getPositionFromIndex(level)) {
String name = splitLine[getPositionFromIndex(level)];
sbIntermediate.append(name);
hashes.add(name);
sbIntermediate.append(REGEX_COMMA).append(HashUtils.calculateHash(hashes)).append(REGEX_COMMA);
LOGGER.debug(" ==sbIntermediate: " + sbIntermediate.toString());
}
}
// end=System.currentTimeMillis();
// LOGGER.info("COMPLETED adding name hash. " + (end - start) + " ms. " + (end - start) / 1000 + " seg.");
// adding cpe.latitude,cpe.longitude,cpe.customer_class, it should be
// better if it would be metadata as well.
String labelLatitude = newTopology.getElement().getEntity().getLatitude();
if (splitLine.length > getPositionFromIndex(labelLatitude)) {
String lat = splitLine[getPositionFromIndex(labelLatitude)];
sbIntermediate.append(lat).append(REGEX_COMMA);
}
String labelLongitude = newTopology.getElement().getEntity().getLongitude();
if (splitLine.length > getPositionFromIndex(labelLongitude)) {
String lon = splitLine[getPositionFromIndex(labelLongitude)];
sbIntermediate.append(lon).append(REGEX_COMMA);
}
String labelCustomerClass = newTopology.getElement().getCustomer_class();
if (splitLine.length > getPositionFromIndex(labelCustomerClass)) {
String customerClass = splitLine[getPositionFromIndex(labelCustomerClass)];
sbIntermediate.append(customerClass).append(REGEX_COMMA);
}
// end=System.currentTimeMillis();
// LOGGER.info("COMPLETED adding lat,lon,customer. " + (end - start) + " ms. " + (end - start) / 1000 + " seg.");
// watch out metadata are optional, it can appear as a void chain!
if (metadata_element != null && metadata_element.length != 0)
for (int j = 0; j < metadata_element.length; j++) {
String label = metadata_element[j].getLabel();
LOGGER.debug(" ==label: " + label + " index_pos: " + j);
if (splitLine.length > getPositionFromIndex(label)) {
String actualValue = splitLine[getPositionFromIndex(label)];
if (!"".equals(actualValue))
sbIntermediate.append(actualValue).append(REGEX_COMMA);
else
sbIntermediate.append("").append(REGEX_COMMA);
} else
sbIntermediate.append("").append(REGEX_COMMA);
LOGGER.debug(" ==sbIntermediate: " + sbIntermediate.toString());
}//for
sbIntermediate.append("\n");
end = System.currentTimeMillis();
LOGGER.info("COMPLETED method createIntermediateStringBuffer. " + (end - start) + " ms. ");
}
As you can see, this method adds a precalculated line to the StringBuffer, reads every line from input csv file, calculate new data from that lines and finally add the generated line to the StringBuffer, so finally i can create the file with that buffer.
I have run jconsole and i can see that there are no memory leaks, i can see the sawtooths representing the creation of objects and the gc recollecting garbaje. It never traspasses the memory heap threshold.
One thing i have noticed is that the time needed for add a new line to the StringBuffer is completed within a very few ms range, (5,6,10), but is raising with time, to (100-200) ms and i suspect more in a near future, so probably this is the battle horse.
I have tried to analyze the code, i know that there are 3 for loops, but they are very shorts, the first loop iterates over 8 elements only:
for (int i = 0; i < servicePath.length; i++) {
String level = servicePath[i];
LOGGER.debug("level is: " + level);
if (splitLine.length > getPositionFromIndex(level)) {
String name = splitLine[getPositionFromIndex(level)];
sbIntermediate.append(name);
hashes.add(name);
sbIntermediate.append(REGEX_COMMA).append(HashUtils.calculateHash(hashes)).append(REGEX_COMMA);
LOGGER.debug(" ==sbIntermediate: " + sbIntermediate.toString());
}
}
I have meassured the time needed to get the name from the splitline and it is worthless, 0 ms, the same to calculateHash method, 0 ms.
the other loop, are practically the same, iterates over 0 to n, where n is a very tiny int, 3 to 10 for example, so i do not understand why it takes more time to finish the method, the only thing i find is that to add a new line to the buffer is getting slow the process.
I am thinking about a producer consumer multi threaded strategy, a reader thread that reads every line and put them into a circular buffer, another threads take it one by one, process them and add a precalculated line to the StringBuffer, which is thread safe, when the file is fully readed, the reader thread sends a message to to the another threads telling them to stop. Finally i have to save this buffer to a file. What do you think? this is a good idea?
I am thinking about a producer consumer multi threaded strategy, a reader thread that reads every line and put them into a circular buffer, another threads take it one by one, process them and add a precalculated line to the StringBuffer, which is thread safe, when the file is fully readed, the reader thread sends a message to to the another threads telling them to stop. Finally i have to save this buffer to a file. What do you think? this is a good idea?
Maybe, but it's quite a lot of work, I'd try something simpler first.
line.split(REGEX_COMMA)
Your REGEX_COMMA is a string which gets compiled into an regex a million times. It's trivial, but I'd try to use a Pattern instead.
You're producing a lot of garbage with your split. Maybe you should avoid it by manually splitting the input into a reused ArrayList<String> (it's just a few lines).
If all you need is writing the result into a file, it might be better to avoid building one huge String. Maybe a List<String> or even a List<StringBuilder> would be better, maybe writing directly to a buffered stream would do.
You seem to be working with ASCII only. Your encoding is platform dependent which may mean you're using UTF-8, which is possibly slow. Switching to a simpler encoding could help.
Working with byte[] instead of String would most probably help. Bytes are half as big as chars and there's no conversion needed when reading a file. All the operations you do can be done with bytes equally easy.
One thing i have noticed is that the time needed for add a new line to the StringBuffer is completed within a very few ms range, (5,6,10), but is raising with time, to (100-200) ms and i suspect more in a near future, so probably this is the battle horse.
That's resizing, which could be sped up by using the suggested ArrayList<String>, as the amount of data to be copied is much lower. Writing the data out when the buffer gets big would do as well.
I have meassured the time needed to get the name from the splitline and it is worthless, 0 ms, the same to calculateHash method, 0 ms.
Never use currentTimeMillis for this as nanoTime is strictly better. Use a profiler. The problem with a profiler is that it changes what it should measure. As a poor man's profiler, you can compute the sum of all the times spend inside of the suspect method and compare it with the total time.
What's the CPU load and what does GC do when running the program?
I used superCSV library in my project to handle large set of lines. it is relatively fast than manually read the lines. Reference
I was successful in reading a file while using multi-process environment using file locking
and in case of multithreaded(singleprocess) i used a queue filled it with file names, opened a thread separately, read from it and then waited till the entire reading was over, after which i used to rename them. In this way i used to read files in multithreaded(in a batch).
Now, i want to read the files in a directory using both multiprocess and multithreads. I tried merging my two approaches but that didn't fare well. log showed a lot of files were showing FileNotFound exception(because their names were changed), some were never read (because thread died), sometimes locks were not released.
///////////////////////////////////////////////////////////////////////
//file filter inner class
class myfilter implements FileFilter{
#Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
Pattern pat = Pattern.compile("email[0-9]+$");
Matcher mat = pat.matcher(pathname.toString());
if(mat.find()) {
return true;
}
return false;
}
}
/////////////////////////////////////////////////////////////////////////
myfilter filter = new myfilter();
File alreadyread[] = new File[5];
Thread t[] = new Thread[5];
fileread filer[] = new fileread[5];
File file[] = directory.listFiles(filter);
FileChannel filechannel[] = new FileChannel[5];
FileLock lock[] = new FileLock[5];
tuple_json = new ArrayList();
//System.out.println("ayush");
while(true) {
//declare a queue
ConcurrentLinkedQueue filequeue = new ConcurrentLinkedQueue();
//addfilenames to queue and their renamed file names
try{
if(file.length!=0) {
//System.out.println(file.length);
for(int i=0;i<5 && i<file.length;i++) {
System.out.println("acquiring lock on file " + file[i].toString());
try{
filechannel[i] = new RandomAccessFile(file[i], "rw").getChannel();
lock[i] = filechannel[i].tryLock();
}
catch(Exception e) {
file[i] = null;
lock[i] = null;
System.out.println("cannot acquire lock");
}
if(lock[i]!=null){
System.out.println("lock acquired on file " + file[i].toString());
filequeue.add(file[i]);
alreadyread[i] = new File(file[i].toString() + "read");
System.out.println(file[i].toString() + "-----" + times);
}
else{
System.out.println("else condition of acquiring lock");
file[i] = null;
}
System.out.println("-----------------------------------");
}
//starting the thread to read the files
for(int i=0;i<5 && i<file.length && lock[i]!=null && file[i]!=null;i++){
filer[i] = new fileread(filequeue.toArray()[i].toString());
t[i] = new Thread(filer[i]);
System.out.println("starting a thread to read file" + file[i].toString());
t[i].start();
}
//read the text
for(int i=0;i<5 && i<file.length && lock[i]!=null && file[i]!=null;i++) {
try {
System.out.println("waiting to read " + file[i].toString() + " to be read completely");
t[i].join();
System.out.println(file[i] + " was read completetly");
//System.out.println(filer[i].getText());
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
//file has been read Now rename the file
for(int i=0;i<5 && i<file.length && lock[i]!=null && file[i]!=null;i++){
if(lock[i]!=null){
System.out.println("renaming file " + file[i].toString());
file[i].renameTo(alreadyread[i]);
System.out.println("releasing lock on file " + file[i].toString());
lock[i].release();
}
}
//rest of the processing
/////////////////////////////////////////////////////////////////////////////////////////////////////
Fileread class
class fileread implements Runnable{
//String loc = "/home/ayusun/workspace/Eclipse/fileread/bin";
String fileloc;
BufferedReader br;
String text = "";
public fileread(String filename) {
this.fileloc = filename;
}
#Override
public void run() {
try {
br = new BufferedReader(new FileReader(fileloc));
System.out.println("started reading file" + fileloc);
String currline;
while((( currline = br.readLine())!=null)){
if(text == "")
text += currline;
else
text += "\n" + currline;
}
System.out.println("Read" + fileloc + " completely");
br.close();
} catch ( IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public String getText() {
return text;
}
}
I would like to know, if there is nay other approach that i can adopt.
If you want to create exclusive access to a file, you cannot use file locking, as on most OSes file locking is advisory, not mandatory.
I'd suggest creating a common lock directory for all your processes; in this lock directory, you would create a directory per file you want to lock, right before you open a file.
The big advantage is that directory creation, unlike file creation, is atomic; as such, you can use Files.createDirectory() (or File's .mkdir() if you still use Java6 but then don't forget to check the return code) to grab a lock on the files you read. If this fails, you know someone else is using the file.
Of course, when you're done with a file, don't forget to remove the lock directory matching this file... (in a finally block)
(note: with Java 7 you can use Files.newBufferedReader(); there is even Files.readAllLines())
If you need to process a large number of files using multiple threads, you should probably first distribute the specific files to each thread before it starts.
For example, if you only want to process files whose names start with email and are followed by some digits, you could create 10 threads. The first thread would look for files with names starting with email0, the second thread could handle email1, etc.
This of course would be efficient only if the numbers are evenly distributed.
Another way could be do have the main thread run through and collect all filenames to deal with. It could then divide the files across the number of available threads, and pass each thread an array of those file names.
There could be other ways of dividing the system load which are relevant to your situation.
I'm using org.apache.commons.net.ftp.FTPClient and seeing behavior that is, well... perplexing.
The method beneath intends to go through an FTPFile list, read them in and then do something with the contents. That's all working. What is not (really) working is that the FTPClient object does the following...
1) Properly retrieves and stores the FIRST file in the list
2) List item evaluates to NULL for x number of successive iterations of the loop (x varies on successive attempts
3) manages to retrieve exactly 1 more file in the list
4) reports that it is null for exactly 1 more file in the list
5) hangs indefinitely, reporting no further activity.
public static String mergeXMLFiles(List<FTPFile> files, String rootElementNodeName, FTPClient ftp){
String ret = null;
String fileAsString = null;
//InputStream inStream;
int c;
if(files == null || rootElementNodeName == null)
return null;
try {
System.out.println("GETTING " + files.size() + " files");
for (FTPFile file : files) {
fileAsString = "";
InputStream inStream = ftp.retrieveFileStream(file.getName());
if(inStream == null){
System.out.println("FtpUtil.mergeXMLFiles() couldn't initialize inStream for file:" + file.getName());
continue;//THIS IS THE PART THAT I SEE FOR files [1 - arbitrary number (usually around 20)] and then 1 more time for [x + 2] after [x + 1] passes successfully.
}
while((c = inStream.read()) != -1){
fileAsString += Character.valueOf((char)c);
}
inStream.close();
System.out.println("FILE:" + file.getName() + "\n" + fileAsString);
}
} catch (Exception e) {
System.out.println("FtpUtil.mergeXMLFiles() failed:" + e);
}
return ret;
}
has anyone seen anything like this? I'm new to FTPClient, am I doing something wrong with it?
According to the API for FTPClient.retrieveFileStream(), the method returns null when it cannot open the data connection, in which case you should check the reply code (e.g. getReplyCode(), getReplyString(), getReplyStrings()) to see why it failed. Also, you are suppose to finalize file transfers by calling completePendingCommand() and verifying that the transfer was indeed successful.
It works ok when I add after the "retrieve" command :
int response = client.getReply();
if (response != FTPReply.CLOSING_DATA_CONNECTION){
//TODO
}