how to delete a particular line from random access file - java

I am using a random access file in which i want to delete a line which satisfies some condition like if i have records
MCA 30
MBA 20
BCA 10
now my requirement is if i enter MBA then second line is to be deleted.

Generally, removing an item from the middle of the file means rewriting all entries after the entry so as to use the space the item occupied.
Some things instead mark items that are deleted with some invalid value so as to spot that the slot is unused. Typically they don't even reuse deleted slots, as this is rather more management than you might imagine, and is basically implementing a heap-like architecture in a file. They need a separate 'compacting' step to remove this dead space later. Microsoft Jet (as in Access) worked like this.
There's a very cool optimisation that's applicable in some cases:
If the rows are unordered, and equal length, you can overwrite the entry you want to 'delete' with the last entry, and truncate the file.
If the rows are unordered but not fixed length, you can use a more complicated variant of this approach where you move some entry from near the end that is the same length as the item being removed, so as only to need to shuffle up as few entries as possible.

If it is a simple text file than you'll have to copy everything except the MBA line(s) to a new file. Random Access files don't really support delete or insert.
Optimization: move everything after the MBA line up (in the same file)
Alternative: use something more structured like a database.

It sounds to me like you're looking for a 'grep' like implementation. There's a Java implementation of GNU grep, you can find the documentation here,
and download from a link in this page.

Related

Sorting text lines from hard drive files by partly loading to memory | Java

My task is to sort file which is too large to fit in memory. File contains text lines.
What I did:
read from original file by parts (of allowed size).
sorted each part
saved each sorted part to tempfiles
As I understand next thing i should do is:
read first lines of each file
sort them between each other (use local variable to temporarily store it, but I am not sure if it will be below restricted size)
write first line (as result of sorting) to final file
now I need to remove line I just wrote from temporary file
now i need to repeat steps 1-4 until all lines are sorted and "transferred" from temp files to final file
I am most unsure about step 4 - is there a class than can look for a value and then erase line with this value (at that point I won't even know from which file that line came)? I think that this is not a proper way to reach my goal at all. But I need to remove lines which are already sorted. And I can't operate with files' data in memory.
do you need to do this in Java (assuming by the tag)? As memory wise it isn't going to be efficient way. The simplest option in my opinion would be using sort and just sort the file directly on the OS level.
This article will give you a guide on how to use sort: https://www.geeksforgeeks.org/sort-command-linuxunix-examples/
Sort is available on Windows as well as unix/linux and can handle huge files.

MarkLogic: Move document from one directory to another on some condition

I'm new to MarkLogic and trying to implement following scenario with its Java API:
For each user I'll have two directories, something like:
1.1. user1/xmls/recent/
1.2. user1/xmls/archived/
When user is doing something with his xml - it's put to the "recent" directory;
When user is doing something with his next xml and "recent" directory is full (e.g. has some amount of documents, let's say 20) - the oldest document is moved to the "archived" directory;
User can request all documents from the "recent" directory and should get no more than 20 records;
User can remove something from the "recent" directory manually; In this case, if it had 20 documents, after deleting one it must have 19;
User can do something with his xmls simultaneously and "recent" directory should never become bigger than 20 entries.
Questions are:
In order to properly handle simultaneous adding of xmls to the "recent" directory, should I block whole "recent" directory when adding new entry (to actually add it, check if there are more than 20 records after adding, select the oldest 21st one and move it to the "archived" directory and do all these steps atomically)? How can I do it?
Any suggestions on how to implement this via Java API?
Is it possible to change document's URI (e.g. replace "recent" with "archived" in my case)?
Should I consider using MarkLogic's collections here?
I'm open to any suggestions and comments (as I said I'm new to MarkLogic and maybe my thoughts on how to handle described scenario are completely wrong).
You can achieve atomicity of a sequence of transactions using Multi-Statement Transactions (MST)
It is possible to MST from the Java API: http://docs.marklogic.com/guide/java/transactions#id_79848
It's not possible to change a URI. However, it is possible to use an MST to delete the old document and reinsert a new one using the new URI in one an atomic step. This would have the same effect.
Possibly, and judging from your use case, unless you must have the recent/archived information as part of the URI, it may be simpler to store this information in collections. However, you should read the documentation and evaluate for yourself: http://docs.marklogic.com/guide/search-dev/collections#chapter
Personally I would skip all the hassle with separate directories as well as collections. You would endlessly have to move files around, or changes their properties. It would be much easier to not calculate anything up front, and simply use lastModified property, or something alike, to determine most recent items at run-time.
HTH!

Java - overwriting specific parts of a file

I would like to update specific part of a text file using Java. I would like to be able to scan through the file and select specific lines to be updated, a bit like in a database, for instance given the file:
ID Value
1 100
2 500
4 20
I would like to insert 3 and update 4, e.g.
ID Value
1 100
2 500
3 80
4 1000
Is there a way to achieve this (seemingly) easy task? I know you can append to a file, but I am more interested in a random access
I know you can append to a file, but I am more interested in a random access
You're trying to insert and delete bytes in the middle of a file. You can't do that. File systems simply don't (in general) support that. You can overwrite specific bytes, but you can't insert or delete them.
You could update specific records with random access if your records were fixed-length (in bytes) but it looks like that's not the case.
You could either load the whole file into memory, or read from the original file, writing to a new file with either the old data or the new data as appropriate on a per line basis.
You can do so using Random Access files in java where you can place your current write and read position using available methods. you can explore more here
Load the file into memory, change your value, and then re-write the file
if there's a way to insert into a file without loading it, I haven't heard of it. You have to move the other data out of the way first.
unless you're dealing with huge files, frequently, performance isn't too much of a concern
As said by the previous answers, it's not possible to do that symply using streams. You could try to use properties, that are key, value pairs that can be saved and modified in a text file.
For example you can add to a file a new property with the command
setProperty(String key, String value)
This method adds a new property or, if already existing, modifies the value of the property with the choosen key.
Obviously, new properties are added at the end of the file but the lack of ordering is not a problem for performances because the access to the file is made with the getProperty method that calls the Hashtable method put.
See this tutorial for some examples:
http://docs.oracle.com/javase/tutorial/essential/environment/properties.html

Java: read groups of lines with same prefix from very large text file

I have a large (~100GB) text file structured like this:
A,foobar
A,barfoo
A,foobar
B,barfoo
B,barfoo
C,foobar
Each line is a comma-separated pair of values. The file is sorted by the first value in the pair. The lines are variable length. Define a group as being all lines with a common first value, i.e. with the example quoted above all lines starting with "A," would be a group, all lines starting with "B," would be another group.
The entire file is too large to fit into memory, but if you took all the lines from any individual group will always fit into memory.
I have a routine for processing a single such group of lines and writing to a text file. My problem is that I don't know how best to read the file a group at a time. All the groups are of arbitrary, unknown size. I have considered two ways:
1) Scan the file using a BufferedReader, accumulating the lines from a group in a String or array. Whenever a line is encountered that belongs to a new group, hold that line in a temporary variable, process the previous group. Clear the accumulator, add the temporary and then continue reading the new group starting from the second line.
2) Scan the file using a BufferedReader, whenever a line is encountered that belongs to a new group, somehow reset the cursor so that when readLine() is next invoked it starts from the first line of the group instead of the second. I have looked into mark() and reset() but these require knowing the byte-position of the start of the line.
I'm going to go with (1) at the moment, but I would be very grateful if someone could suggest a method that smells less.
I think a PushbackReader would work:
if (lineBelongsToNewGroup){
reader.unread(lastLine.toCharArray());
// probably also unread a newline
}
I think option 1 is the simplest. I would parse the text yourself, rather than use BufferedReader as it will take a lone time to parse 100 GB.
The only option which is likely to be faster is to use a binary search accessing the file using RandomAccessFile. You can memory map 100 GB on a 64-bit JVM. This avoids the need to parse every line which is pretty expensive. An advantage of this approach is that you can use multiple threads Its is far, far more complicated to implement, but should be much faster. Once you have each boundary, you can copy the raw data in bulk without having to parse all the lines.

Need advice in Efficiency: Scanning 2 very large files worth of information

I have a relatively strange question.
I have a file that is 6 gigabytes long. What I need to do, is scan the entire file, line by line, and determine all rows that match an id number of any other row in the file. Essentially, its like analyzing a web log file where there are many session ids that are organized by the time of each click rather than by userID.
I tried to do the simple (dumb) thing, which was to create 2 file readers. One that scans the file line by line getting the userID, and the next to 1. verify that the userID has not been processed already and 2. If it hasn't been processed, read every line that begins with the userID that is contained in the file and store (some value X, related to the rows)
Any advice or tips on how I can make this process work more efficiently?
Import file into SQL database
Use SQL
Performance!
Seriously, that's it. Databases are optimized exactly for this kind of thing. Alternatively, if you have a machine with enough RAM, just put all the data into a HashMap for easy lookup.
Easiest: create a datamodel and import the file in a database and take benefit of JDBC and SQL powers. You can if necessary (when the file format is pretty specific) write a some Java which does import line by line with help of under each BufferedReader#readLine() and PreparedStatement#addBatch().
Hardest: write your Java code so that it doesn't unnecessarily keep large amounts of data in the memory. You're then basically reinventing what the average database already does.
For each row R in the file {
Let N be the number that you need to
extract from R.
Check if there is a file called N. If
not, create it.
Append R to the file called N
}
How much data are you storing about each line, compared with the size of the line? Do you have enough memory to maintain the state for each distinct ID (e.g. number of log lines seen, number of exceptions or whatever)? That's what I'd do if possible.
Otherwise, you'll either need to break the log file into separate chunks (e.g. split it based on the first character of the ID) and then parse each file separately, or perhaps have some way of pretending you have enough memory to maintain the state for each distinct ID: have an in-memory cache which dumps values to disk (or reads them back) only when it has to.
You don't mention whether or not this is a regular, ongoing thing or an occasional check.
Have you considered pre-processing the data? Not practical for dynamic data, but if you can sort it based on the field you're interested in, it makes solving the problem much easier. Extracting only the fields you care about may reduce the data volume to a more manageable size as well.
Alot of the other advice here is good but assumes that you'll be able to load what you need into memory without running out of memory. If you can do that that would be better than the 'worst case' solution I'm mentioning.
If you have large files you may end up needing to sort them first. In the past I've dealt with multiple large files where I needed to match them up based on a key (sometimes matches were in all files, sometimes only in a couple, etc). If this is the case the first thing you need to do is sort your files. Hopefully you're on a box where you can easily do this (for example there are many good Unix scripts for this). After you've sorted each file read each file until you get matching IDs then process.
I'd suggest:
1. Open both files and read the first record
2. See if you have matching IDs and processing accordingly
3. Read the file(s) for the key just processed and do step 2 again until EOF.
For example if you had a key of 1,2,5,8 in FILE1 and 2,3,5,9 in FILE2 you'd:
1. Open and read both files (FILE1 has ID 1, FILE2 had ID2).
2. Process 1.
3. Read FILE1 (FILE1 has ID 2)
4. Process 2.
5. Read FILE1 (ID 5) and FILE2 (ID 3)
6. Process 3.
7. Read FILE 2 (ID 5)
8. Process 5.
9. Read FILE1 (ID 8) and FILE2 (ID 9).
10. Process 8.
11. Read FILE1 (EOF....no more FILE1 processing).
12. Process 9.
13. Read FILE2 (EOF....no more FILE2 processing).
Make sense?

Categories

Resources