I searched thoroughly on the web about this issue but I didn't find anything. Sorry for my bad english, but it's also kinda difficult to explain.
I would like to know if there is a simple method or, perhaps, a library, which allows to read/write and identify data blocks (for example, strings) on a file without caring of low level details such as "read until you found this delimeter, then return what you've read" or "write everything, then put this delimeter".
In particular, given a text file like this:
//Data Block 1
//Some kind of delimiter automatically generated
//Data Block 2
...
//Data Block n
The application must automatically:
Write another data block after the last one on the file;
Iterate on all the data blocks and return each one of them;
Access a random datablock (i.e I want to read the third data block currently on file).
Maybe I just can use a data structure, put the data in separately, then serialize the object on file, but I wonder if exists a much cheaper way like the one described before.
Thanks in advance and, again, sorry for the possibly confused explanation.
Related
I was looking back at some older class assignments and for one the user had to provide a text file which would be encoded according to an encryption key the user also gave. I essentially solved the problem by placing the content of the text file within a string, retrieving each letter from the string and encrypting it, and then printing the encrypted character back into the same text file. The problem is my professor docked 5% for storing the whole file content within a string, writing something like: "What if the file contents were very large?" I even recall the few people I talked to after the project was graded saying they lost points for the same reason.
At the time I thought he made sense and was too overburdened by my workload so didn't bother seeing if I could fix it because it seemed to be reasonable and simple enough. However now I can't understand how one would be able to edit the text file directly or write on the same file without storing the entire string (because one would otherwise lose its content). How would someone even go about this? Thank you!
Edit: whomever marked my thread as a duplicate to the one above clearly did not understand my question. I am asking how to manipulate the same file without using an absurd amount of memory as the solution I stated would. The other thread clearly asks what the quickest way to read from a file is, which is not at all the same thing. Joop had the right idea of what I meant so I'll try just that, thank you Joop.
I'm a decent C++ programmer, good enough to do what I want. But I'm working on my first Android App (obviously not C++ related), and I'm having an issue where I'd like to translate what I know from C++ over to the XML/Java used in Android Studio.
Basically I have (in C++) an array of structures. And maybe I didn't do the perfect search, but I sure as heck tried to look around for the answer, but I didn't come up with anything.
How would I go about placing an array of structures inside the XML file and utilizing it in Java?
As a bit of a buffer, let me say that I'm not really looking for code, just verification that this is possible, and a method on how to go about it. I don't mind researching to learn what I want, but I haven't come up with anything. Like I said, I probably haven't googled it properly because I'm unsure of exactly how to ask it.
EDIT: So it appears that XML doesn't have a structure (or anything similar? not sure). But I can utilize a Java class with public variables. Now my question is more or less: What would be the best way to go about inserting all the information into the array/class/variables?
In C++ terms, I could neatly place all the info into a text file and then read from it, using a FOR loop to place all the info in the structures. Or, if I don't want to use an outside source/file, I could hardcode the information into each variable. Tedious, but it'd work. I'm not sure, in Android terms, if I could use the same method and pack in a text file with the app, and read from the file using a FOR loop to insert the information into the array/class/variables
class answerStruct
{
public String a;
public boolean status;
};
class questionStruct
{
public String q;
answerStruct[] answer = new answerStruct[4];
};
I'm not placing this here to brag at my super high tech program, but to give a visual, and frankly that's less I have to write out. This is the method I plan on going with. But, being Java, I'm open to possibly better options. My question still stands as far as inputting information into the variables. Hard code? or does Android/Java allow me to place a text file with my app, and read from it into the variables?
XML is just a markup language for tree-structured data, and imposes no restrictions on how you name or structure your tree nodes.
What I think that you're looking for is an XML Object Serialiser: a way to serialise your in-memory structure into XML for a more permanent storage, and then at a later run, deserialise it back into memory. There are many XML Serialisers for Java, each with an own proprietary XML format.
I've used Simple XML in the past, and found it easy and flexible.
I need to parse and validate a file whose format is a little bit tricky.
Basically the file comes in this format:
\n -- just to make clear it may have empty lines
CLIENT_ID
A_NUMERIC_VALUE
ONE_LINE_OF_SOME_RANDOM_COMMENT_ABOUT_THE_CLIENT
ANOTHER_LINE_OF_SOME_RADOM_COMMENT_ABOUT_THE_CLIENT
\n
\n
CLIENT_ID_2
A_NUMERIC_VALUE_2
ONE_LINE_OF_SOME_RANDOM_COMMENT_ABOUT_THE_CLIENT_2
ANOTHER_LINE_OF_SOME_RADOM_COMMENT_ABOUT_THE_CLIENT_2
OHH_THIS_ONE_HAS_THREE_LINES_OF_COMMENTS
The file will be big very seldom (10 mb is probably the biggest file I've ever seen - usually they have around 900kb-1mb).
So I have two problems:
1) How can I effectively validate the format of the file? Using regex + scanner? (I see this one as a very feasible option if I can transform each client entry into only one string - so I can apply the regex upon it).
2) I need to transform each of the entries in the file into Client objects. Should I validate the whole file before transforming it into Java objects? Or should I validate the file as I go on transforming its entry into Java objects? (Bear in mind that if any client entry is invalid, the processing halts immediately and an exception is thrown - hence any object that was created will be discarded).
I'm really keen to see your suggestions about question #1. Question #2 is more a curiosity of mine on how you would handle this situation. Ignore #2 if you will, but please answer #1 =)
Does anyone know any framework to help me on handling the file by the way?
Thanks.
Update:
I saw this question and the problem is very similar to mine, but I'm not sure whether regex is the best way out to this problem. There might be quite a lot of "\n" throughout the file, varying number of comments for each client entry and an optional ID - hence the regex would have to be quite complex. That's why I mentioned transforming each entry into one row in the question #1 because this way would be much easier to create a regex to validate... nevertheless, this solution does not sound very elegant to my ears :(
Cheers.
If you intend to fail the batch if any part is found invalid, then validate the file first.
There are several advantages. One is that validation and processing need not be synchronous. If, for example, you process batches daily, but receive files throughout the day, you can validate them throughout the day and notify to correct problems before your scheduled processing. Another is that validation of whether a file is well-formed is very fast.
A short, simple perl script would certainly do the job. No need to transform the data, if I understand the pattern correctly, and it's all read-forward.
read past any newlines
read and validate a client id
read and validate a numeric value
read and validate one or more comments until a blank line is found
repeat the above four steps until EOF or invalid data detected
The Problem:
I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.
Setup:
Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here I am grouping on the date-hour field.
First Attempt:
My first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.
Second Attempt:
My second try was to use the MultiStorage() method in the piggybank which worked great until I looked at the file. The problem is that MulitStorage wants to write all fields to the file, including the field I used to group on. What I really want is just the original record written to the file.
The Question:
So...am I using Pig for something it is not intended for, or is there a better way for me to approach this problem using Pig? Now that I have this question out there, I will work on a simple code example to further explain my problem. Once I have it, I will post it here. Thanks in advance.
Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not I find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there. I usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.
Your second attempt is really close to what I would do. You should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file. Unfortunately, Tuple doesn't have a remove or delete method, so you'll have to rewrite the entire tuple. Or, if all you have is the original string, just pull that out and output that wrapped in a Tuple.
Some general documentation on writing Load/Store functions in case you need a bit more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions
I am trying to download a file from a server in a user specified number of parts (n). So there is a file of x bytes divided into n parts with each part downloading a piece of the whole file at the same time. I am using threads to implement this, but I have not worked with http before and do not really understand how downloading a file really works. I have read up on it and it seems "Range" needs to be used, but I do not know how to download different parts and being able to append them without corrupting the data.
(Since it's a homework assignment I will only give you a hint)
Appending to a single file will not help you at all, since this will mess up the data. You have two alternatives:
Download from each thread to a separate temporary file and then merge the temporary files in the right order to create the final file. This is probably easier to conceive, but a rather ugly and inefficient approach.
Do not stick to the usual stream-style semantics - use random access (1, 2) to write data from each thread straight to the right location within the output file.