Validate Format Tricky File Using Java - java

I need to parse and validate a file whose format is a little bit tricky.
Basically the file comes in this format:
\n -- just to make clear it may have empty lines
CLIENT_ID
A_NUMERIC_VALUE
ONE_LINE_OF_SOME_RANDOM_COMMENT_ABOUT_THE_CLIENT
ANOTHER_LINE_OF_SOME_RADOM_COMMENT_ABOUT_THE_CLIENT
\n
\n
CLIENT_ID_2
A_NUMERIC_VALUE_2
ONE_LINE_OF_SOME_RANDOM_COMMENT_ABOUT_THE_CLIENT_2
ANOTHER_LINE_OF_SOME_RADOM_COMMENT_ABOUT_THE_CLIENT_2
OHH_THIS_ONE_HAS_THREE_LINES_OF_COMMENTS
The file will be big very seldom (10 mb is probably the biggest file I've ever seen - usually they have around 900kb-1mb).
So I have two problems:
1) How can I effectively validate the format of the file? Using regex + scanner? (I see this one as a very feasible option if I can transform each client entry into only one string - so I can apply the regex upon it).
2) I need to transform each of the entries in the file into Client objects. Should I validate the whole file before transforming it into Java objects? Or should I validate the file as I go on transforming its entry into Java objects? (Bear in mind that if any client entry is invalid, the processing halts immediately and an exception is thrown - hence any object that was created will be discarded).
I'm really keen to see your suggestions about question #1. Question #2 is more a curiosity of mine on how you would handle this situation. Ignore #2 if you will, but please answer #1 =)
Does anyone know any framework to help me on handling the file by the way?
Thanks.
Update:
I saw this question and the problem is very similar to mine, but I'm not sure whether regex is the best way out to this problem. There might be quite a lot of "\n" throughout the file, varying number of comments for each client entry and an optional ID - hence the regex would have to be quite complex. That's why I mentioned transforming each entry into one row in the question #1 because this way would be much easier to create a regex to validate... nevertheless, this solution does not sound very elegant to my ears :(
Cheers.

If you intend to fail the batch if any part is found invalid, then validate the file first.
There are several advantages. One is that validation and processing need not be synchronous. If, for example, you process batches daily, but receive files throughout the day, you can validate them throughout the day and notify to correct problems before your scheduled processing. Another is that validation of whether a file is well-formed is very fast.
A short, simple perl script would certainly do the job. No need to transform the data, if I understand the pattern correctly, and it's all read-forward.
read past any newlines
read and validate a client id
read and validate a numeric value
read and validate one or more comments until a blank line is found
repeat the above four steps until EOF or invalid data detected

Related

Is it possible to remove tags (or sequences) and relate or remember them as indexes?

I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?
You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.

Security of uploading and parsing Named Binary Tag files (NBT) via PHP

I'm building a application that deals with uploading/downloading Named Binary Tag files (NBT).
After they're uploaded I need to parse them and get some information.
I'm a bit concerned security wise as I don't have the necessary knowledge to properly understand how they're build or what kind of data to expect from them.
What are some sanity checks that I can perform, when the files are uploaded, to make sure that they are indeed NBT files.
Should I be concerned when parsing them?
If there's anything else I should be concerned with, please, do tell.
I realize these are vague questions. There aren't a lot of answers on Google, else I wouldn't be here.
The file-format for NBT is really simple and compact. It's a binary stream (uncompressed or gzipped), which was specified by Notch.
One "problem" comes with special crafted NBT-files, which contains a lot of empty lists and lists of lists ... the memory-overhead of parsing these may result in service failure (mostly because the created objects for each entry just fills your memory).
One solution could be to limit the amount of entries you are reading and when reaching that limit just dropping the parsed file.
I recently published a java-library for reading nbt-files (but without having a limit), maybe it helps you to understand that file-format.
edit: forgot to share this website about the "exploit": http://arstechnica.com/security/2015/04/just-released-minecraft-exploit-makes-it-easy-to-crash-game-servers/

What to do when a huge XML document is not well formed (Java)

I am using Java SAX parser to parse XML data sent from a third party source that is around 3 GB. I am getting an error resulting from the XML document not being well formed: The processing instruction target matching "[xX][mM][lL]" is not allowed.
As far as I understand, this is normally due to a character being somewhere it should not be.
Main problem: Cannot manually edit these files due to their very large size.
I was wondering if there was a workaround for files that are very large in size that cannot be opened and edited manually (due to their large size) and if there is a way to code it so that it would remove any problematic characters automatically.
I would think the most likely explanation is that the file contains a concatenation of several XML documents, or perhaps an embedded XML document: either way, an XML declaration that isn't at the start of the file.
A lot now depends on your relationship with the supplier of the bad data. If they sent you faulty equipment or buggy software, you would presumably complain and ask them to fix it. But if you don't have a service relationship with the third party, you either have to change supplier or do the best you can with the faulty input, which means repairing the fault yourself. In general, you can't repair faulty XML unless you know what kind of fault you are looking for, and that can be very difficult to determine if the files are large (or if the failures are very rare).
The data isn't XML, so don't try to use XML tools to process it. Use text processing tools such as sed or awk. The first step is to search the file for occurrences of <?xml and see if that gives any hints.
This error occurs, if the declaration is anywhere but the beginning of the document. The reason might be
Whitespace before the XML declaration
Any hidden character before the XML declaration
The XML declaration appears anywhere else in the document
You should start checking case #2, see here: http://www.w3.org/International/questions/qa-byte-order-mark#remove
If that doesn't help, you should remove leading whitespace from the document. You could do that by wrapping the original InputStream with another InputStream and use that to remove the whitespace.
The same can be done if you are facing case #3, but the implementation would be a bit more complex.

Writing one file per group in Pig Latin

The Problem:
I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.
Setup:
Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here I am grouping on the date-hour field.
First Attempt:
My first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.
Second Attempt:
My second try was to use the MultiStorage() method in the piggybank which worked great until I looked at the file. The problem is that MulitStorage wants to write all fields to the file, including the field I used to group on. What I really want is just the original record written to the file.
The Question:
So...am I using Pig for something it is not intended for, or is there a better way for me to approach this problem using Pig? Now that I have this question out there, I will work on a simple code example to further explain my problem. Once I have it, I will post it here. Thanks in advance.
Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not I find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there. I usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.
Your second attempt is really close to what I would do. You should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file. Unfortunately, Tuple doesn't have a remove or delete method, so you'll have to rewrite the entire tuple. Or, if all you have is the original string, just pull that out and output that wrapped in a Tuple.
Some general documentation on writing Load/Store functions in case you need a bit more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

JAVA: gathering byte offsets of xml tags using an XmlStreamReader

Is there a way to accurately gather the byte offsets of xml tags using the XMLStreamReader?
I have a large xml file that I require random access to. Rather than writing the whole thing to a database, I would like to run through it once with an XMLStreamReader to gather the byte offsets of significant tags, and then be able to use a RandomAccessFile to retrieve the tag content later.
XMLStreamReader doesn't seem to have a way to track character offsets. Instead people recommend attaching the XmlStreamReader to a reader that tracks how many bytes have been read (the CountingInputStream provided by apache.commons.io, for example)
e.g:
CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;
while (xmlStreamReader.hasNext()) {
int eventCode = xmlStreamReader.next();
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " #" + countingReader.getByteCount()) ;
}
}
xmlStreamReader.close();
Unfortunately there must be some buffering going on, because the above code prints out the same byte offsets for several tags. Is there a more accurate way of tracking byte offsets in xml files (ideally without resorting to abandoning proper xml parsing)?
You could use getLocation() on the XMLStreamReader (or XMLEvent.getLocation() if you use XMLEventReader), but I remember reading somewhere that it is not reliable and precise. And it looks like it gives the endpoint of the tag, not the starting location.
I have a similar need to precisely know the location of tags within a file, and I'm looking at other parsers to see if there is one that guarantees to give the necessary level of location precision.
You could use a wrapper input stream around the actual input stream, simply deferring to the wrapped stream for actual I/O operations but keeping an internal counting mechanism with assorted code to retrieve current offset?
Unfortunatly Aalto doesn't implement the LocationInfo interface.
The last java VTD-XML ximpleware implementation, currently 2.11
on sourceforge or on github
provides some code maintaning a byte offset after each call to
the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings
are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258
Updating IReader with a getCharOffset() method
and implementing it
by adding a charCount member along to the offset member of the
VTDGen and VTDGenHuge classes
and by incrementing it upon each getChar() and skipChar() call of each IReader implementation should give you the start of a solution.
I think I've found another option. If you replace your switch block with the following, it will dump the position immediately after the end element tag.
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " end#" + xmlStreamReader.getLocation().getCharacterOffset()) ;
}
This solution also would require that the actual start position of the end tags would have to be manually calculated, and would have the advantage of not needing an external JAR file.
I was not able to track down some minor inconsistencies in the data management (I think it has to do with how I initialized my XMLStreamReader), but I always saw a consistent increase in the location as the reader moved through the content.
Hope this helps!
I recently worked out a solution for a similar question on How to find character offsets in big XML files using java?. I think it provides a good solution based on a ANTLR generated XML-Parser.
I just burned a day long weekend on this, and arrived at the solution partially thanks to some clues here. Remarkably I don't think this has gotten much easier in the 10 years since the OP posted this question.
TL;DR Use Woodstox and char offsets
The first problem to contend with is that most XMLStreamReader implementations seem to provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.
The second problem is the actual type of offset you use. Unfortunately it seems that you have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset, then start extracting. There may be a more efficient way to do this that I haven't though of, but the performance is acceptable for my case. 500MB files are pretty snappy.
[edit] So this turned into one of those splinter-in-my-mind things, and I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile.
I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.
[/edit]
There is another similar question on SO about this (but the accepted answer frightened and confused me), and some people commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML (still going strong in 2020), you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.
Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution. God help you if you're finding this in 2030, trying to solve the same problem.

Categories

Resources