converting legacy data to json format

converting legacy data to json format - java

I am working with some legacy code where I am using a column value and its earlier history to make some comparisons and highlight the differences. However, the column is stored in an arbitary delimited fashion and the code around comparison is well, very hard to understand.
My initial thought was to refactor the code - but when I later thought about it, I thought why not fix the original source of the issue which is the data in this case. I have some limitations around the table structure and hence converting this from a single column into multiple column data is not an option.
With that, I thought if its worth to convert this arbitary delimited data into a standardized format like json. My approach would be to export this data to some file and apply some regular expressions and convert this data to json format and then re-import it back.
I wanted to check with this group if I am approaching this problem the right way and if there are other ideas that I should consider. Is JSON the right format in the first place? I'd like to know how you approached a similar problem. Thanks!

However, the column is stored in an arbitary delimited fashion and the
code around comparison is well, very hard to understand.
So it seems the problem is in the maintainability/evolvability of the code for adding or modifying functionalities.
My initial thought was to refactor the code - but when I later thought
about it, I thought why not fix the original source of the issue which
is the data in this case.
My opinion is: why? I think your initial thought was right. You should refactor (i.e. " ...restructuring an existing body of code, altering its internal structure without changing its external behavior") to gain what you want. If you change the data format you could end with a messy code above the new cooler data format in any case.
In others words you can have all the combinations because they are not strictly related: you can have a cool data format with a messy cody or with a fantastic code, and you can have a ugly data format with a fantastic code or a messy code.
It seems you are in the last case but moving from this case to: cool data format + messy code and then to cool data format + fantastic code is less straightforward than: fantastic code + ugly data format and then cool data format + fantastic code (optional).
So my opinion is point your goal in a straightforward way:
1) write tests (if you haven't) for the current functionalities and verify them with the current code (first step for refactoring)
2) change the code driven by the tests wrote in 1
3) If you still want, change the data format ever guided by your tests
Regards

Related

What's the best data structure and algorithm for my problem?

I'm trying to solve the problem: I have a text file where the columns are separated by ",". The problem is that I need to be able to search by columns. Example of the data in the file:
I will parse this file by columns and place all the data in the another data structure(s). So my question is what's the best data structure to use in the case like this (and what's the best algorithm to use for searching in this data structure)? I need to count all the entries that are matching too. For example if I chose last column and typed "4" to search it should show last two strings and count 2 entries. I was thinking on something like list, but the file is pretty big and search would take too long, and I need a solution that won't depend much on the data length. I was thinking about binary search tree too, but not quite sure how to use it here.
This is kind of learning task so I don't need just a solution (like grep), because I'm trying to implement all of this on Java. I thought maybe this problem has a common solution that some experienced programmers know, or maybe I need to think of by myself. I'm not asking for the solution or the code, just a hint what data structure/algorithm is better to use in the situation like this, some key words.

Honestly a database. If that's not an option, it really depends on how you're going to query it. Generally a b-tree is good for simple things like comparisons, but you'll need a balanced tree like AVL or red/black trees. And you'll need 1 tree per column you want to index. Which is basically how low complexity databases do things.

Using a SerialBlob vs byte[]

I am using hibernate to store and retrieve data from a MySQL database. I was using a byte array but came across the SerialBlob class. I can use the class successfully but I cant seem to find any difference between using the SerialBlob and a byte array. Does anyone know the basic differences or possible situations you wish to use a SerialBlob inlue of a byte[] are?

You are right that the SerialBlob is just a thin abstraction around a byte[], but:
Are you working in a team?
Do you sometimes make mistakes?
Are you lazy with writing comments?
Do you sometimes forget what your code from a year ago actually does?
If you anwsered any of the above questions with a yes, you should probably use SerialBlob.
It's basically the same with any other abstraction around a simple data structure (think ByteBuffer, for example) or another class. You want to use it over byte[], because:
It's more descriptive. A byte[] could be some sort of cache, it could be a circular buffer, it could be some sort of integrity checking mechanism gone wrong. But if you use SerialBlob, it's obvious that this is just a blob of binary data from the database / to be stored in the database.
Instead of manual array handling, you use methods on the class, which is, again, easier to read if you don't know the code. Even trivial array manipulation must be comprehended by the reader of your code. A method with a good name is self-descriptive.
This is helpful for your teammates and also for you when you'll read this code in a year.
It's more error proof. Every time you write any new code, there's a good chance you had made a bug in it. It may be not visible at first, but it is probably in there. The SerialBlob code has been tested by thousands of people around the world and it's safe to say that you won't get any bugs associated to it.
Even if you're sure you got your byte array handling right, because it's so straightforward, what if somebody else finds your code in half a year and starts "optimizing" things? What if he reuses an old blob, or messes up with your magic array padding? Every single off-by-one error in index manipulating will corrupt your data and that might not be detected right away (You are writing unit tests, aren't you?).
It restricts you to only a handful of possible interactions. This might actually look like a demerit, but it's not! It ensures you won't be using your blob as a local temporary variable after you're done with it. It ensures you won't try to make a String out of it or anything silly. It makes sure you'll only use it as a blob. Again, clarity and safety.
It's already written and always looks the same. You don't have to write a new implementation for every project, or read ten different implementations in ten different projects. If you'll ever see a SerialBlob in anyone's project, the usage will be clear to you. Everyone uses the same one.
TL; DR: A few years ago (or maybe still in C), using a byte[] would be ok. In Java (and OOP in general), try to use a specific class designed for the job instead of a primitive (low level) structure as it more clearly describes your intents, produces less errors and reduces the length of your code in the long run.

Is there a way to output variables as Java sees them so I can isolate methods for testing or asking for assistance?

I'm learning Java and find myself sending methods around while asking for help but my problem is I have many methods and the data is modified at each method. I often have to send large files when only one area is relevant(it makes my SO questions excessively long as well).
But for some of the stuff I do, I can't get the right data format to be outputted as string that I can input later. For example, if I add data to a list of Points(like this, (new Point(0, 0));) then when I output the results I get something like this(with sample data):
[java.awt.Point[x=970,y=10], java.awt.Point[x=65,y=10], java.awt.Point[x=729,y=10]
I get errors when I assign this to a variable and send it to my method I want to test/show. I basically have two goals:
If I want help on a single method(thats part of a much larger class), I want to be able to send the least amount of code to the person helping me(ideally just the method itself and the inputs..which I'm unable to capture exactly right now).
When I test my code, I would like a way to isolate a method so I don't have to run a large file when all I can about is improving one method.
I am pretty sure I'm not the first person to come across this problem, How can I approach this?
UPDATE: Here's an example,
double[] data = new double[] {.05, .02, -.03, .04, .01};
System.out.println(data); //output is: [D#13fcf0ce
If I make a new variable of this and send it to a method I get errors. I have 30 methods in a class. I want to have a friend help me with one. I'm to avoid sending 29 methods that are irrelevant to the person. So I need a way to capture the data, but printout doesn't seem to capture it in a way I can send to methods.

Java outputs variables in a way that is human-readable (although it depends on the object's toString method). The output of toString is (unsurprisingly) a String. Unless you have a parsing mechanism to turn a string back into the original object, it's a one-way operation.
There should be no need to turn it back into the original object, however. If you're trying to isolate a function and sample data, the easiest thing to do is encapsulate it in a test and some data--there are many different ways to do this and communicate it to someone else.
I'm still unclear on your usecase, however. If it's an SO question, all you should need to do is show the code in question, provide a minimal amount of data that shows the problem, and you're done. This could be done in a self-contained example where you simple create the data in code, as a unit test, or by showing the string output as you've already done.
If you're trying to communicate the issue to a tech support tier, then the best mechanism depends entirely on what they're equipped to handle--they'll tell you if you didn't do it right, believe me.

You can use Debuggers and step over your code. You can 'watch' variables so that you can get their actual value, rather than their toString representation. Debuggers are usually part and parcel with all the major IDE's such as Eclipse, Netbeans and IntelliJ.
As to your questions about isolation and testing, this is much more of a design problem. Ideally your methods should be self contained, reducing coupling. What you could do is to learn to break down your problem into smaller chunks and until it can't be broken down further. Once you do this, you start building methods which tackle each part of the problem seperately.
Once you have your method, you test it on its own (thus reducing the amount of things which can go wrong, as opposed to testing tons of code at once). If you are satisfied, you integrate the method with your code and test again. If something then goes wrong, you will know that your last module is the problem since it broke your system. You can get some more information about this here.

Efficiently store easily parsable data in a file?

I am needing to store easily parsable data in a file as an alternative to the database backed solution (not up for debate). Since its going to be storing lots of data, preferably it would be a lightweight syntax. This does not necessarily need to be human readable, but should be parsable. Note that there are going to be multiple types of fields/columns, some of which might be used and some of which won't
From my limited experience without a database I see several options, all with issues
CSV - I could technically do this, and it is very light. However the parsing would be an issue, and then it would suck if I wanted to add a column. Multi-language support is iffy, mainly people's own custom parsers
XML - This is the perfect solution from many fronts except when it comes to parsing and overhead. Thats a lot of tags and would generate a giant file, and parsing would be very resource consuming. However virtually every language supports XML
JSON - This is the middle ground, but I don't really want to do this as its an awkward syntax and parsing is non-trivial. Language support is iffy.
So all have their disadvantages. But what would be the best when trying to aim for language support AND somewhat small file size?

How about sqlite? This would allow you to basically embed the "DB" in your application, but not require a separate DB backend.
Also, if you end up using a DB backend later, it should be fairly easy to switch over.
If that's not suitable, I'd suggest one of the DBM-like stores for key-value lookups, such as Berkely DB or tdb.

If you're just using the basics of all these formats, all of the parsers are trivial. If CSV is an option, then for XML and JSON you're talking blocks of name/value pairs, so there's not even a recursive structure involved. json.org has support for pretty much any language.
That said.
I don't see what the problem is with CSV. If people write bad parsers, then too bad. If you're concerned about compatibility, adopt the default CSV model from Excel. Anyone that can't parse CSV from Excel isn't going to get far in this world. The weakest support you find in CSV is embedded newlines and carriage returns. If you data doesn't have this, then it's not a problem. Only other issue is embedded quotations, and those are escaped in CSV. If you don't have those either, then its even more trivial.
As for "adding a column", you have that problem with all of these. If you add a column, you get to rewrite the entire file. I don't see this being a big issue either.
If space is your concern, CSV is the most compact, followed by JSON, followed by XML. None of the resulting files can be easily updated. They pretty much all would need to be rewritten for any change in the data. CSV has the advantage that it's easily appended to, as there's no closing element (like JSON and XML).

JSON is probably your best bet (it's lightish, faster to parse, and self-descriptive so you can add your new columns as time goes by). You've said parsable - do you mean using Java? There are JSON libraries for Java to take the pain out of most of the work. There are also various light-weight in memory databases that can persist to a file (in case "not an option" means you don't want a big separate database)

If this is just for logging some data quickly to a file, I find tab delimited files are easier to parse than CSV, so if it's a flat text file you're looking for I'd go with that (so long as you don't have tabs in the feed of course). If you have fixed size columns, you could use fixed-length fields. That is even quicker because you can seek.
If it's unstructured data that might need some analysis, I'd go for JSON.
If it's structured data and you envision ever doing any querying on it... I'd go with sqlite.

When I needed solution like this I wrote up a simple representation of data prefixed with length. For example "Hi" will be represented as(in hex) 02 48 69.
To form rows just nest this operation(first number is number of fields, and then the fields), for example if field 0 contains "Hi" and field 1 contains "abc" then it will be:
Num of fields Field Length Data Field Length Data
02 02 48 69 03 61 62 63
You can also use first row as names for the columns.
(I have to say this is kind of a DB backend).

You can use CSV and if you only add columns to the end this is simple to handle. i.e. if you have less columns than you expect, use the default value for the "missing" fields.
If you want to be able to change the order/use of fields, you can add a heading row. i.e. the first row has the names of the columns. This can be useful when you are trying to read the data.

If you are forced to use a flat file, why not develop your own format? You should be able to tweak overhead and customize as much as you want (which is good if you are parsing lots of data).
Data entries will be either of a fixed or variable length, there are advantages to forcing some entries to a fixed length but you will need to create a method for delimiting both. If you have different "types" of rows, write all the rows of a each type in a chunk. Each chunk of rows will have a header. Use one header to describe the type of the chunk, and another header to describe the columns and their sizes. Determine how you will use the headers to describe each chunk.
eg (H is header, C is column descriptions and D is data entry):
H Phone Numbers
C num(10) type
D 1234567890 Home
D 2223334444 Cell
H Addresses
C house(5) street postal(6) province
D 1234_ "some street" N1G5K6 Ontario

I'd say that if you want to store rows and columns, you've got to to use a DB. The reason is simple - modification of the structure with any approach except RDBMS will require significant efforts, and you mentioned that you want to change the structure in future.

Removing Optional Elements from XML when invalid

I have a piece of xml that contains optional non-enumerated elements, so schema validation does not catch invalid values. However, this xml is transformed into a different format after validation and is then handed off to a system that tries to store the information in a database. At this point, some of the values that were optional in the previous format are now coded values in the database that will throw foreign key constraint exception if we try and store them. So, I need to build a process in a J2EE app that will check a set of xpaths values against a set of values that are allowable at those spots and if they are not valid either remove them/replace them/remove them and their parents depending on schema restrictions.
I have a couple options that will work, but neither of them seem like very elegant/intuitive solutions.
Option #1 would involve doing the work in an xslt 1.0. Before sending the xml through the xslt, querying up the acceptable values and sending the lists in as parameters. Then place tests at the appropriate locations in the xml that compares the incoming value against the acceptable ones and generates the xml accordingly.
This option doesn't seem very reusable, but it'd be very quick to implement.
Option #2 would involve Java code and an xml config file. The xml config file would layout the xpaths of the needed tests, the acceptable values, the default values (if applicable) and what to take out of the doc if the tests fail.
This option is much more reusable, but would probably double the time needed to build it.
So, which one of these would you pick? Or do you have another idea altogether? I'm open to all suggestions and would love to hear how you would handle this.

Sounds to me like option 2 is over-engineering. Do you have a clear idea about when you will want to reuse this functionality? If not, YAGNI, so go for the simpler and easier solution

Both options are acceptable. Depending on your skills and the complexity of your XML, I would say that it will require about the same amount of time.
Option 1 would be in my opinion more flexible, easier to maintain in the long run.
Option 2 could be tricky in some cases, how to define the config file itself for complex rules and how do you parse it without having to write complex code? One could say, I'll use a dom4j visitor and I'll be done with it. However, option 2 could become unnecessarily complicated imho if you deal with a complex XML structure.

I agree here. It felt like it was borderline over-engineering, but I was afraid that someone hearing that this was done would assume that it would be reusable and attempt to design something that used it in the future. However, I have since been reassured that this is a one-time deal and thus, will be going with the xslt approach.
Thanks all for your comments/answers!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.