I have two JSON files which have a large amount of records(objects). One file has about 1200 records in it and the other has 600. I'm sorry that i couldn't post them here, but i want to compare both of them, and get back the records that are common. The trick here is that i can't iterate through them, as there are a large number of records and the tool which i'm using cannot support this. I'm posting my sample json below:
{"xyz":{"string":"hello"},"abc:{"string":"rts","event":"file","value":"100"}}
{"xyz":{"string":"hello"},"thg{"Integer":"rts","event":"file","value":"100"}}
My question is whether any libraries are available where i can directly compare two JSON files using predifined methods. If no such libraries are available, can you give an optimal way to find the similar records such as "xyz" in the above example.
I'm not supposed to use GSON as it is incompatible with the tool.
I don't know about libraries, but definitely the algorithm will involve first sorting both files and after that, yes, you should perform iteration and recordwise comparison. The general complexity would be O(n*log(n))
Take a loot at below link.
http://tlrobinson.net/projects/javascript-fun/jsondiff/
This will compare two json contents.
Hope this is what you was looking for.
Related
I'm trying to solve the problem: I have a text file where the columns are separated by ",". The problem is that I need to be able to search by columns. Example of the data in the file:
I will parse this file by columns and place all the data in the another data structure(s). So my question is what's the best data structure to use in the case like this (and what's the best algorithm to use for searching in this data structure)? I need to count all the entries that are matching too. For example if I chose last column and typed "4" to search it should show last two strings and count 2 entries. I was thinking on something like list, but the file is pretty big and search would take too long, and I need a solution that won't depend much on the data length. I was thinking about binary search tree too, but not quite sure how to use it here.
This is kind of learning task so I don't need just a solution (like grep), because I'm trying to implement all of this on Java. I thought maybe this problem has a common solution that some experienced programmers know, or maybe I need to think of by myself. I'm not asking for the solution or the code, just a hint what data structure/algorithm is better to use in the situation like this, some key words.
Honestly a database. If that's not an option, it really depends on how you're going to query it. Generally a b-tree is good for simple things like comparisons, but you'll need a balanced tree like AVL or red/black trees. And you'll need 1 tree per column you want to index. Which is basically how low complexity databases do things.
I've profiled my application and it seems like one of my biggest bottlenecks at the moment is the String.split method. It's taking up 21% of my runtime, and the other big contributors aren't parts that I can streamline anymore than they are. It also seems like all of the newly-created String objects are causing issues with the garbage collector, although I'm less clear whether or not that's the case.
I'm reading in a gzipped file comma-separated values that contain financial data. The number of fields in each row varies depending on what kind of record it is, and the size of each field varies too. What's the fastest way to read the data in, creating the fewest intermediate objects?
I saw this thread but none of the answers give any evidence that OpenCSV is any faster than String.split, and they all seem to focus on using an external library rather than writing new code. I'm also very concerned about memory overhead, because I spend another 20% or so of the total runtime doing garbage collection. I would like to just return views of the string in question, but it looks like that's not possible anymore.
A quicker way is to use just a simple StringTokenizer. It doesn't have the regex overhead of split() and it's in the JDK.
If you do not want to use a library, then an alternative to StringTokenizer would be to write a simple state machine to parse your CSV. Tokenizers can have problems with commas embedded in fields. CSV is a reasonably simple format, so it is not difficult to build a state machine to handle it. If you know exactly what the format of the input file is, then you can simplify it even further since you will not have to deal with any possibilities not present in your specific file.
Numeric data could potentially be converted direct to int on the fly, without having to hold a large number of strings simultaneously.
Use uniVocity-parsers to parse your CSV file. It is suite of parsers for tabular text formats and its CSV parser is the fastest among all other parsers for Java (as you can see here, and here). Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
We used the architecture provided by this framework to build a custom parser for MySQL dump files for this project. We managed to parse a 42GB dump file in 15 minutes (1+ billion rows).
It should solve your problem.
I need to search a big number of files (i.e. 600 files, 0.5 MB each) for a specific string.
I'm using Java, so I'd prefer the answer to be a Java library or in the worst case a library in a different language which I could call from Java.
I need the search to return the exact position of the found string in a file (so it seems Lucene for example is out of the question).
I need the search to be as fast as possible.
EDIT START:
The files might have different format (i.e. EDI, XML, CSV) and contain sometimes pretty random data (i.e. numerical IDs etc.). This is why I preliminarily ruled out an index-based searching engine.
The files will be searched multiple times for similar but different strings (i.e. for IDs which might have similar length and format, but they will usually be different).
EDIT END
Any ideas?
600 files of 0.5 MB each is about 300MB - that can hardly be considered big nowadays, let alone large. A simple string search on any modern computer should actually be more I/O-bound than CPU-bound - a single thread on my system can search 300MB for a relatively simple regular expression in under 1.5 seconds - which goes down to 0.2 if the files are already present in the OS cache.
With that in mind, if your purpose is to perform such a search infrequently, then using some sort of index may result in an overengineered solution. Start by iterating over all files, reading each block-by-block or line-by-line and searching - this is simple enough that it barely merits its own library.
Set down your performance requirements, profile your code, verify that the actual string search is the bottleneck and then decide whether a more complex solution is warranted. If you do need something faster, you should first consider the following solutions, in order of complexity:
Use an existing indexing engine, such as Lucene, to filter out the bulk of the files for each query and then explicitly search in the (hopefully few) remaining files for your string.
If your files are not really text, so that word-based indexing would work, preprocess the files to extract a term list for each file and use a DB to create your own indexing system - I doubt you will find an FTS engine that uses anything else than words for its indexing.
If you really want to reduce the search time to the minimum, extract term/position pairs from your files, and enter those in your DB. You may still have to verify by looking at the actual file, but it would be significantly faster.
PS: You do not mention at all what king of strings we are discussing about. Does it contain delimited terms, e.g. words, or do your files contain random characters? Can the search string be broken into substrings in a meaningful manner, or is it a bunch of letters? Is your search string fixed, or could it also be a regular expression? The answer to each of these questions could significantly limit what is and what is not actually feasible - for example indexing random strings may not be possible at all.
EDIT:
From the question update, it seems that the concept of a term/token is generally applicable, as opposed to e.g. searching for totally random sequences in a binary file. That means that you can index those terms. By searching the index for any tokens that exist in your search string, you can significantly reduce the cases where a look at the actual file is needed.
You could keep a term->file index. If most terms are unique to each file, this approach might offer a good complexity/performance trade-off. Essentially you would narrow down your search to one or two files and then perform a full search on those files only.
You could keep a term->file:position index. For example, if your search string is "Alan Turing". you would first search the index for the tokens "Alan" and "Turing". You would get two lists of files and positions that you could cross-reference. By e.g. requiring that the positions of the token "Alan" precede those of the token "Turing" by at most, say, 30 characters, you would get a list of candidate positions in your files that you could verify explicitly.
I am not sure to what degree existing indexing libraries would help. Most are targeted towards text indexing and may mishandle other types of tokens, such as numbers or dates. On the other hand, your case is not fundamentally different either, so you might be able to use them - if necessary, by preprocessing the files you feed them to make them more palatable. Building an indexing system of your own, tailored to your needs, does not seem too difficult either.
You still haven't mentioned if there is any kind of flexibility in your search string. Do you expect being able to search for regular expressions? Is the search string expected to be found verbatim, or do you need to find just the terms in it? Does whitespace matter? Does the order of the terms matter?
And more importantly, you haven't mentioned if there is any kind of structure in your files that should be considered while searching. For example, do you want to be able to limit the search to specific elements of an XML file?
Unless you have an SSD, your main bottleneck will be all the file accesses. Its going to take about 10 seconds to read the files, regardless of what you in Java.
If you have an SSD, reading the files won't be a problem, and the CPU speed in Java will matter more.
If you can create an index for the files this will help enormously.
Is there any Java library with TreeMap-like data structure which also supports all of these:
lookup by value (like Guava's BiMap)
possibility of non-unique keys as well as non unique values (like Guava's Multimap)
keeps track of sorted values as well as sorted keys
If it exists, it would probaby be called SortedBiTreeMultimap, or similar :)
This can be produced using a few data structures together, but I never took time to unite them in one nice class, so I was wondering if someone else has done it already.
I think you are looking for a "Graph". You might be interested in this slightly similar question asked a while ago, as well as this discussion thread on BiMultimaps / Graphs. Google has a BiMultimap in its internal code base, but they haven't yet decided whether to open source it.
I'm doing a project in which I have to search for a word in a dictionary efficiently.
Can any one provide me the Java code for implementing this search with indexing?
Can I use a b+ tree for the implementation?
Check out this answer.
The best way I know of (personally) to efficiently map from strings to other values is with a Trie. The answer I provided includes links to several already implemented versions.
An alternative is interning all your strings and indexing based on the yourString.intern().getHashCode().
This sounds like homework. If it is, please tag it as such.
Is "using an index" an external requirement, or one you've invented because you think it's part of the solution?
I would consider using a datastructure called a "Trie" for this kind of requirement (assuming the use of an index isn't actually mandated - although even then, you could argue that the Trie IS the index...)