Java string comparison

Java string comparison - java

I am comparing substrings in two large text files. Very simple, tokenizing into two token containers, comparing with 2 for loops. Performance is disastrous! Does anybody have an advice or idea how to improve performance?
for (int s = 0; s < txtA.TokenContainer.size(); s++) {
String strTxtA = txtA.getSubStr(s);
strLengthA = txtA.getNumToken(s);
if (strLengthA >= dp.getMinStrLength()) {
int tokenFileB = 1;
for (int t = 0; t < txtB.TokenContainer.size(); t++) {
String strTxtB = txtB.getSubStr(t);
strLengthB = txtB.getNumToken(t);
if (strTxtA.equalsIgnoreCase(strTxtB)) {
try {
subStrTemp = new SubStrTemp(
txtA.ID, txtB.ID, tokenFileA, tokenFileB,
(tokenFileA + strLengthA - 1),
(tokenFileB + strLengthB - 1));
if (subStrContainer.contains(subStrTemp) == false) {
subStrContainer.addElement(subStrTemp);
}
} catch (Exception ex) {
logger.error("error");
}
}
tokenFileB += strLengthB;
}
tokenFileA += strLengthA;
}
}
Generally my code reading two large Strings with Java Tokonizer into containers A and B. And then trying to compare substrings.Possision of Substrgs which are existing in both strings to store into a Vector. But performance is awful, also don't really know how to solve it with HashMap.

Your main problem is that you go through all txtB for each token in txtA.
You should store informations on token from txtA (in a HashMap for instance) and then in a second loop (but not a nested one) you compare the strings with the existing one in the Map.
On the same topic :
term frequency using java program
How to count words in java

You are doing a join with nested loops? Yes, that is O(n^2). What about doing a hash join instead? That is, create a map from (lowercased) strText to t and do lookups with this map rather than iterating over the token container?

Put the tokens of fileA into a trie data structure. Then when tokenising fileB you can check quite quickly if these tokens are in the trie. A few code comments would help.

A said, this is an issue of complexity and you're algorithm runs in O(n^2) instead of O(n) using hash.
For second order improvements try to call less to functions, for example you can get the size once
sizeB = txtB.TokenContainer.size();
Depeneds on the size, you may call the container once to get an array of strings to save the getStr....
Roni

Related

Ektorp CouchDb: Query for pattern with multiple contains

I want to query multiple candidates for a search string which could look like "My sear foo".
Now I want to look for documents which have a field that contains one (or more) of the entered strings (seen as splitted by whitespaces).
I found some code which allows me to do a search by pattern:
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(trim).endKey(trim + "\u9999");
return db.queryView(viewQuery, DeviceEntityCouch.class);
}
which works quite nice for looking just for one pattern. But how do I have to modify my code to get a multiple contains on doc.serialNumber?
EDIT:
This is the current workaround, but there must be a better way i guess.
Also there is only an OR logic. So an entry fits term1 or term2 to be in the list.
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
String[] split = trim.split(" ");
List<DeviceEntityCouch> list = new ArrayList<>();
for (String s : split) {
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(s).endKey(s + "\u9999");
list.addAll(db.queryView(viewQuery, DeviceEntityCouch.class));
}
return list;
}

Looks like you are implementing a full text search here. That's not going to be very efficient in CouchDB (I guess same applies to other databases).
Correct me if I am wrong but from looking at your code looks like you are trying to search a list of serial numbers for a pattern. CouchDB (or any other database) is quite efficient if you can somehow index the data you will be searching for.
Otherwise you must fetch every single record and perform a string comparison on it.
The only way I can think of to optimize this in CouchDB would be the something like the following (with assumptions):
Your serial numbers are not very long (say 20 chars?)
You force the search to be always 5 characters
Generate view that emits every single 5 char long substring from your serial number - more or less this (could be optimized and not sure if I got the in):
...
for (var i = 0; doc.serialNo.length > 5 && i < doc.serialNo.length - 5; i++) {
emit([doc.serialNo.substring(i, i + 5), doc._id]);
}
...
Use _count reduce function
Now the following url:
http://localhost:5984/test/_design/serial/_view/complex-key?startkey=["01234"]&endkey=["01234",{}]&group=true
Will return a list of documents with a hit count for a key of 01234.
If you don't group and set the reduce option to be false, you will get a list of all matches, including duplicates if a single doc has multiple hits.
Refer to http://ryankirkman.com/2011/03/30/advanced-filtering-with-couchdb-views.html for the information about complex keys lookups.
I am not sure how efficient couchdb is in terms of updating that view. It depends on how many records you will have and how many new entries appear between view is being queried (I understand couchdb rebuilds the view's b-tree on demand).
I have generated a view like that that splits doc ids into 5 char long keys. Out of over 1K docs it generated over 30K results - id being 32 char long, simple maths really: (serialNo.length - searchablekey.length + 1) * docscount).
Generating the view took a while but the lookups where fast.
You could generate keys of multiple lengths, etc. All comes down to your records count vs speed of lookups.

Storing and comparing a large quantity of Strings in Java

My application stores a large number (about 700,000) of strings in an ArrayList. The strings are loaded from a text file like this:
List<String> stringList = new ArrayList<String>(750_000);
//there's a try catch here but I omitted it for this example
Scanner fileIn = new Scanner(new FileInputStream(listPath), "UTF-8");
while (fileIn.hasNext()) {
String s = fileIn.nextLine().trim();
if (s.isEmpty()) continue;
if (s.startsWith("#")) continue; //ignore comments
stringList.add(s);
}
fileIn.close();
Later on, Other strings are compared to this list, using this code:
String example = "Something";
if (stringList.contains(example))
doSomething();
This comparison will happen many hundreds (thousands?) of times.
This all works, but I want to know if there's anything I can do to make it better. I notice that the JVM increases in size from about 100MB to 600MB when it loads the 700K Strings. The strings are mainly about this size:
Blackened Recordings
Divergent Series: Insurgent
Google
Pixels Movie Money
X Ambassadors
Power Path Pro Advanced
CYRFZQ
Is there anything I can do to reduce the memory, or is that to be expected? Any suggestions in general?

ArrayList is a memory effective. Probably your issue is caused by java.util.Scanner. Scanner creates a lot of temp objects during parsing (Patterns, Matchers etc) and not suitable for big files.
Try to replace it with java.io.BufferedReader:
List<String> stringList = new ArrayList<String>();
BufferedReader fileIn = new BufferedReader(new FileReader("UTF-8"));
String line = null;
while ((line = fileIn.readLine()) != null) {
line = line.trim();
if (line.isEmpty()) continue;
if (line.startsWith("#")) continue; //ignore comments
stringList.add(line);
}
fileIn.close();
See java.util.Scanner source code
To pinpoint memory issue attach to your JVM any memory profiler, for example VisualVM from JDK tools.
Added:
Let's make few assumtions:
you have 700000 string with 20 characters each.
object reference size is 32 bits, object header - 24, array header - 16, char - 16, int 32.
Then every string will consume 24+32*2+32+(16+20*16) = 456 bits.
Whole ArrayList with string object will consume about 700000*(32*2+456) = 364000000 bits = 43.4 MB (very roughly).

Not quite an answer, but:
Your scenario uses around 70mb on my machine:
long usedMemory = -(Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory());
{//
String[] strings = new String[700_000];
for (int i = 0; i < strings.length; i++) {
strings[i] = new String(new char[20]);
}
}//
usedMemory += Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.println(usedMemory / 1_000_000d + " mb");
How did you reach 500mb there? As far as I know, String has internally a char[], and each char has 16 bits. Taking the Object and String overhead in account, 500mb is still quite much for the strings only. You may perform some benchmarking tests on your machine.
As others already mentioned, you should change the datastructure for element look-ups/comparison.

You're likely going to be better off using a HashSet instead of an ArrayList as both add and contains are constant time operations in a HashSet.
However, it does assume that your object's hashCode implementation (which is part of Object, but can be overridden) is evenly distributed.

There is a Trie data structure which can be used as dictionary, with so many strings they can occur multiple times. https://en.wikipedia.org/wiki/Trie . It seems to fit your case.
UPDATE:
An alternative can be HashSet or HashMap string -> something if you want occurrences of strings for example. Hashed collection will be faster than list for sure.
I would start with HashSet.

Using an ArrayList is a very bad idea for your use case, because it is not sorted, and hence you cannot efficiently search for an entry.
The best built-in type for your case is a is a TreeSet<String>. It guarantees O(log(n)) Performance for add() and contains().
Be aware that TreeSet is not thread-safe in the basic implementation. Use an mt-safe wrapper (see the JavaDocs of TreeSet for this).

Here is a Java 8 approach. It uses Files.lines() method which take advantage of Stream API. This method reads all lines from a file as a Stream.
As a consequence no String objects are created till the terminal operation which is a static method MyExecutor.doSomething(String).
/**
* Process lines from a file.
* Uses Files.lines() method which take advantage of Stream API introduced in Java 8.
*/
private static void processStringsFromFile(final Path file) {
try (Stream<String> lines = Files.lines(file)) {
lines.map(s -> s.trim())
.filter(s -> !s.isEmpty())
.filter(s -> !s.startsWith("#"))
.filter(s -> s.contains("Something"))
.forEach(MyExecutor::doSomething);
} catch (IOException ex) {
logProcessStringsFailed(ex);
}
}
I conducted an Analysis of Memory Usage in NetBeans and here are the Memory Results for empty implementation of doSomething()
public static void doSomething(final String s) {
}
Live Bytes = 6702720 ≈ 6.4MB.

A more efficient way of finding English words that are one letter off from eachother

I wrote a little program that tries to find a connection between two equal length English words. Word A will transform into Word B by changing one letter at a time, each newly created word has to be an English word.
For example:
Word A = BANG
Word B = DUST
Result:
BANG -> BUNG ->BUNT -> DUNT -> DUST
My process:
Load an English wordlist(consist of 109582 words) into a Map<Integer, List<String>> _wordMap = new HashMap();, key will be the word length.
User put in 2 words.
createGraph creates a graph.
calculate the shortest path between those 2 nodes
prints out the result.
Everything works perfectly fine, but I am not satisfied with the time it took in step 3.
See:
Completely loaded 109582 words!
CreateMap took: 30 milsecs
CreateGraph took: 17417 milsecs
(HOISE : HORSE)
(HOISE : POISE)
(POISE : PRISE)
(ARISE : PRISE)
(ANISE : ARISE)
(ANILE : ANISE)
(ANILE : ANKLE)
The wholething took: 17866 milsecs
I am not satisfied with the time it takes create the graph in step 3, here's my code for it(I am using JgraphT for the graph):
private List<String> _wordList = new ArrayList(); // list of all 109582 English words
private Map<Integer, List<String>> _wordMap = new HashMap(); // Map grouping all the words by their length()
private UndirectedGraph<String, DefaultEdge> _wordGraph =
new SimpleGraph<String, DefaultEdge>(DefaultEdge.class); // Graph used to calculate the shortest path from one node to the other.
private void createGraph(int wordLength){
long before = System.currentTimeMillis();
List<String> words = _wordMap.get(wordLength);
for(String word:words){
_wordGraph.addVertex(word); // adds a node
for(String wordToTest : _wordList){
if (isSimilar(word, wordToTest)) {
_wordGraph.addVertex(wordToTest); // adds another node
_wordGraph.addEdge(word, wordToTest); // connecting 2 nodes if they are one letter off from eachother
}
}
}
System.out.println("CreateGraph took: " + (System.currentTimeMillis() - before)+ " milsecs");
}
private boolean isSimilar(String wordA, String wordB) {
if(wordA.length() != wordB.length()){
return false;
}
int matchingLetters = 0;
if (wordA.equalsIgnoreCase(wordB)) {
return false;
}
for (int i = 0; i < wordA.length(); i++) {
if (wordA.charAt(i) == wordB.charAt(i)) {
matchingLetters++;
}
}
if (matchingLetters == wordA.length() - 1) {
return true;
}
return false;
}
My question:
How can I improve my algorithm inorder to speed up the process?
For any redditors that are reading this, yes I created this after seeing the thread from /r/askreddit yesterday.

Here's a starting thought:
Create a Map<String, List<String>> (or a Multimap<String, String> if you've using Guava), and for each word, "blank out" one letter at a time, and add the original word to the list for that blanked out word. So you'd end up with:
.ORSE => NORSE, HORSE, GORSE (etc)
H.RSE => HORSE
HO.SE => HORSE, HOUSE (etc)
At that point, given a word, you can very easily find all the words it's similar to - just go through the same process again, but instead of adding to the map, just fetch all the values for each "blanked out" version.

You probably need to run it through a profiler to see where most of the time is taken, especially since you are using library classes - otherwise you might put in a lot of effort but see no significant improvement.
You could lowercase all the words before you start, to avoid the equalsIgnoreCase() on every comparison. In fact, this is an inconsistency in your code - you use equalsIgnoreCase() initially, but then compare chars in a case-sensitive way: if (wordA.charAt(i) == wordB.charAt(i)). It might be worth eliminating the equalsIgnoreCase() check entirely, since this is doing essentially the same thing as the following charAt loop.
You could change the comparison loop so it finishes early when it finds more than one different letter, rather than comparing all the letters and only then checking how many are matching or different.
(Update: this answer is about optimizing your current code. I realize, reading your question again, that you may be asking about alternative algorithms!)

You can have the list of words of same length sorted, and then have a loop nesting of the kind for (int i = 0; i < n; ++i) for (int j = i + 1; j < n; ++j) { }.
And in isSimilar count the differences and on 2 return false.

Java - Sort an Array Twice?

I am working on a program that displays zip codes and house numbers. I need to sort the zip codes in ascending order in the first column then sort the house numbers from left to right, keeping them with the same zip code.
For instance:
Looks like this:
90153 | 9810 6037 8761 1126 9792 4070
90361 | 2274 6800 2196 3158 9614 9086
I want it to look like this:
90153 | 1126 4070 6037 8761 9792 9810
90361 | 2186 2274 3158 6800 9086 9614
I used the following code to sort the zip codes but how do I sort the house numbers? Do I need to add a loop to sort the numbers to this code? If so, where? So sorry I couldn't make the code indent correctly.
void DoubleArraySort()
{
int k,m,Hide;
boolean DidISwap;
DidISwap = true;
while (DidISwap)
{
DidISwap = false;
for ( k = 0; k < Row - 1; k++)
{
if ( Numbers[k][0] > Numbers[k+1][0] )
{
for ( m = 0; m < Col; m++)
{
Hide = Numbers[k ][m];
Numbers[k ][m] = Numbers[k+1][m];
Numbers[k+1][m] = Hide ;
DidISwap = true;
}
}
}
}
}

Use an object ZipCode like this:
public class ZipCode{
private String zipcode;
private ArrayList<String> adds
public ZipCode(String zip){
zipcode = zip;
adds = new ArrayList<String>();
}
public void addAddress(String address){
adds.add(address);
Collections.sort(adds);
}
}
Keep an array of ZipCodes sorting them necessarily:
ZipCode[] zips = . . .
.
.
.
Arrays.sort(zips);

First of all, are you aware that Java provides a more efficient sorting mechanism out of the box? Check the Arrays class.
Secondly you have to be very careful with your approach. What you are doing here is swapping all the elements of one row with the other. But you are not doing the same thing within each row. So you need a separate nested loop outside the current while (before or after, doesn't make a difference), which checks the houses themselves and sorts them:
for ( k = 0; k < Row; k++)
{
do
{
DidISwap = false;
for ( m = 0; m < Col-1; m++)
{
if (Numbers[k][m] > Numbers[k][m+1])
{
Hide = Numbers[k][m];
Numbers[k][m] = Numbers[k][m+1];
Numbers[k][m+1] = Hide;
DidISwap = true;
}
}
}
while (DidISwap);
}
However, your approach is very inefficient. Why don't you put the list of houses in a SortedSet, and then create a SortedMap which maps from your postcodes to your Sorted Sets of houses? Everything will be sorted automatically and much more efficiently.
You can use the TreeMap for your SortedMap implementation and the TreeSet for your SortedSet implementation.

I / we could try to tell you how to fix (sort of) your code to do what you want, but it would be counter-productive. Instead, I'm going to explain "the Java way" of doing these things, which (if you follow it) will make you more productive, and make your code more maintainable.
Follow the Java style conventions. In particular, the identifier conventions. Method names and variable names should always start with a lower case character. (And try to use class, method and variable names that hint as to the meaning of the class/method/variable.)
Learn the Java APIs and use existing standard library classes and methods in preference to reinventing the wheel. For instance:
The Arrays and Collections classes have standard methods for sorting arrays and collections.
There are collection types that implement sets and mappings and the like that can take care of "boring" things like keeping elements in order.
If you have a complicated data structure, build it out of existing collection types and custom classes. Don't try and represent it as arrays of numbers. Successful Java programmers use high-level design and implementation abstractions. Your approach is like trying to build a multi-storey car-park from hand-made bricks.
My advice would be to get a text book on object-oriented programming (in Java) and get your head around the right way to design and write Java programs. Investing the effort now will make you more productive.

Tips optimizing Java code

So, I've written a spellchecker in Java and things work as they should. The only problem is that if I use a word where the max allowed distance of edits is too large (like say, 9) then my code runs out of memory. I've profiled my code and dumped the heap into a file, but I don't know how to use it to optimize my code.
Can anyone offer any help? I'm more than willing to put up the file/use any other approach that people might have.
-Edit-
Many people asked for more details in the comments. I figured that other people would find them useful, and they might get buried in the comments. Here they are:
I'm using a Trie to store the words themselves.
In order to improve time efficiency, I don't compute the Levenshtein Distance upfront, but I calculate it as I go. What I mean by this is that I keep only two rows of the LD table in memory. Since a Trie is a prefix tree, it means that every time I recurse down a node, the previous letters of the word (and therefore the distance for those words) remains the same. Therefore, I only calculate the distance with that new letter included, with the previous row remaining unchanged.
The suggestions that I generate are stored in a HashMap. The rows of the LD table are stored in ArrayLists.
Here's the code of the function in the Trie that leads to the problem. Building the Trie is pretty straight forward, and I haven't included the code for the same here.
/*
* #param letter: the letter that is currently being looked at in the trie
* word: the word that we are trying to find matches for
* previousRow: the previous row of the Levenshtein Distance table
* suggestions: all the suggestions for the given word
* maxd: max distance a word can be from th query and still be returned as suggestion
* suggestion: the current suggestion being constructed
*/
public void get(char letter, ArrayList<Character> word, ArrayList<Integer> previousRow, HashSet<String> suggestions, int maxd, String suggestion){
// the new row of the trie that is to be computed.
ArrayList<Integer> currentRow = new ArrayList<Integer>(word.size()+1);
currentRow.add(previousRow.get(0)+1);
int insert = 0;
int delete = 0;
int swap = 0;
int d = 0;
for(int i=1;i<word.size()+1;i++){
delete = currentRow.get(i-1)+1;
insert = previousRow.get(i)+1;
if(word.get(i-1)==letter)
swap = previousRow.get(i-1);
else
swap = previousRow.get(i-1)+1;
d = Math.min(delete, Math.min(insert, swap));
currentRow.add(d);
}
// if this node represents a word and the distance so far is <= maxd, then add this word as a suggestion
if(isWord==true && d<=maxd){
suggestions.add(suggestion);
}
// if any of the entries in the current row are <=maxd, it means we can still find possible solutions.
// recursively search all the branches of the trie
for(int i=0;i<currentRow.size();i++){
if(currentRow.get(i)<=maxd){
for(int j=0;j<26;j++){
if(children[j]!=null){
children[j].get((char)(j+97), word, currentRow, suggestions, maxd, suggestion+String.valueOf((char)(j+97)));
}
}
break;
}
}
}

Here's some code I quickly crafted showing one way to generate the candidates and to then "rank" them.
The trick is: you never "test" a non-valid candidate.
To me your: "I run out of memory when I've got an edit distance of 9" screams "combinatorial explosion".
Of course to dodge a combinatorial explosion you don't do thing like trying to generate yourself all words that are at a distance from '9' from your misspelled work. You start from the misspelled word and generate (quite a lot) of possible candidates, but you refrain from creating too many candidates, for then you'd run into trouble.
(also note that it doesn't make much sense to compute up to a Levenhstein Edit Distance of 9, because technically any word less than 10 letters can be transformed into any other word less than 10 letters in max 9 transformations)
Here's why you simply cannot test all words up to a distance of 9 without either having an OutOfMemory error or simply a program never terminating:
generating all the LED up to 1 for the word "ptmizing", by only adding one letter (from a to z) generates already 9*26 variations (i.e. 324 variations) [there are 9 positions where you can insert one out of 26 letters)
generating all the LED up to 2, by only adding one letter to what we know have generates already 10*26*324 variations (60 840)
generating all the LED up to 3 gives: 17 400 240 variations
And that is only by considering the case where we add one, add two or add three letters (we're not counting deletion, swaps, etc.). And that is on a misspelled word that is only nine characters long. On "real" words, it explodes even faster.
Sure, you could get "smart" and generate this in a way not to have too many dupes etc. but the point stays: it's a combinatorial explosion that explodes fastly.
Anyway... Here's an example. I'm simply passing the dictionary of valid words (containing only four words in this case) to the corresponding method to keep this short.
You'll obviously want to replace the call to the LED with your own LED implementation.
The double-metaphone is just an example: in a real spellchecker words that do "sound alike"
despite further LED should be considered as "more correct" and hence often suggest first. For example "optimizing" and "aupteemising" are quite far from a LED point of view, but using the double-metaphone you should get "optimizing" as one of the first suggestion.
(disclaimer: following was cranked in a few minutes, it doesn't take into account uppercase, non-english words, etc.: it's not a real spell-checker, just an example)
#Test
public void spellCheck() {
final String src = "misspeled";
final Set<String> validWords = new HashSet<String>();
validWords.add("boing");
validWords.add("Yahoo!");
validWords.add("misspelled");
validWords.add("stackoverflow");
final List<String> candidates = findNonSortedCandidates( src, validWords );
final SortedMap<Integer,String> res = computeLevenhsteinEditDistanceForEveryCandidate(candidates, src);
for ( final Map.Entry<Integer,String> entry : res.entrySet() ) {
System.out.println( entry.getValue() + " # LED: " + entry.getKey() );
}
}
private SortedMap<Integer, String> computeLevenhsteinEditDistanceForEveryCandidate(
final List<String> candidates,
final String mispelledWord
) {
final SortedMap<Integer, String> res = new TreeMap<Integer, String>();
for ( final String candidate : candidates ) {
res.put( dynamicProgrammingLED(candidate, mispelledWord), candidate );
}
return res;
}
private int dynamicProgrammingLED( final String candidate, final String misspelledWord ) {
return Levenhstein.getLevenshteinDistance(candidate,misspelledWord);
}
Here you generate all possible candidates using several methods. I've only implemented one such method (and quickly so it may be bogus but that's not the point ; )
private List<String> findNonSortedCandidates( final String src, final Set<String> validWords ) {
final List<String> res = new ArrayList<String>();
res.addAll( allCombinationAddingOneLetter(src, validWords) );
// res.addAll( allCombinationRemovingOneLetter(src) );
// res.addAll( allCombinationInvertingLetters(src) );
return res;
}
private List<String> allCombinationAddingOneLetter( final String src, final Set<String> validWords ) {
final List<String> res = new ArrayList<String>();
for (char c = 'a'; c < 'z'; c++) {
for (int i = 0; i < src.length(); i++) {
final String candidate = src.substring(0, i) + c + src.substring(i, src.length());
if ( validWords.contains(candidate) ) {
res.add(candidate); // only adding candidates we know are valid words
}
}
if ( validWords.contains(src+c) ) {
res.add( src + c );
}
}
return res;
}

One thing you could try is, increase the Java's heap size, in order to overcome "out of memory error".
Following article will help you in order to understand how to increase heap size in Java
http://viralpatel.net/blogs/2009/01/jvm-java-increase-heap-size-setting-heap-size-jvm-heap.html
But I think the better approach to address your problem is, find out a better algorithm than the current algorithm

Well without more Information on the topic there is not much the community could do for you... You can start with the following:
Look at what your Profiler says (after it has run a little while): Does anything pile up? Are there a lot of Objects - this should normally give you a hint on what is wrong with your code.
Publish your saved dump somewhere and link it in your question, so someone else could take a look at it.
Tell us which profiler you are using, then somebody can give you hints on where to look for valuable information.
After you have narrowed down your problem to a specific part of your Code, and you cannot figure out why there are so many objects of $FOO in your memory, post a snippet of the relevant part.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java string comparison - java

You are doing a join with nested loops? Yes, that is O(n^2). What about doing a hash join instead? That is, create a map from (lowercased) strText to t and do lookups with this map rather than iterating over the token container?

Put the tokens of fileA into a trie data structure. Then when tokenising fileB you can check quite quickly if these tokens are in the trie. A few code comments would help.

Related

Ektorp CouchDb: Query for pattern with multiple contains

Storing and comparing a large quantity of Strings in Java

A more efficient way of finding English words that are one letter off from eachother

Java - Sort an Array Twice?

Tips optimizing Java code

Categories

Resources