I have an application which accesses about 2 million tweets from a MySQL database. Specifically one of the fields holds a tweet of text (with maximum length of 140 characters). I am splitting every tweet into an ngram of words ngrams where 1 <= n <= 3. For example, consider the sentence:
I am a boring sentence.
The corresponding nGrams are:
I am
I am a
am a
am a boring
a boring
a boring sentence
boring sentence
With about 2 million tweets, I am generating a lot of data. Regardless, I am surprised to get a heap error from Java:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2145)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1922)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3423)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:483)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3118)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2288)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2709)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2728)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2678)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1612)
at twittertest.NGramFrequencyCounter.FreqCount(NGramFrequencyCounter.java:49)
at twittertest.Global.main(Global.java:40)
Here is the problem code statement (line 49) as given by the above output from Netbeans:
results = stmt.executeQuery("select * from tweets");
So, if I am running out of memory it must be that it is trying to return all the results at once and then storing them in memory. What is the best way to solve this problem? Specifically I have the following questions:
How can I process pieces of results rather than the whole set?
How would I increase the heap size? (If this is possible)
Feel free to include any suggestions, and let me know if you need more information.
Instead of select * from tweets I partitioned the table into equally sized subsets of about 10% of the total size. Then I tried running the program. It looked like it was working fine but it eventually gave me the same heap error. This is strange to me because I have ran the same program in the past, successfully with 610,000 tweets. Now I have about 2,000,000 tweets or roughly 3 times as much more data. So if I split the data into thirds it should work, but I went further and split the subsets into size 10%.
Is some memory not being freed? Here is the rest of the code:
results = stmt.executeQuery("select COUNT(*) from tweets");
int num_tweets = 0;
num_tweets = results.getInt(1);
int num_intervals = 10; //split into equally sized subets
int interval_size = num_tweets/num_intervals;
for(int i = 0; i < num_intervals-1; i++) //process 10% of the data at a time
results = stmt.executeQuery( String.format("select * from tweets limit %s, %s", i*interval_size, (i+1)*interval_size));
while(results.next()) //for each row in the tweets database
tweetID = results.getLong("tweet_id");
curTweet = results.getString("tweet");
int colPos = curTweet.indexOf(":");
curTweet = curTweet.substring(colPos + 1); //trim off the RT and retweeted
if(curTweet != null)
curTweet = removeStopWords(curTweet);
if(curTweet == null)
reader = new StringReader(curTweet);
tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
//tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
//Set stopSet = StopFilter.makeStopSet(Version.LUCENE_36, stopWords, true);
//tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopSet);
tokenizer = new ShingleFilter(tokenizer, 2, 3);
charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken()) //insert each nGram from each tweet into the DB
insertNGram.setInt(1, nGramID++);
insertNGram.setString(2, charTermAttribute.toString().toString());
insertNGram.setLong(3, tweetID);
Don't get all rows from table. Try to select partial
data based on your requirement by setting limits to query. You are using MySQL database your query would be select * from tweets limit 0,10. Here 0 is starting row id and 10 represents 10 rows from start.
You can always increase the heap size available to your JVM using the -Xmx argument. You should read up on all the knobs available to you (e.g. perm gen size). Google for other options or read this SO answer.
You probably can't do this kind of problem with a 32-bit machine. You'll want 64 bits and lots of RAM.
Another option would be to treat it as a map-reduce problem. Solve it on a cluster using Hadoop and Mahout.
Have you considered streaming the result set? Halfway down the page is a section on result set, and it addresses your problem (I think?) Write the n grams to a file, then process the next row? Or, am I misunderstanding your problem?
To be clear, to all of the guys who rush and say that these type of posts are duplicate without even reading it: this is not a type of question in which i ask what null is and how can i manage these exceptions, here i ask why twitter's API returns to my method A null object seemingly random.
I am creating a java application that interacts with Twitter API using the library Twitter4J. I want to download a big amount of tweets, and then do the statistics on the offline data. Tweets are saved in a NoSQL database (elasticsearch).
My code was doing fine when it started printing the tweets only on the console for testing. When my program hit the limit of max tweets it slept until the reset of twitter limitation (more than 1.000.000 was printed and got zero errors), the problem came up after i started saving the tweets in my database, after some loops, i get a java.lang.NullPointerException in this exact statement if (searchTweetsRateLimit.getRemaining() == 0). Any suggestions?
public static void main(String[] args) throws TwitterException {
int totalTweets = 0;
long maxID = -1;
twitter4j.Twitter twitter = getTwitter();
RestClient restclient = RestClient.builder(
new HttpHost("localhost",9200,"http"),
new HttpHost("localhost",9201,"http")).build();
Map<String, RateLimitStatus> rateLimitStatus = twitter.getRateLimitStatus("search");
// This finds the rate limit specifically for doing the search API call we use in this program
RateLimitStatus searchTweetsRateLimit = rateLimitStatus.get("/search/tweets");
System.out.printf("You have %d calls remaining out of %d, Limit resets in %d seconds\n",
int i = 10;
// This is the loop that retrieve multiple blocks of tweets from Twitter
for (int queryNumber=0;queryNumber < MAX_QUERIES; queryNumber++)
System.out.printf("\n\n!!! Starting loop %d\n\n", queryNumber);
// Do we need to delay because we've already hit our rate limits?
if (searchTweetsRateLimit.getRemaining() == 0)
// Yes we do, unfortunately ...
System.out.printf("!!! Sleeping for %d seconds due to rate limits\n", searchTweetsRateLimit.getSecondsUntilReset());
// If you sleep exactly the number of seconds, you can make your query a bit too early
// and still get an error for exceeding rate limitations
Thread.sleep((searchTweetsRateLimit.getSecondsUntilReset()+2) * 1000l);
Query q = new Query(SEARCH_TERM); // Search for tweets that contains this term
q.setCount(TWEETS_PER_QUERY); // How many tweets, max, to retrieve
q.resultType(null); // Get all tweets
q.setLang("en"); // English language tweets, please
I want to query multiple candidates for a search string which could look like "My sear foo".
Now I want to look for documents which have a field that contains one (or more) of the entered strings (seen as splitted by whitespaces).
I found some code which allows me to do a search by pattern:
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(trim).endKey(trim + "\u9999");
return db.queryView(viewQuery, DeviceEntityCouch.class);
which works quite nice for looking just for one pattern. But how do I have to modify my code to get a multiple contains on doc.serialNumber?
This is the current workaround, but there must be a better way i guess.
Also there is only an OR logic. So an entry fits term1 or term2 to be in the list.
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
String[] split = trim.split(" ");
List<DeviceEntityCouch> list = new ArrayList<>();
for (String s : split) {
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(s).endKey(s + "\u9999");
list.addAll(db.queryView(viewQuery, DeviceEntityCouch.class));
return list;
Looks like you are implementing a full text search here. That's not going to be very efficient in CouchDB (I guess same applies to other databases).
Correct me if I am wrong but from looking at your code looks like you are trying to search a list of serial numbers for a pattern. CouchDB (or any other database) is quite efficient if you can somehow index the data you will be searching for.
Otherwise you must fetch every single record and perform a string comparison on it.
The only way I can think of to optimize this in CouchDB would be the something like the following (with assumptions):
Your serial numbers are not very long (say 20 chars?)
You force the search to be always 5 characters
Generate view that emits every single 5 char long substring from your serial number - more or less this (could be optimized and not sure if I got the in):
for (var i = 0; doc.serialNo.length > 5 && i < doc.serialNo.length - 5; i++) {
emit([doc.serialNo.substring(i, i + 5), doc._id]);
Use _count reduce function
Now the following url:
Will return a list of documents with a hit count for a key of 01234.
If you don't group and set the reduce option to be false, you will get a list of all matches, including duplicates if a single doc has multiple hits.
Refer to http://ryankirkman.com/2011/03/30/advanced-filtering-with-couchdb-views.html for the information about complex keys lookups.
I am not sure how efficient couchdb is in terms of updating that view. It depends on how many records you will have and how many new entries appear between view is being queried (I understand couchdb rebuilds the view's b-tree on demand).
I have generated a view like that that splits doc ids into 5 char long keys. Out of over 1K docs it generated over 30K results - id being 32 char long, simple maths really: (serialNo.length - searchablekey.length + 1) * docscount).
Generating the view took a while but the lookups where fast.
You could generate keys of multiple lengths, etc. All comes down to your records count vs speed of lookups.
My regular procedure when coming to the task on getting dimensions of a csv file as following:
Get how many rows it has:
I use a while loop to read every lines and count up through each successful read. The cons is that it takes time to read the whole file just to count how many rows it has.
then get how many columns it has:
I use String[] temp = lineOfText.split(","); and then take the size of temp.
Is there any smarter method? Like:
file1 = read.csv;
xDimention = file1.xDimention;
yDimention = file1.yDimention;
I guess it depends on how regular the structure is, and whether you need an exact answer or not.
I could imagine looking at the first few rows (or randomly skipping through the file), and then dividing the file size by average row size to determine a rough row count.
If you control how these files get written, you could potentially tag them or add a metadata file next to them containing row counts.
Strictly speaking, the way you're splitting the line doesn't cover all possible cases. "hello, world", 4, 5 should read as having 3 columns, not 4.
Your approach won't work with multi-line values (you'll get an invalid number of rows) and quoted values that might happen to contain the deliminter (you'll get an invalid number of columns).
You should use a CSV parser such as the one provided by univocity-parsers.
Using the uniVocity CSV parser, that fastest way to determine the dimensions would be with the following code. It parses a 150MB file to give its dimensions in 1.2 seconds:
// Let's create our own RowProcessor to analyze the rows
static class CsvDimension extends AbstractRowProcessor {
int lastColumn = -1;
long rowCount = 0;
public void rowProcessed(String[] row, ParsingContext context) {
if (lastColumn < row.length) {
lastColumn = row.length;
public static void main(String... args) throws FileNotFoundException {
// let's measure the time roughly
long start = System.currentTimeMillis();
//Creates an instance of our own custom RowProcessor, defined above.
CsvDimension myDimensionProcessor = new CsvDimension();
CsvParserSettings settings = new CsvParserSettings();
//This tells the parser that no row should have more than 2,000,000 columns
//Here you can select the column indexes you are interested in reading.
//The parser will return values for the columns you selected, in the order you defined
//By selecting no indexes here, no String objects will be created
settings.selectIndexes(/*nothing here*/);
//When you select indexes, the columns are reordered so they come in the order you defined.
//By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
//We instruct the parser to send all rows parsed to your custom RowProcessor.
//Finally, we create a parser
CsvParser parser = new CsvParser(settings);
//And parse! All rows are sent to your custom RowProcessor (CsvDimension)
//I'm using a 150MB CSV file with 1.3 million rows.
parser.parse(new FileReader(new File("c:/tmp/worldcitiespop.txt")));
//Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
System.out.println("Columns: " + myDimensionProcessor.lastColumn);
System.out.println("Rows: " + myDimensionProcessor.rowCount);
System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");
The output will be:
Columns: 7
Rows: 3173959
Time taken: 1279 ms
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
IMO, What you are doing is an acceptable way to do it. But here are some ways you could make it faster:
Rather than reading lines, which creates a new String Object for each line, just use String.indexOf to find the bounds of your lines
Rather than using line.split, again use indexOf to count the number of commas
I guess here are the options which will depend on how you use the data:
Store dimensions of your csv file when writing the file (in the first row or as in an additional file)
Use a more efficient way to count lines - maybe http://docs.oracle.com/javase/6/docs/api/java/io/LineNumberReader.html
Instead of creating an arrays of fixed size (assuming thats what you need the line count for) use array lists - this may or may not be more efficient depending on size of file.
To find number of rows you have to read the whole file. There is nothing you can do here. However your method of finding number of cols is a bit inefficient. Instead of split just count how many times "," appeard in the line. You might also include here special condition about fields put in the quotas as mentioned by #Vlad.
String.split method creates an array of strings as a result and splits using regexp which is not very efficient.
I find this short but interesting solution here:
LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")));
System.out.println(lnr.getLineNumber() + 1); //Add 1 because line index starts at 0
My solution is simply and correctly process CSV with multiline cells or quoted values.
for example We have csv-file:
And my solution snippet is:
import java.io.*;
public class CsvDimension {
public void parse(Reader reader) throws IOException {
long cells = 0;
int lines = 0;
int c;
boolean qouted = false;
while ((c = reader.read()) != -1) {
if (c == '"') {
qouted = !qouted;
if (!qouted) {
if (c == '\n') {
if (c == ',') {
System.out.printf("lines : %d\n cells %d\n cols: %d\n", lines, cells, cells / lines);
public static void main(String args[]) throws IOException {
new CsvDimension().parse(new BufferedReader(new FileReader(new File("test.csv"))));
I have solved in various ways a simple problem on CodeEval, which specification can be found here (only a few lines long).
I have made 3 working versions (one of them in Scala) and I don't understand the difference of performances for my last Java version which I expected to be the best time and memory-wise.
I also compared this to a code found on Github. Here are the performance stats returned by CodeEval :
. Version 1 is the version found on Github
. Version 2 is my Scala solution :
object Main extends App {
val p = Pattern.compile("\\d+")
.map(line => {
val dists = new TreeSet[Int]
val m = p.matcher(line)
while (m.find) dists += m.group.toInt
val list = dists.toList
list.zip(0 +: list).map { case (x,y) => x - y }.mkString(",")
. Version 3 is my Java solution which I expected to be the best :
public class Main {
public static void main(String[] args) throws IOException {
Pattern p = Pattern.compile("\\d+");
File file = new File(args[0]);
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
Set<Integer> dists = new TreeSet<Integer>();
Matcher m = p.matcher(line);
while (m.find()) dists.add(Integer.parseInt(m.group()));
Iterator<Integer> it = dists.iterator();
int prev = 0;
StringBuilder sb = new StringBuilder();
while (it.hasNext()) {
int curr = it.next();
sb.append(curr - prev);
sb.append(it.hasNext() ? "," : "");
prev = curr;
Version 4 is the same as version 3 except I don't use a StringBuilder to print the output and do like in version 1
Here is how I interpreted those results :
version 1 is too slow because of the too high number of System.out.print calls. Moreover, using split on very large lines (that's the case in the tests performed) uses a lot of memory.
version 2 seems slow too but it is mainly because of an "overhead" on running Scala code on CodeEval, even very efficient code run slowly on it
version 2 uses unnecessary memory to build a list from the set, which also takes some time but should not be too significant. Writing more efficient Scala would probably like writing it in Java so I preferred elegance to performance
version 3 should not use that much memory in my opinion. The use of a StringBuilder has the same impact on memory as calling mkString in version 2
version 4 proves the calls to System.out.println are slowering down the program
Does someone see an explanation to those results ?
I conducted some tests.
There is a baseline for every type of language. I code in java and javascript. For javascript here are my test results:
Rev 1: Default empty boilerplate for JS with a message to standard output
Rev 2: Same without file reading
Rev 3: Just a message to the standard output
You can see that no matter what, there will be at least 200 ms runtime and about 5 megs of memory usage. This baseline depends on the load of the servers as well! There was a time when codeevals was heavily overloaded, thus making impossible to run anything within the max time(10s).
Check this out, a totally different challenge than the previous:
Rev4: My solution
Rev5: The same code submitted again now. Scored 8000 more ranking point. :D
Conclusion: I would not worry too much about CPU and memory usage and rank. It is clearly not reliable.
Your scala solution is slow, not because of "overhead on CodeEval", but because you are building an immutable TreeSet, adding elements to it one by one. Replacing it with something like
val regex = """\d+""".r // in the beginning, instead of your Pattern.compile
.map { line =>
val dists = regex.findAllIn(line).map(_.toInt).toIndexedSeq.sorted
Should shave about 30-40% off your execution time.
Same approach (build a list, then sort) will, probably, help your memory utilization in "version 3" (java sets are real memory hogs). It is also a good idea to give your list an initial size while you are at it (otherwise, it'll grow by 50% every time it runs out of capacity, which is wasteful in both memory and performance). 600 sounds like a good number, since that's the upper bound for the number of cities from the problem description.
Now, since we know the upper boundary, an even faster and slimmer approach is to do away with lists and boxed Integeres, and just do int dists[] = new int[600];.
If you wanted to get really fancy, you'd also make use of the "route length" range that's mentioned in the description. For example, instead of throwing ints into an array and sorting (or keeping a treeset), make an array of 20,000 bits (or even 20K bytes for speed), and set those that you see in input as you read it ... That would be both faster and more memory efficient than any of your solutions.
I tried solving this question and figured that you don't need the names of the cities, just the distances in a sorted array.
It has much better runtime of 738ms, and memory of 4513792 with this.
Although this may not help improve your piece of code, it seems like a better way to approach the question. Any suggestions to improve the code further are welcome.
import java.io.*;
import java.util.*;
public class Main {
public static void main (String[] args) throws IOException {
File file = new File(args[0]);
BufferedReader buffer = new BufferedReader(new FileReader(file));
String line;
while ((line = buffer.readLine()) != null) {
line = line.trim();
String out = new Main().getDistances(line);
public String getDistances(String s){
//split the string
String[] arr = s.split(";");
//create an array to hold the distances as integers
int[] distances = new int[arr.length];
for(int i=0; i<arr.length; i++){
//find the index of , - get the characters after that - convert to integer - add to distances array
distances[i] = Integer.parseInt(arr[i].substring(arr[i].lastIndexOf(",")+1));
//sort the array
String output = "";
output += distances[0]; //append the distance to the closest city to the string
for(int i=0; i<arr.length-1; i++){
//get distance between current element(city) and next
int distance_between = distances[i+1] - distances[i];
//append the distance to the string
output += "," + distance_between;
return output;
My query returns with 31,000 results with 12 columns for each row, and each row contains roughly 8,000 characters (8KB per row). Here is how I processed:
public List<MyTableObj> getRecords(Connection con) {
List<MyTableObj> list = new ArrayList<MyTableObj>();
String sql = "my query...";
ResultSet rs = null;
Statement st = null;
st = con.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
rs = st.executeQuery(sql);
try {
System.out.println("Before MemoryFreeSize = " + (double)Runtime.getRuntime().freeMemory() / 1024 / 1024 + " MB");
while ( rs.next() ) {
MyTableObjitem item = new MyTableObj();
item.setColumn1( rs.getString("column1") );
... ...
item.setColumn12( rs.getString("column12") );
list.add( item );
} // end loop
// try to release some memory, but it's not working at all
if ( st != null ) st.close();
if ( rs != null ) rs.close();
st = null; rs = null;
catch ( Exception e ) { //do something }
System.out.println("After MemoryFreeSize = " + (double)Runtime.getRuntime().freeMemory() / 1024 / 1024 + " MB");
return list;
} // end getRecords
If each row takes 8kb memory, 31k should take 242mb memory. After finish looping the query result, my remaining memory is only 142mb, which is not enough to finish rest of my other process.
I searched many solutions and I tried to set my heap memory to 512mb -Xmx512m -Xms512m, and I also set the fetch size setFetchSize(50).
I suspect it's the ResultSet occupied too much memories, the results may stored in the client-side catch. However, after I clear up some object ( st.close() and rs.close() ), even I manually called the garbage collector System.gc(), the free memory after the loop never increase (why?).
Let's just assume I can not change the database design, and I need all query results. How can I free more memory after processing?
P.S.: I also tried to not using the ResultSet.getString() and relace it with hardcode String, and after looping, I got 450mb free memory.
I found that, if I do:
// + counter to make the value different for each row, for testing purpose
item.setColumn1( "Constant String from Column1" + counter );
... ...
item.setColumn12( "Constant String from Column12" + counter );
It used only around 60MB memory.
But if I do:
item.setColumn1( rs.getString("column1") );
... ...
item.setColumn12( rs.getString("column12") );
It used up to 380MB memory.
I already did rs.close(); and rs = null; //rs is Result instance, but this seems does not help. Why there is so much memory usage different between these 2 approaches? In both approaches I only passed in String.
You should narrow down your queries, try to get more specific and if necessary add limit in your queries your java can't handle too large results
If you need all the data you're getting in memory at the same time (so you can't process it in chunks), then you'll need to have enough memory for it. Try it with 1G of memory.
Forget calling System.gc(), that's not going to help (it will be called before an OutOfMemoryException is thrown anyway).
I also noticed you're not closing the connection. You should probably do that as well (if you don't have a connection pool yet, set one up).
And of course you can use a profiler to see where the memory is actually going to. What you're doing now is pure guesswork.
I don't think many people may encounter this issue, but I still feel like to post my solution for reference.
Before in my code,
The query is:
String sql = "SELECT column1, column2 ... FROM mytable";
and the setter for MyTableObj is:
public void setColumn1(String columnStr) {
this._columnStr = columnStr == null ? "" : columnStr.trim();
After update:
What I updated is just use the trim in query instead of using java code:
String sql = "SELECT TRIM(column1), TRIM(column2) ... FROM mytable";
public void setColumn1(String columnStr) {
this._columnStr = columnStr == null ? "" : columnStr;
Using this udpated code, it takes only roughly 100 MB memory, which is a lot less than previous memory usage (380 MB).
I still can not give a valid reason why java trim consume more memory the sql trim, if anyone knows the reason, please help me to explain it. I will appreciate it a lot.
After many test, I found that it's data. Each row takes 8 KB and 31,000 rows takes about 240MB memory. TRIM in the query can only works for those short data.
Since data is large and memory is limit, I can only limit my query result for now.