I asked a question here.Simply speaking, my algorithm need a four dimension array. and the size could reach 32G. so I plan to store it in MongoDB. I have implemented it in my way. As I never use MongoDB before, my implementation is too slow, so how should I store this four dimension array in MongoDB?
Some stats:
It would take hours(more than ten I guess,as I didn't wait) to update the whole array as my array size is about 12*7000*100*500, and my server is Windows Server 2008 R2 Standard with 16.0GB ram and cpu is Intel(R) Xeon(R) CPU,2.67GHz. My mongoDB version is 2.4.5
Explain my implementation a bit.
my array has four dimension, name them z, d, wt, wv respectively.
First,I construct a string for the array element. Take an array element p_z_d_wt_wv[1][2][3][4] for instance, as z is 1, d is 2,wt is 3, wv is 4, I get a string "1_2_3_4", it stand for p_z_d_wt_wv[1][2][3][4].Then I store the value of p_z_d_wt_wv[1][2][3][4] in the database.
so my data looks like below:
{ "_id" : { "$oid" : "51e0c6f15a66ea5c32a99773"} , "key" : "1_2_3_4" , "value" : 113.1232}
{ "_id" : { "$oid" : "51e0c6f15a66ea5c32a99774"} , "key" : "1_2_3_5" , "value" : 11.1243}
Any advice would be appreciated!
Thanks advance!
Below is my code
public class MongoTest {
private Mongo mongo = null;
private DB mmplsa;
private DBCollection p_z_d_wt_wv;
private DBCollection p_z_d_wt_wv_test;
public void init()
{
try{
mongo = new Mongo();
} catch (UnknownHostException e) {
e.printStackTrace();
} catch (MongoException e) {
e.printStackTrace();
}
mmplsa = mongo.getDB("mmplsa");
p_z_d_wt_wv = mmplsa.getCollection("p_z_d_wt_wv");
}
public void createIndex()
{
BasicDBObject query = new BasicDBObject("key",1);
p_z_d_wt_wv.ensureIndex(query,null, true);
}
public void add( String key, double value)
{
DBObject element = new BasicDBObject();
element.put("key", key);
element.put("value", value);
p_z_d_wt_wv.insert(element);
}
public Double query(String key)
{
BasicDBObject specific_key = new BasicDBObject("value",1).append("_id", false);
DBObject obj = p_z_d_wt_wv.findOne(new BasicDBObject("key",key),specific_key );
return (Double)obj.get("value");
}
public void update(boolean ifTrainset, String key, double new_value)
{
BasicDBObject query = new BasicDBObject().append("key", key);
BasicDBObject updated_element = new BasicDBObject();
updated_element.append("$set", new BasicDBObject().append("value", new_value));
p_z_d_wt_wv.update(query, updated_element);
}
}
Few suggestions
Since your database size has exceeded(is actually 2X) the size of your RAM. Perhaps you should look at Sharding. Mongo works well when you can fit your database size in memory.
Storing the field key as a String not only consumes more memory, string comparisions are slower. We can easily store this field in a NumberLong(MongoDB's Long DataType). Since you already know the maximum size of your array is 12*7000*100*500
I assume the max size of any dimension cannot grow over 10,000. And consequently the total number of elements in your collection is less than (10000 ** 4).
So if you want the element at p_z_d_wt_wv1[2][3][4]
You calculate the index as
(10000 ** 0 * 4) + (10000 ** 1 * 3) + (10000 ** 2 * 3) + (10000 * 3 * 1)
You go right to left, increase the power of your base and multiply it with whatever value happens to be in that position and finally take their sum.
Index this field and we should expect better performance.
Since you have just a massive array, I suggest you use a memory mapped file. This will use about 32 GB of disk space and be much more efficient. Even so, randomly accessing a data set size larger than main memory is always going to be slow unless you have an fast SDD (buying more memory would be cheaper)
I would be very surprised if Mongo DB will perform fast enough for you. If it takes ten hours to update, it is likely to take ten hours to scan once as well. If you have a SSD, a memory mapped file could take about three minutes. If the data was all in memory e.g. you had 48 GB (you would need 32+ GB free not total), this would drop to seconds.
You cannot beat the limitations of your hardware. ;)
Related
I want to query multiple candidates for a search string which could look like "My sear foo".
Now I want to look for documents which have a field that contains one (or more) of the entered strings (seen as splitted by whitespaces).
I found some code which allows me to do a search by pattern:
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(trim).endKey(trim + "\u9999");
return db.queryView(viewQuery, DeviceEntityCouch.class);
}
which works quite nice for looking just for one pattern. But how do I have to modify my code to get a multiple contains on doc.serialNumber?
EDIT:
This is the current workaround, but there must be a better way i guess.
Also there is only an OR logic. So an entry fits term1 or term2 to be in the list.
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
String[] split = trim.split(" ");
List<DeviceEntityCouch> list = new ArrayList<>();
for (String s : split) {
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(s).endKey(s + "\u9999");
list.addAll(db.queryView(viewQuery, DeviceEntityCouch.class));
}
return list;
}
Looks like you are implementing a full text search here. That's not going to be very efficient in CouchDB (I guess same applies to other databases).
Correct me if I am wrong but from looking at your code looks like you are trying to search a list of serial numbers for a pattern. CouchDB (or any other database) is quite efficient if you can somehow index the data you will be searching for.
Otherwise you must fetch every single record and perform a string comparison on it.
The only way I can think of to optimize this in CouchDB would be the something like the following (with assumptions):
Your serial numbers are not very long (say 20 chars?)
You force the search to be always 5 characters
Generate view that emits every single 5 char long substring from your serial number - more or less this (could be optimized and not sure if I got the in):
...
for (var i = 0; doc.serialNo.length > 5 && i < doc.serialNo.length - 5; i++) {
emit([doc.serialNo.substring(i, i + 5), doc._id]);
}
...
Use _count reduce function
Now the following url:
http://localhost:5984/test/_design/serial/_view/complex-key?startkey=["01234"]&endkey=["01234",{}]&group=true
Will return a list of documents with a hit count for a key of 01234.
If you don't group and set the reduce option to be false, you will get a list of all matches, including duplicates if a single doc has multiple hits.
Refer to http://ryankirkman.com/2011/03/30/advanced-filtering-with-couchdb-views.html for the information about complex keys lookups.
I am not sure how efficient couchdb is in terms of updating that view. It depends on how many records you will have and how many new entries appear between view is being queried (I understand couchdb rebuilds the view's b-tree on demand).
I have generated a view like that that splits doc ids into 5 char long keys. Out of over 1K docs it generated over 30K results - id being 32 char long, simple maths really: (serialNo.length - searchablekey.length + 1) * docscount).
Generating the view took a while but the lookups where fast.
You could generate keys of multiple lengths, etc. All comes down to your records count vs speed of lookups.
Currently I have about 12 csv files, each having about 1.5 million records.
I'm using univocity-parsers as my csv reader/parser library.
Using univocity-parsers, I read each file and add all the records to an arraylist with the addAll() method. When all 12 files are parsed and added to the array list, my code prints the size of the arraylist at the end.
for (int i = 0; i < 12; i++) {
myList.addAll(parser.parseAll(getReader("file-" + i + ".csv")));
}
It works fine at first until I reach my 6th consecutive file, then it seem to take forever in my IntelliJ IDE output window, never printing out the arraylist size even after an hour, where before my 6th file it was rather fast.
If it helps I'm running on a macbook pro (mid 2014) OSX Yosemite.
It was a textbook problem on forks and joins.
I'm the creator of this library. If you want to just count rows, use a
RowProcessor. You don't even need to count the rows yourself as the parser does that for you:
// Let's create our own RowProcessor to analyze the rows
static class RowCount extends AbstractRowProcessor {
long rowCount = 0;
#Override
public void processEnded(ParsingContext context) {
// this returns the number of the last valid record.
rowCount = context.currentRecord();
}
}
public static void main(String... args) throws FileNotFoundException {
// let's measure the time roughly
long start = System.currentTimeMillis();
//Creates an instance of our own custom RowProcessor, defined above.
RowCount myRowCountProcessor = new RowCount();
CsvParserSettings settings = new CsvParserSettings();
//Here you can select the column indexes you are interested in reading.
//The parser will return values for the columns you selected, in the order you defined
//By selecting no indexes here, no String objects will be created
settings.selectIndexes(/*nothing here*/);
//When you select indexes, the columns are reordered so they come in the order you defined.
//By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
settings.setColumnReorderingEnabled(false);
//We instruct the parser to send all rows parsed to your custom RowProcessor.
settings.setRowProcessor(myRowCountProcessor);
//Finally, we create a parser
CsvParser parser = new CsvParser(settings);
//And parse! All rows are sent to your custom RowProcessor (CsvDimension)
//I'm using a 150MB CSV file with 3.1 million rows.
parser.parse(new File("c:/tmp/worldcitiespop.txt"));
//Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
System.out.println("Rows: " + myRowCountProcessor.rowCount);
System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");
}
Output
Rows: 3173959
Time taken: 1062 ms
Edit: I saw your comment regarding your need to use the actual data in the rows. In this case, process the rows in the rowProcessed() method of the RowProcessor class, that's the most efficient way to handle this.
Edit 2:
If you want to just count rows use getInputDimension from CsvRoutines:
CsvRoutines csvRoutines = new CsvRoutines();
InputDimension d = csvRoutines.getInputDimension(new File("/path/to/your.csv"));
System.out.println(d.rowCount());
System.out.println(d.columnCount());
In parseAll they use 10000 elements for preallocation.
/**
* Parses all records from the input and returns them in a list.
*
* #param reader the input to be parsed
* #return the list of all records parsed from the input.
*/
public final List<String[]> parseAll(Reader reader) {
List<String[]> out = new ArrayList<String[]>(10000);
beginParsing(reader);
String[] row;
while ((row = parseNext()) != null) {
out.add(row);
}
return out;
}
If you have millions of records (lines in file I guess) it is not good for performance and memory allocation because it will double the size and copy when allocate new space.
You could try to implement your own parseAll method like this:
public List<String[]> parseAll(Reader reader, int numberOfLines) {
List<String[]> out = new ArrayList<String[]>(numberOfLines);
parser.beginParsing(reader);
String[] row;
while ((row = parser.parseNext()) != null) {
out.add(row);
}
return out;
}
And check if it helps.
The problem is that you are running out of memory. When this happens, the computer begins to crawl, since it starts to swap memory to disk, and viceversa.
Reading the whole contents into memory is definitely not the best strategy to follow. And since you are only interested in calculating some statistics, you do not even need to use addAll() at all.
The objective in computer science is always to meet an equilibrium between memory spent and execution speed. You can always deal with both concepts, trading memory for more speed or speed for memory savings.
So, loading the whole files into memory is comfortable for you, but not a solution, not even in the future, when computers will include terabytes of memory.
public int getNumRecords(CsvParser parser, int start) {
int toret = start;
parser.beginParsing(reader);
while (parser.parseNext() != null) {
++toret;
}
return toret;
}
As you can see, there is no memory spent in this function (except each single row); you can use it inside a loop for your CSV files, and finish with the total count of rows. The next step is to create a class for all your statistics, substituting that int start with your object.
class Statistics {
public Statistics() {
numRows = 0;
numComedies = 0;
}
public countRow() {
++numRows;
}
public countComedies() {
++numComedies;
}
// more things...
private int numRows;
private int numComedies;
}
public int calculateStatistics(CsvParser parser, Statistics stats) {
int toret = start;
parser.beginParsing(reader);
while (parser.parseNext() != null) {
stats.countRow();
}
return toret;
}
Hope this helps.
I am getting battery values from a drone. I am able to display the new battery value on JLabel. However, when I am trying to store these battery values into an int array, it is only store the very first battery value on the array. The subsequent array values will only fill up with the first battery value.
I show an output so you will understand what is happening. The first value is getting from drone while the second value indicate the array index. The output clearly show that the array cannot accept new data for unknown reason.
P/S: I have no idea what is best size of array since I am getting values from drone every seconds. So I have declared an int array with size of 9999999. Any idea how can I set an array to its max size to cater the needs of getting continuous battery values from drone? Those values are being used for drawing graph later.
My code:
public class arDroneFrame extends javax.swing.JFrame implements Runnable, DroneStatusChangeListener, NavDataListener {
private String text; // string for speech
private static final long CONNECT_TIMEOUT = 10000;
public ARDrone drone;
public NavData data;
public Timer timer = new Timer();
public int batteryGraphic=0;
public int [] arrayBatt = new int[9999999];
public arDroneFrame(String text) {
this.text=text;
}
public arDroneFrame() {
initComponents();
initDrone();
}
private void initDrone() {
try {
drone = new ARDrone();
data = new NavData();
} catch (UnknownHostException ex) {
return;
}
videoDrone.setDrone(drone);
drone.addNavDataListener(this);
}
public void navDataReceived(NavData nd) {
getNavData(nd);
int battery = nd.getBattery();
cmdListOK.jlblBatteryLevelValue.setText(battery + " %");
//JLabel can get updated & always display new battery values
}
public void getNavData(NavData nd){
for(int i=0;i<arrayBatt.length;i++){
batteryGraphic= nd.getBattery();
arrayBatt[i] = batteryGraphic;
System.err.println("This is stored battery values : " + arrayBatt[i] + " " + i + "\n");
}
}
}
public static void main(String args[]) {
java.awt.EventQueue.invokeLater(new Runnable() {
public void run() {
String text = "Welcome!";
arDroneFrame freeTTS = new arDroneFrame(text);
freeTTS.speak();
new arDroneFrame().setVisible(true);
}
});
}
Result:
This is stored battery values : 39 0
This is stored battery values : 39 1
This is stored battery values : 39 2
This is stored battery values : 39 3
This is stored battery values : 39 4
This is stored battery values : 39 5
The problem lies in this method:
public void getNavData(NavData nd){
for (int batteryValue : arrayBatt){
arrayBatt[i] = nd.getBattery();
System.err.println("This is stored battery values : " + arrayBatt[i] + " " + i + "\n");
}
}
You call this method by passing it a NavData instance. This means that whatever value nd contains for nd.getBattery() is being assigned to every index in your array as the loop interates over your battery array.
What you should do, is move the loop outside of the getNavData(NavData nd) method, and pass it a new instance of NavData for each call. When you couple this with the ArrayList suggestion below, you should have a dynamic array of distinct battery values
Side solution
The way that you have declared this array is REALLY SCARY.
You should only use the space you need and NOTHING more.
I know that you are unsure of what size is actually required, but don't go over-board on it.
You should initialize your array with something smaller;
public int [] arrayBatt = new int[10000];
As a side note: having your class members as public is generally not recommended. You should make them private and create getter/setter methods to retrieve and modify the data, respectively.
Then, have a method that checks to see if your array is full. If it is full, then increase the array size by n/2, where n is the initial size of your array.
The down-side to this approach is that as your array becomes larger, you are going to spend a lot of time copying the old array to the new array, which is pretty undesirable.
A better solution
Would be to use the built-in ArrayList library, then just append items to your list and let Java do the heavy lifting.
ArrayList<Integer> batteryArray = new ArrayList <Integer>();
you can add items to your list by simply calling:
batteryArray.add(item);
The upside to this solution is that:
The batteryArray size is handled behind-the-scenes
The size of the array is easily retrievable, as well as the elements
ArrayList is a very fast storage structure.
In your loop to print out battery values, you could make it a lot cleaner by implementing a for-each loop.
Why are you using System.err to print out dialogs for the battery?? This isn't what System.err is meant to be used for and violates the Principle of Least Astonishment
public void getNavData(NavData nd){
for (int batteryValue : arrayBatt){
arrayBatt[i] = nd.getBattery();
System.err.println("This is stored battery values : " + arrayBatt[i] + " " + i + "\n");
}
}
I assume that there is some event that is triggered by the drone's hardware.
Your loop runs too fast, probably thousands of times per second so there was not time for any battery change and nd.getBattery() returns the same value.
It seems that this is the reason why the values are repeated.
On the other hand, I suspect that navDataReceived is called only when the hardware detects a change and this is why it displays the new value. When getNavData is called you are running a tight loop that locks the execution and prevents your application from receiving this event while the loop is executing.
You should only store a value when you are notified of some change.
I see your implementation of getNavData as fundamentally wrong.
Your 10 million int array is useless in this situation.
I don't know how your application interacts with the drone's hardware but the interface names DroneStatusChangeListener and NavDataListener suggest that you receive some notification when a change occurs.
I have an application which accesses about 2 million tweets from a MySQL database. Specifically one of the fields holds a tweet of text (with maximum length of 140 characters). I am splitting every tweet into an ngram of words ngrams where 1 <= n <= 3. For example, consider the sentence:
I am a boring sentence.
The corresponding nGrams are:
I
I am
I am a
am
am a
am a boring
a
a boring
a boring sentence
boring
boring sentence
sentence
With about 2 million tweets, I am generating a lot of data. Regardless, I am surprised to get a heap error from Java:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2145)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1922)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3423)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:483)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3118)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2288)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2709)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2728)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2678)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1612)
at twittertest.NGramFrequencyCounter.FreqCount(NGramFrequencyCounter.java:49)
at twittertest.Global.main(Global.java:40)
Here is the problem code statement (line 49) as given by the above output from Netbeans:
results = stmt.executeQuery("select * from tweets");
So, if I am running out of memory it must be that it is trying to return all the results at once and then storing them in memory. What is the best way to solve this problem? Specifically I have the following questions:
How can I process pieces of results rather than the whole set?
How would I increase the heap size? (If this is possible)
Feel free to include any suggestions, and let me know if you need more information.
EDIT
Instead of select * from tweets I partitioned the table into equally sized subsets of about 10% of the total size. Then I tried running the program. It looked like it was working fine but it eventually gave me the same heap error. This is strange to me because I have ran the same program in the past, successfully with 610,000 tweets. Now I have about 2,000,000 tweets or roughly 3 times as much more data. So if I split the data into thirds it should work, but I went further and split the subsets into size 10%.
Is some memory not being freed? Here is the rest of the code:
results = stmt.executeQuery("select COUNT(*) from tweets");
int num_tweets = 0;
if(results.next())
{
num_tweets = results.getInt(1);
}
int num_intervals = 10; //split into equally sized subets
int interval_size = num_tweets/num_intervals;
for(int i = 0; i < num_intervals-1; i++) //process 10% of the data at a time
{
results = stmt.executeQuery( String.format("select * from tweets limit %s, %s", i*interval_size, (i+1)*interval_size));
while(results.next()) //for each row in the tweets database
{
tweetID = results.getLong("tweet_id");
curTweet = results.getString("tweet");
int colPos = curTweet.indexOf(":");
curTweet = curTweet.substring(colPos + 1); //trim off the RT and retweeted
if(curTweet != null)
{
curTweet = removeStopWords(curTweet);
}
if(curTweet == null)
{
continue;
}
reader = new StringReader(curTweet);
tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
//tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
//Set stopSet = StopFilter.makeStopSet(Version.LUCENE_36, stopWords, true);
//tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopSet);
tokenizer = new ShingleFilter(tokenizer, 2, 3);
charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken()) //insert each nGram from each tweet into the DB
{
insertNGram.setInt(1, nGramID++);
insertNGram.setString(2, charTermAttribute.toString().toString());
insertNGram.setLong(3, tweetID);
insertNGram.executeUpdate();
}
}
}
Don't get all rows from table. Try to select partial
data based on your requirement by setting limits to query. You are using MySQL database your query would be select * from tweets limit 0,10. Here 0 is starting row id and 10 represents 10 rows from start.
You can always increase the heap size available to your JVM using the -Xmx argument. You should read up on all the knobs available to you (e.g. perm gen size). Google for other options or read this SO answer.
You probably can't do this kind of problem with a 32-bit machine. You'll want 64 bits and lots of RAM.
Another option would be to treat it as a map-reduce problem. Solve it on a cluster using Hadoop and Mahout.
Have you considered streaming the result set? Halfway down the page is a section on result set, and it addresses your problem (I think?) Write the n grams to a file, then process the next row? Or, am I misunderstanding your problem?
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html
In some part of my application, I am parsing a 17MB log file into a list structure - one LogEntry per line. There are approximately 100K lines/log entries, meaning approx. 170 bytes per line. What surprised me is that I run out of heap space, even when I specify 128MB (256MB seems sufficient). How can 10MB of text turned into a list of objects cause a tenfold increase in space?
I understand that String objects use at least twice the amount of space compared to ANSI text (Unicode, one char=2 bytes), but this consumes at least four times that.
What I am looking for is an approximation for how much an ArrayList of n LogEntries will consume, or how my method might create extraneous objects that aggravate the situation (see comment below on String.trim())
This is the data part of my LogEntry class
public class LogEntry {
private Long id;
private String system, version, environment, hostName, userId, clientIP, wsdlName, methodName;
private Date timestamp;
private Long milliSeconds;
private Map<String, String> otherProperties;
This is the part doing the reading
public List<LogEntry> readLogEntriesFromFile(File f) throws LogImporterException {
CSVReader reader;
final String ISO_8601_DATE_PATTERN = "yyyy-MM-dd HH:mm:ss,SSS";
List<LogEntry> logEntries = new ArrayList<LogEntry>();
String[] tmp;
try {
int lineNumber = 0;
final char DELIM = ';';
reader = new CSVReader(new InputStreamReader(new FileInputStream(f)), DELIM);
while ((tmp = reader.readNext()) != null) {
lineNumber++;
if (tmp.length < LogEntry.getRequiredNumberOfAttributes()) {
String tmpString = concat(tmp);
if (tmpString.trim().isEmpty()) {
logger.debug("Empty string");
} else {
logger.error(String.format(
"Invalid log format in %s:L%s. Not enough attributes (%d/%d). Was %s . Continuing ...",
f.getAbsolutePath(), lineNumber, tmp.length, LogEntry.getRequiredNumberOfAttributes(), tmpString)
);
}
continue;
}
List<String> values = new ArrayList<String>(Arrays.asList(tmp));
String system, version, environment, hostName, userId, wsdlName, methodName;
Date timestamp;
Long milliSeconds;
Map<String, String> otherProperties;
system = values.remove(0);
version = values.remove(0);
environment = values.remove(0);
hostName = values.remove(0);
userId = values.remove(0);
String clientIP = values.remove(0);
wsdlName = cleanLogString(values.remove(0));
methodName = cleanLogString(stripNormalPrefixes(values.remove(0)));
timestamp = new SimpleDateFormat(ISO_8601_DATE_PATTERN).parse(values.remove(0));
milliSeconds = Long.parseLong(values.remove(0));
/* remaining properties are the key-value pairs */
otherProperties = parseOtherProperties(values);
logEntries.add(new LogEntry(system, version, environment, hostName, userId, clientIP,
wsdlName, methodName, timestamp, milliSeconds, otherProperties));
}
reader.close();
} catch (IOException e) {
throw new LogImporterException("Error reading log file: " + e.getMessage());
} catch (ParseException e) {
throw new LogImporterException("Error parsing logfile: " + e.getMessage(), e);
}
return logEntries;
}
Utility function used for populating the map
private Map<String, String> parseOtherProperties(List<String> values) throws ParseException {
HashMap<String, String> map = new HashMap<String, String>();
String[] tmp;
for (String s : values) {
if (s.trim().isEmpty()) {
continue;
}
tmp = s.split(":");
if (tmp.length != 2) {
throw new ParseException("Could not split string into key:value :\"" + s + "\"", s.length());
}
map.put(tmp[0], tmp[1]);
}
return map;
}
You also have a Map there, where you store other properties. Your code doesn't show how this Map is populated, but keep in mind that Maps may have a hefty memory overhead compared to the memory needed for the entries themselves.
The size of the array that backs the Map (at least 16 entries * 4 bytes) + one key/value pair per entry + the size of data themselves. Two map entries, each using 10 chars for key and 10 chars for value, would consume 16*4 + 2*2*4 + 2*10*2 + 2*10*2 + 2*2*8= 64+16+40+40+24 = 184 bytes (1 char = 2 byte, a String object consumes min 8 byte). That alone would almost double the space requirements for the entire log string.
Add to this that the LogEntry contains 12 Objects, i.e. at least 96 bytes. Hence the log objects alone would need around 100 bytes, give or take some, without the Map and without actual string data. Plus all the pointers for the references (4B each). I count at least 18 with the Map, meaning 72 bytes.
Adding the data (-object references and object "headers" mentioned in the last paragraph):
2 longs = 16B, 1 date stored as long = 8B, the map = 184B. In addition comes the string content, say 90 chars = 180 byte. Perhaps a byte or two in each end of the list item when put in the list, so in total somewhere around 100+72+16+8+184+180=560 ~ 600 byte per log line.
So around 600 byte per log line, meaning 100K lines would consume around 60MB minimum. This would place it at least in the same order of magnitude as the heap size that was set asize. In addition comes the fact that tmpString.trim() in a loop might be creating copies of string. Similarly String.format() may also be creating copies. The rest of the application must also fit within this heap space, and might explain where the rest of memory is going.
Don't forget that each String object consumes space (24 bytes ?) for the actual Object definition, plus the reference to the char array, the offset (for substring() usage) etc. So representing a line as 'n' strings will add that additional storage requirement. Can you lazily evaluate these instead within your LogEntry class ?
(re. the String offset usage - prior to Java 7b6 String.substring() acts as a window onto an existing char array and consequently you need an offset. This has recently changed and it may be worth determining if a later JDK build is more memory efficient)