I am trying to convert an input byte array into a collection of another data structure. The input byte array has a lot of bytes that correspond to a data structure that I call Records which consist of 8 bytes for data (long) and 8 bytes for key (double) for each record. What I am trying to do is define a function that does this automatically because I will do this process very frequently. This is the function as I have it right now:
public Record[] bytesToRecord(byte[] byteArray) {
Record[] arrayRecords = new Record[(byteArray.length/16)];
for (int i = 0; i <= (byteArray.length); i+=16) {
arrayRecords[i] = new Record(Arrays.copyOfRange(byteArray, i, i + 16));
}
return arrayRecords;
}
So as you can see above, the function takes an array of bytes and loops through every 16 bytes to create a new Record object and appends it to the arrayRecords, which is the collection of Records in a Record array. The problem I have is that I think something is going wrong so my function is not taking exactly 16 bytes per record, so when a Record object is created I get a NullPointerException in the Record class because it cannot properly slice the subarray of 16 bytes for a Record to get the long and double values for data and key. As follows there is the Record class constructor:
public class Record {
private byte[] record;
long idData = ByteBuffer.wrap(Arrays.copyOfRange(record, 0,9)).getLong();
double key = ByteBuffer.wrap(Arrays.copyOfRange(record, 9,16)).getDouble();
public Record(byte[] recordArray) {
this.record = recordArray;
}
}
I hope somebody can help me to fix this function or suggest another method to do what I am trying to do.
I identified four issues in your code that lead to failures in your code. I will go over each of them separately.
Problem 1: The problem giving you the NullPointerException has to do with your Record class. When a new class instance is created, the fields are initialized before the constructor is executed. This means in your case that ByteBuffer.wrap(Arrays.copyOfRange(record, 0,9)).getLong(); runs before this.record = recordArray;. The problem then is that the first statement uses an array that was not initialized.
Problem 2: The index you use to specify the start of the double in the array is off by one. The long has 8 bytes 0 to 7 and the double the remaining 8 bytes from 8 to 15. The end index given to the method is defined to be 1 more than the actual end index.
So, to solve these two issues you have to change your Record to something like this:
public class Record {
long idData;
double key;
public Record(byte[] recordArray) {
this.idData = ByteBuffer.wrap(Arrays.copyOfRange(record, 0,8)).getLong();
this.key = ByteBuffer.wrap(Arrays.copyOfRange(record, 8,16)).getDouble();
}
}
Problem 3: In the function bytesToRecord you have a for loop starting at index zero until the index is smaller or equal than the length of the array. This has to be changed to strictly smaller as otherwise, the loop iterates one time too often.
Problem 4: You use the loop index for addressing entities in the Record[] and the byte[], which have completely different lengths. The simplest solution is to iterate over the Record[] and calculate the index for the byte[].
Something like this should solve these two issues:
public static Record[] bytesToRecord(byte[] byteArray) {
Record[] arrayRecords = new Record[(byteArray.length/16)];
for (int i = 0; i < (arrayRecords.length); i++) {
arrayRecords[i] = new Record(Arrays.copyOfRange(byteArray, i*16, i * 16 + 16));
}
return arrayRecords;
}
Edit: An enhancement to your approach in my opinion would be to limit the Record class to only store the values and make all computations inside bytesToRecord. This way you don't have to copy parts of the array and save memory. The code should then look like this:
public class Record {
long idData;
double key;
public Record(long idData, double key) {
this.idData = idData;
this.key = key;
}
}
public static Record[] bytesToRecord(byte[] byteArray) {
Record[] arrayRecords = new Record[(byteArray.length / 16)];
for (int i = 0; i < (arrayRecords.length); i++) {
long id = ByteBuffer.wrap(byteArray, i * 16, 8).getLong();
double key = ByteBuffer.wrap(byteArray, i * 16 + 8, 8).getDouble();
arrayRecords[i] = new Record(id, key);
}
return arrayRecords;
}
Related
I am getting timeout error for my code which I wrote using hashmap functions in java 8.When I submitted my answer 5 test cases failed due to timeout error out of 14 test cases on hackerrank platform.
Below is the question
You are given queries. Each query is of the form two integers described below:
x : Insert x in your data structure.
y : Delete one occurence of y from your data structure, if present.
z : Check if any integer is present whose frequency is exactly z. If yes, print 1 else 0.
The queries are given in the form of a 2-D array of where queries[i][0] contains the operation, and queries[i][1] contains the data element.
How should I optimize this code further ?
static HashMap<Integer,Integer> buffer = new HashMap<Integer,Integer>();
// Complete the freqQuery function below.
static List<Integer> freqQuery(List<List<Integer>> queries) {
List<Integer> output = new ArrayList<>();
output = queries.stream().map(query -> {return performQuery(query);}).filter(v -> v!=-1).collect(Collectors.toList());
//get the output array iterate over each query and perform operation
return output;
}
private static Integer performQuery(List<Integer> query) {
if(query.get(0) == 1){
buffer.put(query.get(1), buffer.getOrDefault(query.get(1), 0) + 1);
}
else if(query.get(0) == 2){
if(buffer.containsKey(query.get(1)) && buffer.get(query.get(1))>0 ){
buffer.put(query.get(1), buffer.get(query.get(1)) - 1);
}
}
else{
if(buffer.containsValue(query.get(1))){
return 1;
}
else{
return 0;
}
}
return -1;
}
public static void main(String[] args) {
List<List<Integer>> queries = Arrays.asList(
Arrays.asList(1,5),
Arrays.asList(1,6),
Arrays.asList(3,2),
Arrays.asList(1,10),
Arrays.asList(1,10),
Arrays.asList(1,6),
Arrays.asList(2,5),
Arrays.asList(3,2)
);
long start = System.currentTimeMillis();
System.out.println(freqQuery(queries));
long end = System.currentTimeMillis();
//finding the time difference and converting it into seconds
float sec = (end - start) / 1000F;
System.out.println("FreqQuery function Took "+sec + " s");
}
}
The problem with your code is the z operation. Sepecifically, the method containsValue has linear time complexty, making the whole complexity of the algorithm in the order of O(n*n). Here is a hint: add another hashmap on top of the one that you have which counts the occurences of occurences by value of the other map. In that way you can query directly this second one by the value (because it will be the key in this case).
I have a method which takes a parameter which is Partition enum. This method will be called by multiple background threads (15 max) around same time period by passing different value of partition. Here dataHoldersByPartition is a map of Partition and ConcurrentLinkedQueue<DataHolder>.
private final ImmutableMap<Partition, ConcurrentLinkedQueue<DataHolder>> dataHoldersByPartition;
//... some code to populate entry in `dataHoldersByPartition`
private void validateAndSend(final Partition partition) {
ConcurrentLinkedQueue<DataHolder> dataHolders = dataHoldersByPartition.get(partition);
Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder = new HashMap<>();
int totalSize = 0;
DataHolder dataHolder;
while ((dataHolder = dataHolders.poll()) != null) {
byte[] clientKeyBytes = dataHolder.getClientKey().getBytes(StandardCharsets.UTF_8);
if (clientKeyBytes.length > 255)
continue;
byte[] processBytes = dataHolder.getProcessBytes();
int clientKeyLength = clientKeyBytes.length;
int processBytesLength = processBytes.length;
int additionalLength = clientKeyLength + processBytesLength;
if (totalSize + additionalLength > 50000) {
Message message = new Message(clientKeyBytesAndProcessBytesHolder, partition);
// here size of `message.serialize()` byte array should always be less than 50k at all cost
sendToDatabase(message.getAddress(), message.serialize());
clientKeyBytesAndProcessBytesHolder = new HashMap<>();
totalSize = 0;
}
clientKeyBytesAndProcessBytesHolder.put(clientKeyBytes, processBytes);
totalSize += additionalLength;
}
// calling again with remaining values only if clientKeyBytesAndProcessBytesHolder is not empty
if(!clientKeyBytesAndProcessBytesHolder.isEmpty()) {
Message message = new Message(partition, clientKeyBytesAndProcessBytesHolder);
// here size of `message.serialize()` byte array should always be less than 50k at all cost
sendToDatabase(message.getAddress(), message.serialize());
}
}
And below is my Message class:
public final class Message {
private final byte dataCenter;
private final byte recordVersion;
private final Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder;
private final long address;
private final long addressFrom;
private final long addressOrigin;
private final byte recordsPartition;
private final byte replicated;
public Message(Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder, Partition recordPartition) {
this.clientKeyBytesAndProcessBytesHolder = clientKeyBytesAndProcessBytesHolder;
this.recordsPartition = (byte) recordPartition.getPartition();
this.dataCenter = Utils.CURRENT_LOCATION.get().datacenter();
this.recordVersion = 1;
this.replicated = 0;
long packedAddress = new Data().packAddress();
this.address = packedAddress;
this.addressFrom = 0L;
this.addressOrigin = packedAddress;
}
// Output of this method should always be less than 50k always
public byte[] serialize() {
int bufferCapacity = getBufferCapacity(clientKeyBytesAndProcessBytesHolder); // 36 + dataSize + 1 + 1 + keyLength + 8 + 2;
ByteBuffer byteBuffer = ByteBuffer.allocate(bufferCapacity).order(ByteOrder.BIG_ENDIAN);
// header layout
byteBuffer.put(dataCenter).put(recordVersion).putInt(clientKeyBytesAndProcessBytesHolder.size())
.putInt(bufferCapacity).putLong(address).putLong(addressFrom).putLong(addressOrigin)
.put(recordsPartition).put(replicated);
// now the data layout
for (Map.Entry<byte[], byte[]> entry : clientKeyBytesAndProcessBytesHolder.entrySet()) {
byte keyType = 0;
byte[] key = entry.getKey();
byte[] value = entry.getValue();
byte keyLength = (byte) key.length;
short valueLength = (short) value.length;
ByteBuffer dataBuffer = ByteBuffer.wrap(value);
long timestamp = valueLength > 10 ? dataBuffer.getLong(2) : System.currentTimeMillis();
byteBuffer.put(keyType).put(keyLength).put(key).putLong(timestamp).putShort(valueLength)
.put(value);
}
return byteBuffer.array();
}
private int getBufferCapacity(Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder) {
int size = 36;
for (Entry<byte[], byte[]> entry : clientKeyBytesAndProcessBytesHolder.entrySet()) {
size += 1 + 1 + 8 + 2;
size += entry.getKey().length;
size += entry.getValue().length;
}
return size;
}
// getters and to string method here
}
Basically, what I have to make sure is whenever the sendToDatabase method is called, size of message.serialize() byte array should always be less than 50k at all cost. My sendToDatabase method sends byte array coming out from serialize method. And because of that condition I am doing below validation plus few other stuff. In the method, I will iterate dataHolders CLQ and I will extract clientKeyBytes and processBytes from it. Here is the validation I am doing:
If the clientKeyBytes length is greater than 255 then I will skip it and continue iterating.
I will keep incrementing the totalSize variable which will be the sum of clientKeyLength and processBytesLength, and this totalSize length should always be less than 50000 bytes.
As soon as it reaches the 50000 limit, I will send the clientKeyBytesAndProcessBytesHolder map to the sendToDatabase method and clear out the map, reset totalSize to 0 and start populating again.
If it doesn't reaches that limit and dataHolders got empty, then it will send whatever it has.
I believe there is some bug in my current code because of which maybe some records are not being sent properly or dropped somewhere because of my condition and I am not able to figure this out. Looks like to properly achieve this 50k condition I may have to use getBufferCapacity method to correctly figure out the size before calling sendToDatabase method?
I checked your code, its look good as per your logic. As you said it will always store the information which is less than 50K but it will actually store information till 50K. To make it less than 50K you have to change the if condition to if (totalSize + additionalLength >= 50000).
If your codes still not fulfilling your requirement i.e. storing information when totalSize + additionalLength is greater than 50k I can advise you few thinks.
As more than 50 threads call this method you need to consider two section in your codes to be synchronize.
One is global variable which is a container dataHoldersByPartition object. If multiple concurrent and parallel searches happened in this container object, outcome might not be perfect. Just check whether container type is synchronized or not. If not make this block like below:-
synchronized(this){
ConcurrentLinkedQueue<DataHolder> dataHolders = dataHoldersByPartition.get(partition);
}
Now, I can give only two suggestion to fix this issue. One is instead of if (totalSize + additionalLength > 50000) this you can check the size of the object clientKeyBytesAndProcessBytesHolder if(sizeof(clientKeyBytesAndProcessBytesHolder) >= 50000) (check appropriate method for sizeof in java). And second one is narrow down the area to check whether it is a side effect of multithreading or not. All these suggestion are to find out the area where exactly problem is and fix should be from your end only.
First check whether you method validateAndSend is exactly satisfying your requirement or not. For that synchronize whole validateAndSend method first and check whether everything fine or still have the same result. If still have the same result that means it is not because of multithreading but your coding is not as per requirement. If its work fine that means it is a problem of multithreading. If method synchronization is fixing your issue but degrade the performance you just remove the synchronization from it and concentrate every small block of your code which might cause the issue and make it synchronize block and remove if still not fixing your issue. Like that finally you locate the block of code which is actually creating the issue and leave it as synchronize to fix it finally.
For example first attempt:-
`private synchronize void validateAndSend`
Second attempts: Remove synchronize key words from the method and do the below step:-
synchronize(this){
Message message = new Message(clientKeyBytesAndProcessBytesHolder, partition);
sendToDatabase(message.getAddress(), message.serialize());
}
If you think that I did not correctly understand you please let me know.
In your validateAndSend I would put whole data to the queue, and do whole processing in separate thread. Please consider command model. That way all threads are going to put their load on queue. Consumer thread has all the data, all the information in place, and can process it quite effectively. The only complicated part is sending response / result back to calling thread. Since in your case that is not a problem - the better. There are some more benefits of this pattern - please look at netflix/hystrix.
My requirement is to generate 1000 unique email-ids in Java. I have already generated random Text and using for loop I'm limiting the number of email-ids to be generated. Problem is when I execute 10 email-ids are generated but all are same.
Below is the code and output:
public static void main() {
first fr = new first();
String n = fr.genText()+"#mail.com";
for (int i = 0; i<=9; i++) {
System.out.println(n);
}
}
public String genText() {
String randomText = "abcdefghijklmnopqrstuvwxyz";
int length = 4;
String temp = RandomStringUtils.random(length, randomText);
return temp;
}
and output is:
myqo#mail.com
myqo#mail.com
...
myqo#mail.com
When I execute the same above program I get another set of mail-ids. Example: instead of 'myqo' it will be 'bfta'. But my requirement is to generate different unique ids.
For Example:
myqo#mail.com
bfta#mail.com
kjuy#mail.com
Put your String initialization in the for statement:
for (int i = 0; i<=9; i++) {
String n = fr.genText()+"#mail.com";
System.out.println(n);
}
I would like to rewrite your method a little bit:
public String generateEmail(String domain, int length) {
return RandomStringUtils.random(length, "abcdefghijklmnopqrstuvwxyz") + "#" + domain;
}
And it would be possible to call like:
generateEmail("gmail.com", 4);
As I understood, you want to generate unique 1000 emails, then you would be able to do this in a convenient way by Stream API:
Stream.generate(() -> generateEmail("gmail.com", 4))
.limit(1000)
.collect(Collectors.toSet())
But the problem still exists. I purposely collected a Stream<String> to a Set<String> (which removes duplicates) to find out its size(). As you may see, the size is not always equals 1000
999
1000
997
that means your algorithm returns duplicated values even for such small range.
Therefore, you'd better research already written email generators for Java or improve your own (for example, by adding numbers, some special characters that, in turn, will generate a plenty of exceptions).
If you are planning to use MockNeat, the feature for implementing email strings is already implemented.
Example 1:
String corpEmail = mock.emails().domain("startup.io").val();
// Possible Output: tiptoplunge#startup.io
Example 2:
String domsEmail = mock.emails().domains("abc.com", "corp.org").val();
// Possible Output: funjulius#corp.org
Note: mock is the default "mocking" object.
To guarantee uniqueness you could use a counter as part of the email address:
myqo0000#mail.com
bfta0001#mail.com
kjuy0002#mail.com
If you want to stick to letters only then convert the counter to base 26 representation using 'a' to 'z' as the digits.
Currently I have about 12 csv files, each having about 1.5 million records.
I'm using univocity-parsers as my csv reader/parser library.
Using univocity-parsers, I read each file and add all the records to an arraylist with the addAll() method. When all 12 files are parsed and added to the array list, my code prints the size of the arraylist at the end.
for (int i = 0; i < 12; i++) {
myList.addAll(parser.parseAll(getReader("file-" + i + ".csv")));
}
It works fine at first until I reach my 6th consecutive file, then it seem to take forever in my IntelliJ IDE output window, never printing out the arraylist size even after an hour, where before my 6th file it was rather fast.
If it helps I'm running on a macbook pro (mid 2014) OSX Yosemite.
It was a textbook problem on forks and joins.
I'm the creator of this library. If you want to just count rows, use a
RowProcessor. You don't even need to count the rows yourself as the parser does that for you:
// Let's create our own RowProcessor to analyze the rows
static class RowCount extends AbstractRowProcessor {
long rowCount = 0;
#Override
public void processEnded(ParsingContext context) {
// this returns the number of the last valid record.
rowCount = context.currentRecord();
}
}
public static void main(String... args) throws FileNotFoundException {
// let's measure the time roughly
long start = System.currentTimeMillis();
//Creates an instance of our own custom RowProcessor, defined above.
RowCount myRowCountProcessor = new RowCount();
CsvParserSettings settings = new CsvParserSettings();
//Here you can select the column indexes you are interested in reading.
//The parser will return values for the columns you selected, in the order you defined
//By selecting no indexes here, no String objects will be created
settings.selectIndexes(/*nothing here*/);
//When you select indexes, the columns are reordered so they come in the order you defined.
//By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
settings.setColumnReorderingEnabled(false);
//We instruct the parser to send all rows parsed to your custom RowProcessor.
settings.setRowProcessor(myRowCountProcessor);
//Finally, we create a parser
CsvParser parser = new CsvParser(settings);
//And parse! All rows are sent to your custom RowProcessor (CsvDimension)
//I'm using a 150MB CSV file with 3.1 million rows.
parser.parse(new File("c:/tmp/worldcitiespop.txt"));
//Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
System.out.println("Rows: " + myRowCountProcessor.rowCount);
System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");
}
Output
Rows: 3173959
Time taken: 1062 ms
Edit: I saw your comment regarding your need to use the actual data in the rows. In this case, process the rows in the rowProcessed() method of the RowProcessor class, that's the most efficient way to handle this.
Edit 2:
If you want to just count rows use getInputDimension from CsvRoutines:
CsvRoutines csvRoutines = new CsvRoutines();
InputDimension d = csvRoutines.getInputDimension(new File("/path/to/your.csv"));
System.out.println(d.rowCount());
System.out.println(d.columnCount());
In parseAll they use 10000 elements for preallocation.
/**
* Parses all records from the input and returns them in a list.
*
* #param reader the input to be parsed
* #return the list of all records parsed from the input.
*/
public final List<String[]> parseAll(Reader reader) {
List<String[]> out = new ArrayList<String[]>(10000);
beginParsing(reader);
String[] row;
while ((row = parseNext()) != null) {
out.add(row);
}
return out;
}
If you have millions of records (lines in file I guess) it is not good for performance and memory allocation because it will double the size and copy when allocate new space.
You could try to implement your own parseAll method like this:
public List<String[]> parseAll(Reader reader, int numberOfLines) {
List<String[]> out = new ArrayList<String[]>(numberOfLines);
parser.beginParsing(reader);
String[] row;
while ((row = parser.parseNext()) != null) {
out.add(row);
}
return out;
}
And check if it helps.
The problem is that you are running out of memory. When this happens, the computer begins to crawl, since it starts to swap memory to disk, and viceversa.
Reading the whole contents into memory is definitely not the best strategy to follow. And since you are only interested in calculating some statistics, you do not even need to use addAll() at all.
The objective in computer science is always to meet an equilibrium between memory spent and execution speed. You can always deal with both concepts, trading memory for more speed or speed for memory savings.
So, loading the whole files into memory is comfortable for you, but not a solution, not even in the future, when computers will include terabytes of memory.
public int getNumRecords(CsvParser parser, int start) {
int toret = start;
parser.beginParsing(reader);
while (parser.parseNext() != null) {
++toret;
}
return toret;
}
As you can see, there is no memory spent in this function (except each single row); you can use it inside a loop for your CSV files, and finish with the total count of rows. The next step is to create a class for all your statistics, substituting that int start with your object.
class Statistics {
public Statistics() {
numRows = 0;
numComedies = 0;
}
public countRow() {
++numRows;
}
public countComedies() {
++numComedies;
}
// more things...
private int numRows;
private int numComedies;
}
public int calculateStatistics(CsvParser parser, Statistics stats) {
int toret = start;
parser.beginParsing(reader);
while (parser.parseNext() != null) {
stats.countRow();
}
return toret;
}
Hope this helps.
I am trying to write a query such as this:
select {r: referrers(f), count:count(referrers(f))}
from com.a.b.myClass f
However, the output doesn't show the actual objects:
{
count = 3.0,
r = [object Object]
}
Removing the Javascript Object notation once again shows referrers normally, but they are no longer compartmentalized. Is there a way to format it inside the Object notation?
So I see that you asked this question a year ago, so I don't know if you still need the answer, but since I was searching around for something similar, I can answer this. The problem is that referrers(f) returns an enumeration and so it doesn't really translate well when you try to put it into your hashmap. I was doing a similar type of analysis where I was trying to find unique char arrays (count the unique combinations of char arrays up to the first 50 characters). What I came up with was this:
var counts = {};
filter(
map(
unique(
map(
filter(heap.objects('char[]'), "it.length > 50"), // filter out strings less than 50 chars in length
function(charArray) { // chop the string at 50 chars and then count the unique combos
var subs = charArray.toString().substr(0,50);
if (! counts[subs]) {
counts[subs] = 1;
} else {
counts[subs] = counts[subs] + 1;
}
return subs;
}
) // map
) // unique
, function(subs) { // map the strings into an array that has the string and the counts of that string
return { string: subs, count: counts[subs] };
}) // map
, "it.count > 5000"); // filter out strings that have counts < 5000
This essentially shows how to take an enumeration (heap.objects('char[]') in this case) and filter it and map it so that you can compute statistics on it. Hope this helps someone.