I am exploring Impala for a POC, however I can't see any significant performance. I can't insert 5000 records/sec, at max I was able to insert mere 200/sec. This is really slow considering any database performance.
I tried two different methods but both are slow:
Using Cloudera
First, I installed Cloudera on my system and added latest CDH 6.2 cluster. I created a java client to insert data using ImpalaJDBC41 driver. I am able to insert record but speed is terrible. I tried tuning impala by increasing Impala Daemon Limit and my system RAM but it didn't help. Finally, I thought there is something wrong with my installation or something so I switched to another method.
Using Cloudera VM
Cloudera also ships there ready VM for test purpose. I tried my hands on to see if it gives better performance, but there is no big improvement. I still can't insert data 5k/sec speed.
I don't know where do I need to improvement. I have pasted my code below if any improvement can be done.
What is the ideal Impala configuration to achieve speed of (5k - 10k / sec)? This speed is still very less of which Impala is capable.
private static Connection connectViaDS() throws Exception {
Connection connection = null;
Class.forName("com.cloudera.impala.jdbc41.Driver");
connection = DriverManager.getConnection(CONNECTION_URL);
return connection;
}
private static void writeInABatchWithCompiledQuery(int records) {
int protocol_no = 233,s_port=20,d_port=34,packet=46,volume=58,duration=39,pps=76,
bps=65,bpp=89,i_vol=465,e_vol=345,i_pkt=5,e_pkt=54,s_i_ix=654,d_i_ix=444,_time=1000,flow=989;
String s_city = "Mumbai",s_country = "India", s_latt = "12.165.34c", s_long = "39.56.32d",
s_host="motadata",d_latt="29.25.43c",d_long="49.15.26c",d_city="Damouli",d_country="Nepal";
long e_date= 1275822966, e_time= 1370517366;
PreparedStatement preparedStatement;
int total = 1000*1000;
int counter =0;
Connection connection = null;
try {
connection = connectViaDS();
preparedStatement = connection.prepareStatement(sqlCompiledQuery);
Timestamp ed = new Timestamp(e_date);
Timestamp et = new Timestamp(e_time);
while(counter <total) {
for (int index = 1; index <= 5000; index++) {
counter++;
preparedStatement.setString(1, "s_ip" + String.valueOf(index));
preparedStatement.setString(2, "d_ip" + String.valueOf(index));
preparedStatement.setInt(3, protocol_no + index);
preparedStatement.setInt(4, s_port + index);
preparedStatement.setInt(5, d_port + index);
preparedStatement.setInt(6, packet + index);
preparedStatement.setInt(7, volume + index);
preparedStatement.setInt(8, duration + index);
preparedStatement.setInt(9, pps + index);
preparedStatement.setInt(10, bps + index);
preparedStatement.setInt(11, bpp + index);
preparedStatement.setString(12, s_latt + String.valueOf(index));
preparedStatement.setString(13, s_long + String.valueOf(index));
preparedStatement.setString(14, s_city + String.valueOf(index));
preparedStatement.setString(15, s_country + String.valueOf(index));
preparedStatement.setString(16, d_latt + String.valueOf(index));
preparedStatement.setString(17, d_long + String.valueOf(index));
preparedStatement.setString(18, d_city + String.valueOf(index));
preparedStatement.setString(19, d_country + String.valueOf(index));
preparedStatement.setInt(20, i_vol + index);
preparedStatement.setInt(21, e_vol + index);
preparedStatement.setInt(22, i_pkt + index);
preparedStatement.setInt(23, e_pkt + index);
preparedStatement.setInt(24, s_i_ix + index);
preparedStatement.setInt(25, d_i_ix + index);
preparedStatement.setString(26, s_host + String.valueOf(index));
preparedStatement.setTimestamp(27, ed);
preparedStatement.setTimestamp(28, et);
preparedStatement.setInt(29, _time);
preparedStatement.setInt(30, flow + index);
preparedStatement.addBatch();
}
preparedStatement.executeBatch();
preparedStatement.clearBatch();
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
connection.close();
} catch (SQLException e) {
e.printStackTrace();
}
}
}
Data is updating at snails pace. I tried increasing the batch size but it's decreasing the speed. I don't know if my code is wrong or I need to tune Impala for better performance. Please guide.
I am using VM for testing, here is other details:
System.
Os - Ubuntu 16
RAM - 12 gb
Cloudera - CDH 6.2
Impala daemon limit - 2 gb
Java heap size impala daemon - 500mb
HDFS Java Heap Size of NameNode in Bytes - 500mb.
Please let me know if more details are required.
You can't benchmark on a VM with 12GB. Look at the Impala's hardware requirements and you'll see you need 128GB of memory minimum.
Memory
128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query processing on a particular node exceed the amount of memory available to Impala on that node, the query writes temporary work data to disk, which can lead to long query times. Note that because the work is parallelized, and intermediate results for aggregate queries are typically smaller than the original data, Impala can query and join tables that are much larger than the memory available on an individual node.
Also, the VM is used to familiarize yourself with the toolset but it is not powerful enough to even be a development environment.
References
Impala Requirements:Hardware Requirements
Tuning Impala for Performance
Related
I have a use case where i am writing to a Kafka topic in batches using spark job (no streaming).Initially i pump-in suppose 10 records to Kafka topic and run the spark job which does some processing and finally write to another Kafka topic.
Next time when i push another 5 records and run the spark job, my requirement is to start processing these 5 records only not from starting offset. I need to maintain the committed offset so that spark job should run on next offset position and do the processing.
Here is code from kafka side to fetch the offset:
private static List<TopicPartition> getPartitions(KafkaConsumer consumer, String topic) {
List<PartitionInfo> partitionInfoList = consumer.partitionsFor(topic);
return partitionInfoList.stream().map(x -> new TopicPartition(topic, x.partition())).collect(Collectors.toList());
}
public static void getOffSet(KafkaConsumer consumer) {
List<TopicPartition> topicPartitions = getPartitions(consumer, topic);
consumer.assign(topicPartitions);
consumer.seekToBeginning(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " startingOffSet-> " + consumer.position(x));
});
consumer.assign(topicPartitions);
consumer.seekToEnd(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " endingOffSet-> " + consumer.position(x));
});
topicPartitions.forEach(x -> {
consumer.poll(1000) ;
OffsetAndMetadata offsetAndMetadata = consumer.committed(x);
long position = consumer.position(x);
System.out.printf("Committed: %s, current position %s%n", offsetAndMetadata == null ? null : offsetAndMetadata
.offset(), position);
});
}
Below code is for spark to load the messages from topic which is not working :
Dataset<Row> kafkaDataset = session.read().format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic)
.option("group.id", "test-consumer-group")
.option("startingOffsets","{\"Topic1\":{\"0\":2}}")
.option("endingOffsets", "{\"Topic1\":{\"0\":3}}")
.option("enable.auto.commit","true")
.load();
After above code executes i am again trying to get the offset by calling
getoffset(consumer)
from the topic which always reads from 0 offset and committed offset fetched initially keeps on increasing. I am new to kafka and still figuring out how to handle such scenarion.Please help here.
Initially i had 10 records in my topic, i published another 2 records and here is the o/p:
Output post getoffset method executes :
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
Output post spark code executes for loading messages.
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
I see no diff and . Please take a look and suggest resolution for this sceanario.
Can we add a powerModel for virtual machine also as we do it for host in Cloudsim (simulation Tool)? So that we can track the power consumption of each virtual machines.
Using CloudSim Plus you can compute the CPU usage and power consumption of a VM using the following code into your example:
private void printVmsCpuUtilizationAndPowerConsumption() {
for (Vm vm : vmList) {
System.out.println("Vm " + vm.getId() + " at Host " + vm.getHost().getId() + " CPU Usage and Power Consumption");
double vmPower; //watt-sec
double utilizationHistoryTimeInterval, prevTime = 0;
final UtilizationHistory history = vm.getUtilizationHistory();
for (final double time : history.getHistory().keySet()) {
utilizationHistoryTimeInterval = time - prevTime;
vmPower = history.powerConsumption(time);
final double wattsPerInterval = vmPower*utilizationHistoryTimeInterval;
System.out.printf(
"\tTime %8.1f | Host CPU Usage: %6.1f%% | Power Consumption: %8.0f Watt-Sec * %6.0f Secs = %10.2f Watt-Sec\n",
time, history.vmCpuUsageFromHostCapacity(time) *100, vmPower, utilizationHistoryTimeInterval, wattsPerInterval);
prevTime = time;
}
System.out.println();
}
}
You don't implement specific PowerModel for VMs. The VM power consumption is determined by its CPU utilization and the Host's PowerModel.
You can get the complete example here.
I'm trying to get the Total Ram in Java using Sigar library, I do the following
return String.valueOf(sigar.getMem().getRam());
My total RAM is 4GB, so I was expecting 4 or 4.00 but the result is 4008.
I tried the following:
long x = sigar.getMem().getTotal();
final String[] units = new String[] { "B", "KB", "MB", "GB", "TB" };
int digitGroups = (int) (Math.log10(x) / Math.log10(1024));
return new DecimalFormat("#,##0.##")
.format(x / Math.pow(1024, digitGroups)) + " " + units[digitGroups];
but still I'm having the result of 3.91 GB.
Any Idea?
You probably have 3.91 physically available GigaBytes, and the math from Sigar is correct.
Check if your BIOS is reserving some RAM for the integrated graphics device or some other devilry.
I'm running a Spark application (Spark 1.6.3 cluster), which does some calculations on 2 small data sets, and writes the result into an S3 Parquet file.
Here is my code:
public void doWork(JavaSparkContext sc, Date writeStartDate, Date writeEndDate, String[] extraArgs) throws Exception {
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
S3Client s3Client = new S3Client(ConfigTestingUtils.getBasicAWSCredentials());
boolean clearOutputBeforeSaving = false;
if (extraArgs != null && extraArgs.length > 0) {
if (extraArgs[0].equals("clearOutput")) {
clearOutputBeforeSaving = true;
} else {
logger.warn("Unknown param " + extraArgs[0]);
}
}
Date currRunDate = new Date(writeStartDate.getTime());
while (currRunDate.getTime() < writeEndDate.getTime()) {
try {
SparkReader<FirstData> sparkReader = new SparkReader<>(sc);
JavaRDD<FirstData> data1 = sparkReader.readDataPoints(
inputDir,
currRunDate,
getMinOfEndDateAndNextDay(currRunDate, writeEndDate));
// Normalize to 1 hours & 0.25 degrees
JavaRDD<FirstData> distinctData1 = data1.distinct();
// Floor all (distinct) values to 6 hour windows
JavaRDD<FirstData> basicData1BySixHours = distinctData1.map(d1 -> new FirstData(
d1.getId(),
TimeUtils.floorTimePerSixHourWindow(d1.getTimeStamp()),
d1.getLatitude(),
d1.getLongitude()));
// Convert Data1 to Dataframes
DataFrame data1DF = sqlContext.createDataFrame(basicData1BySixHours, FirstData.class);
data1DF.registerTempTable("data1");
// Read Data2 DataFrame
String currDateString = TimeUtils.getSimpleDailyStringFromDate(currRunDate);
String inputS3Path = basedirInput + "/dt=" + currDateString;
DataFrame data2DF = sqlContext.read().parquet(inputS3Path);
data2DF.registerTempTable("data2");
// Join data1 and data2
DataFrame mergedDataDF = sqlContext.sql("SELECT D1.Id,D2.beaufort,COUNT(1) AS hours " +
"FROM data1 as D1,data2 as D2 " +
"WHERE D1.latitude=D2.latitude AND D1.longitude=D2.longitude AND D1.timeStamp=D2.dataTimestamp " +
"GROUP BY D1.Id,D1.timeStamp,D1.longitude,D1.latitude,D2.beaufort");
// Create histogram per ID
JavaPairRDD<String, Iterable<Row>> mergedDataRows = mergedDataDF.toJavaRDD().groupBy(md -> md.getAs("Id"));
JavaRDD<MergedHistogram> mergedHistogram = mergedDataRows.map(new MergedHistogramCreator());
logger.info("Number of data1 results: " + data1DF.select("lId").distinct().count());
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
logger.info("Number of results with beaufort histograms: " + mergedDataDF.select("Id").distinct().count());
// Save to parquet
String outputS3Path = basedirOutput + "/dt=" + TimeUtils.getSimpleDailyStringFromDate(currRunDate);
if (clearOutputBeforeSaving) {
writeWithCleanup(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext, s3Client);
} else {
write(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext);
}
} finally {
TimeUtils.progressToNextDay(currRunDate);
}
}
}
public void write(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass, SQLContext sqlContext) {
// Apply a schema to an RDD of JavaBeans and save it as Parquet.
DataFrame fullDataDF = sqlContext.createDataFrame(outputRDD, outputClass);
fullDataDF.write().parquet(outputS3Path);
}
public void writeWithCleanup(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass,
SQLContext sqlContext, S3Client s3Client) {
String fileKey = S3Utils.getS3Key(outputS3Path);
String bucket = S3Utils.getS3Bucket(outputS3Path);
logger.info("Deleting existing dir: " + outputS3Path);
s3Client.deleteAll(bucket, fileKey);
write(outputS3Path, outputRDD, outputClass, sqlContext);
}
public Date getMinOfEndDateAndNextDay(Date startTime, Date proposedEndTime) {
long endOfDay = startTime.getTime() - startTime.getTime() % MILLIS_PER_DAY + MILLIS_PER_DAY ;
if (endOfDay < proposedEndTime.getTime()) {
return new Date(endOfDay);
}
return proposedEndTime;
}
The size of data1 is around 150,000 and data2 is around 500,000.
What my code does is basically does some data manipulation, merges the 2 data objects, does a bit more manipulation, prints some statistics and saves to parquet.
The spark has 25GB of memory per server, and the code runs fine.
Each iteration takes about 2-3 minutes.
The problem starts when I run it on a large set of dates.
After a while, I get an OutOfMemory:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.List.$colon$colon$colon(List.scala:127)
at org.json4s.JsonDSL$JsonListAssoc.$tilde(JsonDSL.scala:98)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:139)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:72)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:164)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:38)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:87)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:71)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:70)
Last time it ran, it crashed after 233 iterations.
The line it crashed on was this:
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
Can anyone please tell me what can be the reason for the eventual crashes?
I'm not sure that everyone will find this solution viable, but upgrading the Spark cluster to 2.2.0 seems to have resolved the issue.
I have ran my application for several days now, and had no crashes yet.
This error occurs when GC takes up over 98% of the total execution time of process. You can monitor the GC time in your Spark Web UI by going to stages tab in http://master:4040.
Try increasing the driver/executor(whichever is generating this error) memory using spark.{driver/executor}.memory by --conf while submitting the spark application.
Another thing to try is to change the garbage collector that the java is using. Read this article for that: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html. It very clearly explains why GC overhead error occurs and which garbage collector is best for your application.
How to get the Ram size and Hard disk size of the PC using Java? And Is it possible to get the currently logged user name on PC through java?
Disk size:
long diskSize = new File("/").getTotalSpace();
User name:
String userName = System.getProperty("user.name");
I'm not aware of a reliable way to determine total system memory in Java. On a Unix system you could parse /proc/meminfo. You can of course find the maximum memory available to the JVM:
long maxMemory = Runtime.getRuntime().maxMemory();
Edit: for completeness (thanks Suresh S), here's a way to get total memory with the Oracle JVM only:
long memorySize = ((com.sun.management.OperatingSystemMXBean) ManagementFactory
.getOperatingSystemMXBean()).getTotalPhysicalMemorySize();
For Ram Size , if you are using java 1.5
java.lang.management package
com.sun.management.OperatingSystemMXBean mxbean = (com.sun.management.OperatingSystemMXBean) ManagementFactory.getOperatingSystemMXBean();
System.out.println(mxbean.getTotalPhysicalMemorySize() + " Bytes ");
import java.lang.management.*;
import java.io.*;
class max
{
public static void main(String... a)
{
long diskSize = new File("/").getTotalSpace();
String userName = System.getProperty("user.name");
long maxMemory = Runtime.getRuntime().maxMemory();
long memorySize = ((com.sun.management.OperatingSystemMXBean) ManagementFactory.getOperatingSystemMXBean()).getTotalPhysicalMemorySize();
System.out.println("Size of C:="+diskSize+" Bytes");
System.out.println("User Name="+userName);
System.out.println("RAM Size="+memorySize+" Bytes");
}
}
Have a look at this topic, which goes into detail of how to get OS information such as this.
For Ram capacity:
//this step get ram capacity
long ram= ((com.sun.management.OperatingSystemMXBean) ManagementFactory.getOperatingSystemMXBean()).getTotalPhysicalMemorySize();
long sizekb = ram /1000;
long sizemb = sizekb / 1000;
long sizegb = sizemb / 1000 ;
System.out.println("System Ram ="+sizegb+"gb");