I am wondering what is the best approach to collect all the different exceptions that happened from a log file.
An entry look like this:
/var/log/hadoop-hdfs/hadoop-cmf-hdfs-DATANODE-aaa.log.out.5
2017-08-30 13:54:44,561 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode{data=FSDataset{dirpath='[/var/hadoop/sdc/dn/current, /var/hadoop/sdd/dn/current]'}, localName='host.tld:50010', datanodeUuid='aaaaaa-6828-44dd-xxx-bbbbb', xmitsInProgress=0}:Exception transfering block BP-111111-172.16.9.110-1471873778315:blk_1086251547_12532682 to mirror 172.16.9.8:50010: org.apache.hadoop.hdfs.protocol.datatransfer.InvalidEncryptionKeyException: Can't re-compute encryption key for nonce, since the required block key (keyID=-111) doesn't exist. Current key: 123
Or this:
2016-08-22 15:50:09,706 ERROR org.apache.hadoop.hdfs.server.datanode.DiskBalancer: Disk Balancer is not enabled.
I would like to print out the exception or if there is no exception then the remaining fields after $4.
Current code:
awk '/ERROR/{print $3" "$4}' /var/log/hadoop-hdfs/*.log.out | sort | uniq -c
Is there an easy way to look through all of the fields after $4 and if there is an exception print out the field that has the exception, if there isn't just print out everything?
The output is like this right now:
93 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
8403 ERROR org.apache.hadoop.hdfs.server.datanode.DiskBalancer:
The expected output is:
xx ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Broken pipe
yy ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.net.SocketTimeoutException
8403 ERROR org.apache.hadoop.hdfs.server.datanode.DiskBalancer: Disk Balancer is not enabled.
Sample input:
2016-08-22 16:35:42,502 ERROR org.apache.hadoop.hdfs.server.datanode.DiskBalancer: Disk Balancer is not enabled.
2016-08-22 16:36:42,506 ERROR org.apache.hadoop.hdfs.server.datanode.DiskBalancer: Disk Balancer is not enabled.
2016-08-22 16:37:29,515 ERROR org.apache.hadoop.hdfs.server.datanode.DiskBalancer: Disk Balancer is not enabled.
2016-08-22 16:37:29,530 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2018-01-06 13:45:18,899 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hostname:50010:DataXceiver error processing WRITE_BLOCK operation src: /172.16.9.68:53477 dst: /172.16.9.6:50010
2018-01-06 14:04:05,176 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode{data=FSDataset{dirpath='[/var/hadoop/sdc/dn/current, /var/hadoop/sdd/dn/current]'}, localName='hostname:50010', datanodeUuid='uuid', xmitsInProgress=11}:Exception transfering block BP-1301709078-172.16.9.110-1471873778315:blk_1095601056_21903280 to mirror 172.16.9.34:50010: java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.16.9.6:37439 remote=/172.16.9.34:50010]
Related
in my app i can upload files (max size is 10MB). I created an exception handler for too big files, but console still shows warning that there was a try to upload too big file:
2020-09-30 01:38:59.306 WARN 2476 --- [nio-8080-exec-3] .m.m.a.ExceptionHandlerExceptionResolver : Resolved [org.springframework.web.multipart.MaxUploadSizeExceededException: Maximum upload size exceeded; nested exception is java.lang.IllegalStateException: org.apache.tomcat.util.http.fileupload.impl.SizeLimitExceededException: the request was rejected because its size (26937892) exceeds the configured maximum (10485760)]
Exception handler:
#ExceptionHandler(MaxUploadSizeExceededException.class)
public void oversizedFilesHandler(MaxUploadSizeExceededException e){
accountService.writeExceptionToFile(e);
}
Is it possible to disable these warnings?
You can achieve that by adding log level to your properties file:
RULE : logging.level.xxxx=LEVEL
where:
LEVEL is one of TRACE, DEBUG, INFO, WARN, ERROR, FATAL, OFF.
xxxx is a package/class.
We apply the rule to your case:
logging.level.org.springframework.web=ERROR
Or even thinner:
logging.level.org.springframework.web.multipart =ERROR
Hence, only ERROR, FATAL and OFF level will be logged to you console.
I have deployed Spring Boot application that has a Database based queue with jobs on App Service.
Yesterday I performed a few Scale out and Scale in operations while the application was working to see how it will behave.
At some point (not necessary related to scaling operations) application started to throw Hikari errors.
com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#1ae66f34 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.
com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection org.postgresql.jdbc.PgConnection#1ef85079 marked as broken because of SQLSTATE(08006), ErrorCode(0)
The following are stack traces from my scheduled job in spring and other information:
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
Caused by: javax.net.ssl.SSLException: Connection reset by peer (Write failed)
Suppressed: java.net.SocketException: Broken pipe (Write failed)
Caused by: java.net.SocketException: Connection reset by peer (Write failed)
Next the following stack of errors:
WARN 1 --- [ scheduling-1] com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#48d0d6da (This connection has been closed.).
Possibly consider using a shorter maxLifetime value.
org.springframework.jdbc.support.MetaDataAccessException: Error while extracting DatabaseMetaData; nested exception is java.sql.SQLException: Connection is closed
Caused by: java.sql.SQLException: Connection is closed
The code which is invoked periodically - every 500 milliseconds is here:
#Scheduled(fixedDelayString = "${worker.delay}")
#Transactional
public void execute() {
jobManager.next(jobClass).ifPresent(this::handleJob);
}
Update.
The above code is almost all the time doing nothing, since there was no traffic on the website.
Update2. I've checked Postgres logs and found this:
2020-07-11 22:48:09 UTC-5f0866f0.f0-LOG: checkpoint starting: immediate force wait
2020-07-11 22:48:10 UTC-5f0866f0.f0-LOG: checkpoint complete (240): wrote 30 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.046 s, sync=0.046 s, total=0.437 s; sync files=13, longest=0.009 s, average=0.003 s; distance=163 kB, estimate=13180 kB
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: received immediate shutdown request
2020-07-11 22:48:10 UTC-5f0a3f41.8914-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0a3f41.8914-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
// Same text about 10 times
2020-07-11 22:48:10 UTC-5f0866f2.7c-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: src/port/kill.c(84): Process (272) exited OOB of pgkill.
2020-07-11 22:48:10 UTC-5f0866f1.fc-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0866f1.fc-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-07-11 22:48:10 UTC-5f0866f1.fc-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: archiver process (PID 256) exited with exit code 1
2020-07-11 22:48:11 UTC-5f0866ee.68-LOG: database system is shut down
It looks like it is a problem with Azure PostgresSQL server and it closed itself. Am I reading this right?
Like mentioned in your logs, have you tried setting maxLifetime property for the Hikari CP ? I think after setting that property this issue should be resolved.
Based on Hikari doc (https://github.com/brettwooldridge/HikariCP) --
maxLifetime
This property controls the maximum lifetime of a connection in the pool. An in-use connection will never be retired, only when it is closed will it then be removed. On a connection-by-connection basis, minor negative attenuation is applied to avoid mass-extinction in the pool. We strongly recommend setting this value, and it should be several seconds shorter than any database or infrastructure imposed connection time limit. A value of 0 indicates no maximum lifetime (infinite lifetime), subject of course to the idleTimeout setting. The minimum allowed value is 30000ms (30 seconds). Default: 1800000 (30 minutes)
I have a Kafka streaming application with kafka-streams and kafka-clients both 2.4.0
with the following configs
properties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
properties.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
brokers= ip1:port1, ip2:port2,ip3:port3,
topic partition: 3
topic replication : 3
Scenario 1: I start only 2 brokers (stream app still contains three ips of broker in broker ip setting) and when i start the my stream app the following error occurs.
2020-02-13 13:28:19.711 WARN 18756 --- [-1-0_0-producer] org.apache.kafka.clients.NetworkClient : [Producer clientId=my-app1-a4c8867f-b914-49bb-bc58-203349700828-StreamThread-1-0_0-producer, transactionalId=my-app1-0_0] Connection to node -2 (/ip2:port2) could not be established. Broker may not be available.
and later after 1 minute
org.apache.kafka.streams.errors.StreamsException: stream-thread [my-app1-a4c8867f-b914-49bb-bc58-203349700828-StreamThread-1] Failed to rebalance.
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:852)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:743)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:698)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:671)
Caused by: org.apache.kafka.streams.errors.StreamsException: stream-thread [my-app1-a4c8867f-b914-49bb-bc58-203349700828-StreamThread-1] task [0_0] Failed to initialize task 0_0 due to timeout.
at org.apache.kafka.streams.processor.internals.StreamTask.initializeTransactions(StreamTask.java:966)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:254)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:176)
at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:355)
at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:313)
at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.createTasks(StreamThread.java:298)
at org.apache.kafka.streams.processor.internals.TaskManager.addNewActiveTasks(TaskManager.java:160)
at org.apache.kafka.streams.processor.internals.TaskManager.createTasks(TaskManager.java:120)
at org.apache.kafka.streams.processor.internals.StreamsRebalanceListener.onPartitionsAssigned(StreamsRebalanceListener.java:77)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:272)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:400)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:421)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:340)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:471)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1267)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:843)
... 3 common frames omitted
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
I was Testing for High availability test scenarios. I think kafka should still work as replications are present in the two brokers properly(I have checked using kafka GUI tool).
Scenario 2: Today i noticed that when i start only 2 brokers and give the ips of theses two brokers (i.e. stream app only has the ip of two working brokers)
2020-02-16 16:18:24.818 INFO 5741 --- [-StreamThread-1] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=my-app-0a357371-525b-46cf-9fe1-34ee94fa4158-StreamThread-1-consumer, groupId=my-app] Group coordinator ip2:port2 (id: 2147483644 rack: null) is unavailable or invalid, will attempt rediscovery
2020-02-16 16:18:24.818 ERROR 5741 --- [-StreamThread-1] o.a.k.s.p.internals.StreamThread : stream-thread [my-app-0a357371-525b-46cf-9fe1-34ee94fa4158-StreamThread-1] Encountered the following unexpected Kafka exception during processing, this usually indicate Streams internal errors:
org.apache.kafka.streams.errors.StreamsException: stream-thread [my-app-0a357371-525b-46cf-9fe1-34ee94fa4158-StreamThread-1] Failed to rebalance.
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:852)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:743)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:698)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:671)
Caused by: org.apache.kafka.streams.errors.StreamsException: stream-thread [my-app-0a357371-525b-46cf-9fe1-34ee94fa4158-StreamThread-1] task [0_0] Failed to initialize task 0_0 due to timeout.
at org.apache.kafka.streams.processor.internals.StreamTask.initializeTransactions(StreamTask.java:966)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:254)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:176)
at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:355)
at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:313)
at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.createTasks(StreamThread.java:298)
at org.apache.kafka.streams.processor.internals.TaskManager.addNewActiveTasks(TaskManager.java:160)
at org.apache.kafka.streams.processor.internals.TaskManager.createTasks(TaskManager.java:120)
at org.apache.kafka.streams.processor.internals.StreamsRebalanceListener.onPartitionsAssigned(StreamsRebalanceListener.java:77)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:272)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:400)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:421)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:340)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:471)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1267)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:843)
... 3 common frames omitted
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
Note: This is not the case if i don['t set EXACTLY_ONCE in properties. Them it works as intended.
Tried increasing reties and back off max ms but didn't help.
Can anyone explain what i am missing?
logs of broker2 when broker 1 is down:
[2020-02-17 02:29:00,302] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition __consumer_offsets-36 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
Kafak logs are filled with the above line.
Now One Major Observation:
When I turn off broker2(ie. broker 1 and broker 3 are running) then my stream application runs fine.
My App shuts down only when broker 1 is down. I'm guessing some critical information that should be distributed between all brokers is only saved in broker 1.
I have an issue getting re-connect with camel-netty/netty4 working. Reconnect is triggered successfully when connection gets lost. But the attempt to reconnect fails with netty detecting a potential deadlock (see stacktrace below). After that exception, no further reconnect attempts are made.
Is this a netty/camel bug or did I miss anything?
2017-02-27 18:23:18,076 WARN | Camel Thread #33 - NettyServerTCPWorker | o.a.c.c.netty4.ClientModeTCPNettyServerBootstrapFactory | Error during re-connect to x.x.x.x:yyyy. Will attempt again in 2000 millis. This exception is ignored.
io.netty.util.concurrent.BlockingOperationException: DefaultChannelPromise#3fde00eb(incomplete)
at io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:390)
at io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157)
at io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:283)
at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:135)
at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:28)
at org.apache.camel.component.netty4.ClientModeTCPNettyServerBootstrapFactory.openChannel(ClientModeTCPNettyServerBootstrapFactory.java:175)
at org.apache.camel.component.netty4.ClientModeTCPNettyServerBootstrapFactory.doReconnectIfNeeded(ClientModeTCPNettyServerBootstrapFactory.java:164)
at org.apache.camel.component.netty4.ClientModeTCPNettyServerBootstrapFactory$2.run(ClientModeTCPNettyServerBootstrapFactory.java:216)
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at java.lang.Thread.run(Thread.java:745)
I've tested with camel 2.16.2 and 2.18.1 using netty 4.0.33.Final and 4.0.41.Final, respectively.
EDIT:
I've just verified that this only happens if workerCount=1 is set. Is this intended?
My hadoop version is 0.20.203.0. The namenode running on my hadoop clulser was shut down. I checked the logs, and found the error message only in the secondary name logs:
2014-09-27 22:18:54,930 WARN org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done. New Image Size: 29552383
2014-09-27 22:19:42,792 INFO org.mortbay.log: org.mortbay.io.nio.SelectorManager$SelectSet#8135daf JVM BUG(s) - injecting delay2 times
2014-09-27 22:19:42,792 INFO org.mortbay.log: org.mortbay.io.nio.SelectorManager$SelectSet#8135daf JVM BUG(s) - recreating selector 2 times, canceled keys 38 times
2014-09-27 23:18:55,508 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0
2014-09-27 23:18:55,508 FATAL org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All storage directories are inaccessible.
2014-09-27 23:18:55,509 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
There was anonther error message apprears in one of my datanodes:
2014-09-27 01:03:58,535 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.75.6.51:50010, storageID=DS-532990984-10.75.6.51-50010-1343295370699, infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space left on device
at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:770)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:475)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:528)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:397)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
at java.lang.Thread.run(Thread.java:662)
Not sure whether this is the root cause of the namenode shutting down issue?
New error raised when I was trying to restart the namenode?
2014-09-28 11:25:06,202 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.
java.io.IOException: Incorrect data format. logVersion is -31 but writables.length is 0.
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:542)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1009)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:365)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:379)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:353)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:254)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:434)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1153)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162)
Is there anyone knows about it ? Is it possible to fix the imagne and editor files as I don't want to lose the data?