I've dockerized graphite and am working with this library to get metrics from an Apache Storm topology. I'm getting metrics data, but no matter what I do I can only get data per minute where I really need the points to be per second.
As per this SO post I've set the retention policy to grab data every second. I've also set
conf.put("topology.builtin.metrics.bucket.size.secs", 1);
void initMetrics(TopologyContext context) {
messageCountMetric = new CountMetric();
context.registerMetric("digest_count", messageCountMetric, 1);
in the class that's setting up the topology and the bolt itself, respectively. To my understanding this should cause metrics to be reported every second. What am I missing here? How can I get metrics to be reported every second?
t/y in advance and happy holidays all!
update 1
here is my storage-schemas.conf file:
root#cdd13a16103a:/etc/carbon# cat storage-schemas.conf
# Schema definitions for Whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds.
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
# Carbon's internal metrics. This entry should match what is specified in
pattern = ^carbon\.
retentions = 1s:6h,1min:7d,10min:5y
pattern = .*
retentions = 1s:6h,1min:7d,10min:5y
pattern = ^test.
retentions = 1s:6h,1min:7d,10min:5y
pattern = ^storm.
retentions = 1s:6h,1min:7d,10min:5y
Here is my config setup:
Config conf = new Config();
conf.put("topology.builtin.metrics.bucket.size.secs", 1);
conf.registerMetricsConsumer(GraphiteMetricsConsumer.class, 4);
conf.put("metrics.reporter.name", "com.verisign.storm.metrics.reporters.graphite.GraphiteReporter");
conf.put("metrics.graphite.host", "");
conf.put("metrics.graphite.port", "2003");
conf.put("metrics.graphite.prefix", "storm.test");
In order to apply changes in storage-schemas.conf you have to:
restart carbons
delete old *.wsp or use whisper-resize.py to apply scheme
restart carbon-cache
make sure that DEFAULT_CACHE_DURATION in webapp's local_settings.py is set to 1
make sure nginx/apache2/uwsgi cache is set up correctly as well, if any
There is more whisper-* tools shipped with graphite. The next you may be interested is whisper-info.py
bash$ whisper-info.py /graphite/whisper/prod/some/metric.wsp
maxRetention: 1296000
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 142600
Archive 0
retention: 691200
>> secondsPerPoint: 1
points: 11520
size: 138240
offset: 40
Archive 1
retention: 1296000
secondsPerPoint: 3600
points: 360
size: 4320
offset: 138280
I am using stanford posttager toolkit to tag list of words from academic papers. Here is my codes of this part:
st = StanfordPOSTagger(stanford_tagger_path, stanford_jar_path, encoding = 'utf8', java_options = '-mx2048m')
word_tuples = st.tag(document)
document is a list of words derived from nltk.word_tokenize, they come from mormal academic papers so usually there are several thousand of words (mostly 3000 - 4000). I need to process over 10000 files so I keep calling these functions. My program words fine on a small test set with 270 files, but when the number of file gets bigger, the program gives out this error (Java heap space 2G):
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed
Note that this error does not occur immediately after the execution, it happens after some time of running. I really don't know the reason. Is this because my 3000 - 4000 words are too much ? Thank you very much for help !(Sorry for the bad edition, the error information is too long)
Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.
import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger=StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options='-mx4096m' ### Setting higher memory limit for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())
I assume you have tried increasing the Java stack via the Tagger settings like so
stanford.POSTagger([...], java_options="-mxSIZEm")
Cf the docs, default is 1000:
def __init__(self, [...], java_options='-mx1000m')
In order to test if it is a problem with the size of the dataset, you can tokenize your text into sentences, e.g. using the Punkt Tokenizer and output them right after tagging.
I have the following job, and many others like it, launched like this
String jobName = MR.class
Configuration config = HBaseConfiguration.create();
"hbase.master", HBASE_MASTER);
"hbase.zookeeper.quorum", HBASE_ZOOKEEPERS);
"hbase.zookeeper.property.clientPort", ZK_CLIENT_PORT);
Job job = Job.getInstance(config, jobName);
Scan scan = new Scan();
//gets the raw band data cells
scan.addColumn("gc".getBytes(), "s".getBytes());
System.out.println("Job=" + jobName);
System.out.println("\tHBASE_MASTER=" + HBASE_MASTER);
// TODO: make caching size configurable
// null for output key/value since we're not sending anything to reduce
TableMapReduceUtil.initTableMapperJob(INPUT_TABLE, scan,
"PIXELS", // output table
null, // reducer class
// at least one, adjust as required
boolean b = job.waitForCompletion(true);
but I keep getting this error and can't even read the data into the mapper
Error: java.io.BufferedReader.lines()Ljava/util/stream/Stream;
14/09/22 22:11:13 INFO mapreduce.Job: Task Id : attempt_1410880772411_0045_m_000009_2, Status : FAILED
Error: java.io.BufferedReader.lines()Ljava/util/stream/Stream;
14/09/22 22:11:13 INFO mapreduce.Job: Task Id : attempt_1410880772411_0045_m_000018_2, Status : FAILED
Error: java.io.BufferedReader.lines()Ljava/util/stream/Stream;
14/09/22 22:11:13 INFO mapreduce.Job: Task Id : attempt_1410880772411_0045_m_000002_2, Status : FAILED
never seen this before, been using HBase a while....
here is what i've tried:
I can scan the table via the hbase shell no problem in the same way this job does
I can use the java api to scan in the same way no problem
I built and rebuilt my jar thinking some files were corrupt... not the problem
I have at least 5 other jobs setup the same way
I disabled, enabled, compacted, rebuilt everything you can imagine with the table
I am using maven shade to uber jar
I am running HBase 0.98.1-cdh5.1.0
any help or ideas greatly appreciated.
Looks like I should have waited a bit to ask this question. The problem was that I actually was getting to the data, but one of my external jars had been built with Java 8, and .lines() does not exists in a BufferedReader in java 7, i'm running java 7, hence the error. Nothing to do with HBase or Map Reduce.
I'm getting an out of memory exception due to lack of Java heap space when I try and download tweets using Flume and pipe them into Hadoop.
I have set the heap space currently to 4GB in the mapred-site.xml of Hadoop, like so:
I am hoping to download tweets continually for two days but can't get past 45 minutes without errors.
Since I do have the disk space to hold all of this, I am assuming the error is coming from Java having to handle so many things at once. Is there a way for me to slow down the speed at which these tweets are downloaded, or do something else to solve this problem?
Edit: flume.conf included
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <required>
TwitterAgent.sources.Twitter.consumerSecret = <required>
TwitterAgent.sources.Twitter.accessToken = <required>
TwitterAgent.sources.Twitter.accessTokenSecret = <required>
TwitterAgent.sources.Twitter.keywords = manchester united, man united, man utd, man u
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:50070/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
Edit 2
I've tried increasing the memory to 8GB which still doesn't help. I am assuming I am placing too many tweets in Hadoop at once and need to write them to disk and release the space again (or something to that effect). Is there a guide anywhere on how to do this?
Set JAVA_OPTS value at flume-env.sh and start flume agent.
It appears the problem had to do with the batch size and transactionCapacity. I changed them to the following:
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
This works without me even needing to change the JAVA_OPTS value.
I have a long log file generated with log4j, 10 threads writing to log.
I am looking for log analyzer tool that could find lines where user waited for a long time (i.e where the difference between log entries for the same thread is more than a minute).
P.S I am trying to use OtrosLogViewer, but it gives filtering by certain values (for example, by thread ID), and does not compare between lines.
the new version of OtrosLogViewer has a "Delta" column that calculates the difference between adj log lines (in ms)
thank you
This simple Python script may be enough. For testing, I analized my local Apache log, which BTW uses the Common Log Format so you may even reuse it as-is. I simply compute the difference between two subsequent requests, and print the request line for deltas exceeding a certain threshold (1 second in my test). You may want to encapsulate the code in a function which also accepts a parameter with the thread ID, so you can filter further
#!/usr/bin/env python
import re
from datetime import datetime
last = None
for line in open("/var/log/apache2/access.log"):
# You may insert here something like
# if not re.match(THREAD_ID, line):
# continue
# Python does not support %z, hence the [:-6]
current = datetime.strptime(
re.search(r"\[([^]]+)]", line).group(1)[:-6],
if last != None and (current - last).seconds > THRESHOLD:
print re.search('"([^"]+)"', line).group(1)
last = current
Based on #Raffaele answer, I made some fixes to work on any log file (skipping lines that doesn't begin with the requested date, e.g. Jenkins console log).
In addition, added Max / Min Threshold to filter out lines base on duration limits.
#!/usr/bin/env python
import re
from datetime import datetime
regCompile = r"\w+\s+(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d).*"
filePath = "C:/Users/user/Desktop/temp/jenkins.log"
lastTime = None
lastLine = ""
with open(filePath, 'r') as f:
for line in f:
regexp = re.search(regCompile, line)
if regexp:
currentTime = datetime.strptime(re.search(regCompile, line).group(1), "%Y-%m-%d %H:%M:%S")
if lastTime != None:
duration = (currentTime - lastTime).seconds
if duration >= MIN_THRESHOLD and duration <= MAX_THRESHOLD:
print ("#######################################################################################################################################")
print (lastLine)
print (line)
lastTime = currentTime
lastLine = line
Apache Chainsaw has a time delta column.
I am receiving Eh-cache logging output when using it with Hibernate 2nd Level cache - i do not undertand the output and what it might mean. It is being printed to the logs a lot.
DEBUG [net.sf.ehcache.store.disk.Segment] put added 0 on heap
DEBUG [net.sf.ehcache.store.disk.Segment] put updated, deleted 0 on heap
Could anyone shed some light on what this might mean? My Second level cache appears to be working, according to a print of the statistics...
INFO [com.atlaschase.falcon.commands.domain.AircraftCommandResolutionService] [ name = aircraftCache cacheHits = 824 onDiskHits = 0 offHeapHits = 0 inMemoryHits = 824 misses = 182 onDiskMisses = 182 offHeapMisses = 0 inMemoryMisses = 182 size = 91 averageGetTime = 1.0745527 evictionCount = 0 ]
Any help would be appreciated ..
This output is generated by DiskStore, IIRC enabled by default in EhCache. Basically EhCache overflows cached data from memory to disk. If you want to disable this functionality, set overflowToDisk property to flase:
<cache name="..." overflowToDisk="false"
Oh - can someone also confirm that the 'averageGetTime' is in milliseconds and not seconds?
Confirmed, milliseconds. Although the JavaDoc of Statistics.getAverageGetTime() is slightly confusing:
[...] Because ehcache support JDK1.4.2, each get time uses System.currentTimeMilis, rather than nanoseconds. The accuracy is thus limited.
I found the following code in LiveCacheStatisticsImpl:
public float getAverageGetTimeMillis() {
return (float) totalGetTimeTakenMillis.get() / hitCount;