Seemingly random tika performance issues - java

I use RxJava to pipe documents from a source to tika and from tika to ElasticSearch.
At some point in time tika takes about 5 minutes to index a document and continues normally afterwards.
I am unable to properly point out the cause: If I restart the application, everything is still the same, let's say it took 5mins at the 301st document last time, it will take 5mins at the 301st document again. But if I change up the order of the documents, it happens neither at the same index (301), nor with the same document (the previous 301st).
Here are the relevant parts of the application:
public Indexable analyze(Indexable indexable) {
Timestamp from = new Timestamp(System.currentTimeMillis());
if (indexable instanceof NCFile) {
// there is some code here that has no effect
Metadata md = this.generateMetadata(((NCFile) indexable).getPath());
((NCFile) indexable).setType(md.get("Content-Type"));
if (((NCFile) indexable).getType().startsWith("text/")) {
((NCFile) indexable).setContent(this.parseContent(((NCFile) indexable).getPath())); //todo . hier könnte man noch viel mehr machen
} else {
((NCFile) indexable).setContent("");
}
((NCFile) indexable).setType(this.detectMediaType(((NCFile) indexable).getPath()));
}
Timestamp to = new Timestamp(System.currentTimeMillis());
System.out.println("the file " + ((NCFile) indexable).getPath() + " took " + (to.getTime()-from.getTime()) + " ms to parse");
return indexable;
}
and the pipeline that is feeding the code above:
nc.filter(action -> action.getOperation() == Operation.INSERT)
.map(IndexingAction::getIndexable)
.subscribeOn(Schedulers.computation())
.map(indexableAction -> metadataAnalyzer.analyze(indexableAction))
.map(indexable -> {
indexer.insert(indexable);
return indexable;
})
.subscribeOn(Schedulers.io())
.map(indexable -> "The indexable " + indexable.getIdentifier() + " of class " + indexable.getClass().getName() + " has been inserted.")
.subscribe(message -> this.logger.log(Level.INFO, message));
My guess would be that the problem is memory- or thread-related; but as far as I can see, the code should work perfectly fine.
the file Workspace/xx/xx-server/.test-documents/testRFC822_base64 took 5 ms to index
the file Workspace/xx/xx-server/.test-documents/testPagesHeadersFootersAlphaLower.pages took 306889 ms to index
the file Workspace/xx/xx-server/.test-documents/testFontAfterBufferedText.rtf took 2 ms to index
the file Workspace/xx/xx-server/.test-documents/testOPUS.opus took 7 ms to index
Funny thing is, these are the tika testfiles provided in their repo.
EDIT:
After a request I looked into it using FR, but I am not sure what exactly I have to look at:
Right after the plateau it stops working, even though neither the RAM nor the CPU limit is met
EDIT 2:
Is it the PipedReader that is blocking alle of these threads? Do I understand that correctly?
EDIT 3:
Here is a 1min flight recording:
Note: The flight recording seems wrong. In my system monitor application I do not see such a big memory consumption (apparent 16GB?!)...
What am I doing wrong?

Related

BigQuery Pagination through large result set with cloud library

I am working on accessing data from Google BigQuery, the data is 500MB which I need to transform as part of the requirement. I am setting Allow Large Results, setting a destination table etc.
I have written a java job in Google's new cloud library since that is recommended now - com.google.cloud:google-cloud-bigquery:0.21.1-beta (I have tried 0.20 beta as well without any fruitful results)
I am having problem with pagination of this data, the library is inconsistent in fetching results page wise. Here is my code snippet,
Code Snippet
System.out.println("Accessing Handle of Response");
QueryResponse response = bigquery.getQueryResults(jobId, QueryResultsOption.pageSize(10000));
System.out.println("Got Handle of Response");
System.out.println("Accessing results");
QueryResult result = response.getResult();
System.out.println("Got handle of Result. Total Rows: "+result.getTotalRows());
System.out.println("Reading the results");
int pageIndex = 0;
int rowId = 0;
while (result != null) {
System.out.println("Reading Page: "+ pageIndex);
if(result.hasNextPage())
{
System.out.println("There is Next Page");
}
else
{
System.out.println("No Next Page");
}
for (List<FieldValue> row : result.iterateAll()) {
System.out.println("Row: " + rowId);
rowId++;
}
System.out.println("Getting Next Page: ");
pageIndex++;
result = result.getNextPage();
}
Output print statements
Accessing Handle of Response
Got Handle of Response
Accessing results
Got handle of Result. Total Rows: 9617008
Reading the results
Reading Page: 0
There is Next Page
Row: 0
Row: 1
Row: 2
Row: 3
:
:
Row: 9999
Row: 10000
Row: 10001
:
:
Row: 19999
:
:
Please note that it never hits/prints - "Getting Next Page: ".
My expectation was that I would get data in chunks of 10000 rows at a time. Please note that if I run the same code on a query which returns 10-15K rows and set the pageSize to be 100 records, I do get the "Getting Next Page:" after every 100 rows. Is this a known issue with this beta library?
This looks very close to a problem I have been struggling with for hours. And I just found the solution, so I will share it here, even though you probably found a solution yourself a long time ago.
I did exactly like the documentation and tutorials said, but my page size were not respected and I kept getting all rows every time, no matter what I did. Eventually I found another example, official I think, right here.
What I learned from that example is that you should only use iterateAll() to get the rest of the rows. To get the current page rows you need to use getValues() instead.

Extracting avg time spent at a place from Google

I'm trying to use jsoup to extract the average time spent at a place straight from Google's search results; as Google API does not support fetching of that info at the moment.
For example,
url is "https://www.google.com/search?q=vivocity" and the text to extract is "15 min to 2 hr"
I've tried the following code
try {
String url = "https://www.google.com.sg/search?q=vivocity";
Document doc = Jsoup.connect(url).userAgent("mozilla/17.0").get();
Elements ele = doc.select("div._B1k");
for (Element qwer:ele){
temp += "Avg time spent: " + qwer.getElementsByTag("b").first().text() + "\n";
}
}
catch (IOException e){
e.printStackTrace();
}
I have also tried just outputing doc.text() and searching through the output, it doesn't seem to contain anything to do with avg time taken too.
Strange thing is with other URLs and divs, they work perfectly fine.
Any help will be appreciated, thank you.

File last access time and last modified time in java?

In my application I read file using following method,
public void readFIleData(String path) {
BufferedReader br = null;
try {
String sCurrentLine;
br = new BufferedReader(new FileReader(path));
while ((sCurrentLine = br.readLine()) != null) {
System.out.println("Data : "+sCurrentLine);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null)br.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
Also I get last access time and last modified time of the file using following method,
public void getFIleInfo(String path) {
Path file = Paths.get(path);
try {
BasicFileAttributes attrs = Files.readAttributes(file, BasicFileAttributes.class);
FileTime accessTime = attrs.lastAccessTime();
System.out.println("accessTime : "+accessTime.toMillis());
FileTime modifiedTime = attrs.lastModifiedTime();
System.out.println("modifiedTime : "+modifiedTime.toMillis());
} catch (IOException e) {
e.printStackTrace();
}
}
I ran above methods in following order,
1-> getFIleInfo()
2-> readFIleData()
3-> getFIleInfo()
I got following as output,
accessTime : 1462943491685
modifiedTime : 1462943925846
Data : erteuyuittdgfdfghjkhw5643rtrr66664fdghf
accessTime : 1462943491685
modifiedTime : 1462943925846
Here is output times in string format,
accessTime : 2016-05-11T05:11:31.685881Z
modifiedTime : 2016-05-11T07:39:28.237884Z
Data : erteuyuittdgfdfghjkhw5643rtrr66LE229F1HBQ664fdghf
accessTime : 2016-05-11T05:11:31.685881Z
modifiedTime : 2016-05-11T07:39:28.237884Z
I have a doubt about this output because access time remains the same as before reading the data of the file. Can somebody please explain to me what is actually mean by last access time and last modified time in java?
First, let's focus on what these things mean.
Access - the last time the file was read, i.e., the last time the file data was accessed.
Modify - the last time the file was modified (content has been modified), i.e., time when file data last modified.
Change - the last time meta data of the file was changed (e.g. permissions), i.e., time when file status was last changed.
Edit.
The access time IS changing. I suggest you use Thread.sleep(100) or something and then see if this problem persists.
If it does, the culprit would have to the be the OS you are running since Java simply reads from the filesystem. #Serge Ballesta's comments should give an understanding about the Windows NTFS having an option to disable writing every change made to the file attributes back to the hard drive for performance reasons. There is actually more to this.
From [docs],
NTFS delays updates to the last access time for a file by up to one hour after the last access. NTFS also permits last access time updates to be disabled. Last access time is not updated on NTFS volumes by default.
Following is some data from running the script on mac os x.
calling getFileInfo() at: 11.4.2016 3:13:08:738
accessTime : 11.4.2016 3:12:53:0
modifiedTime : 29.10.2015 1:49:14:0
--------------------
sleeping for 100ms
--------------------
calling readFIleData() at: 11.4.2016 3:13:08:873
--------------------
sleeping for 100ms
--------------------
re-calling getFileInfo() at: 11.4.2016 3:13:08:977
accessTime : 11.4.2016 3:13:08:0 <---- READING FILE CHANGES ACCESS TIME
modifiedTime : 29.10.2015 1:49:14:0
--------------------
sleeping for 100ms
--------------------
re-calling getFileInfo() at: 11.4.2016 3:13:09:81
accessTime : 11.4.2016 3:13:08:0 <---- READING FILE ATTRIBUTES DOES NOT CHANGE ACCESS TIME
modifiedTime : 29.10.2015 1:49:14:0
To enhance clarity, you can convert the milliseconds you have, to something more readable. The following code snippet will elaborate on that.
long accessTimeSinceEpoch = Files.readAttributes(file, BasicFileAttributes.class).lastAccessTime().toMillis();
Calendar calendar = Calendar.getInstance();
calendar.setTimeInMillis(accessTimeSinceEpoch);
int mYear = calendar.get(Calendar.YEAR);
int mMonth = calendar.get(Calendar.MONTH);
int mDay = calendar.get(Calendar.DAY_OF_MONTH);
int mHour = calendar.get(Calendar.HOUR);
int mMin = calendar.get(Calendar.MINUTE);
int mSec = calendar.get(Calendar.SECOND);
int mMilisec = calendar.get(Calendar.MILLISECOND);
String st = mDay + "." + mMonth + "." + mYear + " " + mHour + ":" + mMin + ":" + mSec + ":" + mMilisec;
If you look into the api you have this.
If the file system implementation does not support a time stamp
to indicate the time of last access then this method returns
an implementation specific default value, typically the last-modified-time or a FileTime
representing the epoch (1970-01-01T00:00:00Z).
It looks pretty much like the "problem" is related to your file system and your operating system. I don't think your code has anything wrong in it.
For example, for a windows operating system, the NtfsDisableLastAccessUpdate option was enabled by default in Vista and Windows 7, but you can disable it by using the following command line:
fsutil behavior set disablelastaccess 0
As I said in the comment to your question I was able to solve this problem in Windows in a real machine, but not in a virtual one. If you are still struggling with this issue then issue this command prior to anything to see whats going on with the registry:
fsutil behavior query disablelastaccess
On a last note, I did not had to restart windows or Intellij (where I ran my tests). The results were immediate and I could see that for value 1 the timestamp for the last access does not change and for value 0 it does.

Dataframes are slow to parse through small amount of data

I have 2 classes doing a similar task in Apache Spark but the one using data frame is many times slower than the "regular" one using RDD. (30x)
I would like to use data frame since it will eliminate a lot of code and classes we have but obviously I can't have it be that much slower.
The data set is nothing big. We have 30 some files with json data in each about events triggered from activities in another piece of software. There are between 0 to 100 events in each file.
A data set with 82 events will take about 5 minutes to be processed with data frames.
Sample code:
public static void main(String[] args) throws ParseException, IOException {
SparkConf sc = new SparkConf().setAppName("POC");
JavaSparkContext jsc = new JavaSparkContext(sc);
SQLContext sqlContext = new SQLContext(jsc);
conf = new ConfImpl();
HashSet<String> siteSet = new HashSet<>();
// last month
Date yesterday = monthDate(DateUtils.addDays(new Date(), -1)); // method that returns the date on the first of the month
Date startTime = startofYear(new Date(yesterday.getTime())); // method that returns the date on the first of the year
// list all the sites with a metric file
JavaPairRDD<String, String> allMetricFiles = jsc.wholeTextFiles("hdfs:///somePath/*/poc.json");
for ( Tuple2<String, String> each : allMetricFiles.toArray() ) {
logger.info("Reading from " + each._1);
DataFrame metric = sqlContext.read().format("json").load(each._1).cache();
metric.count();
boolean siteNameDisplayed = false;
boolean dateDisplayed = false;
do {
Date endTime = DateUtils.addMonths(startTime, 1);
HashSet<Row> totalUsersForThisMonth = new HashSet<>();
for (String dataPoint : Conf.DataPoints) { // This is a String[] with 4 elements for this specific case
try {
if (siteNameDisplayed == false) {
String siteName = parseSiteFromPath(each._1); // method returning a parsed String
logger.info("Data for site: " + siteName);
siteSet.add(siteName);
siteNameDisplayed = true;
}
if ( dateDisplayed == false ) {
logger.info("Month: " + formatDate(startTime)); // SimpleFormatDate("yyyy-MM-dd")
dateDisplayed = true;
}
DataFrame lastMonth = metric.filter("event.eventId=\"" + dataPoint + "\"").filter("creationDate >= " + startTime.getTime()).filter("creationDate < " + endTime.getTime()).select("event.data.UserId").distinct();
logger.info("Distinct for last month for " + dataPoint + ": " + lastMonth.count());
totalUsersForThisMonth.addAll(lastMonth.collectAsList());
} catch (Exception e) {
// data does not fit the expected model so there is nothing to print
}
}
logger.info("Total Unique for the month: " + totalStudentForThisMonth.size());
startTime = DateUtils.addMonths(startTime, 1);
dateDisplayed = false;
} while ( startTime.getTime() < commonTmsMetric.monthDate(yesterday).getTime());
// reset startTime for the next site
startTime = commonTmsMetric.StartofYear(new Date(yesterday.getTime()));
}
}
There are a few things that are not efficient in this code but when I look at the logs it only adds a few seconds to the whole processing.
I must be missing something big.
I have ran this with 2 executors and 1 executor and the difference is 20 seconds on 5 minutes.
This is running with Java 1.7 and Spark 1.4.1 on Hadoop 2.5.0.
Thank you!
So there a few things, but its hard to say without seeing the breakdown of the different tasks & their time. The short version is you are doing way to much work in the driver and not taking advantage of Spark's distributed capabilities.
For example, you are collecting all of the data back to the driver program (toArray() and your for loop). Instead you should just point Spark SQL at the files in needs to load.
For the operators, it seems like your doing many aggregations in the driver, instead you could use the driver to generate the aggregations and have Spark SQL execute them.
Another big difference between your in-house code and the DataFrame code is going to be Schema inference. Since you've already created classes to represent your data, it seems likely that you know the schema of your JSON data. You can likely speed up your code by adding the schema information at read time so Spark SQL can skip inference.
I'd suggest re-visiting this approach and trying to build something using Spark SQL's distributed operators.

Offset missing from Kafka logs - Simple Consumer unable to proceed

I have a 3-node kafka cluster setup. I am using storm to read messages from kafka. Each topic in my system has 7 partitions.
Now I am facing a weird problem. Till 3 days ago, everything was working fine. However, now it seems my storm topology is unable to read specifically from 2 partitions - #1 and #4.
I tried to drill down to the problem and found that in my kafka logs, for both of these partitions, one offset is missing i.e. after 5964511, next offset is 5964513 and not 5964512.
Due to missing offset, Simple Consumer is not able to proceed to next offsets. Am I doing something wrong or is it a known bug ?
What possibly could be the reason for such behaviour ?
I am using following code to read window of valid offsets :
public static long getLastOffset(SimpleConsumer consumer, String topic, int partition,
long whichTime, String clientName) {
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, partition);
Map<TopicAndPartition, PartitionOffsetRequestInfo> requestInfoMap = new HashMap<TopicAndPartition, PartitionOffsetRequestInfo>();
requestInfoMap.put(topicAndPartition, new PartitionOffsetRequestInfo(kafka.api.OffsetRequest.LatestTime(), 100));
OffsetRequest request = new OffsetRequest( requestInfoMap, kafka.api.OffsetRequest.CurrentVersion() , clientName);
OffsetResponse response = consumer.getOffsetsBefore(request);
long[] validOffsets = response.offsets(topic, partition);
for (long validOffset : validOffsets) {
System.out.println(validOffset + " : ");
}
long largestOffset = validOffsets[0];
long smallestOffset = validOffsets[validOffsets.length - 1];
System.out.println(smallestOffset + " : " + largestOffset );
return largestOffset;
}
This gives me following output :
4529948 : 6000878
So, the offset I am providing is well within the offset range.
Sorry for the late answer, but...
I code for this case by having a Long instance var to hold the next offset to read and then checking after the fetch to see if the returned FetchResponse hasError(). If there was an error I change the next offset value to a reasonable value (could be the next offset or the last available offset) and try again.

Categories

Resources