I am doing some performance tests of a HTML stripper (written in java), that is to say, I pass a string (actually html content) to a method of the HTML stripper
and the latter returns plain text (without HTML tags and meta information).
Here is an example of the concrete implementation
public void performanceTest() throws IOException {
long totalTime;
File file = new File("/directory/to/ten/different/htmlFiles");
for (int i = 0; i < 200; ++i) {
for (File fileEntry : file.listFiles()) {
HtmlStripper stripper = new HtmlStripper();
URL url = fileEntry.toURI().toURL();
InputStream inputStream = url.openStream();
String html = IOUtils.toString(inputStream, "UTF-8");
long start = System.currentTimeMillis();
String text = stripper.getText(html);
long end = System.currentTimeMillis();
totalTime = totalTime + (end - start);
//The duration for the stripping of each file is computed here
// (200 times for each time). That duration value decreases and then becomes constant
//IMHO if the duration for the same file should always remain the same.
//Or is a cache technique used by the JVM?
System.out.println("time needed for stripping current file: "+ (end -start));
}
}
System.out.println("Average time for one document: "
+ (totalTime / 2000));
}
But the duration for the stripping of each file is computed 200 times for each time and has a different decreasing value. IMHO if the duration for one and the same file X should always remain the same!? Or is a cache technique used by the JVM?
Any help would be appreciated.
Thanks in advance
Horace
N.B:
- I am doing the tests local (NO remote, NO http) on my machine.
- I am using java 6 on Ubuntu 10.04
This is totally normal. The JIT compiles methods to native code and optimizes them more heavily as they're more and more heavily used. (The "constant" your benchmark eventually converges to is the peak of the JIT's optimization capabilities.)
You cannot get good benchmarks in Java without running the method many times before you start timing at all.
IMHO if the duration for one and the same file X should always remain the same
Not in the presence of an optimizing just-in-time compiler. Among other things it keeps track of how many times a particular method/branch is used, and selectively compiles Java byte codes into native code.
Related
I believe my method is leaking memory since in the profiler the number of "Surviving generations" keeps increasing :
In production I get "OOM heap space" errors after a while and I now think my method is the culprit.
As a background my method goal is to retrieve already existing documents in an index. The list is then used afterwards to tell if the document can remain in the index or can be removed (e.g. the corresponding document has been deleted from disk) :
public final List<MyDocument> getListOfMyDocumentsAlreadyIndexed() throws SolrServerException, HttpSolrClient.RemoteSolrException, IOException {
final SolrQuery query = new SolrQuery("*:*");
query.addField("id");
query.setRows(Integer.MAX_VALUE); // we want ALL documents in the index not only the first ones
SolrDocumentList results = this.getSolrClient().
query(
query).getResults();
listOfMyDocumentsAlreadyIndexed = results.parallelStream() // tried to replace with stream with the same behaviour
.map((doc) -> {
MyDocument tmpDoc = new MyDocument();
tmpDoc.setId((String) doc.getFirstValue(
"id"));
// Usually there are things done here to set some boolean fields
// that I have removed for the test and this question
return tmpDoc;
})
.collect(Collectors.toList());
return listOfMyDocumentsAlreadyIndexed;
}
The test for this method does the following call in a for loop 300 times (this simulates the indexing loops since my program indexes one index after the other) :
List<MyDocument> listOfExistingDocsInIndex = index.getListOfMyDocumentsAlreadyIndexed();
I tried to nullify it after use (in the test there is no use, it was just to see if it has any effect) without any noticeable change :
listOfExistingDocsInIndex = null;
This is the call tree I get from Netbeans profiler (I've just started using the profiler) :
What can I change / improve to avoid this memory leak (it is actually a memory leak, isn't it ?) ?
Any help appreciated :-),
So far I've found that to avoid memory leaks while retrieving all documents from an index, one has to avoid using :
query.setRows(Integer.MAX_VALUE);
Instead documents have to be retrieved chunks by chunks where the chunks size resides between 100 and 200 documents as depicted in Solr's wiki :
SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
doCustomProcessingOfResults(rsp);
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
Now surviving generations remain stable over time :
The drawback is that garbage collector is much more active and retrieval is much slower (I did not benchmark it so I have no metrics to show).
I'm trying to write a program that adds every single file and folder name on my C: drive to an ArrayList. The code works fine, but because of the massive amount of recursion, it gets painfully slow. Here is the code:
public static void updateFileDataBase()
{
ArrayList<String> currentFiles = new ArrayList<String>();
addEverythingUnder("C:/",currentFiles,new String[]{"SteamApps","AppData"});
for(String name : currentFiles)
System.out.println(name);
}
private static void addEverythingUnder(String path, ArrayList<String> list, String[] exceptions)
{
System.gc();
System.out.println("searching " + path);
File search = new File(path);
try
{
for(int i = 0; i < search.list().length; i++)
{
boolean include = true;
for(String exception : exceptions)
if(search.list()[i].contains(exception))
include = false;
if(include)
{
list.add(search.list()[i]);
if(new File(path + "/" + search.list()[i]).isDirectory())
{
addEverythingUnder(path + "/" + search.list()[i],list,exceptions);
}
}
}
}
catch(Exception error)
{
System.out.println("ACCESS DENIED");
}
}
I was wondering if there was anything at all that I could do to speed up the process. Thanks in advance :)
Program slowing down due to recursion
No it isn't. Recursion doesn't make things slow. Poor algorithms and bad coding make things slow.
For example, you are calling Files.list() four times for every file you process, as well as once per directory. You can save an O(N) by doing that once per directory:
for(File file : search.listFiles())
{
String name = file.getName();
boolean include = true;
for(String exception : exceptions)
if(name.contains(exception))
include = false;
if(include)
{
list.add(name);
if(file.isDirectory())
{
addEverythingUnder(file,list,exceptions);
}
}
}
There is (as of Java 7) a built in way to do this, Files.walkFileTree, which is much more efficient and removes the need to reinvent the wheel. It calls into a FileVisitor for every entry it finds. There are a couple of examples on the FileVisitor page to get you started.
Is there a particular reason for reinventing the wheel?
If you dont mind please use
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html#listFiles(java.io.File, java.lang.String[], boolean)
because of the massive amount of recursion, it gets painfully slow
While your code is very inefficient as EJP suggests, I suspect the problem is even more basic. When you access a large number of files, this takes time to read from disk (the first time, reading the same again, and again is much quicker as it is cache) Opening files is also pretty slow for a HDD.
A typical HDD has a seek time of 8 ms, if finding and opening a file takes two operations, then you are looking at 16 ms per file. say you have 10,000 files, this will take at least 160 seconds, no matter how efficient you make the code. BTW If you use a decent SSD, this will take about 1 second.
In short, you are likely to be hitting a hardware limit which has nothing to do with how you wrote your software. Shorter: Don't have large numbers of files if you want performance.
What is a more efficient way to retrieve the hard link count of a file, that scales with high numbers of files?
I'm writing an application that scans all files on a volume to draw a graph. It is similar to a freeware program inconveniently called Scanner, which does takes Hard links into account, and scans really fast, faster than I can achieve in Java without even checking hard links.
I already tried checking the hard link count in the following (slow) ways:
(Examples are very simplified for readability)
Via the CMD program STAT (windows)
process = Runtime.getRuntime().exec(new String[]{"stat", "--printf=%h", "\"" + filePath + "\""});
in = new BufferedReader(new InputStreamReader(process.getInputStream()));
String inpMsg = in.readLine();
linkCount = Integer.parseInt(inpMsg);
and using a JNI call to GetFileInformationByHandle:
String lpFileName = filePath;
int dwShareMode = Kernel32.FILE_SHARE_READ | Kernel32.FILE_SHARE_WRITE;
Pointer lpSecurityAttributes = null;
int dwCreationDisposition = Kernel32.OPEN_EXISTING;
int dwFlagsAndAttributes = 0;
int hTemplateFile = 0;
hFile = Kernel32.INSTANCE.CreateFile(lpFileName, dwDesiredAccess, dwShareMode, lpSecurityAttributes, dwCreationDisposition, dwFlagsAndAttributes, hTemplateFile);
Memory lpFileInformation = new Memory(56);
Kernel32.INSTANCE.GetFileInformationByHandle(hFile, lpFileInformation);
linkCount = lpFileInformation.getInt(40);
To give an idea of why I want a faster method, here is a list of how fast different processes can iterate over all files on my C: drive (170000 files):
Alt+Enter on C:\: 19000 files per second (9 seconds)
Scanner (mentioned above): 7800 files per second (22 seconds)
Java (no Hard links): 1750 files per second (98 seconds)
Java (with JNI): 40 files per second (1:10 hours (projected))
Java (with STAT): 8 files per second (5:50 hours (projected))
The fact that Java is slower than Scanner might have to do with the fact that I'm using File.listFiles() instead of the new FileVisitor, but I won't accept a speed of 40 files/second which is 43 times slower than without hard links.
(I ran these tests after already scanning several times before. The first scan always takes several times longer)
OrientDB official site says:
On common hardware stores up to 150.000 documents per second, 10
billions of documents per day. Big Graphs are loaded in few
milliseconds without executing costly JOIN such as the Relational
DBMSs.
But, executing the following code shows that it's taking ~17000ms to insert 150000 simple documents.
import com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx;
import com.orientechnologies.orient.core.record.impl.ODocument;
public final class OrientDBTrial {
public static void main(String[] args) {
ODatabaseDocumentTx db = new ODatabaseDocumentTx("remote:localhost/foo");
try {
db.open("admin", "admin");
long a = System.currentTimeMillis();
for (int i = 1; i < 150000; ++i) {
final ODocument foo = new ODocument("Foo");
foo.field("code", i);
foo.save();
}
long b = System.currentTimeMillis();
System.out.println(b - a + "ms");
for (ODocument doc : db.browseClass("Foo")) {
doc.delete();
}
} finally {
db.close();
}
}
}
My hardware:
Dell Optiplex 780
Intel(R) Core(TM)2 Duo CPU E7500 # 2.93Ghz
8GB RAM
Windows 7 64bits
What am I doing wrong?
Splitting the saves in 10 concurrent threads to minimize Java's overhead made it run in ~13000ms. Still far slower than what OrientDB front page says.
You can achieve that by using 'Flat Database' and orientdb as an embedded library in java
see more explained here
http://code.google.com/p/orient/wiki/JavaAPI
what you use is server mode and it sends many requests to orientdb server,
judging by your benchmark you got ~10 000 inserts per seconds which is not bad,
e.g I think 10 000 requests/s is very good performance for any webserver
(and orientdb server actually is a webserver and you can query it through http, but I think java is using binary mode)
The numbers from the OrientDB site are benchmarked for a local database (with no network overhead), so if you use a remote protocol, expect some delays.
As Krisztian pointed out, reuse objects if possible.
Read the documentation first on how to achive the best performance!
Few tips:
-> Do NOT instantiate ODocument always:
final ODocument doc;
for (...) {
doc.reset();
doc.setClassName("Class");
// Put data to fields
doc.save();
}
-> Do NOT rely on System.currentTimeMillis() - use perf4j or similar tool to measure times, because the first one measures global system times hence includes execution time of all other programs running on your system!
I do have a problem with millis set and read on Android 2.3.4 on a Nexus One. This is the code:
File fileFolder = new File(Environment.getExternalStorageDirectory(), appName + "/"
+ URLDecoder.decode(folder.getUrl()));
if (fileFolder != null && !fileFolder.exists()) {
fileFolder.setLastModified(1310198774);
fileFolder.mkdirs();
fileFolder.setLastModified(1310198774);
}
if (fileFolder != null && fileFolder.exists()) {
long l = fileFolder.lastModified();
}
In this small test I write 1310198774 but the result that is returned from lastModified() is 1310199771000.
Even if I cut the trailing "000" there's a difference of several minutes.
I need to sync files between a webservice and the Android device. The lastmodification millis are part of the data sent by this service. I do set the millis to the created/copied files and folders to check if the file/folder needs to be overwritten.
Everything is working BUT the millis that are returned from the filesystem are different from the values that were set.
I'm pretty sure there's something wrong with my code - but I can't find it.
Many thanks in advance.
HJW
On Jelly Bean+, it's different (mostly on Nexus devices yet, and others that use the new fuse layer for /mnt/shell/emulated sdcard emulation):
It's a VFS permission problem, the syscall utimensat() fails with EPERM due to inappropriate permissions (e.g. ownership).
in platform/system/core/sdcard/sdcard.c:
/* all files owned by root.sdcard */
attr->uid = 0;
attr->gid = AID_SDCARD_RW;
From utimensat()'s syscall man page:
2. the caller's effective user ID must match the owner of the file; or
3. the caller must have appropriate privileges.
To make any change other than setting both timestamps to the current time
(i.e., times is not NULL, and both tv_nsec fields are not UTIME_NOW and both
tv_nsec fields are not UTIME_OMIT), either condition 2 or 3 above must apply.
Old FAT offers an override of the iattr->valid flag via a mount option to allow changing timestamps to anyone, FUSE+Android's sdcard-FUSE don't do this at the moment (so the 'inode_change_ok() call fails) and the attempt gets rejected with -EPERM. Here's FAT's ./fs/fat/file.c:
/* Check for setting the inode time. */
ia_valid = attr->ia_valid;
if (ia_valid & TIMES_SET_FLAGS) {
if (fat_allow_set_time(sbi, inode))
attr->ia_valid &= ~TIMES_SET_FLAGS;
}
error = inode_change_ok(inode, attr);
I also added this info to this open bug.
So maybe I'm missing something but I see some problems with your code above. Your specific problem may be due (as #JB mentioned) to Android issues but for posterity, I thought I'd provide an answer.
First off, File.setLastModified() takes the time in milliseconds. Here are the javadocs. You seem to be trying to set it in seconds. So your code should be something like:
fileFolder.setLastModified(1310198774000L);
As mentioned in the javadocs, many filesystems only support seconds granularity for last-modification time. So if you need to see the same modification time in a file then you should do something like the following:
private void changeModificationFile(File file, long time) {
// round the value down to the nearest second
file.setLastModified((time / 1000) * 1000);
}
If this all doesn't work try this (ugly) workaround quoted from https://code.google.com/p/android/issues/detail?id=18624:
//As a workaround, this ugly hack will set the last modified date to now:
RandomAccessFile raf = new RandomAccessFile(file, "rw");
long length = raf.length();
raf.setLength(length + 1);
raf.setLength(length);
raf.close();
Works on some devices but not on others. Do not design a solution that relies on it working. See https://code.google.com/p/android/issues/detail?id=18624#c29
Here is a simple test to see if it works.
public void testSetLastModified() throws IOException {
long time = 1316137362000L;
File file = new File("/mnt/sdcard/foo");
file.createNewFile();
file.setLastModified(time);
assertEquals(time, file.lastModified());
}
If you only want to change the date/time of a directory to the current date/time (i.e., "now"), then you can create some sort of temporary file inside that directory, write something into it, then immediately delete it. This has the effect of changing the 'lastModified()' date/time of the directory to the present date/time. This won't work though, if you want to change the directory date/time to some other random value, and can't be applied to a file, obviously.