I am wondering if it is possible to get load speed information when using the Java API.
The code I have to load "large" files (few gb) is this:
try (InputStream in = new FileInputStream(arguments.input)) {
RDFParser.create()
.source(in)
.lang(lang)
.errorHandler(ErrorHandlerFactory.errorHandlerStrict)
.base("http://example.com/")
.streamManager()
.parse(model);
}
The loading seems to work but I have no clue about speed, number of triples parsed etc. Is there a way to get such statistics every n-triples or n-seconds? I am not using tdbloader2 as this code is part of a bigger program.
Related
I have been working recently with Jena TDB. My goal is to store an RDF file which is a representation of an RDF graphs.
Everything works fine with my code and i am able to query what i have stored as well.But I am still not sure if my data was completely stored or not!
I know that Jena TDB index the content of the file and that there are several indexes built for one file which will be stored in a specified folder. But how do I check if the database is created and all the RDF files that i will provide to TDB will be stored with the previous ones?
is there any way to do so online maybe or in java? and is my code enough to work with big amount of data or not ?
public static void main(String[] args) {
String directory = "/*location*/ ";
Dataset dataset = TDBFactory.createDataset(directory);
Model tdb = dataset.getNamedModel("RDFData");
// read the input file
String source = "/*location*/rdfstorage.rdf";
FileManager.get().readModel( tdb, source);
tdb.close();
dataset.close();
}
Check the location and see if the files have been updated.
It is better to use a transaction. Your code is OK but if it is interrupted, the store may be corrupted.
https://jena.apache.org/documentation/tdb/tdb_transactions.html
If the source is large, use the bulkloader from the command line.
I know that Apache Spark was primarly developed to analyze unstructured data. However, I have to read and process a huge XML file (greater than 1GB) and I have to use Apache Spark as a requirement.
Googling a little, I found how an XML file can be read by a Spark process, using partitioning in a proper way. As it is described here, it can be used the hadoop-streaming library, such this:
val jobConf = new JobConf()
jobConf.set("stream.recordreader.class",
"org.apache.hadoop.streaming.StreamXmlRecordReader")
jobConf.set("stream.recordreader.begin", "<page")
jobConf.set("stream.recordreader.end", "</page>")
org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf, s"hdfs://$master:9000/data.xml")
// Load documents, splitting wrt <page> tag.
val documents = sparkContext.hadoopRDD(jobConf, classOf[org.apache.hadoop.streaming.StreamInputFormat], classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text])
Every chunk of information can then be processed in a Scala / Java object using dom4j or JAXB (more complex).
Now, the problem is the following: the XML file should be validated, before processing it. How can I do in a way that conforms to Spark? As far as I know, the StreamXmlRecordReader used to split the file does not perform any validation.
I'm using Weka for a sentiment analysis project i'm working on. I'm using Weka CSV Loader to load the training instances from a csv file, but for some reason if i want to load more than 70 instances, the program gives me an "java.lang.ArrayIndexOutOfBoundsException: 2" exception. I found that u can give options to Weka CSV Loader
-B
The size of the in memory buffer (in rows).
(default: 100)
this one beeing maybe the one i need to set, to get rid of this error, but i'm not sure how to do this from a Java project. If anyone can help me with this, i would appreciate it greatly
UPDATE: The buffer size change didn't help the problems comes from somewhere else
How i'm using the loader:
private void getTrainingDataset(final String INPUT_FILENAME)
{
try{
//reading the training dataset from CSV file
CSVLoader trainingLoader =new CSVLoader();
trainingLoader.setSource(new File(INPUT_FILENAME));
inputDataset = trainingLoader.getDataSet();
}catch(IOException ex)
{
System.out.println("Exception in getTrainingDataset Method");
}
}
UPDATE: for those who want to know where the exception occurs
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
at weka.core.converters.CSVLoader.getInstance(CSVLoader.java:1251)
at weka.core.converters.CSVLoader.readData(CSVLoader.java:866)
at weka.core.converters.CSVLoader.readHeader(CSVLoader.java:1150)
at weka.core.converters.CSVLoader.getStructure(CSVLoader.java:924)
at weka.core.converters.CSVLoader.getDataSet(CSVLoader.java:836)
at sentimentanalysis.SentimentAnalysis.getTrainingDataset(SentimentAnalysis.java:209)
at sentimentanalysis.SentimentAnalysis.trainClassifier(SentimentAnalysis.java:134)
at sentimentanalysis.SentimentAnalysis.main(SentimentAnalysis.java:282)
UPDATE: Even for under 70 instances, after a few, the Classifier also gives an error. Everything works fine for like 10-20 instances but it all goes to shit for more :)
Weka read CSV two times, first pass limited to buffersize (in rows) to extract classes of nominal attributes, the second pass read the entire file.
the classes of each nominal attribute much match the classes of the training set (no more, no less).
increase the value of the buffersize to more than the number of rows
if still an error occurs then look for a class that it is not in the both files.
I have successfully integrated LibSVM API to mu java code. I need to transfer large document collection to numerical representation and give it to LibSVM classifier. As far as I know weka has the ability to transfer documents to feature vectors. Can any one please tell me how to do that?
U can do it like this
DataSource source = new DataSource(new File("mycsvinputfile"));
System.out.println(source.getStructure());
Instances data = source.getDataSet();
// setting class attribute if the data format does not provide this information
// For example, the XRFF format saves the class attribute information as well
if (data.classIndex() == -1)
data.setClassIndex(data.numAttributes() - 1);
//initialize svm classifier
LibSVM svm = new LibSVM();
svm.buildClassifier(data);
Don't forget to use weka.jar, libsvm.jar, and wlsvm.jar (the libsvm wrapper) in your project. So just include all 3 jars in your build path or class path or whatever.
I have a Google App Engine App that converts CSV to XML files. It works fine for small XML inputs, but refuses to finalize the file for larger inputed XML. The XML is read from, and the resulting csv files are written to, many times before finalization, over a long-running (multi-day duration) task. My problem is different than this FileServiceFactory getBlobKey throws IllegalArgumentException , since my code works fine both in production and development with small input files. So it's not that I'm neglecting to write to the file before closing/finalizing. However, when I attempt to read from a larger XML file. The input XML file is ~150MB, and the resulting set of 5 CSV files is each much smaller (perhaps 10MB each). I persisted the file urls for the new csv files, and even tried to close them with some static code, but I just reproduce the same error, which is
java.lang.IllegalArgumentException: creation_handle: String properties must be 500 characters or less. Instead, use com.google.appengine.api.datastore.Text, which can store strings of any length.
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedSingleValue(DataTypeUtils.java:242)
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedValue(DataTypeUtils.java:207)
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedValue(DataTypeUtils.java:173)
at com.google.appengine.api.datastore.Query$FilterPredicate.<init>(Query.java:900)
at com.google.appengine.api.datastore.Query$FilterOperator.of(Query.java:75)
at com.google.appengine.api.datastore.Query.addFilter(Query.java:351)
at com.google.appengine.api.files.FileServiceImpl.getBlobKey(FileServiceImpl.java:329)
But I know that it's not a String/Text data type issue, since I am already using similar length file service urls for the previous successful attempts with smaller files. It also wasn't an issue for the other stackoverflow post I linked above. I also tried putting one last meaningless write before finalizing, just in case it would help as it did for the other post, but it made no difference. So there's really no way for me to debug this... Here is my file closing code that is not working. It's pretty similar to the Google how-to example at http://developers.google.com/appengine/docs/java/blobstore/overview#Writing_Files_to_the_Blobstore .
log.info("closing out file 1");
try {
//locked set to true
FileWriteChannel fwc1 = fileService.openWriteChannel(csvFile1, true);
fwc1.closeFinally();
} catch (IOException ioe) {ioe.printStackTrace();}
// You can't get the blob key until the file is finalized
BlobKey blobKeyCSV1 = fileService.getBlobKey(csvFile1);
log.info("csv blob storage key is:" + blobKeyCSV1.getKeyString());
csvUrls[i-1] = blobKeyCSV1.getKeyString();
break;
At this point, I just want to finalize my new blob files for which I have the urls, but cannot. How can I get around this issue, and also, what may be the cause? Again, my code works for small files (~60 kB), but the input file of ~150MB fails). Thank you for any advice on what is causing this or how to get around it! Also, how long will my unfinalized files stick around for, before being deleted?
This issue was a bug in the Java MapReduce and Files API, which was recently fixed by Google. Read announcement here: groups.google.com/forum/#!topic/google-appengine/NmjYYLuSizo