Apache Beam get source File Name

Apache Beam get source File Name - java

EDIT: RESOLVED!
I have multiple Text files from multiple languages. I want to add a language tag to each line using Apache Beam.
Example:
File text_en:
This is a sentence.
File text_de: Dies ist ein Satz.
What I want is this:
en: This is a sentence.
de: Dies ist ein Satz.
What I've tried:
I initially tried to just use one TextIO.Read.From(dataSetDirectory+"/*") and look for an option that looks something like .getSource(). However, this doesn't seem to exist.
Next I tried to read every File one by one like this:
File[] files = new File(datasetDirectory).listFiles();
PCollectionList<String> dataSet=null;
for (File f: files) {
String language = f.getName();
logger.debug(language);
PCollection<String> newPCollection = p.apply(
TextIO.Read.from(f.getAbsolutePath()))
.apply(ParDo.of(new LanguageTagAdder(language)));
if (dataSet==null) {
dataSet=PCollectionList.of(newPCollection);
} else {
dataSet.and(newPCollection);
}
}
PCollection<String> completeDataset= dataSet.apply(Flatten.<String>pCollections())
Reading the Files this way works perfectly fine, however my DoFn LanguageTagAdder is only initialized with the first language, thus all Files have the same added language.
LanguageTagAdder looks like this:
public class LanguageTagAdder
extends DoFn<String,String> {
private String language;
public LanguageTagAdder(String language) {
this.language=language;
}
#ProcessElement
public void processElement(ProcessContext c) {
c.output(language+c.element());
}
}
I realize this behavior is intended and needed so that the data can be parrallelized, but how would I go about solving my Problem? Is there a Beam-way to solve it?
PS: I get following warnings when creating a new LanguageTagAdder for the second time (with a second language):
DEBUG 2016-12-05 17:09:55,070 [main] de.kdld16.hpi.FusionDataset - en
DEBUG 2016-12-05 17:09:56,216 [main] de.kdld16.hpi.FusionDataset - de
WARN 2016-12-05 17:09:56,219 [main] org.apache.beam.sdk.Pipeline - Transform TextIO.Read2 does not have a stable unique name. This will prevent updating of pipelines.
EDIT:
The Problem was the line
dataSet.and(newPCollection);
It needed to be rewritten as:
dataSet=dataSet.and(newPCollection);
The way it was, dataSet only contained the first File.... No wonder they all had the same language Tag!

The Problem was the line
dataSet.and(newPCollection);
It needed to be rewritten as:
dataSet=dataSet.and(newPCollection);
The way it was, dataSet only contained the first File.

Related

JFlex Scanner ArrayIndexOutOfBoundsException: 769

I am trying to create a Scanner with JFlex. I created my .jflex file and it compiles and everything. The problem is that when I try to prove it, some times it gives me and error of ArrayIndexOutOfBoundsException: 769 in the .java class that JFlex created.
I am using Cup Parser generator too. I don't know if the problem can be related with the part of Cup Analysis, but here is how I called my analyzers.
ScannerLexico lexico = new ScannerLexico(new BufferedReader(new StringReader( jTextPane1.getText())));
ParserSintactico sintaxis = new ParserSintactico(lexico);
I don't know how to fix it. Please help me.
Here are the links to my code:
JFlex File "ScannerFranklin.jflex"
Java Class generated "ScannerLexico.java"
The part where I have the problem in the .java class created by JFlex, in next_token() function (Line 899 in java file).
int zzNext = zzTransL[ zzRowMapL[zzState] + zzCMapL[zzInput] ];
if (zzNext == -1) break zzForAction;
zzState = zzNext;
Thanks.

According to its documentation, JFlex throws ArrayIndexOutOfBounds exceptions whenever it encounters Unicode characters using the %7bit or %8bit/%full encoding options. It recommends to always use the %unicode option instead, which is the default.
You are using the %unicode option, but you're also using %full. Apparently when you have both options, %full takes precedence. So remove %full and the error should go away.

WEKA - Multi Class Classification - Can't find class called: weka.classifiers.functions.supportVector.RegSMOImproved

I'm trying to train a MultiClassClassifiermodel in Weka with the base algorithm set to weka.classifiers.functions.supportVector.RegSMOImproved class, with the following options:
MultiClassClassifier cModel = new MultiClassClassifier();
String options[] = {
"weka.classifiers.meta.MultiClassClassifier",
"-M","0",
"-R","2.0",
"-S","1",
"-W","weka.classifiers.functions.supportVector.RegSMOImproved",
"-P","1.0e-12",
"-L","1.0e-3",
"-W","1"
};
try {
cModel.setOptions(options);
} catch (Exception e) {
e.printStackTrace();
}
When I run my code I get the following error:
java.lang.Exception: Can't find class called: weka.classifiers.functions.supportVector.RegSMOImproved
at weka.core.Utils.forName(Utils.java:1073)
at weka.classifiers.AbstractClassifier.forName(AbstractClassifier.java:90)
at weka.classifiers.SingleClassifierEnhancer.setOptions(SingleClassifierEnhancer.java:108)
at weka.classifiers.RandomizableSingleClassifierEnhancer.setOptions(RandomizableSingleClassifierEnhancer.java:93)
at weka.classifiers.meta.MultiClassClassifier.setOptions(MultiClassClassifier.java:802)
at myApp.Main.trainMultiClassClassifier(Main.java:983)
at myApp.Main.createSets(Main.java:903)
at myApp.Main.main(Main.java:387)
What is the correct class path for use of RegSMOImproved algorithm if not weka.classifiers.functions.supportVector.RegSMOImproved?
Am I missing something else here, an additional setting perhaps, or some kind of a parent class?
I'm using Weka developer-branch from here. If there is anything I left out unintentionally please let me know and I'll make an edit asap.
Thank You in advance.
EDIT 1:
I'm trying to accomplish multi class classification where I would train my model/models as one class vs. the rest. My data is balanced (100 samples per class). This is what I've found so far:
http://weka.8497.n7.nabble.com/meta-multi-class-classifier-with-the-option-smo-td26548.html
EDIT 2:
So I've changed my options object to:
String options[] = {
"-M","0",
"-R","2.0",
"-S","1",
"-W","weka.classifiers.functions.SMO",
"--",
"-C","1",
"-L","0.001",
"-P","1.0e-12",
"-M",
"-N", "0",
"-V","-1",
"-W","1",
"-K", "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
};
This seems to go through setOptions(), so I've clearly mixed the two SMO classes from supportVector and functions packages. I've also read that I need to set the -M and -V properties for SMO in order for my MultiClassClassifier to work correctly. So I've turned on "fitting calibration models to SVM outputs" with the -M property and I've set the number of folds for cross validation to -1 (default) with the -V property.
I assume the number of folds property for cross validation has to be set for testing purposes. Will have to check out posts on cross validation from this point.
Thank You again!

A) you probably shouldn't be using the developer branch unless you have a specific need. For all we know they are moving stuff around and its potentially broken
B) RegSMOImproved is for Regression , not classification. So some of your issues could be that miss match between the MultiClassClassifier and a regression algorithm.

Warning: File for type '[Insert class here]' created in the last round will not be subject to annotation processing

I switched an existing code base to Java 7 and I keep getting this warning:
warning: File for type '[Insert class here]' created in the last round
will not be subject to annotation processing.
A quick search reveals that no one has hit this warning.
It's not documented in the javac compiler source either:
From OpenJDK\langtools\src\share\classes\com\sun\tools\javac\processing\JavacFiler.java
private JavaFileObject createSourceOrClassFile(boolean isSourceFile, String name) throws IOException {
checkNameAndExistence(name, isSourceFile);
Location loc = (isSourceFile ? SOURCE_OUTPUT : CLASS_OUTPUT);
JavaFileObject.Kind kind = (isSourceFile ?
JavaFileObject.Kind.SOURCE :
JavaFileObject.Kind.CLASS);
JavaFileObject fileObject =
fileManager.getJavaFileForOutput(loc, name, kind, null);
checkFileReopening(fileObject, true);
if (lastRound) // <-------------------------------TRIGGERS WARNING
log.warning("proc.file.create.last.round", name);
if (isSourceFile)
aggregateGeneratedSourceNames.add(name);
else
aggregateGeneratedClassNames.add(name);
openTypeNames.add(name);
return new FilerOutputJavaFileObject(name, fileObject);
}
What does this mean and what steps can I take to clear this warning?
Thanks.

The warning
warning: File for type '[Insert class here]' created in the last round
will not be subject to annotation processing
means that your were running an annotation processor creating a new class or source file using a javax.annotation.processing.Filer implementation (provided through the javax.annotation.processing.ProcessingEnvironment) although the processing tool already decided its "in the last round".
This may be problem (and thus the warning) because the generated file itself may contain annotations being ignored by the annotation processor (because it is not going to do a further round).
The above ought to answer the first part of your question
What does this mean and what steps can I take to clear this warning?
(you figured this out already by yourself, didn't you :-))
What possible steps to take? Check your annotation processors:
1) Do you really have to use filer.createClassFile / filer.createSourceFile on the very last round of the annotaion processor? Usually one uses the filer object inside of a code block like
for (TypeElement annotation : annotations) {
...
}
(in method process). This ensures that the annotation processor will not be in its last round (the last round always being the one having an empty set of annotations).
2) If you really can't avoid writing your generated files in the last round and these files are source files, trick the annotation processor and use the method "createResource" of the filer object (take "SOURCE_OUTPUT" as location).

In OpenJDK test case this warning produced because processor uses "processingOver()" to write new file exactly at last round.
public boolean process(Set<? extends TypeElement> elems, RoundEnvironment renv) {
if (renv.processingOver()) { // Write only at last round
Filer filer = processingEnv.getFiler();
Messager messager = processingEnv.getMessager();
try {
JavaFileObject fo = filer.createSourceFile("Gen");
Writer out = fo.openWriter();
out.write("class Gen { }");
out.close();
messager.printMessage(Diagnostic.Kind.NOTE, "File 'Gen' created");
} catch (IOException e) {
messager.printMessage(Diagnostic.Kind.ERROR, e.toString());
}
}
return false;
}
I modified original example code a bit. Added diagnostic note "File 'Gen' created", replaced "*" mask with "org.junit.runner.RunWith" and set return value to "true". Produced compiler log was:
Round 1:
input files: {ProcFileCreateLastRound}
annotations: [org.junit.runner.RunWith]
last round: false
Processor AnnoProc matches [org.junit.runner.RunWith] and returns true.
Round 2:
input files: {}
annotations: []
last round: true
Note: File 'Gen' created
Compilation completed successfully with 1 warning
0 errors
1 warning
Warning: File for type 'Gen' created in the last round will not be subject to annotation processing.
If we remove my custom note from log, it's hard to tell that file 'Gen' was actually created on 'Round 2' - last round. So, basic advice applies: if in doubt - add more logs.
Where is also a little bit of useful info on this page:
http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/javac.html
Read section about "ANNOTATION PROCESSING" and try to get more info with compiler options:
-XprintProcessorInfo
Print information about which annotations a processor is asked to process.
-XprintRounds Print information about initial and subsequent annotation processing rounds.

I poked around the java 7 compiler options and I found this:
-implicit:{class,none}
Controls the generation of class files for implicitly loaded source files. To automatically generate class files, use -implicit:class. To suppress class file generation, use -implicit:none. If this option is not specified, the default is to automatically generate class files. In this case, the compiler will issue a warning if any such class files are generated when also doing annotation processing. The warning will not be issued if this option is set explicitly. See Searching For Types.
Source
Can you try and implicitly declare the class file.

Human editable JSON-like or YAML-like program configuration in Java

Is there a Java library similar to libconfig for C++, where the config file is stored in a JSON-like format that can be edited by humans, and later read from the program?
I don't want to use Spring or any of the larger frameworks. What I'm looking for is a small, fast, self-contained library. I looked at java.util.Properties, but it doesn't seem to support hierarchical/nested config data.

I think https://github.com/typesafehub/config is exactly what you are looking for. The format is called HOCON for Human-Optimized Config Object Notation and it a super-set of JSON.
Examples of HOCON:
HOCON that is also valid JSON:
{
"foo" : {
"bar" : 10,
"baz" : 12
}
}
HOCON also supports standard properties format, so the following is valid as well:
foo.bar=10
foo.baz=12
One of the features I find very useful is inheritance, this allows you to layer configurations. For instance a library would have a reference.conf, and the application using the library would have an application.conf. The settings in the application.conf will override the defaults in reference.conf.
Standard Behavior for loading configs:
The convenience method ConfigFactory.load() loads the following
(first-listed are higher priority):
system properties application.conf (all resources on classpath with
this name)
application.json (all resources on classpath with this
name)
application.properties (all resources on classpath with this
name)
reference.conf (all resources on classpath with this name)

I found this HOCON example:
my.organization {
project {
name = "DeathStar"
description = ${my.organization.project.name} "is a tool to take control over whole world. By world I mean couch, computer and fridge ;)"
}
team {
members = [
"Aneta"
"Kamil"
"Lukasz"
"Marcin"
]
}
}
my.organization.team.avgAge = 26
to read values:
val config = ConfigFactory.load()
config.getString("my.organization.project.name") // => DeathStar
config.getString("my.organization.project.description") // => DeathStar is a tool to take control over whole world. By world I mean couch, computer and fridge ;)
config.getInt("my.organization.team.avgAge") // => 26
config.getStringList("my.organization.team.members") // => [Aneta, Kamil, Lukasz, Marcin]
Reference: marcinkubala.wordpress.com

Apache Commons Configuration API and Constretto seem to be somewhat popular and support multiple formats (no JSON mentioned, though). I've personally never tried either, so YMMV.

There's a Java library to handle JSON files if that's what you're looking for:
http://www.json.org/java/index.html
Check out other tools on the main page:
http://json.org/

Problem validating against an XSD with Java5

I'm trying to validate an Atom feed with Java 5 (JRE 1.5.0 update 11). The code I have works without problem in Java 6, but fails when running in Java 5 with a
org.xml.sax.SAXParseException: src-resolve: Cannot resolve the name 'xml:base' to a(n) 'attribute declaration' component.
I think I remember reading something about the version of Xerces bundled with Java 5 having some problems with some schemas, but i cant find the workaround. Is it a known problem ? Do I have some error in my code ?
public static void validate() throws SAXException, IOException {
List<Source> schemas = new ArrayList<Source>();
schemas.add(new StreamSource(AtomValidator.class.getResourceAsStream("/atom.xsd")));
schemas.add(new StreamSource(AtomValidator.class.getResourceAsStream("/dc.xsd")));
// Lookup a factory for the W3C XML Schema language
SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
// Compile the schemas.
Schema schema = factory.newSchema(schemas.toArray(new Source[schemas.size()]));
Validator validator = schema.newValidator();
// load the file to validate
Source source = new StreamSource(AtomValidator.class.getResourceAsStream("/sample-feed.xml"));
// check the document
validator.validate(source);
}
Update : I tried the method below, but I still have the same problem if I use Xerces 2.9.0. I also tried adding xml.xsd to the list of schemas (as xml:base is defined in xml.xsd) but this time I have
Exception in thread "main" org.xml.sax.SAXParseException: schema_reference.4: Failed to read schema document 'null', because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>.
Update 2: I tried to configure a proxy with the VM arguments -Dhttp.proxyHost=<proxy.host.com> -Dhttp.proxyPort=8080 and now it works. I'll try to post a "real answer" from home.
and sorry, I cant reply as a comment : because of security reasons XHR is disabled from work ...

Indeed, people have been mentioning the Java 5 Sun provided SchemaFactory is giving troubles.
So: did you include Xerces in your project yourself?
After including Xerces, you need to ensure it is being used. If you like to hardcode it (well, as a minimal requirement you'd probably use some application properties file to enable and populate the following code):
String schemaFactoryProperty =
"javax.xml.validation.SchemaFactory:" + XMLConstants.W3C_XML_SCHEMA_NS_URI;
System.setProperty(schemaFactoryProperty,
"org.apache.xerces.jaxp.validation.XMLSchemaFactory");
SchemaFactory factory =
SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Or, if you don't want to hardcode, or when your troublesome code would be in some 3rd party library that you cannot change, set it on the java command line or environment options. For example (on one line of course):
set JAVA_OPTS =
"-Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema
=org.apache.xerces.jaxp.validation.XMLSchemaFactory"
By the way: apart from the Sun included SchemaFactory implementation giving trouble (something like com.sun.org.apache.xerces.internal.jaxp.validation.xs.schemaFactoryImpl), it also seems that the "discovery" of non-JDK implementations fails in that version. If I understand correctly than, normally, just including Xerces would in fact make SchemaFactory#newInstance find that included library, and give it precedence over the Sun implementation. To my knowledge, that fails as well in Java 5, making the above configuration required.

I tried to configure a proxy with the VM arguments -Dhttp.proxyHost=<proxy.host.com> -Dhttp.proxyPort=8080 and now it works.
Ah, I didn't realize that xml.xsd is in fact the one referenced as http://www.w3.org/2001/xml.xsd or something like that. That should teach us to always show some XML and XSD fragments as well. ;-)
So, am I correct to assume that 1.) to fix the Java 5 issue, you still needed to include Xerces and set the system property, and that 2.) you did not have xml.xsd available locally?
Before you found your solution, did you happen to try using getResource rather than getResourceAsStream, to see if the exception would then have showed you some more details?
If you actually did have xml.xsd available (so: if getResource did in fact yield a URL) then I wonder what Xerces was trying to fetch from the internet then. Or maybe you did not add that schema to the list prior to adding your own schemas? The order is important: dependencies must be added first.
For whoever gets tot his question using the search: maybe using a custom EntityResolver could have indicated the source of the problem as well (if only writing something to the log and just returning null to tell Xerces to use the default behavior).

Hmmm, just read your "comment" -- editing does not alert people for new replies, so time to ask your boss for some iPhone or some other gadget that is connected to the net directly ;-)
Well, I assume you added:
schemas.add(
new StreamSource(AtomValidator.class.getResourceAsStream("/xml.xsd")));
If so, is xml.xsd actually to be found on the classpath then? I wonder if the getResourceAsStream did not yield null in your case, and how new StreamSource(null) would act then.
Even if getResourceAsStream did not yield null, the resulting StreamSource would still not know where it was loaded from, which may be a problem when trying to include references. So, what if you use the constructor StreamSource(String systemId) instead:
schemas.add(new StreamSource(AtomValidator.class.getResource("/atom.xsd")));
schemas.add(new StreamSource(AtomValidator.class.getResource("/dc.xsd")));
You might also use StreamSource(InputStream inputStream, String systemId), but I don't see any advantage over the above two lines. However, the documentation explains why passing the systemId in either of the 2 constructors seems good:
This constructor allows the systemID to be set in addition to the input stream, which allows relative URIs to be processed.
Likewise, setSystemId(String systemId) explains a bit:
The system identifier is optional if there is a byte stream or a character stream, but it is still useful to provide one, since the application can use it to resolve relative URIs and can include it in error messages and warnings (the parser will attempt to open a connection to the URI only if there is no byte stream or character stream specified).
If this doesn't work out, then maybe some custom error handler can give you more details:
ErrorHandlerImpl errorHandler = new ErrorHandlerImpl();
validator.setErrorHandler(errorHandler);
:
:
validator.validate(source);
if(errorHandler.hasErrors()){
LOG.error(errorHandler.getMessages());
throw new [..];
}
if(errorHandler.hasWarnings()){
LOG.warn(errorHandler.getMessages());
}
...using the following ErrorHandler to capture the validation errors and continue parsing as far as possible:
import org.xml.sax.helpers.DefaultHandler;
private class ErrorHandlerImpl extends DefaultHandler{
private String messages = "";
private boolean validationError = false;
private boolean validationWarning = false;
public void error(SAXParseException exception) throws SAXException{
messages += "Error: " + exception.getMessage() + "\n";
validationError = true;
}
public void fatalError(SAXParseException exception) throws SAXException{
messages += "Fatal: " + exception.getMessage();
validationError = true;
}
public void warning(SAXParseException exception) throws SAXException{
messages += "Warn: " + exception.getMessage();
validationWarning = true;
}
public boolean hasErrors(){
return validationError;
}
public boolean hasWarnings(){
return validationWarning;
}
public String getMessages(){
return messages;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Beam get source File Name - java

The Problem was the line dataSet.and(newPCollection); It needed to be rewritten as: dataSet=dataSet.and(newPCollection); The way it was, dataSet only contained the first File.

Related

JFlex Scanner ArrayIndexOutOfBoundsException: 769

WEKA - Multi Class Classification - Can't find class called: weka.classifiers.functions.supportVector.RegSMOImproved

Warning: File for type '[Insert class here]' created in the last round will not be subject to annotation processing

Human editable JSON-like or YAML-like program configuration in Java

Problem validating against an XSD with Java5

Categories

Resources