Reuse normalizer in ND4J/DL4J - java

I wonder what's the proper way to reuse a normalizer in ND4J/DL4J. Currently, I save it follows:
final DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit( trainingData );
normalizer.transform( trainingData );
normalizer.transform( testData );
try {
final NormalizerSerializer normalizerSerializer = new NormalizerSerializer();
normalizerSerializer.addStrategy( new StandardizeSerializerStrategy() );
normalizerSerializer.write( normalizer, path );
} catch ( final IOException e ) {
// ...
}
And load it via:
try {
final NormalizerSerializer normalizerSerializer = new NormalizerSerializer();
normalizerSerializer.addStrategy( new StandardizeSerializerStrategy() );
final DataNormalization normalizer = normalizerSerializer.restore( path );
} catch ( final Exception e ) { // Throws Exception instead of IOException.
// ...
}
Is that OK? Unfortunately, I wasn't able to find more information in the docs.

This is what I do...
DataNormalization normalizer = new NormaizerStandardize();
normalizer.fit(trainingData);
normalizer.transform(trainingData);
save it
NormalizerSerializer saver = NormalizerSerializer.getDefaults();
File normalsFile = new File("fileName");
saver.write(normalizer,normalsFile);
restore it
NormalizerSerializer loader = NormalizerSerializer.getDefaults();
DataNormalization restoredNormalizer = loader.restore(normalsFile);
restoredNormalizer.transform(testData);
The ND4J Java Docs say that .getDefaults() gets a serializer, configured with strategies for the built-in normalizer implementations. As you are using NormalizerStandardize the getDefaults() offers a short-hand way of achieving the same end without explicitly adding the strategy.

Related

How to use Wordnet Synonyms with Hibernate Search?

I've been trying to figure out how to use WordNet synonyms with a search function I'm developing which uses Hibernate Search 5.6.1. At first, I thought about using Hibernate Search annotations:
#TokenFilterDef(factory = SynonymFilterFactory.class, params = {#Parameter(name = "ignoreCase", value = "true"),
#Parameter(name = "expand", value = "true"),#Parameter(name = "synonyms", value = "synonymsfile") })
However, this requires an actual file populated with synonyms. From WordNet I was only able to get ".pl" files. So I tried manually making a SynonymAnalyzer class which would read from the ".pl" file:
public class SynonymAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
SynonymMap wordnetSynonyms = null;
try {
wordnetSynonyms = loadSynonyms();
} catch (IOException e) {
e.printStackTrace();
}
result = new SynonymFilter(result, wordnetSynonyms, false);
result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(source, result);
}
private SynonymMap loadSynonyms() throws IOException {
File file = new File("synonyms\\wn_s.pl");
InputStream stream = new FileInputStream(file);
Reader reader = new InputStreamReader(stream);
SynonymMap.Builder parser = null;
parser = new WordnetSynonymParser(true, true, new StandardAnalyzer(CharArraySet.EMPTY_SET));
try {
((WordnetSynonymParser) parser).parse(reader);
} catch (ParseException e) {
e.printStackTrace();
}
return parser.build();
}
}
The problem with this method is that I'm getting java.lang.OutOfMemoryError which I'm assuming is because there's too many synonyms or something? What is the proper way to do this, everywhere I've looked online has suggested using WordNet but I can't seem to find an example with Hibernate Search Annotations. Any help is appreciated, thanks!
The wordnet format is actually supported by SynonymFilterFactory. You're simply missing the "format" parameter in your annotation configuration; by default, the factory uses the Solr format.
Change your annotation to this:
#TokenFilterDef(
factory = SynonymFilterFactory.class,
params = {
#Parameter(name = "ignoreCase", value = "true"),
#Parameter(name = "expand", value = "true"),
#Parameter(name = "synonyms", value = "synonymsfile"),
#Parameter(name = "format", value = "wordnet") // Add this
}
)
Also, make sure that the value of the "synonyms" parameter is the path of a file in your classpath (e.g. "com/acme/synonyms.pl", or just "synonyms.pl" if the file is at the root of your "resources" directory).
In general when you have an issue with the parameters of a Lucene filter/tokenizer factory, your best bet is having a look at the source code of that factory, or having a look at this page.

How to create separate change list when using the API?

I am trying to create a Groovy script that takes a list of change lists from our trunk and merges them one at a time into a release branch. I would like to have all the change lists locally because I want to run a test build before submitting upstream. However, whenever I run the script I find that when I look in P4V all the merges have been placed in the default change list. How can I keep them separate?
My code (in Groovy but using the Java API) is as follows:
final changeListNumbers = [ 579807, 579916, 579936 ]
final targetBranch = "1.0.7"
changeListNumbers.each { changeListNumber ->
final existingCl = server.getChangelist( changeListNumber )
final cl = new Changelist(
IChangelist.UNKNOWN,
client.getName(),
server.userName,
ChangelistStatus.NEW,
new Date(),
"${existingCl.id} - ${existingCl.description}",
false,
server
);
cl.fileSpecs = mergeChangeListToBranch( client, cl, changeListNumber, targetBranch )
}
def List<IFileSpec> mergeChangeListToBranch( final IClient client, final IChangelist changeList, final srcChangeListNumber, final String branchVersion ){
final projBase = '//Clients/Initech'
final trunkBasePath = "$projBase/trunk"
final branchBasePath = "$projBase/release"
final revisionedTrunkPath = "$trunkBasePath/...#$srcChangeListNumber,$srcChangeListNumber"
final branchPath = "$branchBasePath/$branchVersion/..."
println "trunk path: $revisionedTrunkPath\nbranch path is: $branchPath"
mergeFromTo( client, changeList, revisionedTrunkPath, branchPath )
}
def List<IFileSpec> mergeFromTo( final IClient client, final IChangelist changeList,final String sourceFile, final String destFile ){
mergeFromTo(
client,
changeList,
new FileSpec( new FilePath( FilePath.PathType.DEPOT, sourceFile ) ),
new FileSpec( new FilePath( FilePath.PathType.DEPOT, destFile ) )
)
}
def List<IFileSpec> mergeFromTo( final IClient client, final IChangelist changeList, final FileSpec sourceFile, final FileSpec destFile ){
final resolveOptions = new ResolveFilesAutoOptions()
resolveOptions.safeMerge = true
client.resolveFilesAuto(
client.integrateFiles( sourceFile, destFile, null, null ),
// client.integrateFiles( changeList.id, false, null, null, sourceFile, destFile ),
resolveOptions
)
}
If I try to IChangeList.update() I get the following error:
Caught: com.perforce.p4java.exception.RequestException: Error in change specification.
Error detected at line 7.
Invalid status 'new'.
If instead of using IChangelist.UNKNOWN to existingCl.id + 10000 (which is larger than any existing change list number currently in use) then I get
Caught: com.perforce.p4java.exception.RequestException: Tried to update new or default changelist
To create the changelist in the server, call IClient.createChangelist():
final existingCl = server.getChangelist( changeListNumber )
cl = new Changelist(
IChangelist.UNKNOWN,
... snip ...
);
cl = client.createChangelist(cl);
cl.fileSpecs = mergeChangeListToBranch( client, cl, ...
Then to integrate into this particular change:
IntegrateFilesOptions intOpts = new IntegrateFilesOptions()
intOpts.setChangelistId( cl.getId())
client.integrateFiles( sourceFile, destFile, null, intOpts)
That integrateFiles() returns the integrated file(s), so check that the returned IFileSpec.getOpStatus() is FileSpecOpStatus.VALID.

How to read Nutch content from Java/Scala?

I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup.
I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory.
The problem is that I cannot figure out how to actually read the website data (URLs and HTML) in a Java/Scala program. I read this document, but find it a bit overwhelming since I've never used Hadoop.
I tried to adapt the example code to my environment, and this is what I arrived at (mostly by guesswprk):
val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()), ".../apache-nutch-1.8/crawl/segments/20140711115438/content/part-00000", new Configuration())
var key = null
var value = null
reader.next(key, value) // test for a single value
println(key)
println(value)
However, I am getting this exception when I run it:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1873)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
I am not sure how to work with a MapFile.Reader, specifically, what constructor parameters I am supposed to pass to it. What Configuration objects am I supposed to pass in? Is that the correct FileSystem? And is that the data file I'm interested in?
Scala:
val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)
val webdata = Stream.continually {
val key = new Text()
val content = new Content()
reader.next(key, content)
(key, content)
}
println(webdata.head)
Java:
public class ContentReader {
public static void main(String[] args) throws IOException {
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (reader.next(key, content)) {
try {
System.out.write(content.getContent(), 0,
content.getContent().length);
} catch (Exception e) {
}
}
}
}
Alternatively, you can use org.apache.nutch.segment.SegmentReader (example).

SecureASTCustomizer: how to restrict loops?

I'm trying to restrict using loops(FOR and WHILE operators) in Groovy script.
I tried http://groovy-sandbox.kohsuke.org/ but it seems to be not possible to restrict loops with this lib.
Code:
final String script = "while(true){}";
final ImportCustomizer imports = new ImportCustomizer();
imports.addStaticStars("java.lang.Math");
imports.addStarImports("groovyx.net.http");
imports.addStaticStars("groovyx.net.http.ContentType", "groovyx.net.http.Method");
final SecureASTCustomizer secure = new SecureASTCustomizer();
secure.setClosuresAllowed(true);
List<Integer> tokensBlacklist = new ArrayList<>();
tokensBlacklist.add(Types.KEYWORD_WHILE);
secure.setTokensBlacklist(tokensBlacklist);
final CompilerConfiguration config = new CompilerConfiguration();
config.addCompilationCustomizers(imports, secure);
Binding intBinding = new Binding();
GroovyShell shell = new GroovyShell(intBinding, config);
final Object eval = shell.evaluate(script);
Whats wrong with my code or probably some one knows how I can restrict some loops or operators?
WHILE and FOR are statements. You should rather try adding them as statementsBlacklist instead of tokenBlacklist.
List<Class> statementBlacklist = new ArrayList<>();
statementBlacklist.add( org.codehaus.groovy.ast.stmt.WhileStatement );
secure.setStatementsBlacklist( statementBlacklist );

How to get output of groovy script in java

I am executing groovy script in java:
final GroovyClassLoader classLoader = new GroovyClassLoader();
Class groovy = classLoader.parseClass(new File("script.groovy"));
GroovyObject groovyObj = (GroovyObject) groovy.newInstance();
groovyObj.invokeMethod("main", null);
this main method println some information which I want to save in some variable. How can I do it ?
You would have to redirect System.out into something else..
Of course, if this is multi-threaded, you're going to hit issues
final GroovyClassLoader classLoader = new GroovyClassLoader();
Class groovy = classLoader.parseClass(new File("script.groovy"));
GroovyObject groovyObj = (GroovyObject) groovy.newInstance();
ByteArrayOutputStream buffer = new ByteArrayOutputStream() ;
PrintStream saveSystemOut = System.out ;
System.setOut( new PrintStream( buffer ) ) ;
groovyObj.invokeMethod("main", null);
System.setOut( saveSystemOut ) ;
String output = buffer.toString().trim() ;
It's probably better (if you can) to write our scripts so they return something rather than dump to system.out

Categories

Resources