Extract Wikipedia Infobox data

Extract Wikipedia Infobox data - java

I want to extract the data from wikipedia infobox and came upon the code in Wikipedia infobox extraction in Java that suggests a method to do so with java. I am not handy with java as I am with python so I am using the wikixmlj-r43.jar in my eclipse with the code :
import edu.jhu.nlp.wikipedia.*;
public class InfoboxParser {
public static void main(String[] args) throws Exception{
WikiXMLParser parser = WikiXMLParserFactory.getSAXParser("/home/siddhartha/Documents/wiki/enwiki-latest-pages-articles.xml");
parser.setPageCallback(new PageCallbackHandler() {
public void process(WikiPage page) {
InfoBox infobox=page.getInfoBox();
//do something with info box
}
});
parser.parse();
}
}
I am getting the following error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tools/bzip2/CBZip2InputStream
at edu.jhu.nlp.wikipedia.WikiXMLParserFactory.getSAXParser(WikiXMLParserFactory.java:15)
at parser.InfoboxParser.main(InfoboxParser.java:7)
Caused by: java.lang.ClassNotFoundException: org.apache.tools.bzip2.CBZip2InputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 2 more
I have added the JAR in eclipse under properties > java build path > libraries. What I get is that it is not able to find CBZip2InputStream class.
Please help.

Response res = Jsoup.connect("http://en.wikipedia.org/wiki/Carbon")
.execute();
String html = res.body();
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
Elements tables = body.getElementsByTag("table");// hasClass("infobox bordered");
for (Element table : tables) {
if (table.className().equalsIgnoreCase("infobox bordered")) {
System.out.println(table.outerHtml());
break;
}

This might help you.
https://code.google.com/p/wikixmlj/source/browse/trunk/tests/InfoBoxTest.java?spec=svn40&r=40
Replace the link in the code(data/newton.xml) with this
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles=your_title&rvsection=0

Related

Read multiple csv file in apache beam using java

This code works well with just one file as input but when I pass :-
D://beam//csv//*.csv
or D://beam//csv//20*.csv as parameter it throws :-
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.nio.file.InvalidPathException: Illegal char <*> at index 17: D:\\beam\\csv\\20*.csv
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:332)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:302)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:197)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:64)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:299)
at beam.wordcount.TestCsv.main(TestCsv.java:60)
Caused by: java.nio.file.InvalidPathException: Illegal char <*> at index 17: D:\\beam\\csv\\20*.csv
at sun.nio.fs.WindowsPathParser.normalize(Unknown Source)
at sun.nio.fs.WindowsPathParser.parse(Unknown Source)
at sun.nio.fs.WindowsPathParser.parse(Unknown Source)
at sun.nio.fs.WindowsPath.parse(Unknown Source)
at sun.nio.fs.WindowsFileSystem.getPath(Unknown Source)
at java.nio.file.Paths.get(Unknown Source)
at org.apache.beam.sdk.io.LocalFileSystem.matchOne(LocalFileSystem.java:217)
at org.apache.beam.sdk.io.LocalFileSystem.match(LocalFileSystem.java:90)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:119)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:140)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:152)
at org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn.process(FileIO.java:636)
I don't know why it is throwing error , * is used to read multiple files with similar type
CODE
public interface BatchOptions extends PipelineOptions {
#Description("Path to the data file(s) containing game data.")
#Default.String("D:\\beam\\csv\\2020.csv")
String getInput();
void setInput(String value);
}
public static void main(String[] args) {
BatchOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(BatchOptions.class);
Pipeline pipeline = Pipeline.create(options);
PCollection lines=pipeline
.apply(FileIO.match().filepattern(options.getInput()))
.apply(FileIO.readMatches());
herepipeline.run().waitUntilFinish();
}

WindowsFileSystem does not expand * and treat it as special character.
I would recommend passing the complete directory like
D://beam//csv//

Ignite Scan Query Throws class org.apache.ignite.binary.BinaryInvalidTypeException

Following Ignite Readme page https://apacheignite.readme.io/docs/cache-queries#section-scan-queries I am trying to run the discussed example code.
IgniteCache<Long, Person> cache = ignite.cache("mycache");
// Find only persons earning more than 1,000.
try (QueryCursor<Cache.Entry<Long, Person>> cursor =
cache.query(new ScanQuery<Long, Person>((k, p) -> p.getSalary() > 1000))) {
for (Cache.Entry<Long, Person> entry : cursor)
System.out.println("Key = " + entry.getKey() + ", Value = " + entry.getValue());
}
I am getting the following exception
Caused by: class org.apache.ignite.binary.BinaryInvalidTypeException: examples.model.Person
at org.apache.ignite.internal.binary.BinaryContext.descriptorForTypeId(BinaryContext.java:707)
at org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1757)
at org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1716)
at org.apache.ignite.internal.binary.BinaryObjectImpl.deserializeValue(BinaryObjectImpl.java:798)
at org.apache.ignite.internal.binary.BinaryObjectImpl.value(BinaryObjectImpl.java:143)
at org.apache.ignite.internal.processors.cache.CacheObjectUtils.unwrapBinary(CacheObjectUtils.java:177)
at org.apache.ignite.internal.processors.cache.CacheObjectUtils.unwrapBinaryIfNeeded(CacheObjectUtils.java:39)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager$ScanQueryIterator.advance(GridCacheQueryManager.java:3063)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager$ScanQueryIterator.onHasNext(GridCacheQueryManager.java:2965)
at org.apache.ignite.internal.util.GridCloseableIteratorAdapter.hasNextX(GridCloseableIteratorAdapter.java:53)
at org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.runQuery(GridCacheQueryManager.java:1141)
at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager.processQueryRequest(GridCacheDistributedQueryManager.java:234)
at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager$2.apply(GridCacheDistributedQueryManager.java:109)
at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager$2.apply(GridCacheDistributedQueryManager.java:107)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
at org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
... 3 more
Caused by: java.lang.ClassNotFoundException: examples.model.Person
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.ignite.internal.util.IgniteUtils.forName(IgniteUtils.java:8771)
at org.apache.ignite.internal.MarshallerContextImpl.getClass(MarshallerContextImpl.java:349)
at org.apache.ignite.internal.binary.BinaryContext.descriptorForTypeId(BinaryContext.java:698)
The Person object is taken from the Ingnite example (as well as most of the code, available https://gist.github.com/alexterman/075d7e12f470ce873f99d59478260250 on github ). I am running it on Mac.

This means you do not have Person class in classpath of your server node(s).
Ignite will not peer class load its Key-Value classes so you need to distribute them to all nodes before running any distributed operations which use those types.
Alternatively you can use withKeepBinary(), work on BinaryObject's. Something along the lines of
cache.withKeepBinary().query(new ScanQuery<Long, BinaryObject>(
(k, p) -> p.<Integer>getField("salary") > 1000))

Javassist CannotCompileException when trying to add a line to create a Map

um trying to instrument a method to do the following task.
Task - Create a Map and insert values to the map
Adding System.out.println lines wouldn't cause any exception. But when i add the line to create the Map, it throws a cannotCompileException due to a missing ;. When i print the final string it doesn't seem to miss any. What am i doing wrong here.
public void createInsertAt(CtMethod method, int lineNo, Map<String,String> parameterMap)
throws CannotCompileException {
StringBuilder atBuilder = new StringBuilder();
atBuilder.append("System.out.println(\"" + method.getName() + " is running\");");
atBuilder.append("java.util.Map<String,String> arbitraryMap = new java.util.HashMap<String,String>();");
for (Map.Entry<String,String> entry : parameterMap.entrySet()) {
}
System.out.println(atBuilder.toString());
method.insertAt(1, atBuilder.toString());
}
String obtained by printing the output of string builder is,
System.out.println("prepareStatement is
running");java.util.Map arbitraryMap = new
java.util.HashMap();
Exception received is,
javassist.CannotCompileException: [source error] ; is missing
at javassist.CtBehavior.insertAt(CtBehavior.java:1207)
at javassist.CtBehavior.insertAt(CtBehavior.java:1134)
at org.wso2.das.javaagent.instrumentation.InstrumentationClassTransformer.createInsertAt(InstrumentationClassTransformer.java:126)
at org.wso2.das.javaagent.instrumentation.InstrumentationClassTransformer.instrumentMethod(InstrumentationClassTransformer.java:100)
at org.wso2.das.javaagent.instrumentation.InstrumentationClassTransformer.transform(InstrumentationClassTransformer.java:37)
at sun.instrument.TransformerManager.transform(TransformerManager.java:188)
at sun.instrument.InstrumentationImpl.transform(InstrumentationImpl.java:424)
at sun.instrument.InstrumentationImpl.retransformClasses0(Native Method)
at sun.instrument.InstrumentationImpl.retransformClasses(InstrumentationImpl.java:144)
at org.wso2.das.javaagent.instrumentation.Agent.premain(Agent.java:39)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:382)
at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:397)
Caused by: compile error: ; is missing
at javassist.compiler.Parser.parseDeclarationOrExpression(Parser.java:594)
at javassist.compiler.Parser.parseStatement(Parser.java:277)
at javassist.compiler.Javac.compileStmnt(Javac.java:567)
at javassist.CtBehavior.insertAt(CtBehavior.java:1186)
... 15 more
(Is there any way to debug these kind of issues.) Some help please.....

Javassist's compiler doesn't support generics. Either remove or comment them out:
.append("java.util.Map arbitraryMap = new java.util.HashMap();")
or
.append("java.util.Map/*<String,String>*/ arbitraryMap = new java.util.HashMap/*<String,String>*/();")
The latter is useful as comment for yourself only, of course, it has no special meaning for Javassist.

Stanford NER Error: Loading distsim lexicon Failed

In my project. I need to use NER annotation so I used NERDemo.java
It works fine when I create a new project and have only this code, but when I add it to my project I keep getting errors. I have edited the path in my code to the specific location of the classifiers.
I added the Jar files:
This is the code:
String serializedClassifier = "/Users/ha/stanford-ner-2014-10-26/classifiers/english.all.3class.distsim.crf.ser.gz";
String serializedClassifier2 = "/Users/ha/stanford-ner-2014-10-26/classifiers/english.muc.7class.distsim.crf.ser.gz";
if (args.length > 0) {
serializedClassifier = args[0];
}
NERClassifierCombiner classifier = new NERClassifierCombiner(false, false, serializedClassifier, serializedClassifier2);
String fileContents = IOUtils.slurpFile("/Users/ha/NetBeansProjects/StanfordPOSCode/src/stanfordposcode/input.txt");
List<List<CoreLabel>> out = classifier.classify(fileContents);
int i = 0;
for (List<CoreLabel> lcl : out) {
i++;
int j = 0;
for (CoreLabel cl : lcl) {
j++;
System.out.printf("%d:%d: %s%n", i, j,
cl.toShorterString("Text", "CharacterOffsetBegin", "CharacterOffsetEnd", "NamedEntityTag"));
}
}
But I got this error:
run:
Loading classifier from /Users/ha/stanford-ner-2014-10-26/classifiers/english.all.3class.distsim.crf.ser.gz ... Loading distsim lexicon from /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters ... java.lang.RuntimeException: java.io.FileNotFoundException: /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters (No such file or directory)
at edu.stanford.nlp.objectbank.ReaderIteratorFactory$ReaderIterator.setNextObject(ReaderIteratorFactory.java:225)
at edu.stanford.nlp.objectbank.ReaderIteratorFactory$ReaderIterator.<init>(ReaderIteratorFactory.java:161)
at edu.stanford.nlp.objectbank.ReaderIteratorFactory.iterator(ReaderIteratorFactory.java:98)
at edu.stanford.nlp.objectbank.ObjectBank$OBIterator.<init>(ObjectBank.java:404)
at edu.stanford.nlp.objectbank.ObjectBank.iterator(ObjectBank.java:242)
at edu.stanford.nlp.ie.NERFeatureFactory.initLexicon(NERFeatureFactory.java:471)
at edu.stanford.nlp.ie.NERFeatureFactory.init(NERFeatureFactory.java:379)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.reinit(AbstractSequenceClassifier.java:171)
at edu.stanford.nlp.ie.crf.CRFClassifier.loadClassifier(CRFClassifier.java:2630)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1620)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1736)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1679)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1662)
at edu.stanford.nlp.ie.crf.CRFClassifier.getClassifier(CRFClassifier.java:2851)
at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifierFromPath(ClassifierCombiner.java:189)
at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifiers(ClassifierCombiner.java:173)
at edu.stanford.nlp.ie.ClassifierCombiner.<init>(ClassifierCombiner.java:125)
at edu.stanford.nlp.ie.NERClassifierCombiner.<init>(NERClassifierCombiner.java:52)
at stanfordposcode.MultipleNERs.main(MultipleNERs.java:24)
Caused by: java.io.FileNotFoundException: /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:131)
at edu.stanford.nlp.io.EncodingFileReader.<init>(EncodingFileReader.java:78)
at edu.stanford.nlp.objectbank.ReaderIteratorFactory$ReaderIterator.setNextObject(ReaderIteratorFactory.java:192)
... 18 more
Loading classifier from /Users/ha/stanford-ner-2014-10-26/classifiers/english.all.3class.distsim.crf.ser.gz ... Exception in thread "main" java.io.FileNotFoundException
at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifierFromPath(ClassifierCombiner.java:199)
at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifiers(ClassifierCombiner.java:173)
at edu.stanford.nlp.ie.ClassifierCombiner.<init>(ClassifierCombiner.java:125)
at edu.stanford.nlp.ie.NERClassifierCombiner.<init>(NERClassifierCombiner.java:52)
at stanfordposcode.MultipleNERs.main(MultipleNERs.java:24)
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to edu.stanford.nlp.classify.LinearClassifier
at edu.stanford.nlp.ie.ner.CMMClassifier.loadClassifier(CMMClassifier.java:1070)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1620)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1736)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1679)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1662)
at edu.stanford.nlp.ie.ner.CMMClassifier.getClassifier(CMMClassifier.java:1116)
at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifierFromPath(ClassifierCombiner.java:195)
... 4 more
Java Result: 1
BUILD SUCCESSFUL (total time: 1 second)

You are mixing and matching the code from version 3.4 and the models from version 3.5. I suggest upgrading everything to the latest version.

Java jList NullPointerException error

Error:
Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
at test.factory.MainWindow.setFuncList(MainWindow.java:160)
at test.factory.MainWindow.<init>(MainWindow.java:22)
at test.factory.MainWindow$2.run(MainWindow.java:151)
at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:251)
at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:721)
at java.awt.EventQueue.access$200(EventQueue.java:103)
at java.awt.EventQueue$3.run(EventQueue.java:682)
at java.awt.EventQueue$3.run(EventQueue.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:691)
at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:244)
at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:163)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:151)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:147)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:139)
at java.awt.EventDispatchThread.run(EventDispatchThread.java:97)
Code:
TestFactory tf = new TestFactory();
ArrayList<Function> fList = tf.getFunctions();
DefaultListModel<Function> dFuncList = new DefaultListModel();
fListPane.setModel(dFuncList);
for(Function f : fList) {
dFuncList.addElement(f);
}
Question:
Now, if you find the error that's great, but my question is. How do I parse the error text to find where my error originated? I'm used to things like missing ';' at line 24 of C:\filename
Update: fList has two elements, so not null.

The error dump is a stack trace, so I tend to find it's always best to start at the top and work down. In this case it looks like your setFuncList at line 160 of MainWindow.java is trying to work with an object that is null (maybe not yet initialised?).
UPDATE: Example of code that works
class Function {
int i;
public Function(int myI) {
this.i = myI;
}
#Override
public String toString() {
return "i=" + this.i;
}
}
Used with:
ArrayList<Function> fList = new ArrayList<>();
fList.add(new Function(1));
fList.add(new Function(2));
DefaultListModel<Function> dFuncList = new DefaultListModel();
jList2.setModel(dFuncList);
for(Function f : fList) {
dFuncList.addElement(f);
}

So basically look through the stack trace from the top, it will list the calls that have occurred which led to the error you received. Look carefully at the lines in your code that are listed. If you can't see any obvious errors you can add some extra tests based on the error. Ie check some objects are not null before the line which caused the error, I find printouts a simple approach. You can also use a debugger, I use jswat but only break it out when I really need to.
Hope that was what you were after
#orangegoat gave a good breakdown of how to interpret the stack trace if that's what you wanted
Also a link to jswat
http://code.google.com/p/jswat/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract Wikipedia Infobox data - java

This might help you. https://code.google.com/p/wikixmlj/source/browse/trunk/tests/InfoBoxTest.java?spec=svn40&r=40 Replace the link in the code(data/newton.xml) with this http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles=your_title&rvsection=0

Related

Read multiple csv file in apache beam using java

Ignite Scan Query Throws class org.apache.ignite.binary.BinaryInvalidTypeException

Javassist CannotCompileException when trying to add a line to create a Map

Stanford NER Error: Loading distsim lexicon Failed

Java jList NullPointerException error

Categories

Resources