I am trying to get all the noun phrases using the edu.stanford.nlp.* package. I got all the subtrees of label value "NP", but I am not able to get the normal original String format (not Penn Tree format).
E.g. for the subtree.toString() gives (NP (ND all)(NSS times))) but I want the string "all times". Can anyone please help me. Thanks in advance.
I believe what you want is something like:
final StringBuilder sb = new StringBuilder();
for ( final Tree t : tree.getLeaves() ) {
sb.append(t.toString()).append(" ");
}
While I'm not 100% sure, I seem to recall this being the solution used for some software I worked on a few years back.
This can be accomplished using the yield() method for the subtree, instead of creating a separate StringBuilder objext.
if (subtree.label().value().equals("NP")) {
out.println(subtree); //print subtree
out.println(Sentence.listToString(subtree.yield())); //print phrase
break;
}
Related
First of, I am not asking for codes. I will be just asking for suggestions or ideas on how to start this project so please help me I want to learn.
INPUT TEXT FILE:
Five little monkeys jumping on the bed
One fell off and bumped his head
Mama called the doctor and the doctor said:
"No more monkeys jumping on the bed!"
OUTPUT:
Noun:
Monkeys
Doctor
(and other parts of speech)
If i remove one word from the text file it will also be gone from the Output. Is this program possible without downloading anything like the Stanford? I'm using Java. I don't know how to start it without ideas :(
Question:
What method am I going to use.
EDIT!!!!!!!!!!!
public static void main(String[] args) throws IOException {
BufferedReader br = new BufferedReader(new FileReader("C:/Users/xxxx/Desktop/lyrics.txt"));
String line = null;
while ((line = br.readLine()) != null)
{
System.out.println("Noun: ");
}
}
}
HERE NOW IT ALREADY READS MY TEXTFILE. I thought of an idea that i can just find a specific word in the text file and print it out as, "Noun: Monkeys" but without the user input. What i'm talking about is something like this one
Type word to find: ExampleWord
Output:
Sys.out.print( word + "Found!");
Can do something like this without asking for the user? It will just automatically print out every word?
I recently discovered the Stanford NLP parser and it seems quite amazing. I have currently a working instance of it running in our project but facing the below mentioned 2 problems.
How can I parse text and then extract only specific speech-labels from the parsed data, for example, how can I extract only NNPS and PRP from the sentence.
Our platform works in both English and German, so there is always a possibility that the text is either in English or German. How can I accommodate this scenario. Thank you.
Code :
private final String PCG_MODEL = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
private final TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "invertible=true");
public void testParser() {
LexicalizedParser lp = LexicalizedParser.loadModel(PCG_MODEL);
String sent="Complete Howto guide to install EC2 Linux server in Amazon Web services cloud.";
Tree parse;
parse = lp.parse(sent);
List taggedWords = parse.taggedYield();
System.out.println(taggedWords);
}
The above example works, but as you can see I am loading the English data. Thank you.
Try this:
for (Tree subTree: parse) // traversing the sentence's parse tree
{
if(subTree.label().value().equals("NNPS")) //If the word's label is NNPS
{ //Do what you want }
}
For Query 1, I don't think stanford-nlp has an option to extract a specific POS tags.
However, Using custom trained models, we can achieve the same. I had tried similar requirement for NER - name Entity recognition custom models.
I've some tags in a word file which looks like <tag>.
Now I get the content of the Word file with docx4j and loop through every line and search for this tag. When I find one, then i replace it with a String. But this code i tried doesn't work and now i really don't know how i can realise it!
Here's the code i've already tried:
WordprocessingMLPackage wpml = WordprocessingMLPackage.load(new File(path));
MainDocumentPart mdp = wpml.getMainDocumentPart();
List<Object> content = mdp.getContent();
String line;
for (Object object : content) {
line = object.toString();
if (line.contains("<tag>")) {
line.replace("<tag>", "<newTag>");
}
}
Any tips or solutions how i can achieve it?
One of your problems is that you modify String line which has no effect on anything. line.replace("<tag>", "<newTag>"); result of this operation is ignored. you would definitely want to do sth with that, right?
Also, if object in your loop is not an instaneOf String, then line and object are pointing to different objects.
You need to modify contents but not the way you're doing this. Please read getting started
Also there are lots of examples (sample code) in source code download section
If you have any concrete problems after reading the getting started, we'll be happy to help you.
The things in your List will be org.docx4j.wml.P (paragraphs), or Tbl (tables) or other block level content.
Paragraphs contain runs, which in turn contain the actual text.
For the suggested way of doing what you want to do, see the VariableReplace sample.
Better yet, consider content control data binding.
Otherwise, if you want to roll your own approach, see the Getting Started guide for how to traverse, or use JAXB-level XPath.
First you should use replaceAll() instead of replace().
Then you should store this String into an object you can serialize back after modification to the Word file.
Furthermore I think that it would also be good to handle closing tags (if there some) ...
String (line) is immutable, therefore replace("<tag>", "<newTag>") does not modify your line it creates a new modified one.
Your code shoudl do something like this:
for (Object object : content) {
line = object.toString();
if (line.contains("<tag>")) {
line= line.replaceAll("<tag>", "<newTag>");
}
writeLineToNewFile(line);
}
or shorter:
for (Object object : content) {
writeLineToNewFile(object.toString().replace("<tag>", "<newTag>");
}
This link: http://www.otc.edu/GEN/schedule/all_classes_fall.txt contains classes for my college, and I am trying to take all of this data and store it in a ClassInformationFall object I have created. Basically, the classes begin at the class title in the format like this : "ABR-100-101" and have class instructor, days it occurs, start/end time, etc.
I have written some regex to pick out the class title, and some of the easier things like start and ending time, but I have been struggling on trying to get the rest of it out. I was thinking about setting up some code where anytime another class title is encountered, it adds the following text to a new ClassInformationFall object, which I am storing in a list of that type. Even if I had that, though, I still haven't been able to successfully extract all of the data for all of the things that make up the class.
What would be the regex to pick this information out, or is regex even the way to go?
Thanks for any help, this has stumped me for awhile.
PS - I am developing the application using this in Java.
If the fields are always in the same order, you can just split each line by tabs and deal with the resulting array.
String line = bufferedReader.readLine();
while (line != null) {
String[] data = line.split("\\t+");
String name = data[0];
String credits = data[2];
String description = data[3];
String professor = data[11];
ClassInfo ci = new ClassInfo(name, credits, description, professor);
classInfoList.add(ci);
line = bufferedReader.readLine();
}
I am trying to remove all HTML elements from a String. Unfortunately, I cannot use regular expressions because I am developing on the Blackberry platform and regular expressions are not yet supported.
Is there any other way that I can remove HTML from a string? I read somewhere that you can use a DOM Parser, but I couldn't find much on it.
Text with HTML:
<![CDATA[As a massive asteroid hurtles toward Earth, NASA head honcho Dan Truman (Billy Bob Thornton) hatches a plan to split the deadly rock in two before it annihilates the entire planet, calling on Harry Stamper (Bruce Willis) -- the world's finest oil driller -- to head up the mission. With time rapidly running out, Stamper assembles a crack team and blasts off into space to attempt the treacherous task. Ben Affleck and Liv Tyler co-star.]]>
Text without HTML:
As a massive asteroid hurtles toward Earth, NASA head honcho Dan Truman (Billy Bob Thornton) hatches a plan to split the deadly rock in two before it annihilates the entire planet, calling on Harry Stamper (Bruce Willis) -- the world's finest oil driller -- to head up the mission. With time rapidly running out, Stamper assembles a crack team and blasts off into space to attempt the treacherous task.Ben Affleck and Liv Tyler co-star.
Thanks!
There are a lot of nuances to parsing HTML in the wild, one of the funnier ones being that many pages out there do not follow any standard. This said, if all your HTML is going to be as simple as your example, something like this is more than enough:
char[] cs = s.toCharArray();
StringBuilder sb = new StringBuilder();
boolean tag = false;
for (int i=0; i<cs.length; i++) {
switch(cs[i]) {
case '<': if ( ! tag) { tag = true; break; }
case '>': if (tag) { tag = false; break; }
case '&': i += interpretEscape(cs, i, sb); break;
default: if ( ! tag) sb.append(cs[i]);
}
}
System.err.println(sb);
Where interpretEscape() is supposed to know how to convert HTML escapes such as > to their character counterparts, and skip all characters up to the ending ;.
I cannot use regular expressions
because I am developing on the
Blackberry platform
You cannot use regular expressions because HTML is a recursive language and regular expressions can't handle those.
You need a parser.
If you can add external jars you can try with those two small libs:
tagsoup, it's a sax parser
jericho html, another small html parser
they both allow you to strip everything.
I used jericho many times, to strip you define an extractor as you like it:
class HTMLStripExtractor extends TextExtractor
{
public HTMLStripExtractor(Source src)
{
super(src)
src.setLogger(null)
}
public boolean excludeElement(StartTag startTag)
{
return startTag.getName() != HTMLElementName.A
}
}
I'd try to tackle this the other way around, create a DOM tree from the HTML and then extract the string from the tree:
Use a library like TagSoup to parse in the HTML while cleaning it up to be close to XHTML.
As you're streaming the cleaned up XHTML, extract the text you want.