I'm working on migrating a textile plugin for a java blogging platform from one library (textile4j) to Mylyn's WikiText. So far very promising, but I have some unit tests that are failing:
public void testLinksSyntax44() {
String in = "\"link text(with title)\":http://example.com/";
String out = "<p>link text</p>";
textile.parse(in);
String content = writer.toString();
assertEquals(out, content);
}
public void testLinksSyntax46() {
String in = "\"(link)link text(with title)\":http://example.com/";
String out = "<p>link text</p>";
textile.parse(in);
String content = writer.toString();
assertEquals(out, content);
}
Basically, the output is showing a problem with WikiText parsing the title syntax. The output for each test is as follows:
In #44, the output is: <p>link text(with title)</p>
In #46, the output is: <p>link text(with title)</p>
The Textpattern Textile web widget correctly parses the link with class and title ("(link)link text(with title)":http://www.example.com/) and the link with title ("link text(with title)":http://www.example.com/) short forms.
Am I doing something wrong, or did I find a bug? I'm still groking the library, but it might be that one familiar with the library knows the problem, can find the error, or can help correct me.
Much thanks!
Tim
I found that the bug has been reported...
Eclipse Mylyn WikiText Bugzilla
Related
I recently discovered the Stanford NLP parser and it seems quite amazing. I have currently a working instance of it running in our project but facing the below mentioned 2 problems.
How can I parse text and then extract only specific speech-labels from the parsed data, for example, how can I extract only NNPS and PRP from the sentence.
Our platform works in both English and German, so there is always a possibility that the text is either in English or German. How can I accommodate this scenario. Thank you.
Code :
private final String PCG_MODEL = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
private final TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "invertible=true");
public void testParser() {
LexicalizedParser lp = LexicalizedParser.loadModel(PCG_MODEL);
String sent="Complete Howto guide to install EC2 Linux server in Amazon Web services cloud.";
Tree parse;
parse = lp.parse(sent);
List taggedWords = parse.taggedYield();
System.out.println(taggedWords);
}
The above example works, but as you can see I am loading the English data. Thank you.
Try this:
for (Tree subTree: parse) // traversing the sentence's parse tree
{
if(subTree.label().value().equals("NNPS")) //If the word's label is NNPS
{ //Do what you want }
}
For Query 1, I don't think stanford-nlp has an option to extract a specific POS tags.
However, Using custom trained models, we can achieve the same. I had tried similar requirement for NER - name Entity recognition custom models.
How do I retrieve the text from JLabel without the HTML taggings?
E.g.
CustomJLabel:
public CustomJLabel extends JLabel(){
private String text;
public CustomJLabel(String text) {
super("<html><div style='text-align: center;'>"+text+"</div></html>"),
this.text=text;
}
}
Main method:
testCustomLbl = new CustomJLabel("Testing");
System.out.println(testCustomLbl.getText());
Output I got:
<html><div style='text-align: center;'>Testing</div></html>
Desired output:
Testing
There are three options:
You pick your favorite HTML parser and parse HTML; see here for some inspiration. This is by far the most robust and straight forward solution; but of course: costly.
If you are well aware of the exact HTML content that goes into your labels, then you could turn to regular expressions; or other means of string parsing. The problem is: if you don't control those strings, then coming up with your own custom "parsing" is hard. Because each and any change somewhere to the HTML that goes in ... might break your little parser.
You rework your whole design: if having HTML text is such a core thing in your application, you might consider to really "represent" that in your class. For example by creating your own versions of JLabels that take some HtmlString input ... and simply remember which parts are HTML, and which one "pure text".
And whoops; the code you are showing is already suited for option 3. So if you want that getText() returns that original text, you could add a simple
#Override
public void String getText() {
return this.text;
}
to your CustomLabel class.
Edit: alternatively, you could simply add a new method like
public void String getTextWithoutHtmlTags()
or something alike; as overriding that inherited method somehow changes the "contract" of that method. Which (depending on the context) might be ok, or not so ok.
There's no need for complex code or 3rd party JARS / Libraries.
Here's a simple solution using RegEx:
String htmlStr = "<html><h1>Heading</h1> ...... </html>";
String noHtmlStr = htmlStr.replaceAll("\\<.*?\\>", "");
Works great for me.
Hope this helps.
when I am trying to call method with parameter using my Polish language f.e.
node.call("ąćęasdasdęczć")
I get these characters as input characters.
Ä?Ä?Ä?asdasdÄ?czÄ
I don't know where to set correct encoding in maven pom.xml? or in my IDE? I tried to change UTF-8 to ISO_8859-2 in my IDE setting, but it didn't work. I was searching similiar questions, but I didn't find the answer.
#Edit 1
Sample code:
public void findAndSendKeys(String vToSet , By vLocator){
WebElement element;
element = webDriverWait.until(ExpectedConditions.presenceOfElementLocated(vLocator));
element.sendKeys(vToSet);
}
By nameLoc = By.id("First_Name");
findAndSendKeys("ąćęasdasdęczć" , nameLoc );
Then in input field I got Ä?Ä?Ä?asdasdÄ?czÄ. Converting string to Basic Latin in my IDE helps, but It's not the solution that I needed.
I have also problems with fields in classes f.e. I have class in which I have to convert String to basic Latin
public class Contacts{
private static final By LOC_ADDRESS_BTN = By.xpath("//button[contains(#aria-label,'Wybór adresu')]");
// it doesn't work, I have to use basic latin and replace "ó" with "\u00f3" in my IDE
}
#Edit 2 - Changed encoding, but problem still exists
1:
I have been having trouble trying to get proxies from hidemyass. I was wondering if anybody could either tell me what I'm doing wrong or show me a way of fixing the following:
public void loadProxies()
{
proxies.clear();
String html = null;
String url = "http://hidemyass.com/proxy-list/";
int page = 1;
Pattern REPLACECRAP = Pattern.compile("<(span|div) style=\"display:none\">[\\s\\d\\s]*</(span|div)>");
while (page <= this.pages) {
status = "Scraping Proxies " + page + "/40";
try {
html = Jsoup.connect(url + page).get().html();
org.jsoup.select.Elements ele = Jsoup.parse(html).getElementsByAttributeValueMatching("class", "altshade");
for (Iterator localIterator = ele.iterator(); localIterator.hasNext();) {
Object s = localIterator.next();
org.jsoup.select.Elements ele1 = Jsoup.parse(s.toString()).children();
String text = ele1.toString().substring(ele1.toString().indexOf("</span>"), ele1.toString().indexOf("<span class=\"country\""));
org.jsoup.select.Elements ele2 = Jsoup.parse(text).children();
Matcher matcher = REPLACECRAP.matcher(ele2.toString());
String better = matcher.replaceAll("");
ele2 = Jsoup.parse(better).children();
String done = ele2.text();
String port = done.substring(done.lastIndexOf(" ") + 1);
String ip = done.substring(0, done.lastIndexOf(" ")).replaceAll(" ", "");
proxies.add(ip + ":" + port);
}
page++;
} catch (Exception e) {
e.printStackTrace();
}
}
}
This does get some part of the proxy from the website although it seems to be mixing bits together like this:
PROXY:98210.285995154180237.6396219.54:3128
PROXY:58129158250.246.179237.4682139176:1080
PROXY:5373992110205212248.8199175.88107.15141185249:8080
PROXY:34596887144221.4.2449100134138186248.231:9000
Those are some of the results i get ^ when running the above code. When i would want something PROXY:210:197:182:294:8080
Any help with this would be greatly appreciated.
Except if you really want to do it this way, consider taking a look at http://import.io which provides a tool to parse anything you want and to export it as an API.
Is you're using Java you can try http://thuzhen.github.io/facilitator/ which will help you getting your data a very quick way.
Parsing this website is going to take more than running a regex over the source.
It has been designed to make scraping difficult, mixing random data with display:none in with data that you're looking for.
If you're going to try and parse this correctly, you'll need to pick out the data marked as display:inline as well as parsing the inline CSS before each row which marks elements with certain ids as inline or none as appropriate.
Also, when the website is designed to make scraping as difficult as possible, I'd expect them to regularly change the source in ways that will break scrapers that currently work.
HideMyAss uses a variety of tactics. And despite what people always say about "you can't do that with regex!", yes you can. Well, with help of regex as I wrote a scraper for HideMyAss that relies on it heavily.
In addition to what you've taken out, you need to check for inline css like:
.HE8g{display:none}
.rI6a{display:inline}
.aHd-{display:none}
.Ln16{display:inline}
and remove any elements matching display none in the inline css:
<span class="HE8g">48</span>
which will be interjected throughout the ip addresses.
as well as empty spans:
As far as I remember there are no empty divs that are your concern, but it wouldn't hurt to check for them
There are a few gotchas but the obfuscated html is very predictable and has been for years.
It was easiest for me to solve by running against the same html source and to remove the obfuscations in a step by step fashion.
I know this is an old question, but good luck to anyone reading.
My overall goal is to return only clean sentences from a Wikipedia article without any markup. Obviously, there are ways to return JSON, XML, etc., but these are full of markup. My best approach so far is to return what Wikipedia calls raw. For example, the following link returns the raw format for the page "Iron Man":
http://en.wikipedia.org/w/index.php?title=Iron%20Man&action=raw
Here is a snippet of what is returned:
...//I am truncating some markup at the beginning here.
|creative_team_month =
|creative_team_year =
|creators_series =
|TPB =
|ISBN =
|TPB# =
|ISBN# =
|nonUS =
}}
'''Iron Man''' is a fictional character, a [[superhero]] that appears in\\
[[comic book]]s published by [[Marvel Comics]].
...//I am truncating here everything until the end.
I have stuck to the raw format because I have found it the easiest to clean up. Although what I have written so far in Java cleans up this pretty well, there are a lot of cases that slip by. These cases include markup for Wikipedia timelines, Wikipedia pictures, and other Wikipedia properties which do not appear on all articles. Again, I am working in Java (in particular, I am working on a Tomcat web application).
Question: Is there a better way to get clean, human-readable sentences from Wikipedia articles? Maybe someone already built a library for this which I just can't find?
I will be happy to edit my question to provide details about what I mean by clean and human-readable if it is not clear.
My current Java method which cleans up the raw formatted text is as follows:
public String cleanRaw(String input){
//Next three lines attempt to get rid of references.
input= input.replaceAll("<ref>.*?</ref>","");
input= input.replaceAll("<ref .*?</ref>","");
input= input.replaceAll("<ref .*?/>","");
input= input.replaceAll("==[^=]*==", "");
//I found that anything between curly braces is not needed.
while (input.indexOf("{{") >= 0){
int prevLength= input.length();
input= input.replaceAll("\\{\\{[^{}]*\\}\\}", "");
if (prevLength == input.length()){
break;
}
}
//Next line gets rid of links to other Wikipedia pages.
input= input.replaceAll("\\[\\[([^]]*[|])?([^]]*?)\\]\\]", "$2");
input= input.replaceAll("<!--.*?-->","");
input= input.replaceAll("[^A-Za-z0-9., ]", "");
return input;
}
I found a couple of projects that might help. You might be able to run the first one by including a Javascript engine in your Java code.
txtwiki.js
A javascript library to convert MediaWiki markup to plaintext.
https://github.com/joaomsa/txtwiki.js
WikiExtractor
A Python script that extracts and cleans text from a Wikipedia database dump
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
Source:
http://www.mediawiki.org/wiki/Alternative_parsers