Grabbing proxies from website in Java? - java

I have been having trouble trying to get proxies from hidemyass. I was wondering if anybody could either tell me what I'm doing wrong or show me a way of fixing the following:
public void loadProxies()
{
proxies.clear();
String html = null;
String url = "http://hidemyass.com/proxy-list/";
int page = 1;
Pattern REPLACECRAP = Pattern.compile("<(span|div) style=\"display:none\">[\\s\\d\\s]*</(span|div)>");
while (page <= this.pages) {
status = "Scraping Proxies " + page + "/40";
try {
html = Jsoup.connect(url + page).get().html();
org.jsoup.select.Elements ele = Jsoup.parse(html).getElementsByAttributeValueMatching("class", "altshade");
for (Iterator localIterator = ele.iterator(); localIterator.hasNext();) {
Object s = localIterator.next();
org.jsoup.select.Elements ele1 = Jsoup.parse(s.toString()).children();
String text = ele1.toString().substring(ele1.toString().indexOf("</span>"), ele1.toString().indexOf("<span class=\"country\""));
org.jsoup.select.Elements ele2 = Jsoup.parse(text).children();
Matcher matcher = REPLACECRAP.matcher(ele2.toString());
String better = matcher.replaceAll("");
ele2 = Jsoup.parse(better).children();
String done = ele2.text();
String port = done.substring(done.lastIndexOf(" ") + 1);
String ip = done.substring(0, done.lastIndexOf(" ")).replaceAll(" ", "");
proxies.add(ip + ":" + port);
}
page++;
} catch (Exception e) {
e.printStackTrace();
}
}
}
This does get some part of the proxy from the website although it seems to be mixing bits together like this:
PROXY:98210.285995154180237.6396219.54:3128
PROXY:58129158250.246.179237.4682139176:1080
PROXY:5373992110205212248.8199175.88107.15141185249:8080
PROXY:34596887144221.4.2449100134138186248.231:9000
Those are some of the results i get ^ when running the above code. When i would want something PROXY:210:197:182:294:8080
Any help with this would be greatly appreciated.

Except if you really want to do it this way, consider taking a look at http://import.io which provides a tool to parse anything you want and to export it as an API.
Is you're using Java you can try http://thuzhen.github.io/facilitator/ which will help you getting your data a very quick way.

Parsing this website is going to take more than running a regex over the source.
It has been designed to make scraping difficult, mixing random data with display:none in with data that you're looking for.
If you're going to try and parse this correctly, you'll need to pick out the data marked as display:inline as well as parsing the inline CSS before each row which marks elements with certain ids as inline or none as appropriate.
Also, when the website is designed to make scraping as difficult as possible, I'd expect them to regularly change the source in ways that will break scrapers that currently work.

HideMyAss uses a variety of tactics. And despite what people always say about "you can't do that with regex!", yes you can. Well, with help of regex as I wrote a scraper for HideMyAss that relies on it heavily.
In addition to what you've taken out, you need to check for inline css like:
.HE8g{display:none}
.rI6a{display:inline}
.aHd-{display:none}
.Ln16{display:inline}
and remove any elements matching display none in the inline css:
<span class="HE8g">48</span>
which will be interjected throughout the ip addresses.
as well as empty spans:
As far as I remember there are no empty divs that are your concern, but it wouldn't hurt to check for them
There are a few gotchas but the obfuscated html is very predictable and has been for years.
It was easiest for me to solve by running against the same html source and to remove the obfuscations in a step by step fashion.
I know this is an old question, but good luck to anyone reading.

Related

When I parsed with jsoup, the contents of the tag disappeared

I am studying jsoup library. And he was faced with difficulties. The tag seen in the Chrome developer tool disappears when parsed. Help me.
enter image description here
enter image description here
The contents of the div tag with an id called cbox_module are missing. Tell me how to get the contents of this tag.
and this is my code
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
String url = "https://comic.naver.com/webtoon/detail.nhn?titleId=597447&no=364&weekday=sat";
String address = "https://comic.naver.com/comment/comment.nhn?titleId=651673&no=514";
Document doc = Jsoup.connect(address).get();
Elements el = doc.select("#cbox_module");
System.out.println(doc);
System.out.println(el);
}
I am sorry if my English was poor. I'm a foreigner and I'm using a translator.
Not entirely sure what you're looking to extract here but you're taking the cbox_module which is the very first element after the <body> tags.
Looking through the network tab in chrome tools, I can see a request to:
https://apis.naver.com/commentBox/cbox/web_naver_list_jsonp.json?ticket=comic&templateId=webtoon&pool=cbox3&_callback=jQuery112408278558406808354_1605604312744&lang=ko&country=KR&objectId=651673_514&categoryId=&pageSize=15&indexSize=10&groupId=&listType=OBJECT&pageType=default&page=1&initialize=true&userType=&useAltSort=true&replyPageSize=10&_=1605604312745
That returns the Json that populates the comments in the page which gives you more direct access to the data you probably want.
Most of the query params are in the initial html response received however one in particular is not... _callback=jQuery112408278558406808354_1605604312744
1605604312744 - is a unix time stamp and easy enough to obtain using int now = Integer.parseInt(String.valueOf(LocalDateTime.now().toEpochSecond(ZoneOffset.UTC)));
jQuery112408278558406808354 - this was tough to understand how this is being computed but, from the scripts:
...
n.extend({
expando: "jQuery" + (m + Math.random()).replace(/\D/g, ""),
...
.replace(/\D/g, "") - remove all non-digits - so a random number.
m is "1.12.4" at the top of the script (the jQuery version in use, possibly)... so we end up with a random number appended to 1124 + 0.9999999... (some random value) (remove the . since its a non-digit) and we get the random _callback parameter value (16 decimal places).
This comes from: https://comic.naver.com/aggregate/javascript/release/comic_comment_20201113103349.js
I tried with this but it did not work. So I'm a little stumped as to how its working. I saw a .toLowerCase somewhere and when I do that on the URL + query params I get a "partner not recognised" error... something clearly not quite right.
I'd try with HtmlUnit as it has support for Javascript - it may make this much simpler.

How can I get a real page count in the InDesign Java document

I am using Adobe InDesign CS5 Server Java. For setting the desired preferences, I am using the following code:
Document myDocument = myApp.addDocument(OptArg.noDocumentPreset());
DocumentPreference docPrefs = myDocument.getDocumentPreferences();
docPrefs.setPageHeight(UnitUtils.createString("800pt"));
docPrefs.setPageWidth(UnitUtils.createString("600pt"));
docPrefs.setPageOrientation(kPageOrientationLandscape.value);
docPrefs.setPagesPerDocument(16);
I would like to know if it is somehow possible to find out the real document page count in java, without setting setPagesPerDocument? Thank you in advance for any help.
You can simply find out the number of pages like this:
var pageCount = myDocument.pages.length
$.writeln("The document has " + pageCount + " pages.");
Btw. the InDesign scripting is done in JavaScript (or more precisely in ExtendScript which is a JavaScript dialect) which is a very different language than Java.
Edit: Ok, answering your comment, I have no idea what InDesignServerAPI.jar is, but looking at your code it looks like the InDesign ExtendScript language is just sort of wrapped into Java code. So my guess would be, that you can get the page count like this:
int pageCount = myDocument.pages.length;
Just in case. Sorry, I don't know how it works in Java. But in Python on Windows it can be done this way:
from win32com.client import Dispatch
app = Dispatch('InDesign.Application.CS6')
doc = app.Open(r"d:\sample.indd")
pages = doc.pages;
pages_length = len(pages)
doc.Close()
print(pages_length)

HTML break tag is visible in the text instead of breaking the line

I have trouble using the break tag to break the text in new line.
Here is what I get in my Web-Browser:
The text is read from a database table.
Heree the entry:
And the database entry is created with java.
Here the java code:
StringBuilder sb = new StringBuilder();
sb.append("in Folgender Stückelung : <br />");
while (itKassette.hasNext()) {
KassetteBefuellungZuweisung kassetteBef = (KassetteBefuellungZuweisung) itKassette.next();
int Anzahl = kassetteBef.getAnzahl();
double inhaltNennwert = kassetteBef.getWert() / Anzahl;
if (Anzahl != 0) {
sb.append("<br />Inhalt (Nennwert): " + inhaltNennwert + " Anzahl: " + Anzahl);
}
}
I also tried to figure out what is wrong by looking in the chrome console.
Here:
But I didnt found anything wrong there so I searches the HTML-Sourcecode form that site and got something suspicious.
Here the HTML:
For any reason the browser convertet the encoding!
This is a Struts project and Im pretty new to struts, cant say where or why this happens but hopefully I get a few answeres here. Thanks in advance.
In the Java code it may not work if you entered tabgs.
Try '/n' instead of
Try using Tag Builder. TagBuilder is a class that specially designed for creating html tags and their content.
This issue is of pre-formmatting.
You can wrap your content with <pre> tag.
You should use CSS white-space:pre applied to the appropriate <td>.
Here, is a reference link related to this issue.

Any way to return only (clean) text from a Wikipedia article?

My overall goal is to return only clean sentences from a Wikipedia article without any markup. Obviously, there are ways to return JSON, XML, etc., but these are full of markup. My best approach so far is to return what Wikipedia calls raw. For example, the following link returns the raw format for the page "Iron Man":
http://en.wikipedia.org/w/index.php?title=Iron%20Man&action=raw
Here is a snippet of what is returned:
...//I am truncating some markup at the beginning here.
|creative_team_month =
|creative_team_year =
|creators_series =
|TPB =
|ISBN =
|TPB# =
|ISBN# =
|nonUS =
}}
'''Iron Man''' is a fictional character, a [[superhero]] that appears in\\
[[comic book]]s published by [[Marvel Comics]].
...//I am truncating here everything until the end.
I have stuck to the raw format because I have found it the easiest to clean up. Although what I have written so far in Java cleans up this pretty well, there are a lot of cases that slip by. These cases include markup for Wikipedia timelines, Wikipedia pictures, and other Wikipedia properties which do not appear on all articles. Again, I am working in Java (in particular, I am working on a Tomcat web application).
Question: Is there a better way to get clean, human-readable sentences from Wikipedia articles? Maybe someone already built a library for this which I just can't find?
I will be happy to edit my question to provide details about what I mean by clean and human-readable if it is not clear.
My current Java method which cleans up the raw formatted text is as follows:
public String cleanRaw(String input){
//Next three lines attempt to get rid of references.
input= input.replaceAll("<ref>.*?</ref>","");
input= input.replaceAll("<ref .*?</ref>","");
input= input.replaceAll("<ref .*?/>","");
input= input.replaceAll("==[^=]*==", "");
//I found that anything between curly braces is not needed.
while (input.indexOf("{{") >= 0){
int prevLength= input.length();
input= input.replaceAll("\\{\\{[^{}]*\\}\\}", "");
if (prevLength == input.length()){
break;
}
}
//Next line gets rid of links to other Wikipedia pages.
input= input.replaceAll("\\[\\[([^]]*[|])?([^]]*?)\\]\\]", "$2");
input= input.replaceAll("<!--.*?-->","");
input= input.replaceAll("[^A-Za-z0-9., ]", "");
return input;
}
I found a couple of projects that might help. You might be able to run the first one by including a Javascript engine in your Java code.
txtwiki.js
A javascript library to convert MediaWiki markup to plaintext.
https://github.com/joaomsa/txtwiki.js
WikiExtractor
A Python script that extracts and cleans text from a Wikipedia database dump
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
Source:
http://www.mediawiki.org/wiki/Alternative_parsers

How to set text link as the actual URL and not the text tag from HTML

I've been trying to figure this one out for a bit using patterns or other utils but haven't gotten it to work just yet.
Say I have an HTML link:
Google, Inc.
I want to make a link into a text view BUT set the text of the link as the actual URL not Google, Inc.
So for example if the data I received is:
--Hey if you want to try a search go to Google, Inc. and it's easy as that.
I want it to display as:
--Hey if you want to try a search go to http://www.google.com and it's easy as that.
Instead of:
--Hey if you want to try a search go to Google, Inc. and it's easy as that.
Html.fromHtml() makes it show as "Google, Inc." automatically, but isn't the result that I want.
Also, I don't need this to work for specifically this example, I need it to work for all html links as I don't know what links I will get as data.
it's actually pretty tricky .... but i have found a way to do so.
Thanks to SO for that.
here is the answer:
TextView tvYourTextView = ( TextView ) findById( R.id.yourTextViewThatShowsALink );
tvYourTextView.setMovementMethod(LinkMovementMethod.getInstance()); //that will make your links work.
PS:
Don't forget to use Html.fromHtml("your content as html")
I suggest using Linkify and parsing the Strings yourself to create the links you want.
Set Linkify to look for raw web addresses and turn them into links in your text:
<TextView
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:autoLink="web"
/>
This function will pull the raw address out the html tags:
public String parseLinks(String raw) {
String openTag = "<a href=\"";
String closeTag = "</a>";
String result = "";
int start = 0;
int middle = raw.indexOf(openTag);
int end;
while(middle > -1) {
result += raw.substring(start, middle);
end = raw.indexOf("\">", middle);
result += raw.substring(middle + openTag.length(), end);
start = raw.indexOf(closeTag, end) + closeTag.length();
middle = raw.indexOf(openTag, start);
}
result += raw.substring(start, raw.length());
return result;
}
Understand that this function does no error checking, I recommend adding this yourself.
Now simply pass the String returned from parseLinks() to your TextView, like this:
textView.setText(parseLinks(rawHTML));
Hope that helps!

Categories

Resources