url harvester string manipulation

url harvester string manipulation - java

I'm doing a recursive url harvest.. when I find an link in the source that doesn't start with "http" then I append it to the current url. Problem is when I run into a dynamic site the link without an http is usually a new parameter for the current url. For example if the current url is something like http://www.somewebapp.com/default.aspx?pageid=4088 and in the source for that page there is a link which is default.aspx?pageid=2111. In this case I need do some string manipulation; this is where I need help.
pseudocode:
if part of the link found is a contains a substring of the current url
save the substring
save the unique part of the link found
replace whatever is after the substring in the current url with the unique saved part
What would this look like in java? Any ideas for doing this differently? Thanks.
As per comment, here's what I've tried:
if (!matched.startsWith("http")) {
String[] splitted = url.toString().split("/");
java.lang.String endOfURL = splitted[splitted.length-1];
boolean b = false;
while (!b && endOfURL.length() > 5) { // f.bar shortest val
endOfURL = endOfURL.substring(0, endOfURL.length()-2);
if (matched.contains(endOfURL)) {
matched = matched.substring(endOfURL.length()-1);
matched = url.toString().substring(url.toString().length() - matched.length()) + matched;
b = true;
}
}
it's not working well..

I think you are doing this the wrong way. Java has two classes URL and URI which are capable of parsing URL/URL strings much more accurately than a "string bashing" solution. For example the URL constructor URL(URL, String) will create a new URL object in the context of an existing one, without you needing to worry whether the String is an absolute URL or a relative one. You would use it something like this:
URL currentPageUrl = ...
String linkUrlString = ...
// (Exception handling not included ...)
URL linkUrl = new URL(currentPageUrl, linkUrlString);

Related

Extract part of Dynamic Url?

I have the following URl http://127.0.0.1/?code=AQABAAIAAAAGV_bv21
I need to capture the chracters after code=
but every time the URL is loaded that code is different..
I had something like this but since its dynamic I can not do this..
String url = "http://127.0.0.1/?code=AQABAAIAAAAGV_bv21"
String code = url.substring(url.length() -10);

you can use something like below :-
String code = url.split("?code=")[1];

if you are on Android:
String url = "http://127.0.0.1/?code=AQABAAIAAAAGV_bv21"
Uri uri = Uri.parse(url);
String code = uri.getQueryParameter("code");
or try the following regexp:
(\?|\&)([^=]+)\=([^&]+)

Try this.
String url = "http://127.0.0.1/?code=AQABAAIAAAAGV_bv21";
String codeValue = url.replaceAll(".*code=([^&]*).*", "$1");
System.out.println(codeValue);
output:
AQABAAIAAAAGV_bv21
This method works even if other parameters are added. For example http://127.0.0.1/?id=123&code=AQABAAIAAAAGV_bv21&opt=yes

Cannot get '#' symbol in Controller using Spring #RequestParam

I have the following request Url /search?charset=UTF-8&q=C%23C%2B%2B.
My controller looks like
#RequestMapping(method = RequestMethod.GET, params = "q")
public String refineSearch(#RequestParam("q") final String searchQuery,....
and here i have searchQuery = 'CC++'.
'#' is encoded in '%23' and '+' is '%2B'.
Why searchQuery does not contain '#'?
searchQuery in debug

I resolved a similar problem by URL encoding the hash part. We have Spring web server and mix of JS and VueJS client. This fixed my problem:
const location = window.location;
const redirect = location.pathname + encodeURIComponent(location.hash);

The main cause is known as the "fragment identifier". You find more detail for Fragment Identifier right here. It says:
The fragment identifier introduced by a hash mark # is the optional last part of a URL for a document. It is typically used to identify a portion of that document.
When you write # sign, it contains info for clientbase. Put everything only the browser needs here. You can get this problem for all types of URI characters you can look Percent Encoding for this. In my opinion The simple solution is character replacing, you could try replace in serverbase.

Finally i found a problem.In filters chain ServletRequest is wrapped in XSSRequestWrapper with DefaultXSSValueTranslator and here is the method String stripXSS(String value) which iterates through pattern list,in case if value matches with pattern, method will delete it.
Pattern list contains "\u0023" pattern and '#' will be replaced with ""
DefaultXSSValueTranslator.
private String stripXSS(String value) {
Pattern scriptPattern;
if (value != null && value.length() > 0) {
for(Iterator var3 = this.patterns.iterator(); var3.hasNext(); value = scriptPattern.matcher(value).replaceAll("")) {
scriptPattern = (Pattern)var3.next();
}
}
return value;
}

How to program a script that changes url

I want to make a tampermonkey script that basically changes the url of the page. What I want to do is to look if the url has "youtube.com" in it and if it doesn't then it should add /youtube.com to the url.
An example of this is:
The starting website: www.website.com/watch8dzjad8
The changed website: www.website.com/youtube.com/watch8dzjad8
If it helps then the script is meant to be finished in tampermonkey, so that on a specific website it is going to scan for the link and add the /youtube.com if it can't find it since it won't work otherwise and it would really help me to not to copy and paste /youtube.com 10 times a day, as well as to learn how to work with URL's in JavaScript. Thanks in advance

if( !location.host.match(/youtube.com/) )
location= "/youtube.com"+ location.pathname
But instead of that you should restrict this behaviour to a specific site, not just all domains that are not youtube, for example:
if( location.href.match(/website.com\/watch/) )
location= "/youtube.com"+ location.pathname
Explanations
location.href.match(/website.com/watch/)
location.host is the domain of the page (www.website.com)
location.href is the complete URL of the page (http://www.website.com/watch8dzjad8)
match tests if the string follow the given pattern
location= "/youtube.com"+ location.pathname
setting location implies opening the given URL
location.pathname gives the path of the URL (/watch8dzjad8)
So if the URL (http://www.website.com/watch8dzjad8) of the visited page contains the string "website.com/watch", then open "/youtube.com" + "/watch8dzjad8".
As the domain is the same, a relative URL is enough, the browser knows that is the same domain as the current page.
https://developer.mozilla.org/en-US/docs/Web/API/Window/location
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match

try this
function getQueryValue( myUrl ){
myUrl = newUrl.replace(/[\[]/,"\\\[").replace(/[\]]/,"\\\]");
var regexS = "[\\?&]" + myUrl + "=([^&#]*)";
var regex = new RegExp( regexS );
var results = regex.exec( location.href);
if( results == null )
return "";
else
return results;
}
//current url
var curUrl = location.href;
//new url
var newUrl = getQueryValue( "curUrl" );
//redirect to new page
location.href = newUrl;
}

How to read the public URL in GWT?

I m new in GWT and I m generating a web application in which i have to create a public URL.
In this public URL i have to pass hashtag(#) and some parameters.
I am finding difficulty in achieving this task.
Extracting the hashtag from the URL.
Extracting the userid from the URL.
My public URL example is :: http://www.xyz.com/#profile?userid=10003

To access the URL in GWT you can use the History.getToken() method. It will give you the entire string that follows the hashtag ("#").
In your case (http://www.xyz.com/#profile?userid=10003) it will return a string "profile?userid=10003". After you have this you can parse it however you want. You can check if it contains("?") and u can split it by "?" or you can get a substring. How you get the information from that is really up to you.

I guess you already have the URL. I'm not that good at Regex, but this should work:
String yourURL = "http://www.xyz.com/#profile?userid=10003";
String[] array = yourURL.split("[\\p{Lower}\\p{Upper}\\p{Punct}}]");
int userID = 0;
for (String string : array) {
if (!string.isEmpty()) {
userID = Integer.valueOf(string);
}
}
System.out.println(userID);

To get the parameters:
String userId = Window.Location.getParameter("userid");
To get the anchor / hash tag:
I don't think there is something, you can parse the URL: look at the methods provided by Window.Location.

jericho Html parser error in jsp page

I have write code as
String sourceUrlString="http://some url";
Source source=new Source(new URL(sourceUrlString));
Element INFORM = source.getElementById("main").getAllElementsByClass("game").get(i-1);
String INFORM = INFORM.replaceAll("\\s",""); //shows error here
sendResponse(resp,+INFORM);
Now i want the text fetch from Element INFORM is Neglect white space how can i do so? above mentioned String INFORM Show error Duplicate local variable INFORM);
e.g
text fetch by Element INFORM is "my name is satish"
but it must send response as
"mynameissatish"

You have the name INFORM used twice - and thats not possible!
String sourceUrlString = "http://some url";
Source source = new Source(new URL(sourceUrlString));
Element INFORM = source.getElementById("main").getAllElementsByClass("game").get(i-1);
String response = INFORM.replaceAll("\\s",""); // ! Use another name here !
sendResponse(resp, respone); // or use '+' - not shure if 1 or 2 args

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

url harvester string manipulation - java

Related

Extract part of Dynamic Url?

Cannot get '#' symbol in Controller using Spring #RequestParam

How to program a script that changes url

How to read the public URL in GWT?

jericho Html parser error in jsp page

Categories

Resources