How to detect URL to different page (also in the same domain) - java

I have question about detect url in page. I'm founding the best way how it solve. For downloading page I use Jsoup.
URI uri = new URI("http://www.niocchi.com/");
Document doc = Jsoup.connect(uri.toString()).get();
Elements links = doc.select("a")
And this page get me some links. For example this:
http://www.niocchi.com/#Package organization
http://www.niocchi.com/#Architecture
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
I need get only different pages without references to paragraphs.
I would like to get from example this:
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/

It looks like you want to select only these <a> with href attribute with value build from characters which are not #. In that case you can use
doc.select("a[href~=^[^#]+$]")
attribute~=regex is syntax used to check if part of value of attribute can be matched with regex.
regex accepting one or more non # characters can look like this [^#]+
regex accepting only entire string (not only its part) need to be surrounded with ^ and $ anchors which represents
^ - start of the string,
$ end of the string.

You could convert them to strings and then split them based on the # mark.
for example:
public void stringSplitter() {
String result = null;
// example
String[] stringURL = {"http://www.niocchi.com/#Package organization", "http://www.niocchi.com/#Architecture",
"http://www.linkedin.com/in/ivanprado", "http://www.niocchi.com/examples/ "};
try {
for (int i = 0; i < stringURL.length; i++) {
String [] parts = stringURL[i].split("#");
result = parts[0];
System.out.println(result);
}
}catch (Exception ex) {
ex.printStackTrace();
}
}
The output is:
http://www.niocchi.com/
http://www.niocchi.com/
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
I would even think about setting a part of the method to return only unique URL's

Related

Java won't replace all strings, because there is text next to the tags (post improved)

I'm working on a program, which formats HTML Code, extracted from a PDF file.
I have a String list, which contains paragraphs and is divided by that.
As the PDF has hyperlinks, I decided to replace them with a foot note number "[1]".
This will be used for citation of sources. I will eventually plan, to put it at the end of a paragraph, or sentence, so you can look up the sources, like you would in a book.
My Problem
For some reason not all the hyperlinks are replaced.
The reason is most likely, that there is text directly next to the tag.
Hell<a href="http://www.example.com">o old chap!
Specifically the "o" part and the "hell" part is blocking the java .replaceAll function, from doing it's job.
Expected Result
Hello [1] old chap!
EDIT:
If I would just add space, before and after the URL, it might split some words like "help", into "hel p", which is also not an option.
My code would have to replace the URL tag (without the ) and create no new extra spaces.
This is some of my code, where the problem occures:
for (int i = 0; i < EN.length; i++) {
Pattern pattern_URL = Pattern.compile("<a(.+?)\">", Pattern.DOTALL);
Matcher matcher_URL = pattern_URL.matcher(EN[i]); //Checks in the curren Array part.
if (matcher_URL.find() == true) {
source_number++;
String extractedURL = matcher_URL.group(0);
//System.out.println(extractedURL);
String extractedURL_fully = extractedURL.replaceAll("href=\"", ""); //Anführungszeichen
//System.out.println(extractedURL_fully);
String nobracketURL = extractedURL.replaceAll("\\)", ""); //Remove round brackets from URL
EN[i] = EN[i].replaceAll("\\)\"", "\""); /*Replace round brackets from URL in Array. (For some reasons there have been href URLs, with an bracket at the end. This was already in the PDF. They were causing massive problems, because it didn't comment them out, so the entire replaceAll command didn't function.)*/
EN[i] = EN[i].replaceAll(nobracketURL, "[" + source_number + "]"); //Replace URL tags with number and Edgy brackets
}
else{
//System.out.println("FALSE: " + "[" + i + "]");
}
}
The whole idea of this is, that it loops through the array and replaces all the URLs, including it's starting tag <a until the end of the starting tag "> (which can also be seen in the pattern regex.)
Correct me if I'm wrong, but what you need is to eliminate all the <a> tags from a given string, right? If that's the case all you needed to do was use a code like the following:
final String string = "<a href=\"http://www.example.com\">Sen";
final Pattern pattern = Pattern.compile("<a(.+?)>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll("");
System.out.println(result); // prints "Sen"
Notice I didn't use the replaceAll from the String object, but from the Matcher object. This replaces all matches for the empty string "".

Cleaning a file name in Java

I want to write a script that will clean my .mp3 files.
I was able to write a few line that change the name but I want to write an automatic script that will erase all the undesired characters $%_!?7 and etc. while changing the name in the next format Artist space dash Song.
File file = new File("C://Users//nikita//Desktop//$%#Artis8t_-_35&Son5g.mp3");
String Original = file.toString();
String New = "Code to change 'Original' to 'Artist - Song'";
File file2 = new File("C://Users//nikita//Desktop//" + New + ".mp3");
file.renameTo(file2);
I feel like I should make a list with all possible characters and then run the String through this list and erase all of the listed characters but I am not sure how to do it.
String test = "$%$#Arti56st_-_54^So65ng.mp3";
Edit 1:
When I try using the method remove, it still doesn't change the name.
String test = "$%$#Arti56st_-_54^So65ng.mp3";
System.out.println("Original: " + test);
test.replace( "[0-9]%#&\\$", "");
System.out.println("New: " + test);
The code above returns the following output
Original: $%$#Arti56st_-_54^So65ng.mp3
New: $%$#Arti56st_-_54^So65ng.mp3
I'd suggest something like this:
public static String santizeFilename(String original){
Pattern p = Pattern.compile("(.*)-(.*)\\.mp3");
Matcher m = p.matcher(original);
if (m.matches()){
String artist = m.group(1).replaceAll("[^a-zA-Z ]", "");
String song = m.group(2).replaceAll("[^a-zA-Z ]", "");
return String.format("%s - %s", artist, song);
}
else {
throw new IllegalArgumentException("Failed to match filename : "+original);
}
}
(Edit - changed whitelist regex to exclude digits and underscores)
Two points in particular - when sanitizing strings, it's a good idea to whitelist permitted characters, rather than blacklisting the ones you want to exclude, so you won't be surprised by edge cases later. (You may want a less restrictive whitelist than I've used here, but it's easy to vary)
It's also a good idea to handle the case that the filename doesn't match the expected pattern. If your code comes across something other than an MP3, how would you like it to respond? Here I've through an exception, so the calling code can catch and handle that appropriately.
String new = original.replace( "[0-9]%#&\\$", "")
this should replace almost all the characters you don't want
or you can come up with your own regex
https://docs.oracle.com/javase/tutorial/essential/regex/

Dynamically replace part in URL using Regex

I tried searching for something similar, and couldn't find anything. I'm having difficulty trying to replace a few characters after a specific part in a URL.
Here is the URL: https://scontent-b.xx.fbcdn.net/hphotos-xpf1/v/t1.0-9/s130x130/10390064_10152552351881633_355852593677844144_n.jpg?oh=479fa99a88adea07f6660e1c23724e42&oe=5519DE4B
I want to remove the /v/ part, leave the t1.0-9, and also remove the /s130x130/.I cannot just replace s130x130, because those may be different variables. How do I go about doing that?
I have a previous URL where I am using this code:
if (pictureUri.indexOf("&url=") != -1)
{
String replacement = "";
String url = pictureUri.replaceAll("&", "/");
String result = url.replaceAll("().*?(/url=)",
"$1" + replacement + "$2");
String pictureUrl = null;
if (result.startsWith("/url="))
{
pictureUrl = result.replace("/url=", "");
}
}
Can I do something similar with the above URL?
With the regex
/v/|/s\d+x\d+/
replaced with
/
It turns the string from
https://scontent-b.xx.fbcdn.net/hphotos-xpf1/v/t1.0-9/s130x130/10390064_10152552351881633_355852593677844144_n.jpg?oh=479fa99a88adea07f6660e1c23724e42&oe=5519DE4B
to
https://scontent-b.xx.fbcdn.net/hphotos-xpf1/t1.0-9/10390064_10152552351881633_355852593677844144_n.jpg?oh=479fa99a88adea07f6660e1c23724e42&oe=5519DE4B
as seen here. Is this what you're trying to do?

If a string contains a letter, return the entire String

Weird one but:
Let's say you've a huge html page and if the page contains an email address (looking for an # sign) you want to return that email.
So far I know I need something like this:
String email;
if (myString.contains("#")) {
email = myString.substring("#")
}
I know how to get to the # but how do I go back in the string to find what's before it etc?
if the myString is the string for email you received from html page then ,
you can return the same string if it has # right. something like below
String email;
if (myString.contains("#")) {
email = myString;
}
whats the challenge here.. can you explain any challenge if so ?
This method will give you a list of all the email addresses contained in a string.
static ArrayList<String> getEmailAdresses(String str) {
ArrayList<String> result = new ArrayList<>();
Matcher m = Pattern.compile("\\S+?#[^. ]+(\\.[^. ]+)*").matcher(str.replaceAll("\\s", " "));
while(m.find()) {
result.add(m.group());
}
return result;
}
String email;
if (myString.contains("#")) {
// Locate the #
int atLocation = myString.indexOf("#");
// Get the string before the #
String start = myString.substring(0, atLocation);
// Substring from the last space before the end
start = start.substring(start.lastIndexOf(" "), start.length);
// Get the string after the #
String end = myString.substring(atLocation, myString.length);
// Substring from the first space after the start (of the end, lol)
end = end.substring(end.indexOf(" "), end.length);
// Stick it all together
email = start + "#" + end;
}
This may be a little off as I've been writing javascript all day. :)
Rather than exact code, I would like to give you an approach.
Checking just by # symbol might not be appropriate as it might be possible in other cases as well.
Search through internet or create your own, a regex pattern which matches an email.
(if you want, you can add a check for email providers as well) [here is a link] (http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/)
Get the index of a pattern in a string using regex and find out the substring (email in your case).

LucidWorks: Java Regular Expressions & GNU Regular Expressions

I am trying to create regular expressions so that I can crawl and index certain URL's on my web site with LucidWorks.
Example URL: http://www.example.com/reviews/assassins-creed-revelations/24475/reviews/
Example URL: http://www.example.com/reviews/super-mario-3d-land/64303/reviews/
Basically, I want LucidWorks to search my entire site and index only URL'S that have /reviews/ at the end of the URL.
Could anyone help me construct an expression to do that please? :)
Updated:
URL: http://www.example.com/
Include paths: //*/reviews/*
That kind of worked, but it only crawls the first page, it won't go to the next page with more reviews (1,2,3 etc).
If I also add: ///reviews/.*
I get a load of pages indexed which I don't want such as http://www.example.com/?page=2
Check with this function
public boolean canAcceptURL(String url,String endsWith){
boolean canAccept = false;
String regex = "";
try{
if(endsWith.equals("")){
endsWith = "/reviews/";
}
regex = "[\\x20-\\x7E]*"+endsWith+"$";//Check the url string u passed ends with the endString you hav passed.If end string is null it will take the default value.
canAccept = url.matches(regex);
}catch (PatternSyntaxException pe) {
pe.printStackTrace();
}catch (Exception e) {
e.printStackTrace();
}
System.out.println("String matches : "+canAccept);
return canAccept;
}
Sample out put :
calling function : canAcceptURL("http://www.example.com/reviews/super-mario-3d-land/64303/reviews/","/reviews/");
String matches : true
if you want to get the url contains *'/reviews/'* just change the regex string to
String regex = "[\\x20-\\x7E]*/reviews/[\\x20-\\x7E]*"; // this will accept a string with white space and special character.

Categories

Resources