How to split URL? - java

This is my code to split URL, but that code have problem. All link appear with double word, example www.utem.edu.my/portal/portal . the words /portal/portal always double in any link appear. Any suggestion to me extract links in the webpage?
public String crawlURL(String strUrl) {
String results = ""; // For return
String protocol = "http://";
// Assigns the input to the inURL variable and checks to add http
String inURL = strUrl;
if (!inURL.toLowerCase().contains("http://".toLowerCase()) &&
!inURL.toLowerCase().contains("https://".toLowerCase())) {
inURL = protocol + inURL;
}
// Pulls URL contents from the web
String contectURL = pullURL(inURL);
if (contectURL == "") { // If it fails, then try with https
protocol = "https://";
inURL = protocol + inURL.split("http://")[1];
contectURL = pullURL(inURL);
}
// Declares some variables to be used inside loop
String aTagAttr = "";
String href = "";
String msg = "";
// Finds A tag and stores its href value into output var
String bodyTag = contectURL.split("<body")[1]; // Find 1st <body>
String[] aTags = bodyTag.split(">"); // Splits on every tag
//To show link different from one another
int index = 0;
for (String s: aTags) {
// Process only if it is A tag and contains href
if (s.toLowerCase().contains("<a") && s.toLowerCase().contains("href")) {
aTagAttr = s.split("href")[1]; // Split on href
// Split on space if it contains it
if (aTagAttr.toLowerCase().contains("\\s"))
aTagAttr = aTagAttr.split("\\s")[2];
// Splits on the link and deals with " or ' quotes
href = aTagAttr.split(((aTagAttr.toLowerCase().contains("\""))? "\"" : "\'"))[1];
if (!results.toLowerCase().contains(href))
//results += "~~~ " + href + "\r\n";
/*
* Last touches to URl before display
* Adds http(s):// if not exist
* Adds base url if not exist
*/
if(results.toLowerCase().indexOf("about") != -1) {
//Contains 'about'
}
if (!href.toLowerCase().contains("http://") && !href.toLowerCase().contains("https://")) {
// http:// + baseURL + href
if (!href.toLowerCase().contains(inURL.split("://")[1]))
href = protocol + inURL.split("://")[1] + href;
else
href = protocol + href;
}
System.out.println(href);//debug

consider to use the URL class...
Use it as suggested by the documentation :
)
public static void main(String[] args) throws Exception {
URL aURL = new URL("http://example.com:80/docs/books/tutorial"
+ "/index.html?name=networking#DOWNLOADING");
System.out.println("protocol = " + aURL.getProtocol());
System.out.println("authority = " + aURL.getAuthority());
System.out.println("host = " + aURL.getHost());
System.out.println("port = " + aURL.getPort());
System.out.println("path = " + aURL.getPath());
System.out.println("query = " + aURL.getQuery());
System.out.println("filename = " + aURL.getFile());
System.out.println("ref = " + aURL.getRef());
}
}
the output:
protocol = http
authority = example.com:80
host = example.com
port = 80
etc
after this you can take the elements you need an create a new one string/URL :)

Related

I can't scrape the google search result with Chinese character keywords

I can't perform the "chinese keywords"search here . (eng words are okay)
String search = "大學";
English keywords are fine here (able to do search )
I tried to use both UTF-8 or big5 for the charset.
But both of them are not working .
Here is my work .
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String[] line = new String[100];
final int[] score = { 0};
String google = "http://www.google.com/search?q=";
String search = "大學";
String charset = "UTF-8";//UTF-8 is neither working
String news="&tbm=nws";
String string = google + URLEncoder.encode(search , charset) + news+"&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2016%2Ccd_max%3A12%2F31%2F2016";
String userAgent ="Chrome/57.0.2987.133";
int numberOfResultpages = 10; // grabs first two pages of search results
int idx = 0;
for (int i = 0; i < numberOfResultpages; i++) {
Document document = Jsoup.connect(string).userAgent(userAgent) .data("start",""+i).get();
Elements links = document.select( ".r>a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
line[idx++]=title;
// }
}
}

Can't get text with Selenium

I have a problem with Selenium WebDriver in Java. When I use this code (without using element.click();) it works:
public static void main(String[] args) {
try {
File salida= new File("salidas/Salida.txt");
FileWriter fw = new FileWriter(salida);
PrintWriter volcado = new PrintWriter(fw);
System.setProperty("webdriver.chrome.driver", "path to\\chromedriver.exe");
WebDriver driver = new ChromeDriver();
driver.get("http://ranking-empresas.eleconomista.es/REPSOL-PETROLEO.html");
String name = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[1]/td[2]")).getText();
String obj_soc = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[2]/td[2]")).getText();
String direcc = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[3]/td[2]")).getText();
String loc = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[4]/td[2]")).getText();
String tel = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[5]/td[2]")).getText();
String url = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[8]/td[2]")).getText();
String actividad = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[9]/td[2]")).getText();
volcado.print(name + " " + obj_soc + " " + direcc + " " + loc + " " + tel + " " + url + " " + actividad);
volcado.close();
driver.close();
}
catch(Exception e) {
e.printStackTrace();
}
}
But the problem came when I wanted to access by the previous page with the element.click(); like this:
System.setProperty("webdriver.chrome.driver", "path to\\chromedriver.exe");
WebDriver driver = new ChromeDriver();
driver.get("http://ranking-empresas.eleconomista.es/ranking_empresas_nacional.html");
WebElement element = driver.findElement(By.xpath("//*[#id=\"tabla-ranking\"]/table/tbody/tr[1]/td[7]/a"));
element.click();
String name = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[1]/td[2]")).getText();
String obj_soc = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[2]/td[2]")).getText();
String direcc = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[3]/td[2]")).getText();
String loc = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[4]/td[2]")).getText();
String tel = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[5]/td[2]")).getText();
String url = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[8]/td[2]")).getText();
String actividad = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[9]/td[2]")).getText();
volcado.print(name+" "+obj_soc+" "+direcc+" "+loc+" "+tel+" "+url+" "+actividad);
volcado.close();
driver.close();
}
catch(Exception e){
e.printStackTrace();
}}
Selenium opens the browser and the pages, but my variables don’t get the text of the XPath expression.
The data is not yet present on the page at the time you are trying to get the text. Wait for the data before reading it, and it should be fine:
WebDriver driver = new ChromeDriver();
WebDriverWait wait = new WebDriverWait(driver, 10);
driver.get("http://ranking-empresas.eleconomista.es/ranking_empresas_nacional.html");
driver.findElement(By.xpath("//*[#id=\"tabla-ranking\"]/table/tbody/tr[1]/td[7]/a")).click();
// Wait for the data to be present
wait.until(ExpectedConditions.visibilityOfElementLocated(By.id("business-profile")));
String name = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[1]/td[2]")).getText();
String obj_soc = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[2]/td[2]")).getText();
String direcc = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[3]/td[2]")).getText();
String loc = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[4]/td[2]")).getText();
String tel = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[5]/td[2]")).getText();
String url = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[8]/td[2]")).getText();
String actividad = driver.findElement(By.xpath("//*[#id=\"business-profile\"]/div[17]/div[1]/div[2]/table/tbody/tr[9]/td[2]")).getText();
volcado.print(name + " " + obj_soc + " " + direcc + " " + loc + " " + tel + " " + url + " "+actividad);
volcado.close();
However, a much cleaner alternative would be to get all the cells with a single XPath expression:
driver.get("http://ranking-empresas.eleconomista.es/ranking_empresas_nacional.html");
driver.findElement(By.xpath("//*[#id=\"tabla-ranking\"]/table/tbody/tr[1]/td[7]/a")).click();
// Wait for the data to be present
List<WebElement> cells = wait.until(
ExpectedConditions.presenceOfAllElementsLocatedBy(
By.xpath("//h3[.='Datos comerciales de REPSOL PETROLEO SA']/following::tbody[1]/tr/td[2]")));
String name = cells.get(0).getText();
String obj_soc = cells.get(1).getText();
String direcc = cells.get(2).getText();
String loc = cells.get(3).getText();
String tel = cells.get(4).getText();
String url = cells.get(7).getText();
String actividad = cells.get(8).getText();

return array from inside an if,for statement

I am building a tag reader for inventory purpose. Using the for loop to iterate through the tags to count/total the ids. I get an error on my return line "tagsFound cannot be resolved into a variable". How do i use the variable inside the for loop and then access it outside the loop?
public String[] getTags(AlienClass1Reader reader)throws AlienReaderException{
int coneCount = 0;
int drumCount = 0;
// Open a connection to the reader
reader.open();
// Ask the reader to read tags and print them
Tag tagList[] = reader.getTagList();
if (tagList == null) {
System.out.println("No Tags Found");
} else {
System.out.println("Tag(s) found: " + tagList.length);
for (int i=0; i<tagList.length; i++) {
Tag tag = tagList[i];
System.out.println("ID:" + tag.getTagID() +
", Discovered:" + tag.getDiscoverTime() +
", Last Seen:" + tag.getRenewTime() +
", Antenna:" + tag.getAntenna() +
", Reads:" + tag.getRenewCount()
);
//tagFound[i]= "" + tag.getTagID();
String phrase = tag.getTagID();
tagFound[i] = phrase;
String delims = "[ ]+";
String[] tokens = phrase.split(delims);
if (tokens[0].equals("0CCE") && tokens[3].equals("1001")){drumCount++;}
if (tokens[0].equals("0CCE") && tokens[3].equals("1004")){coneCount++;}
String[] tagsFound;
tagsFound[i] = tag.getTagID();
}
System.out.println("Cones= " + coneCount);
System.out.println("Drums= " + drumCount);
// Close the connection
reader.close();
return tagsFound;
}
}
public String[] getTags(AlienClass1Reader reader)throws AlienReaderException{
int coneCount = 0;
int drumCount = 0;
// Open a connection to the reader
reader.open();
// Ask the reader to read tags and print them
Tag tagList[] = reader.getTagList();
if (tagList == null) {
System.out.println("No Tags Found");
} else {
System.out.println("Tag(s) found: " + tagList.length);
String[] tagsFound = new String[tagList.length];
for (int i=0; i<tagList.length; i++) {
tagsFound = "";
Tag tag = tagList[i];
System.out.println("ID:" + tag.getTagID() +
", Discovered:" + tag.getDiscoverTime() +
", Last Seen:" + tag.getRenewTime() +
", Antenna:" + tag.getAntenna() +
", Reads:" + tag.getRenewCount()
);
//tagFound[i]= "" + tag.getTagID();
String phrase = tag.getTagID();
tagFound[i] = phrase;
String delims = "[ ]+";
String[] tokens = phrase.split(delims);
if (tokens[0].equals("0CCE") && tokens[3].equals("1001")){drumCount++;}
if (tokens[0].equals("0CCE") && tokens[3].equals("1004")){coneCount++;}
tagsFound[i] = tag.getTagID();
}
System.out.println("Cones= " + coneCount);
System.out.println("Drums= " + drumCount);
// Close the connection
reader.close();
return tagsFound;
}
}
the returned array will have empty strings in the positions where the tag does not satisfy the criteria.

java.lang.ArrayIndexOutOfBoundsException :

I have a String = "abc model 123 abcd1862893007509396 abcd2862893007509404", if I provide space between abcd1 & number eg. abcd1 862893007509396 my code will work fine, but if there is no space like abcd1862893007509396, I will get java.lang.ArrayIndexOutOfBoundsException, please help ?:
PFB the code :
String text = "";
final String suppliedKeyword = "abc model 123 abcd1862893007509396 abcd2862893007509404";
String[] keywordarray = null;
String[] keywordarray2 = null;
String modelname = "";
String[] strIMEI = null;
if ( StringUtils.containsIgnoreCase( suppliedKeyword,"model")) {
keywordarray = suppliedKeyword.split("(?i)model");
if (StringUtils.containsIgnoreCase(keywordarray[1], "abcd")) {
keywordarray2 = keywordarray[1].split("(?i)abcd");
modelname = keywordarray2[0].trim();
if (keywordarray[1].trim().contains(" ")) {
strIMEI = keywordarray[1].split(" ");
for (int i = 0; i < strIMEI.length; i++) {
if (StringUtils.containsIgnoreCase(strIMEI[i],"abcd")) {
text = text + " " + strIMEI[i] + " "
+ strIMEI[i + 1];
System.out.println(text);
}
}
} else {
text = keywordarray2[1];
}
}
}
After looking at your code the only thing i can consider for cause of error is
if (StringUtils.containsIgnoreCase(strIMEI[i],"abcd")) {
text = text + " " + strIMEI[i] + " "
+ strIMEI[i + 1];
System.out.println(text);
}
You are trying to access strIMEI[i+1] which will throw an error if your last element in strIMEI contains "abcd".

How to Manipulate List to String and print in seperate row

I am trying to export the 4 columns with the below code.the last column organization is a List.
String appname = "abc";
String path = "//home/exportfile//";
String filename = path + "ApplicationExport-" + appname + ".txt";
String ret = "false";
QueryOptions ops = new QueryOptions();
Filter[] filters = new Filter[1];
filters[0] = Filter.eq("application.name", appname);
ops.add(filters);
List props = new ArrayList();
props.add("identity.name");
// Do search
Iterator it = context.search(Link.class, ops, props);
// Build file and export header row
BufferedWriter out = new BufferedWriter(new FileWriter(filename));
out.write("IdentityName,UserName,WorkforceID,Organization");
out.newLine();
// Iterate Search Results
if (it != null) {
while (it.hasNext()) {
// Get link and create object
Object[] record = it.next();
String identityName = (String) record[0];
Identity user = (Identity) context.getObject(Identity.class, identityName);
// Get Identity attributes for export
String workforceid = (String) user.getAttribute("workforceID");
// Get application attributes for export
String userid = "";
List links = user.getLinks();
if (links != null) {
Iterator lit = links.iterator();
while (lit.hasNext()) {
Link l = lit.next();
String lname = l.getApplicationName();
if (lname.equalsIgnoreCase(appname)) {
userid = (String) l.getAttribute("User Name");
List organizations = l.getAttribute("Organization");
StringBuilder sb = new StringBuilder();
String listItemsSeparator = ",";
for (Object organization : organizations) {
sb.append(organization.toString());
sb.append(listItemsSeparator);
}
org = sb.toString().trim();
}
}
}
// Output file
out.write(identityName + "," + userid + "," + workforceid + "," + org);
out.newLine();
out.flush();
}
ret = "true";
}
// Close file and return
out.close();
return ret;
the output of the above code will be.for ex:
IdentityName,UserName,WorkforceID,Organization
dthomas,dthomas001,12345,Finance,HR
How do i get the output in below fashion
IdentityName,UserName,WorkforceID,Organization
dthomas,dthomas001,12345,Finance
dthomas,dthomas001,12345,HR
what and where i need to change the code?
You'll have to write one line to the file for each organization. So, basically, do not concatenate all organizations for a user with the string builder and move the output statements into the for loop that iterates through the organizations.
But it's difficult to provide a working example, because you're code you've shown doesn't compile yet...
This should bring you somewhat closer to the solution:
if (links != null) {
Iterator lit = links.iterator();
while (lit.hasNext()) {
Link l = lit.next();
String lname = l.getApplicationName();
if (lname.equalsIgnoreCase(appname)) {
userid = (String) l.getAttribute("User Name");
List organizations = l.getAttribute("Organization");
for (Object organization : organizations) {
// Output file
out.write(identityName + "," + userid + "," + workforceid + "," + organization);
out.newLine();
out.flush();
}
}
}
}
Remove this innermost for block and associated variables:
StringBuilder sb = new StringBuilder();
for (Object organization : organizations)
{
sb.append(organization.toString());
sb.append(listItemsSeparator);
}
org = sb.toString().trim();
Move the declaration of organizations outside the if (it != null) { block:
// Get application attributes for export
String userid = "";
List organizations = null;
List links = user.getLinks();
if (it != null)
{
Iterator lit = links.iterator();
while (lit.hasNext())
{
Link l = lit.next();
String lname = l.getApplicationName();
if (lname.equalsIgnoreCase(appname))
{
userid = (String) l.getAttribute("User Name");
organizations = l.getAttribute("Organization");
And then change this file output code:
// Output file
out.write(identityName + "," + userid + "," + workforceid + "," + org);
out.newLine();
To this:
// Output file
for (Object organization : organizations)
{
out.write(identityName + "," + userid + "," + workforceid + "," + organization.toString());
out.newLine();
}

Categories

Resources