I have this code
public void descargarURL() {
try{
URL url = new URL("https://www.amazon.es/MSI-Titan-GT73EVR-7RD-1027XES-Ordenador/dp/B078ZYX4R5/ref=sr_1_1?ie=UTF8&qid=1524239679&sr=8-1");
BufferedReader lectura = new BufferedReader(new InputStreamReader(url.openStream()));
File archivo = new File("descarga2.txt");
BufferedWriter escritura = new BufferedWriter(new FileWriter(archivo));
BufferedWriter ficheroNuevo = new BufferedWriter(new FileWriter("nuevoFichero.txt"));
String texto;
while ((texto = lectura.readLine()) != null) {
escritura.write(texto);
}
lectura.close();
escritura.close();
ficheroNuevo.close();
System.out.println("Archivo creado!");
//}
}
catch(Exception ex) {
ex.printStackTrace();
}
}
public static void main(String[] args) throws FileNotFoundException, IOException {
Paginaweb2 pg = new Paginaweb2();
pg.descargarURL();
}
}
And I want to remove from the URL the part of the reference that is B078ZYX4R5, and this entity /
After the html that is saved in the text file there is a part of the code that has *"<div id =" cerberus-data-metrics "style =" display: none; "data-asin =" B078ZYX4R5 "data-as-price = "1479.00" data-asin-shipping = "0" data-asin-currency-code = "EUR" data-substitute-count = "0" data-device-type = "WEB" data-display-code = "Asin is not eligible because it has a retail offer "> </ div>"*, and I want to only get the price from there that is 1479.00, it is included among the tags "data-as-price = "
I dont want to use external libraries, I know that it can be done with split, index of, and substring
Thanks!!!!
You could solve both tasks by using regular expressions. Yet for the second task (extraction of the price from the HTML) you could use JSOUP which is much better suited to extract content from HTML.
Here are some possible solutions based on regular expressions for your tasks:
1. Change URL
private static String modifyUrl(String str) {
return str.replaceFirst("/[^/]+(?=/ref)", "");
}
This is just a replacement using a regular expression using a positive look-ahead (?=/ref) (see https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html)
Extract Price
private static Optional<String> extractPrice(String html) {
Pattern pat = Pattern.compile("data-as-price\\s*=\\s*[\"'](?<price>.+?)[\"']", Pattern.MULTILINE);
Matcher m = pat.matcher(html);
if(m.find()) {
String price = m.group("price");
return Optional.of(price);
}
return Optional.empty();
}
Here you can use also a regular expression (data-as-price\s*=\s*["'](?<price>.+?)["']) to locate the price. With a named group ((?<price>.+?)) you can then extract the price.
I am returning an Optional here so that you can deal with the case that the price was not found.
Here is a simple test case for the two methods:
public static void main(String[] args) throws IOException {
String str = "https://www.amazon.es/MSI-Titan-GT73EVR-7RD-1027XES-Ordenador/dp/B078ZYX4R5/ref=sr_1_1?ie=UTF8&qid=1524239679&sr=8-1";
System.out.println(modifyUrl(str));
String html = "<div id =\" cerberus-data-metrics \"style =\" display: none; \"data-asin =\" B078ZYX4R5 \"data-as-price = \"1479.00\" data-asin-shipping = \"0\" data-asin-currency-code = \"EUR\" data-substitute-count = \"0\" data-device-type = \"WEB\" data-display-code = \"Asin is not eligible because it has a retail offer \"> </ div>";
extractPrice(html).ifPresent(System.out::println);
}
If you run this simple test case you will see on the console this output:
https://www.amazon.es/MSI-Titan-GT73EVR-7RD-1027XES-Ordenador/dp/ref=sr_1_1?ie=UTF8&qid=1524239679&sr=8-1
1479.00
Update
If you want to extract the reference from the URL, you can do it using similar code to the one used to extract the price. Here is a method which extract a specific named group from a pattern:
private static Optional<String> extractNamedGroup(String str, Pattern pat, String reference) {
Matcher m = pat.matcher(str);
if (m.find()) {
return Optional.of(m.group(reference));
}
return Optional.empty();
}
Then you can use this method for extracting the reference and price:
private static Optional<String> extractReference(String str) {
Pattern pat = Pattern.compile("/(?<reference>[^/]+)(?=/ref)");
return extractNamedGroup(str, pat, "reference");
}
private static Optional<String> extractPrice(String html) {
Pattern pat = Pattern.compile("data-as-price\\s*=\\s*[\"'](?<price>.+?)[\"']", Pattern.MULTILINE);
return extractNamedGroup(html, pat, "price");
}
You can test the above methods with:
public static void main(String[] args) throws IOException {
String str = "https://www.amazon.es/MSI-Titan-GT73EVR-7RD-1027XES-Ordenador/dp/B078ZYX4R5/ref=sr_1_1?ie=UTF8&qid=1524239679&sr=8-1";
extractReference(str).ifPresent(System.out::println);
String html = "<div id =\" cerberus-data-metrics \"style =\" display: none; \"data-asin =\" B078ZYX4R5 \"data-as-price = \"1479.00\" data-asin-shipping = \"0\" data-asin-currency-code = \"EUR\" data-substitute-count = \"0\" data-device-type = \"WEB\" data-display-code = \"Asin is not eligible because it has a retail offer \"> </ div>";
extractPrice(html).ifPresent(System.out::println);
}
This will print:
B078ZYX4R5
1479.00
Update 2: Using URL
If you want to use the java.net.URL class to help you narrow down the search scope you can do it. But you cannot use this class to do the full extraction.
Since the token you want to extract is in the URL path you can extract the path and then apply the regular expression explained above to do the extraction.
Here is the sample code you can use to narrow down the search scope:
public static void main(String[] args) throws IOException {
String str = "https://www.amazon.es/MSI-Titan-GT73EVR-7RD-1027XES-Ordenador/dp/B078ZYX4R5/ref=sr_1_1?ie=UTF8&qid=1524239679&sr=8-1";
URL url = new URL(str);
extractReference(url.getPath() /* narrowing the search scope here */).ifPresent(System.out::println);
String html = "<div id =\" cerberus-data-metrics \"style =\" display: none; \"data-asin =\" B078ZYX4R5 \"data-as-price = \"1479.00\" data-asin-shipping = \"0\" data-asin-currency-code = \"EUR\" data-substitute-count = \"0\" data-device-type = \"WEB\" data-display-code = \"Asin is not eligible because it has a retail offer \"> </ div>";
extractPrice(html).ifPresent(System.out::println);
}
Related
I am trying to create a web scraper program that takes tables from a website and converts them into ".csv" files.
I'm using Jsoup to pull the data down into a document and have it read from document.html() doc.html() below. The reader as it stands picks up 18 tables at my test site but no table data tags.
Do you have any idea what could be going wrong?
ArrayList<Data_Log> container = new ArrayList<Data_Log>();
ArrayList<ListData_Log> containerList = new ArrayList<ListData_Log>();
ArrayList<String> tableNames = new ArrayList<String>();// Stores native names of tables
ArrayList<Double> meanStorage = new ArrayList<Double>();// Stores data mean per table
ArrayList<String> processlog = new ArrayList<String>();// Keeps a record of all actions taken per iteration
ArrayList<Double> modeStorage = new ArrayList<Double>();
Calendar cal;
private static final long serialVersionUID = -8174362940798098542L;
public void takeData() throws IOException {
if (testModeActive == true) {
System.out.println("Initializing Data Cruncher with developer logs");
System.out.println("Taking data from: " + dataSource); }
int irow = 0;
int icolumn = 0;
int iTable = 0;
// int iListno = 0;
// int iListLevel;
String u = null;
boolean recording = false;
boolean duplicate = false;
Document doc = Jsoup.connect(dataSource).get();
Webtitle = doc.title();
Pattern tb = Pattern.compile("<table");
Matcher tB = tb.matcher(doc.html());
Pattern ttl = Pattern.compile("<title>(//s+)</title>");
Matcher ttl2= ttl.matcher(doc.html());
Pattern tr = Pattern.compile("<tr");
Matcher tR = tr.matcher(doc.html());
Pattern td = Pattern.compile("<td(//s+)</td>");
Matcher tD = td.matcher(doc.html());
Pattern tdc = Pattern.compile("<td class=(//s+)>(//s+)</td>");
Matcher tDC = tdc.matcher(doc.html());
Pattern tb2 = Pattern.compile("</table>");
Matcher tB2 = tb2.matcher(doc.html());
Pattern th = Pattern.compile("<th");
Matcher tH = th.matcher(doc.html());
while (tB.find()) {
iTable++;
while(ttl2.find()) {
tableNames.add(ttl2.group(1));
}
while (tR.find()) {
while (tD.find()||tH.find()) {
u = tD.group(1);
Data_Log v = new Data_Log();
v.setTable(iTable);
v.dataSort(u);
v.setRow(irow);
v.setColumn(icolumn);
container.add(v);
icolumn++;
}
while(tDC.find()) {
u = tDC.group(2);
Data_Log v = new Data_Log();
v.setTable(iTable);
v.dataSort(u);
v.setRow(irow);
v.setColumn(icolumn);
container.add(v);
icolumn++;
}
irow++;
}
if (tB2.find()) {
irow=0;
icolumn=0;
}
}
Expected results:
table# logged + "td"s logged
Actual result:
table# logged "td"s omitted
Since you're using jsoup, use it
var url = "<your url>";
var doc = Jsoup.connect(url).get();
var tables = doc.body().getElementsByTag("table");
tables.forEach(table -> {
System.out.println(table.id());
System.out.println(table.className());
System.out.println(table.getElementsByTag("td"));
});
For your tries to parse html with regex, here's some suggested reading
Using regular expressions to parse HTML: why not?
Why is it such a bad idea to parse XML with regex?
RegEx match open tags except XHTML self-contained tags
I'd like to retrieve data from string based on params from template.
For example:
given string -> "some text, var=20 another part param=45"
template -> "some text, var=${var1} another part param=${var2}"
result -> var1 = 20; var2 = 45
How could I achive that result in Java. Are there some libs or I need to use regex?
I tried different template processors, but they don't have needed functionality, I need something like inverse to them.
I hope below sample will serve your purpose -
String strValue = "some text, var=20 another part param=45";
String strTemplate = "some text, var=${var1} another part param=${var2}";
ArrayList<String> wildcards = new ArrayList<String>();
StringBuffer outputBuffer = new StringBuffer();
Pattern pat1 = Pattern.compile("(\\$\\{\\w*\\})");
Matcher mat1 = pat1.matcher(strTemplate);
while (mat1.find())
{
wildcards.add(mat1.group(1).replaceAll("\\$", "").replaceAll("\\{", "").replaceAll("\\}", ""));
strTemplate = strTemplate.replace(mat1.group(1), "(\\w*)");
}
if(wildcards!= null && wildcards.size() > 0)
{
Pattern pat2 = Pattern.compile(strTemplate);
Matcher mat2 = pat2.matcher(strValue);
if (mat2.find())
{
for(int i=0;i<wildcards.size();i++)
{
outputBuffer.append(wildcards.get(i)).append(" = ");
outputBuffer.append(mat2.group(i+1));
if(i != wildcards.size()-1)
{
outputBuffer.append("; ");
}
}
}
}
System.out.println(outputBuffer.toString());
How it will print welcome using following System.out.println.
Generated will give ac. But how should I make it as welcome (that ia ac value not "ac")
public class BrowserSample {
public static void main(String[] args) {
String generated = "ac";
String ac = "welcome";
System.out.println("value from generated is = " + generated);
}
}
ok After a brief experiment here is the solution you want
String generated = "ac";
String ac = "welcome"; // declare as member of class
String s = (String) getClass().getDeclaredField(generated).get(this);
s will contain welcome
What about a map, key-value pair.
Map<String,String> map = new HashMap<>();
String generated = "ac";
map.put("ac","welcome")
System.out.println("value from generated is = "+map.get("ac"));
And what you are expecting that is not possible and also meaningless.
I am extracting a youtube video id from a youtube link. the list looks like this
http://www.youtube.com/watch?v=mmmc&feature=plcp
I want to get the mmmc only.
i used .replaceAll ?
Three ways:
Url parsing:
http://download.oracle.com/javase/6/docs/api/java/net/URL.html
URL url = new URL("http://www.youtube.com/watch?v=mmmc&feature=plcp");
url.getQuery(); // return query string.
Regular Expression
Examples here http://www.vogella.com/articles/JavaRegularExpressions/article.html
Tokenize
String s = "http://www.youtube.com/watch?v=mmmc&feature=plcp";
String arr[] = s.split("=");
String arr1[] = arr[1].split("&");
System.out.println(arr1[0]);
If you'd like to use regular expressions, this could be a solution:
Pattern p = Pattern
.compile("http://www.youtube.com/watch\\?v=([\\s\\S]*?)\\&feature=plcp");
Matcher m = p.matcher(youtubeLink);
if (m.find()) {
return m.group(1);
}
else{
throw new IllegalArgumentException("invalid youtube link");
}
Of course, this will only work if the feature will always be plcp, if not, you could simply remove that part or replace it with a wilcard as I did with mmmc
Edit: now i know what you are looking for i hope:
String url= "http://www.youtube.com/watch?v=mmmc&feature=plcp";
String search = "v=";
int index = url.indexOf(search);
int index2 = url.indexOf("&",index);
String found = url.substring(index+2,index2);
System.out.println(found);
Here's a generic solution (using Guava MapSplitter):
public final class UrlUtil {
/**
* Query string splitter.
*/
private static final MapSplitter PARAMS_SPLITTER = Splitter.on('&').withKeyValueSeparator("=");
/**
* Get param value in provided url for provided param.
*
* #param url Url to use
* #param param Param to use
* #return param value or null.
*/
public static String getParamVal(String url, String param)
{
if (url.contains("?")) {
final String query = url.substring(url.indexOf('?') + 1);
return PARAMS_SPLITTER.split(query).get(param);
}
return null;
}
public static void main(final String[] args)
{
final String url = "http://www.youtube.com/watch?v=mmmc&feature=plcp";
System.out.println(getParamVal(url, "v"));
System.out.println(getParamVal(url, "feature"));
}
}
Outputs:
mmmc
plcp
URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.
Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many.
Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better.
I'll handcode something for now and keep an eye on this question.
EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.
Have you taken a look at the URI class?
http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()
I found this question last night, but there wasn't an answer I was looking for so I made my own. Here it is incase somebody in the future wants it:
/**
* - Covert the scheme and host to lowercase (done by java.net.URL)
* - Normalize the path (done by java.net.URI)
* - Add the port number.
* - Remove the fragment (the part after the #).
* - Remove trailing slash.
* - Sort the query string params.
* - Remove some query string params like "utm_*" and "*session*".
*/
public class NormalizeURL
{
public static String normalize(final String taintedURL) throws MalformedURLException
{
final URL url;
try
{
url = new URI(taintedURL).normalize().toURL();
}
catch (URISyntaxException e) {
throw new MalformedURLException(e.getMessage());
}
final String path = url.getPath().replace("/$", "");
final SortedMap<String, String> params = createParameterMap(url.getQuery());
final int port = url.getPort();
final String queryString;
if (params != null)
{
// Some params are only relevant for user tracking, so remove the most commons ones.
for (Iterator<String> i = params.keySet().iterator(); i.hasNext();)
{
final String key = i.next();
if (key.startsWith("utm_") || key.contains("session"))
{
i.remove();
}
}
queryString = "?" + canonicalize(params);
}
else
{
queryString = "";
}
return url.getProtocol() + "://" + url.getHost()
+ (port != -1 && port != 80 ? ":" + port : "")
+ path + queryString;
}
/**
* Takes a query string, separates the constituent name-value pairs, and
* stores them in a SortedMap ordered by lexicographical order.
* #return Null if there is no query string.
*/
private static SortedMap<String, String> createParameterMap(final String queryString)
{
if (queryString == null || queryString.isEmpty())
{
return null;
}
final String[] pairs = queryString.split("&");
final Map<String, String> params = new HashMap<String, String>(pairs.length);
for (final String pair : pairs)
{
if (pair.length() < 1)
{
continue;
}
String[] tokens = pair.split("=", 2);
for (int j = 0; j < tokens.length; j++)
{
try
{
tokens[j] = URLDecoder.decode(tokens[j], "UTF-8");
}
catch (UnsupportedEncodingException ex)
{
ex.printStackTrace();
}
}
switch (tokens.length)
{
case 1:
{
if (pair.charAt(0) == '=')
{
params.put("", tokens[0]);
}
else
{
params.put(tokens[0], "");
}
break;
}
case 2:
{
params.put(tokens[0], tokens[1]);
break;
}
}
}
return new TreeMap<String, String>(params);
}
/**
* Canonicalize the query string.
*
* #param sortedParamMap Parameter name-value pairs in lexicographical order.
* #return Canonical form of query string.
*/
private static String canonicalize(final SortedMap<String, String> sortedParamMap)
{
if (sortedParamMap == null || sortedParamMap.isEmpty())
{
return "";
}
final StringBuffer sb = new StringBuffer(350);
final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator();
while (iter.hasNext())
{
final Map.Entry<String, String> pair = iter.next();
sb.append(percentEncodeRfc3986(pair.getKey()));
sb.append('=');
sb.append(percentEncodeRfc3986(pair.getValue()));
if (iter.hasNext())
{
sb.append('&');
}
}
return sb.toString();
}
/**
* Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode
* according to the RFC, so we make the extra replacements.
*
* #param string Decoded string.
* #return Encoded string per RFC 3986.
*/
private static String percentEncodeRfc3986(final String string)
{
try
{
return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
}
catch (UnsupportedEncodingException e)
{
return string;
}
}
}
Because you also want to identify URLs which refer to the same content, I found this paper from the WWW2007 pretty interesting: Do Not Crawl in the DUST: Different URLs with Similar Text. It provides you with a nice theoretical approach.
No, there is nothing in the standard libraries to do this. Canonicalization includes things like decoding unnecessarily encoded characters, converting hostnames to lowercase, etc.
e.g. http://ACME.com/./foo%26bar becomes:
http://acme.com/foo&bar
URI's normalize() does not do this.
The RL library:
https://github.com/backchatio/rl
goes quite a ways beyond java.net.URL.normalize().
It's in Scala, but I imagine it should be useable from Java.
You can do this with the Restlet framework using Reference.normalize(). You should also be able to remove the elements you don't need quite conveniently with this class.
In Java, normalize parts of a URL
Example of a URL: https://i0.wp.com:55/lplresearch.com/wp-content/feb.png?ssl=1&myvar=2#myfragment
protocol: https
domain name: i0.wp.com
subdomain: i0
port: 55
path: /lplresearch.com/wp-content/uploads/2019/01/feb.png?ssl=1
query: ?ssl=1"
parameters: &myvar=2
fragment: #myfragment
Code to do the URL parsing:
import java.util.*;
import java.util.regex.*;
public class regex {
public static String getProtocol(String the_url){
Pattern p = Pattern.compile("^(http|https|smtp|ftp|file|pop)://.*");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static String getParameters(String the_url){
Pattern p = Pattern.compile(".*(\\?[-a-zA-Z0-9_.#!$&''()*+,;=]+)(#.*)*$");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static String getFragment(String the_url){
Pattern p = Pattern.compile(".*(#.*)$");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static void main(String[] args){
String the_url =
"https://i0.wp.com:55/lplresearch.com/" +
"wp-content/feb.png?ssl=1&myvar=2#myfragment";
System.out.println(getProtocol(the_url));
System.out.println(getFragment(the_url));
System.out.println(getParameters(the_url));
}
}
Prints
https
#myfragment
?ssl=1&myvar=2
You can then push and pull on the parts of the URL until they are up to muster.
Im have a simple way to solve it. Here is my code
public static String normalizeURL(String oldLink)
{
int pos=oldLink.indexOf("://");
String newLink="http"+oldLink.substring(pos);
return newLink;
}