I am trying to create a web scraper program that takes tables from a website and converts them into ".csv" files.
I'm using Jsoup to pull the data down into a document and have it read from document.html() doc.html() below. The reader as it stands picks up 18 tables at my test site but no table data tags.
Do you have any idea what could be going wrong?
ArrayList<Data_Log> container = new ArrayList<Data_Log>();
ArrayList<ListData_Log> containerList = new ArrayList<ListData_Log>();
ArrayList<String> tableNames = new ArrayList<String>();// Stores native names of tables
ArrayList<Double> meanStorage = new ArrayList<Double>();// Stores data mean per table
ArrayList<String> processlog = new ArrayList<String>();// Keeps a record of all actions taken per iteration
ArrayList<Double> modeStorage = new ArrayList<Double>();
Calendar cal;
private static final long serialVersionUID = -8174362940798098542L;
public void takeData() throws IOException {
if (testModeActive == true) {
System.out.println("Initializing Data Cruncher with developer logs");
System.out.println("Taking data from: " + dataSource); }
int irow = 0;
int icolumn = 0;
int iTable = 0;
// int iListno = 0;
// int iListLevel;
String u = null;
boolean recording = false;
boolean duplicate = false;
Document doc = Jsoup.connect(dataSource).get();
Webtitle = doc.title();
Pattern tb = Pattern.compile("<table");
Matcher tB = tb.matcher(doc.html());
Pattern ttl = Pattern.compile("<title>(//s+)</title>");
Matcher ttl2= ttl.matcher(doc.html());
Pattern tr = Pattern.compile("<tr");
Matcher tR = tr.matcher(doc.html());
Pattern td = Pattern.compile("<td(//s+)</td>");
Matcher tD = td.matcher(doc.html());
Pattern tdc = Pattern.compile("<td class=(//s+)>(//s+)</td>");
Matcher tDC = tdc.matcher(doc.html());
Pattern tb2 = Pattern.compile("</table>");
Matcher tB2 = tb2.matcher(doc.html());
Pattern th = Pattern.compile("<th");
Matcher tH = th.matcher(doc.html());
while (tB.find()) {
iTable++;
while(ttl2.find()) {
tableNames.add(ttl2.group(1));
}
while (tR.find()) {
while (tD.find()||tH.find()) {
u = tD.group(1);
Data_Log v = new Data_Log();
v.setTable(iTable);
v.dataSort(u);
v.setRow(irow);
v.setColumn(icolumn);
container.add(v);
icolumn++;
}
while(tDC.find()) {
u = tDC.group(2);
Data_Log v = new Data_Log();
v.setTable(iTable);
v.dataSort(u);
v.setRow(irow);
v.setColumn(icolumn);
container.add(v);
icolumn++;
}
irow++;
}
if (tB2.find()) {
irow=0;
icolumn=0;
}
}
Expected results:
table# logged + "td"s logged
Actual result:
table# logged "td"s omitted
Since you're using jsoup, use it
var url = "<your url>";
var doc = Jsoup.connect(url).get();
var tables = doc.body().getElementsByTag("table");
tables.forEach(table -> {
System.out.println(table.id());
System.out.println(table.className());
System.out.println(table.getElementsByTag("td"));
});
For your tries to parse html with regex, here's some suggested reading
Using regular expressions to parse HTML: why not?
Why is it such a bad idea to parse XML with regex?
RegEx match open tags except XHTML self-contained tags
Related
Scenario: I need to append the id to the url.
What I have done :
I have taken the last id from the table and stored it in a list:
Then I get the text of the id and is Stored in a String.
List<WebElement> id = driver.findElements(By.xpath("(//table[contains(#class,'mat-table')]//tr/td[1])[last()]"));
int rowsize = id.size();
for(int i=0;i<rowsize;i++)
{
String text = id.get(i).getText();
System.out.println("Get the id:"+text);
Then I use that text and append it to the URL
String confirmationURL = "https://test-websites.net/#/email?type=confirm";
String newurl = confirmationURL+"&id=text"; = **This part iam giving the text as id ... which is
wrong and I need to enter the id which I got from the list ....**
driver.get(newurl);
So Basically the url should be like: https://test-websites.net /#/email?type=confirm&id=47474
Can someone pls give inputs on what should be done?
You can create a new list of URLS, and can use add method to append text.
List<WebElement> id = driver.findElements(By.xpath("(//table[contains(#class,'mat-table')]//tr/td[1])[last()]"));
String confirmationURL = "https://test-websites.net/#/email?type=confirm";
List<String> newurls = new ArrayList<String>();
int rowsize = id.size();
for(int i = 0; i < rowsize; i++) {
String text = id.get(i).getText();
System.out.println("Get the id:"+text);
newurls.add(confirmationURL + "&id=" + text);
}
after successfully execution of this code, you'd have a newurls list with URLs ending with id's from //table[contains(#class,'mat-table')]//tr/td[1])[last()] xpath.
I'm trying to run the following code:
for (java.util.Iterator<Row> iter = dataframe1.toLocalIterator(); iter.hasNext();) {
Row it = (iter.next());
String item = it.get(2).toString();
String rayon = it.get(6).toString();
Double d = Double.parseDouble(rayon)/100000;
String geomType = it.get(14).toString();
Dataset<Row> res_f = null;
if(geomType.equalsIgnoreCase("Polygon")) {
res_f= dataframe2.withColumn("ST_WITHIN",expr("ST_WITHIN(ST_GeomFromText(CONCAT('POINT(',longitude,' ',latitude,')',4326)),ST_GeomFromWKT('"+item+"'))"));
} else {
res_f = dataframe2.withColumn("ST_BUFFERR",expr("ST_Buffer(ST_GeomFromWKT('"+item+"'),"+d+")")).withColumn("ST_WITHIN",expr("ST_WITHIN(ST_GeomFromText(CONCAT('POINT(',longitude,' ',latitude,')',4326)),ST_BUFFERR)"));
}
res_f.show();
}
But res_f returns nothing and is always null.
I'm using Spark with Java.
EDIT
I solved the problem, just change this line from Dataset<Row> res_f = null; to Dataset<Row> res_f;
I need your help .
The android app reading paragraphs and some properties in Ms Word document with Aspose Words for Android library. It's getting paragraph text, style name and is seperated value. There are some words have hyperlink in paragraph line. How to get start and end boundaries of the hyperlink of words? For example:
This is an inline hyperlink paragraph example that the start bound is 18 and end bound is 27.
public static ArrayList<String[]> GetBookLinesByTag(String file) {
ArrayList<String[]> bookLines = new ArrayList<>();
try {
Document doc = new Document(file);
ParagraphCollection paras = doc.getFirstSection().getBody().getParagraphs();
for(int i = 0; i < paras.getCount(); i++){
String styleName = paras.get(i).getParagraphFormat().getStyleName().trim();
String isStyleSeparator = Integer.toString(paras.get(i).getBreakIsStyleSeparator() ? 1 : 0);
String content = paras.get(i).toString(SaveFormat.TEXT).trim();
bookLines.add(new String[]{content, styleName, isStyleSeparator});
}
} catch (Exception e){}
return bookLines;
}
Edit:
Thanks Alexey Noskov, solved with you.
public static ArrayList<String[]> GetBookLinesByTag(String file) {
ArrayList<String[]> bookLines = new ArrayList<>();
try {
Document doc = new Document(file);
ParagraphCollection paras = doc.getFirstSection().getBody().getParagraphs();
for(int i = 0; i < paras.getCount(); i++){
String styleName = paras.get(i).getParagraphFormat().getStyleName().trim();
String isStyleSeparator = Integer.toString(paras.get(i).getBreakIsStyleSeparator() ? 1 : 0);
String content = paras.get(i).toString(SaveFormat.TEXT).trim();
for (Field field : paras.get(i).getRange().getFields()) {
if (field.getType() == FieldType.FIELD_HYPERLINK) {
FieldHyperlink hyperlink = (FieldHyperlink) field;
String urlId = hyperlink.getSubAddress();
String urlText = hyperlink.getResult();
// Reformat linked text: urlText:urlId
content = urlText + ":" + urlId;
}
}
bookLines.add(new String[]{content, styleName, isStyleSeparator});
}
} catch (Exception e){}
return bookLines;
}
Hyperlinks in MS Word documents are represented as fields. If you press Alt+F9 in MS Word you will see something like this
{ HYPERLINK "https://aspose.com" }
Follow the link to learn more about fields in Aspose.Words document model and in MS Word.
https://docs.aspose.com/display/wordsjava/Introduction+to+Fields
In your case you need to locate position of FieldStart – this will be the start position, then measure length of content between FieldSeparator and FieldEnd – start position plus the calculated length will the end position.
Disclosure: I work at Aspose.Words team.
I'd like to retrieve data from string based on params from template.
For example:
given string -> "some text, var=20 another part param=45"
template -> "some text, var=${var1} another part param=${var2}"
result -> var1 = 20; var2 = 45
How could I achive that result in Java. Are there some libs or I need to use regex?
I tried different template processors, but they don't have needed functionality, I need something like inverse to them.
I hope below sample will serve your purpose -
String strValue = "some text, var=20 another part param=45";
String strTemplate = "some text, var=${var1} another part param=${var2}";
ArrayList<String> wildcards = new ArrayList<String>();
StringBuffer outputBuffer = new StringBuffer();
Pattern pat1 = Pattern.compile("(\\$\\{\\w*\\})");
Matcher mat1 = pat1.matcher(strTemplate);
while (mat1.find())
{
wildcards.add(mat1.group(1).replaceAll("\\$", "").replaceAll("\\{", "").replaceAll("\\}", ""));
strTemplate = strTemplate.replace(mat1.group(1), "(\\w*)");
}
if(wildcards!= null && wildcards.size() > 0)
{
Pattern pat2 = Pattern.compile(strTemplate);
Matcher mat2 = pat2.matcher(strValue);
if (mat2.find())
{
for(int i=0;i<wildcards.size();i++)
{
outputBuffer.append(wildcards.get(i)).append(" = ");
outputBuffer.append(mat2.group(i+1));
if(i != wildcards.size()-1)
{
outputBuffer.append("; ");
}
}
}
}
System.out.println(outputBuffer.toString());
With Microsoft CRM 2011 online and using webservices, I am using below method in my Main.java using the OrganizationServiceStub class created by webservices call. The output retrieved no of records is -1 can someone help where I am going wrong. I want to retrieve the accounts where name begins with "Tel" without giving the accountid. I can see the data exists in CRM.
Thanks
public static void getAccountDetails(OrganizationServiceStub service, ArrayOfstring fields)
{
try{
ArrayOfanyType aa = new ArrayOfanyType();
aa.setAnyType(new String[] {"Tel"});
ConditionExpression condition1 = new ConditionExpression();
condition1.setAttributeName("name");
condition1.setOperator(ConditionOperator.BeginsWith);
condition1.setValues(aa);
ArrayOfConditionExpression ss = new ArrayOfConditionExpression();
ss.setConditionExpression(new ConditionExpression[] {condition1});
FilterExpression filter1 = new FilterExpression();
filter1.setConditions(ss);
QueryExpression query = new QueryExpression();
query.setEntityName("account");
ColumnSet cols = new ColumnSet();
cols.setColumns(fields);
query.setColumnSet(cols);
query.setCriteria(filter1);
RetrieveMultiple ll = new RetrieveMultiple();
ll.setQuery(query);
RetrieveMultipleResponse result1 = service.retrieveMultiple(ll);
EntityCollection accounts = result1.getRetrieveMultipleResult();
System.out.println(accounts.getTotalRecordCount());
}
catch (IOrganizationService_RetrieveMultiple_OrganizationServiceFaultFault_FaultMessage e) {
logger.error(e.getMessage());
e.printStackTrace();
}
catch (RemoteException e) {
logger.error(e.getMessage());
e.printStackTrace();
}
}
For Java, include this code snippet works for the above issue
ArrayOfanyType aa = new ArrayOfanyType();
aa.setAnyType(new String[] {"555"});
ConditionExpression condition1 = new ConditionExpression();
condition1.setAttributeName("telephone1");
condition1.setOperator(ConditionOperator.BeginsWith);
condition1.setValues(aa);
ArrayOfConditionExpression ss = new ArrayOfConditionExpression();
ss.setConditionExpression(new ConditionExpression[] {condition1});
FilterExpression filter1 = new FilterExpression();
filter1.setConditions(ss);
QueryExpression query = new QueryExpression();
query.setEntityName("account");
PagingInfo pagingInfo = new PagingInfo();
pagingInfo.setReturnTotalRecordCount(true);
query.setPageInfo(pagingInfo);
OrganizationServiceStub.ColumnSet colSet = new OrganizationServiceStub.ColumnSet();
OrganizationServiceStub.ArrayOfstring cols = new OrganizationServiceStub.ArrayOfstring();
cols.setString(new String[]{"name", "telephone1", "address1_city"});
colSet.setColumns(cols);
query.setColumnSet(colSet);
query.setCriteria(filter1);
RetrieveMultiple ll = new RetrieveMultiple();
ll.setQuery(query);
OrganizationServiceStub.RetrieveMultipleResponse response = serviceStub.retrieveMultiple(ll);
EntityCollection result = response.getRetrieveMultipleResult();
ArrayOfEntity attributes = result.getEntities();
Entity[] keyValuePairs = attributes.getEntity();
for (int i = 0; i < keyValuePairs.length; i++) {
OrganizationServiceStub.KeyValuePairOfstringanyType[] keyValuePairss = keyValuePairs[i].getAttributes().getKeyValuePairOfstringanyType();
for (int j = 0; j < keyValuePairss.length; j++) {
System.out.print(keyValuePairss[j].getKey() + ": ");
System.out.println(keyValuePairss[j].getValue());
}
}
Not sure how similar your EntityCollection object is to the .Net version in the SDK, however you need to specify ReturnTotalRecordCount in the query's PagingInfo in .Net for the TotalRecordCount property to have a value. Could you not instead check accounts.Entities.Count?
Note: I'm not a Java guy either...