Extract Data From Multiple Files - java

I have exactly 278 Html files of essays from different students, every file contains student id, first name and last in the following format
<p>Student ID: 000000</p>
<p>First Name: John</p>
<p>Last Name: Doe</p>
I'm trying to extract Student IDs from all this files, is there a way to extract data between X and Y? X being "<p>Student ID: " and Y being "</p>" which should leave us with ID
What Method/Language/Concept/Software would you recommend to get this work done?

Using java:
import java.io.File;
import java.io.IOException;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
public class StudentIDsRetriever {
public static void main(String[] args) throws IOException {
File dir = new File("htmldir");
String[] htmlFiles = dir.list();
List<String> studentIds = new ArrayList<>();
List<String> emailDs = new ArrayList<>();
for (String htmlFile : htmlFiles) {
Path path = FileSystems.getDefault().getPath("htmldir", htmlFile);
List<String> lines = Files.readAllLines(path);
for (String str : lines) {
if (str.contains("<p>Student ID:")) {
String idTag = str.substring(str.indexOf("<p>Student ID:"));
String id = idTag.substring("<p>Student ID:".length(), idTag.indexOf("</p>"));
System.out.println("Id is "+id);
studentIds.add(id);
}
if (str.contains("#") && (str.contains(".com") || str.contains(".co.in"))) {
String[] words = str.split(" ");
for (String word : words)
if (word.contains("#") && (word.contains(".com") || word.contains(".co.in")))
emailDs.add(word);
}
}
}
System.out.println("Student list is "+studentIds);
System.out.println("Student email list is "+emailDs);
}
}
P.S: This works from Java7+

I recommand you python script. If you first use python, It's OK. python is so easy script language and has many references in google.
1) Language: python (version 2.7)
2) Library: beautifulsoup (you can download this with pip(pip is package manager program, pip can be installed in python installer)
traverse files one by one and open your local files. and parse HTML content using beautifulsoup. (see this part https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag)
then, extract content from <p> tag. It return "Student ID: 000000".
split this string with ":". this return str[0] and str[1].
str[1] is student number you want (maybe you erase space character... call ' Hel lo '.strip() -> Hello
If you need a help, reply.

Related

Reading data from websites using Jsoup

I found a bit of code by BalusC which was edited by another user: Pisek, and was wondering how to read data from another website.
I understand how to find the new class name to read different parts of data but I'm not sure how to read the quantity of the product.
Here's my code so far:
package internalAssessment;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class practiceArea {
public static void main(String[] args) throws Exception {
String url = "https://www.tesco.com/groceries/product/details/?id=265485175";
Document document = Jsoup.connect(url).get();
String price = document.select(".linePrice").text();
System.out.println("Price: " + price);
String quantity = document.select("").text();
System.out.println("Quantity: " + quantity);
}
}
The way you get the price is by using the class :
String price = document.select(".linePrice").text();
You can also get the quantity with its class (or by its id):
document.select(".quantity").attr("value"); // by class
document.select("#qty-265485175-1").attr("value"); // by id
The thing which differ is get the number, here it is a value attribute so you'll use : .attr("value");
As I said in comment : this launches a nex connection to the website, so there is reason that the value you'll get will not be 1

Migrating a file or a folder from one repository to another in Documentum

I am working on a JavaFx project connected to Documentum data storage . And I am trying to configure how to move a file (lets call it file1) placed in a folder (lets call it Folder1) into another folder (lets call it Folder2) . It's worth mentioning that both of the Folders are in the same cabinet . I have implemented the following class :
package application;
import com.documentum.com.DfClientX;
import com.documentum.com.IDfClientX;
import com.documentum.fc.client.DfClient;
import com.documentum.fc.client.IDfDocument;
import com.documentum.fc.client.IDfFolder;
import com.documentum.fc.client.IDfSession;
import com.documentum.fc.common.DfException;
import com.documentum.fc.common.DfId;
import com.documentum.operations.IDfMoveNode;
import com.documentum.operations.IDfMoveOperation;
public class Migrate {
public Migrate(){}
public String move ( IDfSession mySession,String docId, String destination){
String str ="";
try{
IDfClientX clientx = new DfClientX();
IDfMoveOperation mo = clientx . getMoveOperation();
IDfFolder destinationDirectory = mySession . getFolderByPath(destination);
//Here is the line that causes error
mo.setDestinationFolderId(destinationDirectory . getObjectId());
IDfDocument doc = (IDfDocument) mySession . getObject(new DfId(docId));
IDfMoveNode node = (IDfMoveNode)mo.add(doc);
if (mo.execute()) {
str= "Move operation successful . ";
}
else {
str = "Move operation failed . ";
}
}catch(DfException e){
System.out.println(e.getLocalizedMessage());
}
return str;
}
}
instead of docId I am passing through the r_object_id of the file I am wishing to be moved but I get the following error :
com.documentum.fc.client.DfFolder___PROXY cannot be cast to
com.documentum.fc.client.IDfDocument
Does any one know where my mistake is ? Or where am I doing it wrong ?
It's obvious, in line
IDfDocument doc = (IDfDocument) mySession . getObject(new DfId(docId));
the docId parameter represents folder object, not the document object. Do the type check first to be sure and than use either IDfFolder or IDfDocument. If you're sure that you're moving folder to another folder than just change IDfDocument -> IDfFolder.

Parsing Information from URL Using Jsoup

I need help with my Java project using Jsoup (if you think there is a more efficient way to achieve the purpose, please let me know). The purpose of my program is to parse certain useful information from different URLs and put it in a text file. I am not an expert in HTML or JavaScript, therefore, it has been difficult for me to code in Java exactly what I want to parse.
In the website that you see in the code below as one of the examples, the information that interests me to parse with Jsoup is everything you can see in the table under “Routing”(Route, Location, Vessel/Voyage, Container Arrival Date, Container Departure Date; = Origin, Seattle SSA Terminal T18, 26 Jun 15 A, 26 Jun 15 A… and so on).
So far, with Jsoup we are only able to parse the title of the website, yet we have been unsuccessful in getting any of the body.
Here is the code that I used, which I got from an online source:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Jsouptest71115 {
public static void main(String[] args) throws Exception {
String url = "http://google.com/gentrack/trackingMain.do "
+ "?trackInput01=999061985";
Document document = Jsoup.connect(url).get();
String title = document.title();
System.out.println("title : " + title);
String body = document.select("body").text();
System.out.println("Body: " + body);
}
}
Working code:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
public class Sample {
public static void main(String[] args) {
String url = "http://homeport8.apl.com/gentrack/blRoutingPopup.do";
try {
Connection.Response response = Jsoup.connect(url)
.data("blNbr", "999061985") // tracking number
.method(Connection.Method.POST)
.execute();
Element tableElement = response.parse().getElementsByTag("table")
.get(2).getElementsByTag("table")
.get(2);
Elements trElements = tableElement.getElementsByTag("tr");
ArrayList<ArrayList<String>> tableArrayList = new ArrayList<>();
for (Element trElement : trElements) {
ArrayList<String> columnList = new ArrayList<>();
for (int i = 0; i < 5; i++) {
columnList.add(i, trElement.children().get(i).text());
}
tableArrayList.add(columnList);
}
System.out.println("Origin/Location: "
+tableArrayList.get(1).get(1));// row and column number
System.out.println("Discharge Port/Container Arrival Date: "
+tableArrayList.get(5).get(3));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output:
Origin/Location: SEATTLE SSA TERMINAL (T18), WA  
Discharge Port/Container Arrival Date: 23 Jul 15  E
You need to utilize document.select("body") select method input to which is CSS selector. To know more about CSS selectors just google it, or Read this. Using CSS selectors you can identify parts of web page body easily.
In your particular case you will have a different problem though, for instance the table you are after is inside an IFrame and if you look at the html of web page you are visiting its(iframe's) url is "http://homeport8.apl.com/gentrack/blRoutingFrame.do", so if you visit this URL directly so that you can access its content you will get an exception which is perhaps some restriction from Server. To get content properly you need to visit two URLs via JSoup, 1. http://homeport8.apl.com/gentrack/trackingMain.do?trackInput01=999061985 and 2. http://homeport8.apl.com/gentrack/blRoutingFrame.do?trackInput01=999061985
For first URL you'll get nothing useful, but for second URL you'll get tables of your interest. The try using document.select("table") which will give you List of tables iterator over this list and find table of your interest. Once you have the table use Element.select("tr") to get a table row and then for each "tr" use Element.select("td") to get table cell data.
The webpage you are visiting didn't use CSS class and id selectors which would have made reading it with jsoup a lot easier so I am afraid iterating over document.select("table") is your best and easy option.
Good Luck.

How to get an App category from play store by its package name in Android?

I want to fetch the app category from play store through its unique identifier i.e. package name, I am using the following code but does not return any data. I also tried to use this AppsRequest.newBuilder().setAppId(query) still no help.
Thanks.
String AndroidId = "dead000beef";
MarketSession session = new MarketSession();
session.login("email", "passwd");
session.getContext().setAndroidId(AndroidId);
String query = "package:com.king.candycrushsaga";
AppsRequest appsRequest = AppsRequest.newBuilder().setQuery(query).setStartIndex(0)
.setEntriesCount(10).setWithExtendedInfo(true).build();
session.append(appsRequest, new Callback<AppsResponse>() {
#Override
public void onResult(ResponseContext context, AppsResponse response) {
String response1 = response.toString();
Log.e("reponse", response1);
}
});
session.flush();
Use this script:
######## Fetch App names and genre of apps from playstore url, using pakage names #############
"""
Reuirements for running this script:
1. requests library
Note: Run this command to avoid insecureplatform warning pip install --upgrade ndg-httpsclient
2. bs4
pip install requests
pip install bs4
"""
import requests
import csv
from bs4 import BeautifulSoup
# url to be used for package
APP_LINK = "https://play.google.com/store/apps/details?id="
output_list = []; input_list = []
# get input file path
print "Need input CSV file (absolute) path \nEnsure csv is of format: <package_name>, <id>\n\nEnter Path:"
input_file_path = str(raw_input())
# store package names and ids in list of tuples
with open(input_file_path, 'rb') as csvfile:
for line in csvfile.readlines():
(p, i) = line.strip().split(',')
input_list.append((p, i))
print "\n\nSit back and relax, this might take a while!\n\n"
for package in input_list:
# generate url, get html
url = APP_LINK + package[0]
r = requests.get(url)
if not (r.status_code==404):
data = r.text
soup = BeautifulSoup(data, 'html.parser')
# parse result
x = ""; y = "";
try:
x = soup.find('div', {'class': 'id-app-title'})
x = x.text
except:
print "Package name not found for: %s" %package[0]
try:
y = soup.find('span', {'itemprop': 'genre'})
y = y.text
except:
print "ID not found for: %s" %package[0]
output_list.append([x,y])
else:
print "App not found: %s" %package[0]
# write to csv file
with open('results.csv', 'w') as fp:
a = csv.writer(fp, delimiter=",")
a.writerows(output_list)
This is what i did, best and easy solution
https://androidquery.appspot.com/api/market?app=your.unique.package.name
Or otherwise you can get source html and get the string out of it ...
https://play.google.com/store/apps/details?id=your.unique.package.name
Get this string out of it - use split or substring methods
<span itemprop="genre">Sports</span>
In this case sports is your category
use android-market-api it will gives all information of application

How to parse data in Talend with Java (coming from a previously produced .txt file)?

I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here:
Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and writes it to a mysql database. Here is the code which basically does exactly that. (I'm a beginner, sorry for the lack of elegance)
package org.jsoup.examples;
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import java.io.IOException;
public class parse2 {
static parse2 parseIt2 = new parse2();
String companyName = "Platzhalter";
String jobTitle = "Platzhalter";
String location = "Platzhalter";
String timeAdded = "Platzhalter";
public static void main(String[] args) throws IOException {
parseIt2.getData();
}
//
public void getData() throws IOException {
Document document = Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
// Parse Data into Elements
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name span[itemprop=name]");
Elements locationElement = element.select(".locality span[itemprop=addressLocality]");
Elements dateElement = element.select(".job_date_added [datetime]");
// Strip Data from unnecessary tags
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
String location = locationElement.text();
String timeAdded = dateElement.attr("datetime");
System.out.println("Firma:\t"+ companyName + "\t" + jobTitle + "\t in:\t" + location + " \t Erstellt am \t" + timeAdded );
}
}
}
Now I want to do the process End-to-End in Talend, and I got assured this works.
I tried this (which looks quite shady to me):
Basically I put all imports in "advanced settings" and the code in the "basic settings" section. This importLibrary is thought to load the jsoup parsing library, as well as the mysql connect (i might to the connect with talend tools though).
Obviously this isn't working. I tried to strip the Base Code from classes and stuff and it was even worse. Can you help me how to get the generated .txt files parsed with Java here?
EDIT: Here is the Link to the talend Job http://www.share-online.biz/dl/8M5MD99NR1
EDIT2: I changed the code to the one I tried in JavaFlex. But it didn't work (the import part in the start part of the code, the rest in "body/main" and nothing in "end".
This is a problem related to Talend, in your code, use the complete method names including their packages. For your document parsing for example, you can use :
Document document = org.jsoup.Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");

Categories

Resources