Basically I intend to extract the entire category tree in Wikipedia under the root node "Economics" using Wikipedia API sandbox. I don't need the content of the articles, I just need few basic details like pageid, title, revision history (at some later stage of my work). As of now I can extract it level by level but what I want is a recursive/iterative function which does it.
Each category contains a categories and articles (like each root contains nodes and leaves).
I wrote one code to extract the first level into files. one file contains the articles, second folder contains the name of categories (daughters of the root which can be further sub-classified).
Then I went into level and extracted their categories and articles and sub-categories using similar code.
The code remains similar in each case but its the scalability. I need to reach the lowest leaves of all nodes. So i need a recursion which continuously checks till the end.
I labelled files which contains categories as 'c_', so I can provide the condition while extracting different levels.
Now for some reason it has entered into a deadlock and keeps adding same things again and again. I need a way out of the deadlock.
package wikiCrawl;
import java.awt.List;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Scanner;
import org.apache.commons.io.FileUtils;
import org.json.CDL;
import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
public class SubCrawl
{
public static void main(String[] args) throws IOException, InterruptedException, JSONException
{ File file = new File("C:/Users/User/Desktop/Root/Economics_2.txt");
crawlfile(file);
}
public static void crawlfile(File food) throws JSONException, IOException ,InterruptedException
{
ArrayList<String> cat_list =new ArrayList <String>();
Scanner scanner_cat = new Scanner(food);
scanner_cat.useDelimiter("\n");
while (scanner_cat.hasNext())
{
String scan_n = scanner_cat.next();
if(scan_n.indexOf(":")>-1)
cat_list.add(scan_n.substring(scan_n.indexOf(":")+1));
}
System.out.println(cat_list);
//get the categories in different languages
URL category_json;
for (int i_cat=0; i_cat<cat_list.size();i_cat++)
{
category_json = new URL("https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3A"+cat_list.get(i_cat).replaceAll(" ", "%20").trim()+"&cmlimit=500"); //.trim() removes trailing and following whitespaces
System.out.println(category_json);
HttpURLConnection urlConnection = (HttpURLConnection) category_json.openConnection(); //Opens the connection to the URL so clients can communicate with the resources.
BufferedReader reader = new BufferedReader (new InputStreamReader(category_json.openStream()));
String line;
String diff = "";
while ((line = reader.readLine()) != null)
{
System.out.println(line);
diff=diff+line;
}
urlConnection.disconnect();
reader.close();
JSONArray jsonarray_cat = new JSONArray (diff.substring(diff.indexOf("[{\"pageid\"")));
System.out.println(jsonarray_cat);
//Loop categories
for (int i_url = 0; i_url<jsonarray_cat.length();i_url++) //jSONarray is an array of json objects, we are looping through each object
{
//Get the URL _part (Categorie isn't correct)
int pageid=Integer.parseInt(jsonarray_cat.getJSONObject(i_url).getString("pageid")); //this can be written in a much better way
System.out.println(pageid);
String title=jsonarray_cat.getJSONObject(i_url).getString("title");
System.out.println(title);
File food_year= new File("C:/Users/User/Desktop/Root/"+cat_list.get(i_cat).replaceAll(" ", "_").trim()+".txt");
File food_year2= new File("C:/Users/User/Desktop/Root/c_"+cat_list.get(i_cat).replaceAll(" ", "_").trim()+".txt");
food_year.createNewFile();
food_year2.createNewFile();
BufferedWriter writer = new BufferedWriter (new OutputStreamWriter(new FileOutputStream(food_year, true)));
BufferedWriter writer2 = new BufferedWriter (new OutputStreamWriter(new FileOutputStream(food_year2, true)));
if (title.contains("Category:"))
{
writer2.write(pageid+";"+title);
writer2.newLine();
writer2.flush();
crawlfile(food_year2);
}
else
{
writer.write(pageid+";"+title);
writer.newLine();
writer.flush();
}
}
}
}
}
For starters this might be too big a demand on the wikimedia servers. There are over a million categories (1) and you need to read Wikipedia:Database download - Why not just retrieve data from wikipedia.org at runtime. You would need to throttle your uses to about 1 per second or risk getting blocked. This means it would take about 11 days to get the full tree.
It would be much better to use the standard dumps at https://dumps.wikimedia.org/enwiki/ these will be easier to read and process and you don't need to put a big load on the server.
Still better is to get a Wikimedia Labs account, which allow you to run queries on a replication of the database servers or scripts on the dumps without having to download some very big files.
To get just the economics categories then its easiest to go via https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Economics this has 1242 categories. You may find it easier to use the list of categories there and build the tree from there.
This will be better than a recursive approach. The problem with the wikipedia category system is that it is not really a tree, with plenty of loops. I would not be surprised if you keep following categories you will end up getting the most of wikipedia.
Related
I was able to connect Java to AWS S3, and I was able to perform basic operations like listing buckets. I need a way to read a CSV file without downloading it. I am attaching my current code here.
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.Bucket;
import com.amazonaws.services.s3.model.CannedAccessControlList;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.PutObjectRequest;
import com.amazonaws.services.s3.model.S3ObjectSummary;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;
import java.util.Properties;
public class test {
public static void main(String args[])throws IOException
{
AWSCredentials credentials =new BasicAWSCredentials("----","----");
AmazonS3 s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.US_EAST_2)
.build();
List<Bucket> buckets = s3client.listBuckets();
for(Bucket bucket : buckets) {
System.out.println(bucket.getName());
}
}
}
There is a way with a code like this. In my code I am trying to get the file which we want to read in my S3Object obj , then I am passing that file to InputStreamReader() :
S3Object Obj = s3client.getObject("<Bucket Name>", "File Name");
BufferedReader reader = new BufferedReader(new InputStreamReader(Obj.getObjectContent()));
// this will store characters of first row in array
String row[] = line.split(",");
// this will fetch number of columns
int length = row.length;
while((line=reader.readLine()) != null) {
// storing characters of corresponding line in an array
String value[] = line.split(",");
for(int i=0;i<length;i++) {
System.out.print(value[i]+" ");
}
System.out.println();
}
The answer by #jay and #Elikill58 is super helpful! This just adds a bit of clarity and accessibility to it.
To get an object from and S3 bucket after you have done all the authentication work is with the .getObject(String bucketName, String fileName) function. Note what it says about file names in the documentation:
An Amazon S3 bucket has no directory hierarchy such as you would find in a typical computer file system. You can, however, create a logical hierarchy by using object key names that imply a folder structure. For example, instead of naming an object sample.jpg, you can name it photos/2006/February/sample.jpg.
To get an object from such a logical hierarchy, specify the full key
name for the object in the GET operation. For a virtual hosted-style
request example, if you have the object
photos/2006/February/sample.jpg, specify the resource as
/photos/2006/February/sample.jpg. For a path-style request example, if
you have the object photos/2006/February/sample.jpg in the bucket
named examplebucket, specify the resource as
/examplebucket/photos/2006/February/sample.jpg.
Once you have an the S3Object that'll be returned, just pass it into this function below (which is just a modified version of #jay's that fixes a few errors)!
private static void parseCSVS3Object(S3Object data) {
BufferedReader reader = new BufferedReader(new InputStreamReader(data.getObjectContent()));
try {
// Get all the csv headers
String line = reader.readLine();
String[] headers = line.split(",");
// Get number of columns and print headers
int length = headers.length;
for (String header : headers) {
System.out.print(header + " ");
}
while((line = reader.readLine()) != null) {
System.out.println();
// get and print the next line (row)
String[] row = line.split(",");
for (String value : row) {
System.out.print(value + " ");
}
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}
For your code to read the file, it needs the contents -- and that means copying it to the local system.
However, you can use "range" (Java) to read just a part.
I have one java file named Employee.java.i have to read a CSV file where the employee details are stored i.e;
(Employee_ID, Employee_Name, Employee_Designation, Employee_Mobile, Employee_Year_Of_Joining).
i have to convert it into an ArrayList.
Another file Project.java where i have to read another CSV file where the project details are stored i.e;
(Project_ID, Project_Name, Project_Description, Project_Start_Date, Project_End_Date, Employee_ID).
i have to convert this also into an ArrayList.
Now i have to send the Employee_ID and extract the Employee_Name and Employee_Designation and the output is-
OutPut
Project_Name
Project_Description
Employee_Name
Employee_Designation
I presume you are learning Java, do writing all the code for you would be a disservice. And Stack Overflow is not intended to be a “write my code for me” service.
So here is a narrative to walk you through the steps to take in writing your code. Tip: Learn to use these four steps to programming: write requirements as you did in the Question, write a plain language description to walk through the logic, as I did here, then write pseudo-code without worrying about the exact syntax of any particular programming language so you can work out the nuances and details, and finally write your code.
Write classes (or not)
If you will be doing further work with this data, write a class for each type, an Employee class and a Project class. Then read each input file, parsing each row to instantiate an object of the class you just wrote.
If not doing further work, you could skip writing the classes. In my description of looping below, substitute a row of input data for where I say Employee or Project object.
Tip: Use any of the several libraries for reading and parsing the CSV files. For example, Apache Commons CSV.
Nested loop
How would you do this if you were a clerk rather than a programmer, with two written lists and a typewriter in front of you? The same logic applies here in programming.
Write a loop, going through the Project list.
For each Project element, get the project’s employee ID number.
Write a second loop, going through the Employee list. Extract from each Employee object the employee ID number. Compare this employee ID number to the employee ID number we have in hand for the project. If they match, write your result. And if they match, you can break out of this inner loop as there is no reason to continue examining further employee objects. If you exhaust the employee list without a match, you have an error condition to resolve however you see fit.
Maybe I have a code to help you. it reads a excel and save an arraylist. you should only duplicate to code to read the second csv
read csv java
the code is the next, maybe you should add the neccesary library to run its. You find the library that i used at pom
package com.examen.excel;
import java.io.File;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;
/**
*
* #author wilso
*/
public class Excel {
public static void main(String[] args) {
try {
ArrayList<String> dta = new ArrayList<>();
try (Workbook workbook = WorkbookFactory.create(new File("C:\\Users\\wilso\\OneDrive\\Documents\\NetBeansProjects\\Excel\\src\\resources/Employment.xlsx"))) {
Sheet firstSheet = workbook.getSheetAt(0);
Iterator<Row> iterator = firstSheet.iterator();
StringBuilder result;
while (iterator.hasNext()) {
result = new StringBuilder();
Row nextRow = iterator.next();
Iterator<Cell> cellIterator = nextRow.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
switch (cell.getCellType()) {
case Cell.CELL_TYPE_STRING:
result.append(cell.getStringCellValue());
break;
case Cell.CELL_TYPE_NUMERIC:
result.append(cell.getNumericCellValue());
break;
}
result.append(":");
}
dta.add(result.toString());
}
System.out.println("Resultado");
System.out.println(dta);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
The second option is the next
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package com.generic;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
/**
*
* #author wilso
*/
public class prueba {
public static void main(String[] args) {
try {
ArrayList<String> data = new ArrayList<>();
BufferedReader bufferedReader = new BufferedReader(new FileReader(new File("C:\\Users\\wilso\\Desktop/PRUEBA.csv")));
String line;
while ((line = bufferedReader.readLine()) != null) {
data.add(line);
}
//Read the second file and do matching
} catch (Exception e) {
e.printStackTrace();
}
}
}
The image below shows the format of my settings file for a web bot I'm developing. If you look at line 31 in the image you will see it says chromeVersion. This is so the program knows which version of the chromedriver to use. If the user enters an invalid response or leaves the field blank the program will detect that and determine the version itself and save the version it determines to a string called chromeVersion. After this is done I want to replace line 31 of that file with
"(31) chromeVersion(76/77/78), if you don't know this field will be filled automatically upon the first run of the bot): " + chromeVersion
To be clear I do not want to rewrite the whole file I just want to either change the value assigned to chromeVersion in the text file or rewrite that line with the version included.
Any suggestions or ways to do this would be much appreciated.
image
You will need to rewrite the whole file, except the byte length of the file remains the same after your modification. Since this is not guaranteed to be the case or to find out is too cumbersome here is a simple procedure:
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
public class Lab1 {
public static void main(String[] args) {
String chromVersion = "myChromeVersion";
try {
Path path = Paths.get("C:\\whatever\\path\\toYourFile.txt");
List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);
int lineToModify = 31;
lines.set(lineToModify, lines.get(lineToModify)+ chromVersion);
Files.write(path, lines, StandardCharsets.UTF_8);
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
Note that this is not the best way to go for very large files. But for the small file you have it is not an issue.
I've been looking at word stemming algorithms such as the porter algorithm, but everything I've found so far has dealt with files as input.
Are there any existing algorithms which would let me simply pass the stemmer a string, and have it return the stemmed string?
Something like:
String toBeStemmed = "The man worked tirelessly";
Stemmer s = new Stemmer();
String stemmed = s.stem(toBeStemmed);
The algorithms themselves don't take files. The code probably takes the file and reads it in as a series of Strings, which are fed to the algorithm. You just need to look at the part of the code that reads the Strings in from the file, and pass the Strings in yourself in a similar way.
In your example, toBeStemmed is a sentence, that you want to tokenize first. Then you would stem the individual tokens/words, like 'worked' or 'tirelessly'.
Here's fine morphological analyzer I use as a stemmer in some of my projects.
stemmer JAR: https://code.google.com/p/hunglish-webapp/source/browse/trunk/#trunk%2Flib%2Fnet%2Fsf%2Fjhunlang%2Fjmorph%2F1.0
stemmer source: https://code.google.com/p/j-morph/source/checkout
language resource files: https://code.google.com/p/hunglish-webapp/source/browse/trunk/#trunk%2Fsrc%2Fmain%2Fresources%2Fresources-lang%2Fjmorph
How I use it with Lucene: https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/jmorph/
properties file: https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/resources/META-INF/spring/stemmer.properties
Example usage:
import net.sf.jhunlang.jmorph.lemma.Lemma;
import net.sf.jhunlang.jmorph.lemma.Lemmatizer;
import net.sf.jhunlang.jmorph.analysis.Analyser;
import net.sf.jhunlang.jmorph.analysis.AnalyserContext;
import net.sf.jhunlang.jmorph.analysis.AnalyserControl;
import net.sf.jhunlang.jmorph.factory.Definition;
import net.sf.jhunlang.jmorph.factory.JMorphFactory;
import net.sf.jhunlang.jmorph.parser.ParseException;
import net.sf.jhunlang.jmorph.sample.AnalyserConfig;
import net.sf.jhunlang.jmorph.sword.parser.EnglishAffixReader;
import net.sf.jhunlang.jmorph.sword.parser.EnglishReader;
import net.sf.jhunlang.jmorph.sword.parser.SwordAffixReader;
import net.sf.jhunlang.jmorph.sword.parser.SwordReader;
AnalyserConfig acEn = new AnalyserConfig();
//TODO: set path to the English affix file
String enAff = "src/main/resources/resources-lang/jmorph/en.aff";
Definition affixDef = acEn.createDefinition(enAff, "utf-8", EnglishAffixReader.class);
//TODO set path to the English dict file
String enDic = "src/main/resources/resources-lang/jmorph/en.dic";
Definition dicDef = acEn.createDefinition(enDic, "utf-8", EnglishReader.class);
int enRecursionDepth = 3;
acEn.setRecursionDepth(affixDef, enRecursionDepth);
JMorphFactory jf = new JMorphFactory();
Analyser enAnalyser = jf.build(new Definition[] { affixDef, dicDef });
AnalyserControl acEn = new AnalyserControl(AnalyserControl.ALL_COMPOUNDS);
AnalyserContext analyserContextEn = new AnalyserContext(acEn);
boolean enStripDerivates = true;
Lemmatizer enLemmatizer = new net.sf.jhunlang.jmorph.lemma.LemmatizerImpl(enAnalyser, enStripDerivates, analyserContextEn);
//After somewhat complex initializing, here we go
List<Lemma> lemmas = enLemmatizer.lemmatize("worked");
for (Lemma lemma : lemmas) {
System.out.println(lemma.getWord());
}
I need to extract data from some PDF documents (using Java). I need to know what would be the easiest way to do it.
I tried iText. It's fairly complicated for my needs. Besides I guess it is not available for free for commercial projects. So it is not an option. I also gave a try to PDFBox, and ran into various NoClassDefFoundError errors.
I googled and came across several other options such as PDF Clown, jPod, but I do not have time to experiment with all of these libraries. I am relying on community's experience with PDF reading thru Java.
Note that I do not need to create or manipulate PDF documents. I just need to exrtract textual data from PDF documents with moderate level layout complexity.
Please suggest the quickest and easiest way to extract text from PDF documents. Thanks.
I recommend trying Apache Tika. Apache Tika is basically a toolkit that extracts data from many types of documents, including PDFs.
The benefits of Tika (besides being free), is that is used to be a subproject of Apache Lucene, which is a very robust open-source search engine. Tika includes a built-in PDF parser that uses a SAX Content Handler to pass PDF data to your application. It can also extract data from encrypted PDFs and it allows you to create or subclass an existing parser to customize the behavior.
The code is simple. To extract the data from a PDF, all you need to do is create a Parser class that implements the Parser interface and define a parse() method:
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
metadata.set("Hello", "World");
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
xhtml.endDocument();
}
Then, to run the parser, you could do something like this:
InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());
I am using JPedal and I'm really happy with the results. It isn't free but it's high quality and the output for image generation from pdfs or text extraction is really nice.
And as a paid library, the support is always there to answer.
I have used PDFBox to extract text for Lucene indexing without too many issues. Its error/warning logging is quite verbose if I remember right - what was the cause for those errors you received?
I understand this post is pretty old but I would recommend using itext from here:
http://sourceforge.net/projects/itext/
If you are using maven you can pull the jars in from maven central:
http://mvnrepository.com/artifact/com.itextpdf/itextpdf
I can't understand how using it can be difficult:
PdfReader pdf = new PdfReader("path to your pdf file");
PdfTextExtractor parser = new PdfTextExtractor();
String output = parser.getTextFromPage(pdf, pageNumber);
assert output.contains("whatever you want to validate on that page");
Import this Classes and add Jar Files 1.- pdfbox-app- 2.0.
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.support.FindBy;
import org.testng.Assert;
import org.testng.annotations.Test;
import java.io.File;
import java.io.IOException;
import java.text.ParseException;
import java.util.List;
import org.apache.log4j.Logger;
import org.apache.log4j.PropertyConfigurator;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.chrome.ChromeDriver;
import com.coencorp.selenium.framework.BasePage;
import com.coencorp.selenium.framework.ExcelReadWrite;
import com.relevantcodes.extentreports.LogStatus;
Add this code inside the class.
public void showList() throws InterruptedException, IOException {
showInspectionsLink.click();
waitForElement(hideInspectionsLink);
printButton.click();
Thread.sleep(10000);
String downloadPath = "C:\\Users\\Updoer\\Downloads";
File getLatestFile = getLatestFilefromDir(downloadPath);
String fileName = getLatestFile.getName();
Assert.assertTrue(fileName.equals("Inspections.pdf"), "Downloaded file name is not
matching with expected file name");
Thread.sleep(10000);
//testVerifyPDFInURL();
PDDocument pd;
pd= PDDocument.load(new File("C:\\Users\\Updoer\\Downloads\\Inspections.pdf"));
System.out.println("Total Pages:"+ pd.getNumberOfPages());
PDFTextStripper pdf=new PDFTextStripper();
System.out.println(pdf.getText(pd));
Add this Method in same class.
public void testVerifyPDFInURL() {
WebDriver driver = new ChromeDriver();
driver.get("C:\\Users\\Updoer\\Downloads\\Inspections.pdf");
driver.findElement(By.linkText("Adeeb Khan")).click();
String getURL = driver.getCurrentUrl();
Assert.assertTrue(getURL.contains(".pdf"));
}
private File getLatestFilefromDir(String dirPath){
File dir = new File(dirPath);
File[] files = dir.listFiles();
if (files == null || files.length == 0) {
return null;
}
File lastModifiedFile = files[0];
for (int i = 1; i < files.length; i++) {
if (lastModifiedFile.lastModified() < files[i].lastModified()) {
lastModifiedFile = files[i];
}
}
return lastModifiedFile;
}