I have the following piece of code that uses the java 7 features like java.nio.file.Files and java.nio.file.Paths
import java.io.File;
import java.io.IOException;
import java.io.StringWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.SerializationFeature;
import com.fasterxml.jackson.databind.node.ObjectNode;
public class JacksonObjectMapper {
public static void main(String[] args) throws IOException {
byte[] jsonData = Files.readAllBytes(Paths.get("employee.txt"));
ObjectMapper objectMapper = new ObjectMapper();
Employee emp = objectMapper.readValue(jsonData, Employee.class);
System.out.println("Employee Object\n"+emp);
Employee emp1 = createEmployee();
objectMapper.configure(SerializationFeature.INDENT_OUTPUT, true);
StringWriter stringEmp = new StringWriter();
objectMapper.writeValue(stringEmp, emp1);
System.out.println("Employee JSON is\n"+stringEmp);
}
}
Now I have to run the the same code on Java 6 , what are the best possible alternatives other than using FileReader ?
In Files class source you can see that in readAllBytes method bytes are read from InputStream.
public static byte[] readAllBytes(Path path) throws IOException {
long size = size(path);
if (size > (long)Integer.MAX_VALUE)
throw new OutOfMemoryError("Required array size too large");
try (InputStream in = newInputStream(path)) {
return read(in, (int)size);
}
}
return read(in, (int)size) - here it uses buffer to read data from InputStream.
So you can do it in the same way or just use Guava or Apache Commons IO http://commons.apache.org/io/.
Alternative are classes from java.io or Apache Commons IO, also Guava IO can help.
Guava is most modern, so I think it is the best solution for you.
Read more: Guava's I/O package utilities, explained.
If you really don't want to use FileReader(Though I didn't understand why) you can go for FileInputStream.
Syntax:
InputStream inputStream = new FileInputStream(Path of your file);
Reader reader = new InputStreamReader(inputStream);
You are right to avoid FileReader as that always uses the default character encoding for the platform it is running on, which may not be the same as the encoding of the JSON file.
ObjectMapper has an overload of readValue that can read directly from a File, there's no need to buffer the content in a temporary byte[]:
Employee emp = objectMapper.readValue(new File("employee.txt"), Employee.class);
You can read all bytes of a file into byte array even in Java 6 as described in an answer to a related question:
import java.io.RandomAccessFile;
import java.io.IOException;
RandomAccessFile f = new RandomAccessFile(fileName, "r");
if (f.length() > Integer.MAX_VALUE)
throw new IOException("File is too large");
byte[] b = new byte[(int)f.length()];
f.readFully(b);
if (f.getFilePointer() != f.length())
throw new IOException("File length changed while reading");
I added the checks leading to exceptions and the change from read to readFully, which was proposed in comments under the original answer.
Related
I have a large JSON file (2.5MB) containing about 80000 lines.
It looks like this:
{
"a": 123,
"b": 0.26,
"c": [HUGE irrelevant object],
"d": 32
}
I only want the integer values stored for keys a, b and d and ignore the rest of the JSON (i.e. ignore whatever is there in the c value).
I cannot modify the original JSON as it is created by a 3rd party service, which I download from its server.
How do I do this without loading the entire file in memory?
I tried using gson library and created the bean like this:
public class MyJsonBean {
#SerializedName("a")
#Expose
public Integer a;
#SerializedName("b")
#Expose
public Double b;
#SerializedName("d")
#Expose
public Integer d;
}
but even then in order to deserialize it using Gson, I need to download + read the whole file in memory first and the pass it as a string to Gson?
File myFile = new File(<FILENAME>);
myFile.createNewFile();
URL url = new URL(<URL>);
OutputStream out = new BufferedOutputStream(new FileOutputStream(myFile));
URLConnection conn = url.openConnection();
HttpURLConnection httpConn = (HttpURLConnection) conn;
InputStream in = conn.getInputStream();
byte[] buffer = new byte[1024];
int numRead;
while ((numRead = in.read(buffer)) != -1) {
out.write(buffer, 0, numRead);
}
FileInputStream fis = new FileInputStream(myFile);
byte[] data = new byte[(int) myFile.length()];
fis.read(data);
String str = new String(data, "UTF-8");
Gson gson = new Gson();
MyJsonBean response = gson.fromJson(str, MyJsonBean.class);
System.out.println("a: " + response.a + "" + response.b + "" + response.d);
Is there any way to avoid loading the whole file and just get the relevant values that I need?
You should definitely check different approaches and libraries. If you are really take care about performance check: Gson, Jackson and JsonPath libraries to do that and choose the fastest one. Definitely you have to load the whole JSON file on local disk, probably TMP folder and parse it after that.
Simple JsonPath solution could look like below:
import com.jayway.jsonpath.DocumentContext;
import com.jayway.jsonpath.JsonPath;
import java.io.File;
public class JsonPathApp {
public static void main(String[] args) throws Exception {
File jsonFile = new File("./resource/test.json").getAbsoluteFile();
DocumentContext documentContext = JsonPath.parse(jsonFile);
System.out.println("" + documentContext.read("$.a"));
System.out.println("" + documentContext.read("$.b"));
System.out.println("" + documentContext.read("$.d"));
}
}
Notice, that I do not create any POJO, just read given values using JSONPath feature similarly to XPath. The same you can do with Jackson:
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.File;
public class JsonPathApp {
public static void main(String[] args) throws Exception {
File jsonFile = new File("./resource/test.json").getAbsoluteFile();
ObjectMapper mapper = new ObjectMapper();
JsonNode root = mapper.readTree(jsonFile);
System.out.println(root.get("a"));
System.out.println(root.get("b"));
System.out.println(root.get("d"));
}
}
We do not need JSONPath because values we need are directly in root node. As you can see, API looks almost the same. We can also create POJO structure:
import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.File;
import java.math.BigDecimal;
public class JsonPathApp {
public static void main(String[] args) throws Exception {
File jsonFile = new File("./resource/test.json").getAbsoluteFile();
ObjectMapper mapper = new ObjectMapper();
Pojo pojo = mapper.readValue(jsonFile, Pojo.class);
System.out.println(pojo);
}
}
#JsonIgnoreProperties(ignoreUnknown = true)
class Pojo {
private Integer a;
private BigDecimal b;
private Integer d;
// getters, setters
}
Even so, both libraries allow to read JSON payload directly from URL I suggest to download it in another step using best approach you can find. For more info, read this article: Download a File From an URL in Java.
There are some excellent libraries for parsing large JSON files with minimal resources. One is the popular GSON library. It gets at the same effect of parsing the file as both stream and object. It handles each record as it passes, then discards the stream, keeping memory usage low.
If you’re interested in using the GSON approach, there’s a great tutorial for that here. Detailed Tutorial
I only want the integer values stored for keys a, b and d and ignore the rest of the JSON (i.e. ignore whatever is there in the c value). ... How do I do this without loading the entire file in memory?
One way would be to use jq's so-called streaming parser, invoked with the --stream option. This does exactly what you want, but there is a trade-off between space and time, and using the streaming parser is usually more difficult.
In the present case, for example, using the non-streaming (i.e., default) parser, one could simply write:
jq '.a, .b, .d' big.json
Using the streaming parser, you would have to write something like:
jq --stream 'select(length==2 and .[0][-1] == ("a","b","d"))[1]' big.json
or if you prefer:
jq -c --stream '["a","b","d"] as $keys | select(length==2 and (.[0][-1] | IN($keys[])))[1]' big.json
In certain cases, you could achieve significant speedup by wrapping the filter in a call to limit, e.g.
["a","b","d"] as $keys
| limit($keys|length;
select(length==2 and .[0][-1] == ("a","b","c"))[1])
Note on Java and jq
Although there are Java bindings for jq (see e.g. "𝑸: What language bindings are available for Java?" in the jq FAQ), I do not know any that work with the --stream option.
However, since 2.5MB is tiny for jq, you could use one of the available Java-jq bindings without bothering with the streaming parser.
Basically I intend to extract the entire category tree in Wikipedia under the root node "Economics" using Wikipedia API sandbox. I don't need the content of the articles, I just need few basic details like pageid, title, revision history (at some later stage of my work). As of now I can extract it level by level but what I want is a recursive/iterative function which does it.
Each category contains a categories and articles (like each root contains nodes and leaves).
I wrote one code to extract the first level into files. one file contains the articles, second folder contains the name of categories (daughters of the root which can be further sub-classified).
Then I went into level and extracted their categories and articles and sub-categories using similar code.
The code remains similar in each case but its the scalability. I need to reach the lowest leaves of all nodes. So i need a recursion which continuously checks till the end.
I labelled files which contains categories as 'c_', so I can provide the condition while extracting different levels.
Now for some reason it has entered into a deadlock and keeps adding same things again and again. I need a way out of the deadlock.
package wikiCrawl;
import java.awt.List;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Scanner;
import org.apache.commons.io.FileUtils;
import org.json.CDL;
import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
public class SubCrawl
{
public static void main(String[] args) throws IOException, InterruptedException, JSONException
{ File file = new File("C:/Users/User/Desktop/Root/Economics_2.txt");
crawlfile(file);
}
public static void crawlfile(File food) throws JSONException, IOException ,InterruptedException
{
ArrayList<String> cat_list =new ArrayList <String>();
Scanner scanner_cat = new Scanner(food);
scanner_cat.useDelimiter("\n");
while (scanner_cat.hasNext())
{
String scan_n = scanner_cat.next();
if(scan_n.indexOf(":")>-1)
cat_list.add(scan_n.substring(scan_n.indexOf(":")+1));
}
System.out.println(cat_list);
//get the categories in different languages
URL category_json;
for (int i_cat=0; i_cat<cat_list.size();i_cat++)
{
category_json = new URL("https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3A"+cat_list.get(i_cat).replaceAll(" ", "%20").trim()+"&cmlimit=500"); //.trim() removes trailing and following whitespaces
System.out.println(category_json);
HttpURLConnection urlConnection = (HttpURLConnection) category_json.openConnection(); //Opens the connection to the URL so clients can communicate with the resources.
BufferedReader reader = new BufferedReader (new InputStreamReader(category_json.openStream()));
String line;
String diff = "";
while ((line = reader.readLine()) != null)
{
System.out.println(line);
diff=diff+line;
}
urlConnection.disconnect();
reader.close();
JSONArray jsonarray_cat = new JSONArray (diff.substring(diff.indexOf("[{\"pageid\"")));
System.out.println(jsonarray_cat);
//Loop categories
for (int i_url = 0; i_url<jsonarray_cat.length();i_url++) //jSONarray is an array of json objects, we are looping through each object
{
//Get the URL _part (Categorie isn't correct)
int pageid=Integer.parseInt(jsonarray_cat.getJSONObject(i_url).getString("pageid")); //this can be written in a much better way
System.out.println(pageid);
String title=jsonarray_cat.getJSONObject(i_url).getString("title");
System.out.println(title);
File food_year= new File("C:/Users/User/Desktop/Root/"+cat_list.get(i_cat).replaceAll(" ", "_").trim()+".txt");
File food_year2= new File("C:/Users/User/Desktop/Root/c_"+cat_list.get(i_cat).replaceAll(" ", "_").trim()+".txt");
food_year.createNewFile();
food_year2.createNewFile();
BufferedWriter writer = new BufferedWriter (new OutputStreamWriter(new FileOutputStream(food_year, true)));
BufferedWriter writer2 = new BufferedWriter (new OutputStreamWriter(new FileOutputStream(food_year2, true)));
if (title.contains("Category:"))
{
writer2.write(pageid+";"+title);
writer2.newLine();
writer2.flush();
crawlfile(food_year2);
}
else
{
writer.write(pageid+";"+title);
writer.newLine();
writer.flush();
}
}
}
}
}
For starters this might be too big a demand on the wikimedia servers. There are over a million categories (1) and you need to read Wikipedia:Database download - Why not just retrieve data from wikipedia.org at runtime. You would need to throttle your uses to about 1 per second or risk getting blocked. This means it would take about 11 days to get the full tree.
It would be much better to use the standard dumps at https://dumps.wikimedia.org/enwiki/ these will be easier to read and process and you don't need to put a big load on the server.
Still better is to get a Wikimedia Labs account, which allow you to run queries on a replication of the database servers or scripts on the dumps without having to download some very big files.
To get just the economics categories then its easiest to go via https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Economics this has 1242 categories. You may find it easier to use the list of categories there and build the tree from there.
This will be better than a recursive approach. The problem with the wikipedia category system is that it is not really a tree, with plenty of loops. I would not be surprised if you keep following categories you will end up getting the most of wikipedia.
1 byte = 8 bits, how can I create and store 11001100 in those 8 bits
and the file should be 1 byte in size?
What should be the file format?
All this in Java.
To write bytes to a file, you can use FileOutputStream.
See: Lesson: Basic I/O in Oracle's Java Tutorials and see the API documentation.
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
public class Example {
public static void main(String[] args) throws IOException {
try (OutputStream out = new FileOutputStream("example.bin")) {
out.write(0b11001100);
}
}
}
In C, when I call open() to open a file descriptor, I have to explicitly pass the O_SYNC flag to ensure that writes to this file will be persisted to disk by the time write() returns. If I want to, I can not supply O_SYNC to open(), and then my writes will return much more quickly because they only have to make it into a filesystem cache before returning. If I want to, later on I can force all outstanding writes to this file to be written to disk by calling fsync(), which blocks until that operation has finished. (More details are available on all this in the Linux man page.)
Is there any way to do this in Java? The most similar thing I could find was using a BufferedOutputStream and calling .flush() on it, but if I'm doing writes to randomized file offsets I believe this would mean the internal buffer for the output stream could end up consuming a lot of memory.
Using Java 7 NIO FileChannel#force method:
RandomAccessFile aFile = new RandomAccessFile("file.txt", "rw");
FileChannel inChannel = aFile.getChannel();
// .....................
// flushes all unwritten data from the channel to the disk
channel.force(true);
An important detail :
If the file does not reside on a local device then no such guarantee is made.
Based on Sergey Tachenov's comment, I found that you can use FileChannel for this. Here's some sample code that I think does the trick:
import java.nio.*;
import java.nio.channels.*;
import java.nio.file.*;
import java.nio.file.attribute.*;
import java.io.*;
import java.util.*;
import java.util.concurrent.*;
import static java.nio.file.StandardOpenOption.*;
public class Main {
public static void main(String[] args) throws Exception {
// Open the file as a FileChannel.
Set<OpenOption> options = new HashSet<>();
options.add(WRITE);
// options.add(SYNC); <------- This would force O_SYNC semantics.
try (FileChannel channel = FileChannel.open(Paths.get("./test.txt"), options)) {
// Generate a bit data to write.
ByteBuffer buffer = ByteBuffer.allocate(4096);
for (int i = 0; i < 10; i++) {
buffer.put(i, (byte) i);
}
// Choose a random offset between 0 and 1023 and write to it.
long offset = ThreadLocalRandom.current().nextLong(0, 1024);
channel.write(buffer, offset);
}
}
}
I need to extract data from some PDF documents (using Java). I need to know what would be the easiest way to do it.
I tried iText. It's fairly complicated for my needs. Besides I guess it is not available for free for commercial projects. So it is not an option. I also gave a try to PDFBox, and ran into various NoClassDefFoundError errors.
I googled and came across several other options such as PDF Clown, jPod, but I do not have time to experiment with all of these libraries. I am relying on community's experience with PDF reading thru Java.
Note that I do not need to create or manipulate PDF documents. I just need to exrtract textual data from PDF documents with moderate level layout complexity.
Please suggest the quickest and easiest way to extract text from PDF documents. Thanks.
I recommend trying Apache Tika. Apache Tika is basically a toolkit that extracts data from many types of documents, including PDFs.
The benefits of Tika (besides being free), is that is used to be a subproject of Apache Lucene, which is a very robust open-source search engine. Tika includes a built-in PDF parser that uses a SAX Content Handler to pass PDF data to your application. It can also extract data from encrypted PDFs and it allows you to create or subclass an existing parser to customize the behavior.
The code is simple. To extract the data from a PDF, all you need to do is create a Parser class that implements the Parser interface and define a parse() method:
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
metadata.set("Hello", "World");
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
xhtml.endDocument();
}
Then, to run the parser, you could do something like this:
InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());
I am using JPedal and I'm really happy with the results. It isn't free but it's high quality and the output for image generation from pdfs or text extraction is really nice.
And as a paid library, the support is always there to answer.
I have used PDFBox to extract text for Lucene indexing without too many issues. Its error/warning logging is quite verbose if I remember right - what was the cause for those errors you received?
I understand this post is pretty old but I would recommend using itext from here:
http://sourceforge.net/projects/itext/
If you are using maven you can pull the jars in from maven central:
http://mvnrepository.com/artifact/com.itextpdf/itextpdf
I can't understand how using it can be difficult:
PdfReader pdf = new PdfReader("path to your pdf file");
PdfTextExtractor parser = new PdfTextExtractor();
String output = parser.getTextFromPage(pdf, pageNumber);
assert output.contains("whatever you want to validate on that page");
Import this Classes and add Jar Files 1.- pdfbox-app- 2.0.
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.support.FindBy;
import org.testng.Assert;
import org.testng.annotations.Test;
import java.io.File;
import java.io.IOException;
import java.text.ParseException;
import java.util.List;
import org.apache.log4j.Logger;
import org.apache.log4j.PropertyConfigurator;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.chrome.ChromeDriver;
import com.coencorp.selenium.framework.BasePage;
import com.coencorp.selenium.framework.ExcelReadWrite;
import com.relevantcodes.extentreports.LogStatus;
Add this code inside the class.
public void showList() throws InterruptedException, IOException {
showInspectionsLink.click();
waitForElement(hideInspectionsLink);
printButton.click();
Thread.sleep(10000);
String downloadPath = "C:\\Users\\Updoer\\Downloads";
File getLatestFile = getLatestFilefromDir(downloadPath);
String fileName = getLatestFile.getName();
Assert.assertTrue(fileName.equals("Inspections.pdf"), "Downloaded file name is not
matching with expected file name");
Thread.sleep(10000);
//testVerifyPDFInURL();
PDDocument pd;
pd= PDDocument.load(new File("C:\\Users\\Updoer\\Downloads\\Inspections.pdf"));
System.out.println("Total Pages:"+ pd.getNumberOfPages());
PDFTextStripper pdf=new PDFTextStripper();
System.out.println(pdf.getText(pd));
Add this Method in same class.
public void testVerifyPDFInURL() {
WebDriver driver = new ChromeDriver();
driver.get("C:\\Users\\Updoer\\Downloads\\Inspections.pdf");
driver.findElement(By.linkText("Adeeb Khan")).click();
String getURL = driver.getCurrentUrl();
Assert.assertTrue(getURL.contains(".pdf"));
}
private File getLatestFilefromDir(String dirPath){
File dir = new File(dirPath);
File[] files = dir.listFiles();
if (files == null || files.length == 0) {
return null;
}
File lastModifiedFile = files[0];
for (int i = 1; i < files.length; i++) {
if (lastModifiedFile.lastModified() < files[i].lastModified()) {
lastModifiedFile = files[i];
}
}
return lastModifiedFile;
}