Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 months ago.
Improve this question
I have a pdf file that needs to be parsed and written to a database.
It looks something like this:
Some report name Data for 10.10.2022
_____________________________________________________________________________________________
Name Currency1 Currency2 Percent1 Percent2
Some name1(IO/C)|1.2% 1'01.12 USD 1'021.2 USD 0.11% 1.12%
Some name2(IO/C)|1.2% 1'01.12 USD 1'021.2 USD 0.11% 1.12%
I used apache.pdfbox library
public class PdfParser {
private PDFParser parser;
private PDFTextStripper textStripper;
private PDDocument document;
private COSDocument cosDocument;
private String text;
private String filePath;
private File file;
public PdfParser() {}
public String parsePdf() throws IOException {
this.textStripper = null;
this.document = null;
this.cosDocument = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file, "rw"));
parser.parse();
cosDocument = parser.getDocument();
textStripper = new PDFTextStripper();
document = new PDDocument(cosDocument);
document.getNumberOfPages();
textStripper.setStartPage(0);
textStripper.setEndPage(2);
text = textStripper.getText(document);
return text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
public PDDocument getDocument() {
return document;
}
public void showDocument() {
List<String> list = new ArrayList<>();
PdfParser pdfParser = new PdfParser();
pdfParser.setFilePath("C:\\......pdf");
try {
String text = pdfParser.parsePdf();
list.addAll(List.of(strings));
for (Object obj : list) {
System.out.println(obj);
}
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
}
I have not parsed pdf documents before, and the result of my code does not suit me. As I understand it, it reads line by line and incorrectly displays the text format.
It comes out like this:
Some report name
Data for 10.10.2022
Name
Currency1
Currency2
Percent1
Percent2
1'01.12Some name1(IO/C)|1.2% 1'021.2 USD 0.11% 1.12%USD
1'01.12Some name2(IO/C)|1.2% 1'021.2 USD 0.11% 1.12%USD
Please tell me who worked with parsing pdf files. How can I at least correctly display my document in the console? And then write the received data to the object and save it to the database
Related
When I tried to extract text from PDF using Apache PDFBox 2.0.18, the output looked like below,
How can I avoid those question mark symbols ?
Below is my pdf extraction method.
public static String getPDFContent(File pdfFile) throws IOException {
PDDocument doc = null;
String text = null;
try {
doc = PDDocument.load(pdfFile);
text = new PDFTextStripper().getText(doc);
}
catch (Exception e) {
logger.error("An exception occurred while extracting text from pdf using Apache PDFBox.");
return null;
}
finally {
if( doc != null )
{
doc.close();
}
}
return text;
}
i am converting doc to html using following code
private static final String docName = "This is a test page.docx";
private static final String outputlFolderPath = "C://";
String htmlNamePath = "docHtml1.html";
String zipName="_tmp.zip";
static File docFile = new File(outputlFolderPath+docName);
File zipFile = new File(zipName);
public void ConvertWordToHtml() {
try {
InputStream doc = new FileInputStream(new File(outputlFolderPath+docName));
System.out.println("InputStream"+doc);
XWPFDocument document = new XWPFDocument(doc);
XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;
String root = "target";
File imageFolder = new File( root + "/images/" + doc );
options.setExtractor( new FileImageExtractor( imageFolder ) );
options.URIResolver( new FileURIResolver( imageFolder ) );
OutputStream out = new FileOutputStream(new File(htmlPath()));
XHTMLConverter.getInstance().convert(document, out, options);
} catch (Exception ex) {
}
}
public static void main(String[] args) throws IOException, ParserConfigurationException, Exception {
Convertion cwoWord=new Convertion();
cwoWord.ConvertWordToHtml();
}
public String htmlPath(){
return outputlFolderPath+htmlNamePath;
}
public String zipPath(){
// d:/_tmp.zip
return outputlFolderPath+zipName;
}
Above code is converting doc to html fine. Issue comes when i try to convert a doc file which has graphics
like circle (shown in screenshot), In this case, graphics doesn't show into html file.
Please help me out how can we maintain graphics from doc to html file as well after conversion. Thanks in Advance
You can embed the images in the html by using the following code:
Base64ImageExtractor imageExtractor = new Base64ImageExtractor();
options.setExtractor(imageExtractor);
options.URIResolver(imageExtractor);
where Base64ImageExtractor looks like:
public class Base64ImageExtractor implements IImageExtractor, IURIResolver {
private byte[] picture;
public void extract(String imagePath, byte[] imageData) throws IOException {
this.picture = imageData;
}
private static final String EMBED_IMG_SRC_PREFIX = "data:;base64,";
public String resolve(String uri) {
StringBuilder sb = new StringBuilder(picture.length + EMBED_IMG_SRC_PREFIX.length())
.append(EMBED_IMG_SRC_PREFIX)
.append(Base64Utility.encode(picture));
return sb.toString();
}
}
I'm working in a servlet file for a web project and this is my code :
I have the v.2.0.0 of pdfbox library and my code works in a simple java application
pdfmanager.java :
public class pdfManager {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
public pdfManager() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
the srvlet file :
PrintWriter out = response.getWriter() ;
out.println("\ndata we gottoo : ") ;
pdfManager pdfManager = new pdfManager();
pdfManager.setFilePath("/Users/rami/Desktop/pdf2.pdf");
System.out.println(pdfManager.ToText());
called in doGet method
The library you need is not on the classpath or other problems occur when the classloader wants to load the class of the library. If you are in on a server, be sure to add the library to classpath folder. This can be done by hand or your application has to provide/deliver it by itself. Since it's not clear how your app is deployed or delivered it can have many reasons
This question already has answers here:
How to upload only csv files into the server in Java
(3 answers)
Closed 7 years ago.
I am trying to upload a csv file and i intend to write success to the uploaded csv if the upload goes successfully. below is my code(this is not working as of now)
#RequestMapping(value = "/submitUploadCarForm")
public String uploadCar(CarUploadForm uploadform,HttpServletRequest request, Model model, BindingResult result) {
try {
CommonsMultipartFile file = uploadform.getFileData();
// parse csv file to list
csvList = CarCSVUtil.getCSVInputList(file.getInputStream());
for (CarCSVFileInputBean inputRecord : csvList) {
Car carentity = new Car();
carentity.setId(inputRecord.getId());
carentity.setName(inputRecord.getcarName());
carentity.setShortName(inputRecord.getcarShortName());
carentity.setEnvironment(inputRecord.getEnvironment());
carentity = this.Service.saveCar(carentity);
CarConfig compcfg = new CarConfig();
compcfg.setCar(carentity);
compcfg.setCarType(pickCarType(carTypes,inputRecord.getcarType()));
this.Service.saveCompCfg(compcfg);
inputRecord.setuploadstatus("success");<--- This is where i need help
}
}
catch (Exception e) {
e.printStackTrace();
result.rejectValue("name", "failureMsg","Error while uploading ");
model.addAttribute("failureMsg", "Error while uploading ");
}
return "view";
}
I am using this code to import csv file and data is store into database.Add opencsv jar file to your build path.
public String importCSV(#RequestParam("file") MultipartFile file,
HttpServletRequest req) throws IOException {
CsvToBean csv = new CsvToBean();
CSVReader reader = new CSVReader(new InputStreamReader(
file.getInputStream()), ',', '\"', 1);
ColumnPositionMappingStrategy maping = new ColumnPositionMappingStrategy();
maping.setType(tbBank.class);
String[] column = { "bankid", "bankname", "bankbranch" };
maping.setColumnMapping(column);
List banklist = csv.parse(maping, reader);
for (Object obj : banklist) {
tbBank bank = (tbBank) obj;
projectservice.insertBank(bank);
}
return "redirect:/bankview";
}
I am trying to create a plugin for nutch. I am using nutch 1.7 and solr. I used a lot of different tutorials. I want to realize a plugin that returns raw html data. i used the standard wiki of nutch and the following tutorial:http://sujitpal.blogspot.nl/2009/07/nutch-custom-plugin-to-parse-and-add.html
I created two files getDivinfohtml.java and getDivinfo.java.
getDivinfohtml.java needs to read the content and then return the complete source code. or atleast a part of the source code
package org.apache.nutch.indexer;
public class getDivInfohtml implements HtmlParseFilter
{
private static final Log LOG = LogFactory.getLog(getDivInfohtml.class);
private Configuration conf;
public static final String TAG_KEY = "source";
// Logger logger = Logger.getLogger("mylog");
// FileHandler fh;
//FileSystem fs = FileSystem.get(conf);
//Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
//SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
//Text key = new Text();
// Content content = new Content();
// fh = new FileHandler("/root/JulienKulkerNutch/mylogfile.log");
// logger.addHandler(fh);
// SimpleFormatter formatter = new SimpleFormatter();
//fh.setFormatter(formatter);
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
{
try
{
LOG.info("Parsing Url:" + content.getUrl());
LOG.info("Julien: "+content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));
Parse parse = parseResult.get(content.getUrl());
Metadata metadata = parse.getData().getParseMeta();
String fullContent = metadata.get("fullcontent");
Document document = Jsoup.parse(fullContent);
Element contentwrapper = document.select("div#jobBodyContent").first();
String source = contentwrapper.text();
metadata.add("SOURCE", source);
return parseResult;
}
catch(Exception e)
{
LOG.info(e);
}
return parseResult;
}
public Configuration getConf()
{
return conf;
}
public void setConf(Configuration conf)
{
this.conf = conf;
}
}
It reads the compelete content right now and then extract the text in jobBodyContent.
Then we have the parser that needs to put the data into the fields
getDivinfo(parser)
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
{
// LOG.info("Julien is sukkel");
try
{
fh = new FileHandler("/root/JulienKulkerNutch/mylogfile2.log");
SimpleFormatter formatter = new SimpleFormatter();
fh.setFormatter(formatter);
logger.info("Julien is sukkel");
Metadata metadata = parse.getData().getParseMeta();
logger.info("julien is gek:");
String fullContent = metadata.get("SOURCE");
logger.info("Output:" + metadata);
logger.info(fullContent);
String fullSource = parse.getData().getParseMeta().getValues(getDivInfohtml.TAG_KEY);
logger.info(fullSource);
doc.add("divcontent", fullContent);
}
catch(Exception e)
{
//LOG.info(e);
}
return doc;
}
the erros is in getDivinfo: String fullSource = parse.getData().getParseMeta().getValues(getDivInfohtml.TAG_KEY);
[javac] /root/JulienKulkerNutch/apache-nutch-1.8/src/plugin/myDivSelector/src/java/org/apache/nutch/indexer/getDivInfo.java:58: error: cannot find symbol
[javac] String fullSource = parse.getData().getParseMeta().getValues(getDivInfohtml.TAG_KEY);
You may need to implement HTMLParser. In your getFields implementation,
private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
static {
FIELDS.add(WebPage.Field.CONTENT);
FIELDS.add(WebPage.Field.OUTLINKS);
}
public Collection<Field> getFields() {
return FIELDS;
}