Can we extract texts between tables in PDF using Tabula in Java?

Can we extract texts between tables in PDF using Tabula in Java? - java

I was able to extract the tables using Tabula. I looked for ways on how to output the texts in between them using Tabula but it seems like it is only for tables. Any idea on how to do it?
public static List<Table> extractTablesFromPDF(PDDocument document) {
NurminenDetectionAlgorithm detectionAlgorithm = new NurminenDetectionAlgorithm();
ExtractionAlgorithm algExtractor;
SpreadsheetExtractionAlgorithm extractor=new SpreadsheetExtractionAlgorithm();
ObjectExtractor extractor = new ObjectExtractor(document);
PageIterator pages = extractor.extract();
List<Table> tables=new ArrayList<Table>();
while (pages.hasNext()) {
Page page = pages.next();
if (extractor.isTabular(page)) {
algExtractor=new SpreadsheetExtractionAlgorithm();
}
else
algExtractor=new BasicExtractionAlgorithm();
List<Rectangle> tablesOnPage = detectionAlgorithm.detect(page);
for (Rectangle guessRect : tablesOnPage) {
Page guess = page.getArea(guessRect);
tables.addAll((List<Table>) algExtractor.extract(guess));
}
}
return tables;
}
Thank you in advance for your help!

maintainer of Tabula here.
There are no public methods in Tabula to do so, but you can resort to PDFBox's PDFTextStripper.
Looking at one of the command line tools included with PDFBox might be useful: https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractText.java

Related

Dropbox SDK Java - write search query to get all files

I'm working on a simple project to download all files with certain extensions. And I'm doing a search like this
public void findFile(String query){
try{
SearchV2Builder searchBuilder = client.files().searchV2Builder(query);
List<String> fileExtensions = Arrays.asList(extensions);
SearchOptions searchOptions = SearchOptions.newBuilder().withFileExtensions(fileExtensions).build();
SearchV2Result searchResult = searchBuilder.withOptions(searchOptions).start();
List<SearchMatchV2> searchMatches = searchResult.getMatches();
System.out.println(searchMatches.size());
for (SearchMatchV2 s: searchMatches){
System.out.println(s.getMetadata());
}
}
catch(DbxException e){
e.printStackTrace();
}
}
And i don't know how to write query to get all the files I tried "*" "" and none of it worked. How to wrote correct query for that?

In my experience, you have to include in your initial search string something you are specifically looking for - there does not seem to be the notion of a "wildcard" search in this string.
Also, there is no way to limit the query by date.
So for my usage, I always have directories that start with a known string ("p_"), and also include a YYYYMMDD string in them.
To return all directories, I query "p_" only.

ResearchStack InformedConsent ConsentVisualStep deprecated method problem

I am currently building an Android app using ResearchStack to conduct studies. It is the Android version of ResearchKit. Maybe someone with experience in ResearchKit can also help me. I've used the following example to adapt it to my needs.
https://www.raywenderlich.com/637-researchstack-tutorial-getting-started
The following method is meanwhile deprecated.
visualStep.setNextButtonString(getString(R.string.rsb_next));
Find this line of code outcommented in the example below.
If I am not using this line of code, there is no text shown in the bottom bar where you usually find a "next"-text indicating where to click to continue. Clicking the bottom right corner still causes to go to the next page.
Can anyone help me how to add a text to this?
Tanks!
private List<Step> createConsentSteps(ConsentDocument document) {
List<Step> steps = new ArrayList<>();
for (ConsentSection section: document.getSections()) {
ConsentVisualStep visualStep = new ConsentVisualStep(section.getType().toString());
visualStep.setSection(section);
//visualStep.setNextButtonString(getString(R.string.rsb_next)); //--> deprecated
steps.add(visualStep);
}
ConsentDocumentStep documentStep = new ConsentDocumentStep("consent_doc");
documentStep.setConsentHTML(document.getHtmlReviewContent());
documentStep.setConfirmMessage(getString(R.string.rsb_consent_review_reason));
steps.add(documentStep);
ConsentSignature signature = document.getSignature(0);
if (signature.requiresName()) {
TextAnswerFormat format = new TextAnswerFormat();
format.setIsMultipleLines(false);
QuestionStep fullName = new QuestionStep("consent_name_step", "Please enter your full name",
format);
fullName.setPlaceholder("Full name");
fullName.setOptional(false);
steps.add(fullName);
}
if (signature.requiresSignatureImage()) {
ConsentSignatureStep signatureStep = new ConsentSignatureStep("signature_step");
signatureStep.setTitle(getString(R.string.rsb_consent_signature_title));
signatureStep.setText(getString(R.string.rsb_consent_signature_instruction));
signatureStep.setOptional(false);
signatureStep.setStepLayoutClass(ConsentSignatureStepLayout.class);
steps.add(signatureStep);
}
return steps;
}

How to restart page number from 1 in different group of BIRT report

Backgroud:
Use Java + BIRT to generate report.
Generate report in viewer and allow user to choose to export it to different format (pdf, xls, word...).
All program are in "Layout", no program in "Master Page".
Have 1 "Data Set". The fields in "Layout" refer to this DS.
There is Group in "Layout", gropu by one field.
In "Group Header", I create one cell to use as page number. "Page : MyPageNumber".
"MyPageNumber" is a field I define which would +1 in Group Header.
Problem:
When I use 1st method to generate report, "MyPageNumber" could not show correctly. Because group header only load one time for each group. It would always show 1.
Question:
As I know there is "restart page number in group" in Crystal report. How to restart page in BIRT?
I want to show data of different group in 1 report file, and the page number start from 1 for each group.

You can do it with BIRT reports using page variables. For example:
Add 2 page variables... Group_page, Group_name.
Add 1 report variable... Group_total_page.
In the report beforeFactory add the script:
prevGroupKey = "";
groupPageNumber = 1;
reportContext.setGlobalVariable("gGROUP_NAME", "");
reportContext.setGlobalVariable("gGROUP_PAGE", 1);
In the report onPageEnd add the script:
var groupKey = currGroup;
var prevGroupKey = reportContext.getGlobalVariable("gGROUP_NAME");
var groupPageNumber = reportContext.getGlobalVariable("gGROUP_PAGE");
if( prevGroupKey == null ){
prevGroupKey = "";
}
if (prevGroupKey == groupKey)
{
if (groupPageNumber != null)
{
groupPageNumber = parseInt(groupPageNumber) + 1;
}
else {
groupPageNumber = 1;
}
}
else {
groupPageNumber = 1;
prevGroupKey = groupKey;
}
reportContext.setPageVariable("GROUP_NAME", groupKey);
reportContext.setPageVariable("GROUP_PAGE", groupPageNumber);
reportContext.setGlobalVariable("gGROUP_NAME", groupKey);
reportContext.setGlobalVariable("gGROUP_PAGE", groupPageNumber);
var groupTotalPage = reportContext.getPageVariable("GROUP_TOTAL_PAGE");
if (groupTotalPage == null)
{
groupTotalPage = new java.util.HashMap();
reportContext.setPageVariable("GROUP_TOTAL_PAGE", groupTotalPage);
}
groupTotalPage.put(groupKey, groupPageNumber);
In a master page onRender script add the following script:
var totalPage = reportContext.getPageVariable("GROUP_TOTAL_PAGE");
var groupName = reportContext.getPageVariable("GROUP_NAME");
if (totalPage != null)
{
this.text = java.lang.Integer.toString(totalPage.get(groupName));
}
In the table group header onCreate event, add the following script, replacing 'COUNTRY' with the name of the column that you are grouping on:
currGroup = this.getRowData().getColumnValue("COUNTRY");
In the master page add a grid to the header or footer and add an autotext variable for Group_page and Group_total_page. Optionally add the page variable for the Group_name as well.
Check out these links for more information about BIRT page variables:
https://books.google.ch/books?id=aIjZ4FYJOQkC&pg=PA85&lpg=PA85&dq=birt+change+autotext&source=bl&ots=K0nCmF2hrD&sig=CBOr_otRW0B72sZoFS7LC_1Mrz4&hl=en&sa=X&ei=ZKNAVcnuLYLHsAXRmIHoCw&ved=0CEoQ6AEwBQ#v=onepage&q=birt%20change%20autotext&f=false
https://www.youtube.com/watch?v=lw_k1qHY_gU
http://www.eclipse.org/birt/phoenix/project/notable2.5.php#jump_4
https://bugs.eclipse.org/bugs/show_bug.cgi?id=316173
http://www.eclipse.org/forums/index.php/t/575172/

Alas, this is not supported with BIRT.
That's probably not the answer you've hoped for, but it's the truth.
This is one of the very few aspects where BIRT is way behind other report generator tools.
However, depending on how you have BIRT integrated into your environment, a workaround approach is possible for PDF export that we use in our solution with great success.
The idea is to let BIRT generate a PDF outline based on the grouping.
And the BIRT report creates information in the ReportContext about where and how it wants the page numbers to be displayed.
After BIRT generated the PDF, a custom PDFPostProcessor uses the PDF outline and the information from the ReportContext to add the page numbers with iText.
If this work-around is viable for you, feel free to contact me.

Docx4j - Images in the document

How can we remove an image from the docx4j.
Say I have 10 images, and i want to replace 8 images with my own byte array/binary data, and I want to delete remaining 2.
I am also having trouble in locating images.
Is it somehow possible to replace text placeholders in the document with images?

Refer to this post : http://vixmemon.blogspot.com/2013/04/docx4j-replace-text-placeholders-with.html
for(Object obj : elemetns){
if(obj instanceof Tbl){
Tbl table = (Tbl) obj;
List rows = getAllElementFromObject(table, Tr.class);
for(Object trObj : rows){
Tr tr = (Tr) trObj;
List cols = getAllElementFromObject(tr, Tc.class);
for(Object tcObj : cols){
Tc tc = (Tc) tcObj;
List texts = getAllElementFromObject(tc, Text.class);
for(Object textObj : texts){
Text text = (Text) textObj;
if(text.getValue().equalsIgnoreCase("${MY_PLACE_HOLDER}")){
File file = new File("C:\\image.jpeg");
P paragraphWithImage = addInlineImageToParagraph(createInlineImage(file));
tc.getContent().remove(0);
tc.getContent().add(paragraphWithImage);
}
}
System.out.println("here");
}
}
System.out.println("here");
}
}
wordMLPackage.save(new java.io.File("C:\\result.docx"));

See docx4j checking checkboxes for the 2 approaches to finding stuff (XPath, or non XPath traversal).
VariableReplace allows you to replace text placeholders, but not with images. I think there may be code floating around (in the docx4j forums?) which extends it to do that.
But I'd suggest you use content control databinding instead. See how to create a new word from template with docx4j
You can use base64 encoded images in your XML data, and docx4j and/or Word will do the rest.

Extracting Values From an XML File Either using XPath, SAX or DOM for this Specific Scenario

I am currently working on an academic project, developing in Java and XML. Actual task is to parse XML, passing required values preferably in HashMap for further processing. Here is the short snippet of actual XML.
<root>
<BugReport ID = "1">
<Title>"(495584) Firefox - search suggestions passes wrong previous result to form history"</Title>
<Turn>
<Date>'2009-06-14 18:55:25'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "3.1"> Created an attachment (id=383211) [details] Patch v.2</Sentence>
<Sentence ID = "3.2"> Ah. So, there's a ._formHistoryResult in the....</Sentence>
<Sentence ID = "3.3"> The simple fix it to just discard the service's form history result.</Sentence>
<Sentence ID = "3.4"> Otherwise it's trying to use a old form history result that no longer applies for the search string.</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 12:07:34'</Date>
<From>'Gavin Sharp'</From>
<Text>
<Sentence ID = "4.1"> (From update of attachment 383211 [details])</Sentence>
<Sentence ID = "4.2"> Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 13:17:56'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "5.1"> (In reply to comment #3)</Sentence>
<Sentence ID = "5.2"> &gt; (From update of attachment 383211 [details] [details])</Sentence>
<Sentence ID = "5.3"> &gt; Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
<Sentence ID = "5.4"> Good point.</Sentence>
<Sentence ID = "5.5"> I renamed the one in the wrapper to _formHistResult. </Sentence>
<Sentence ID = "5.6"> fhResult seemed maybe a bit too short.</Sentence>
</Text>
</Turn>
.....
and so on
</BugReport>
There are many commenter like 'Justin Dolske' who have commented on this report and what I actually looking for is the list of commenter and all sentences they have written in a whole XML file. Something like if(from == justin dolske) getHisAllSentences(). Similarly for other commenters (for all). I have tried many different ways to get the sentences only for 'Justin dolske' or other commenters, even in a generic form for all using XPath, SAX and DOM but failed. I am quite new to these technologies including JAVA and any don't know how to achieve it.
Can anyone guide me specifically how could I get it with any of above technologies or is there any other better strategy to do it?
(Note: Later I want to put it in a hashmap such as like this HashMap (key, value) where key = name of commenter (justin dolske) and value is (all sentences))
Urgent help will be highly appreciated.

There're several ways using which you can achieve your requirement.
One way would be use JAXB. There're several tutorials available on this on the web, so feel free to refer to them.
You can also think of creating a DOM and then extracting data from it and then put it into your HashMap.
One reference implementation would be something like this:
import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
public class XMLReader {
private HashMap<String,ArrayList<String>> namesSentencesMap;
public XMLReader() {
namesSentencesMap = new HashMap<String, ArrayList<String>>();
}
private Document getDocument(String fileName){
Document document = null;
try{
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File(fileName));
}catch(Exception exe){
//handle exception
}
return document;
}
private void buildNamesSentencesMap(Document document){
if(document == null){
return;
}
//Get each Turn block
NodeList turnList = document.getElementsByTagName("Turn");
String fromName = null;
NodeList sentenceNodeList = null;
for(int turnIndex = 0; turnIndex < turnList.getLength(); turnIndex++){
Element turnElement = (Element)turnList.item(turnIndex);
//Assumption: <From> element
Element fromElement = (Element) turnElement.getElementsByTagName("From").item(0);
fromName = fromElement.getTextContent();
//Extracting sentences - First check whether the map contains
//an ArrayList corresponding to the name. If yes, then use that,
//else create a new one
ArrayList<String> sentenceList = namesSentencesMap.get(fromName);
if(sentenceList == null){
sentenceList = new ArrayList<String>();
}
//Extract sentences from the Turn node
try{
sentenceNodeList = turnElement.getElementsByTagName("Sentence");
for(int sentenceIndex = 0; sentenceIndex < sentenceNodeList.getLength(); sentenceIndex++){
sentenceList.add(((Element)sentenceNodeList.item(sentenceIndex)).getTextContent());
}
}finally{
sentenceNodeList = null;
}
//Put the list back in the map
namesSentencesMap.put(fromName, sentenceList);
}
}
public static void main(String[] args) {
XMLReader reader = new XMLReader();
reader.buildNamesSentencesMap(reader.getDocument("<your_xml_file>"));
for(String names: reader.namesSentencesMap.keySet()){
System.out.println("Name: "+names+"\tTotal Sentences: "+reader.namesSentencesMap.get(names).size());
}
}
}
Note: This is just a demonstration and you would need to modify it to suit your need. I've created it based on your XML to show one way of doing it.

I suggest to use JAXB to creates a Data Model reflecting your XML structure.
One done, you can load the XML into Java instances.
Put each 'Turn' into a Map< String, List< Turn >>, using Turn.From as key.
Once done, you'll can write:
List< Turn > justinsTurn = allTurns.get( "'Justin Dolske'" );

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can we extract texts between tables in PDF using Tabula in Java? - java

Related

Dropbox SDK Java - write search query to get all files

ResearchStack InformedConsent ConsentVisualStep deprecated method problem

How to restart page number from 1 in different group of BIRT report

Docx4j - Images in the document

Extracting Values From an XML File Either using XPath, SAX or DOM for this Specific Scenario

Categories

Resources