I am trying to take a text file which has a list of peoples first and last names with age and rearrange it so the console output would go from 46 Richman, Mary A. to Mary A. Richman 46. However, in my attempt to do so I have ran into issues (shown below) and I don't understand exactly why they're occurring (it was much worse earlier).
I'd really appreciate the assistance!
Text File:
75 Fresco, Al
67 Dwyer, Barb
55 Turner, Paige
108 Peace, Warren
46 Richman, Mary A.
37 Ware, Crystal
83 Carr, Dusty
15 Sledd, Bob
64 Sutton, Oliver
70 Mellow, Marsha
29 Case, Justin
35 Time, Justin
8 Shorts, Jim
20 Morris, Hugh
25 Vader, Ella
76 Bird, Earl E.
My Code:
import java.io.*;
import java.util.*;
public class Ex2 {
public static void main(String[] args) throws FileNotFoundException {
Scanner input = new Scanner(new File("people.txt"));
while (input.hasNext()) { // Input == people.txt
String line = input.next().replace(",", "");
String firstName = input.next();
String lastName = input.next();
int age = input.nextInt();
System.out.println(firstName + lastName + age);
}
}
}
Bad Console Output: (How is it throwing an Unknown Source Error?)
Fresco,Al67
Exception in thread "main" java.util.InputMismatchException
at java.util.Scanner.throwFor(Unknown Source)
at java.util.Scanner.next(Unknown Source)
at java.util.Scanner.nextInt(Unknown Source)
at java.util.Scanner.nextInt(Unknown Source)
at Ex2.main(Ex2.java:11)
Target Console Output:
Al Fresco 75
Barb Dwyer 67
Paige Turner 55
Warren Peace 108
Mary A. Richman 46
Crystal Ware 37
Dusty Carr 83
Bob Sledd 15
Oliver Sutton 64
Marsha Mellow 70
Justin Case 29
Justin Time 35
Jim Shorts 8
Hugh Morris 20
Ella Vader 25
Earl E. Bird 76
This will make sure the first name includes the middle initial
while (input.hasNext())
{
String[] line = input.nextLine().replace(",", "").split("\\s+");
String age = line[0];
String lastName = line[1];
String firstName = "";
//take the rest of the input and add it to the last name
for(int i = 2; 2 < line.length && i < line.length; i++)
firstName += line[i] + " ";
System.out.println(firstName + lastName + " " + age);
}
You can avoid the issue and simplify the logic by actually reading with input.nextLine() as shown in the below code with comments:
while (input.hasNextLine()) {
String line = input.nextLine();//read next line
line = line.replace(",", "");//replace ,
line = line.replace(".", "");//replace .
String[] data = line.split(" ");//split with space and collect to array
//now, write the output derived from the split array
System.out.println(data[2] + " " + data[1] + " " + data[0]);
}
I created a program to read and extract text from PDF files... But it producing this exception during execution..
java.io.IOException: Error: Expected a long type, actual='930[299'
at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1669)
at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100)
at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:632)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1205)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1097)
at PatentAdder.main(PatentAdder.java:60)
This is my code :
import java.awt.Rectangle;
import java.io.File;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.TrueFileFilter;
import org.apache.commons.io.filefilter.WildcardFileFilter;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFTextStripperByArea;
public class PatentAdder {
/**
* #param args
*/
public static String patno,patit,patdate,patfilled,appno;
private static int File;
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
int cnt=0;
if( args.length == 1 )
{
// usage();
}
else
{
PDDocument document = null;
try
{
File dataDir = new File("F:/patents/test/tittest/USP2002w17/06/378/pdfs");
File[] files = dataDir.listFiles();
// String[] files = dataDir.list();
int count=0;
// System.out.println ("Satrt1");
for (File file : files) {
// System.out.println ("Satrt2");
File f = file;
if (!f.isDirectory()) {
document = PDDocument.load(f.getAbsolutePath());
if( document.isEncrypted() )
{
try
{
document.decrypt( "" );
}
catch( InvalidPasswordException e )
{
System.err.println( "Error: Document is encrypted with a password." );
System.exit( 1 );
}
} }
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
// Rectangle rectt = new Rectangle( 590, 108, 600, 100 ); // enlarge title
Rectangle rectt = new Rectangle( 288, 60, 222, 40 );
Rectangle rect = new Rectangle( 55, 108, 230, 600 ); // US-Patent title h40
// Rectangle rect = new Rectangle( 108, 210, 480, 499 ); //full enlarge
stripper.addRegion( "class1", rect );
stripper.addRegion("class2", rectt);
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
String title = "(?s)\\(54\\)\\s*([\\w\\s,-]+)|(?s)\\[54\\]\\s*([\\w\\s,-]+)";
String in ="((?s)\\(\\d\\d\\)\\s+Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Assignee:))|((?s)\\[\\d\\d\\)\\s+Inventor:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Assignee:))|((?s)\\(\\d\\d\\)\\s+Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Assignee:))|((?s)\\(\\d\\d\\)\\s+Inventor:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Assignee:))";
String as ="((?s)\\(\\d\\d\\)\\s+Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Notice:))|((?s)\\(\\d\\d\\)\\s+Assignee:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Notice:))|(Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+)(?=Notice:))";
String app_no ="(?s)\\(21\\)\\s*([\\w\\s,.://-]+)|(?s)\\[21\\]\\s*([\\w\\s,.://-]+)";
String filed ="((?s)\\(22\\)\\s*([\\w\\s,.://-]+))|((?s)\\(22\\)\\s*([\\w\\s,.://-]+)(?=\\s*\\n\\s*Related))|((?s)\\[22\\]\\s*([\\w\\s,.://-]+))|((?s)\\[22\\]\\s*([\\w\\s,.://-]+)(?=\\s*\\n\\s*Related))";
String term ="((?s)\\s*Term\\s*([\\w\\s,.://-]+))|((?s)\\s*Term\\s*([\\w\\s,.://-]+))";
String pat_no = "(?s)\\s*Patent No\\.\\:\\s*([\\w\\d\\s,.://-]+)|(?s)\\s*Patent Number\\:\\s*([\\w\\d\\s,.://-]+)";
String pat_dt = "(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventor:)|(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventors:)|(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\[\\d*\\]\\s+Inventor:)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventors:)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)";
// System.out.println(rg);
String region = stripper.getTextForRegion( "class1" );
// System.out.println(region);
String regiont = stripper.getTextForRegion( "class2" );
Pattern p = Pattern.compile(in);
Matcher m = p.matcher(region);
Pattern p2 = Pattern.compile(as);
Matcher m2 = p2.matcher(region);
Pattern p3 = Pattern.compile(title);
Matcher m3 = p3.matcher(region);
Pattern p4 = Pattern.compile(pat_no);
Matcher m4 = p4.matcher(regiont);
Pattern p5 = Pattern.compile(app_no);
Matcher m5 = p5.matcher(region);
Pattern p6 = Pattern.compile(filed);
Matcher m6 = p6.matcher(region);
Pattern p7 = Pattern.compile(pat_dt);
Matcher m7 = p7.matcher(regiont);
while(m.find())
{
// System.out.println(m.group());
}
while(m2.find())
{
// System.out.println(m2.group());
}
while(m3.find())
{
// System.out.println(m3.group());
patit = m3.group().replace("(54)", " ");
patit = patit.trim();
}
while(m4.find())
{
// System.out.println(m4.group());
patno = m4.group().replace("Patent No.: ", " ");
patno = patno.replace("Patent No: ", " ");
patno = patno.replace("Patent", " ");
patno = patno.replace("No.:", " ");
patno = patno.replace("No:", " ");
patno = patno.replace("Number: ", " ");
patno = patno.replace("Number.: ", " ");
patno = patno.trim();
}
while(m5.find())
{
// System.out.println(m5.group());
appno = m5.group().replace("(21)", " ");
appno = appno.replace("Appl. No.: ", " ");
appno = appno.replace("Appl.", " ");
appno = appno.replace("No.", " ");
appno = appno.replace(":"," ");
appno = appno.trim();
}
while(m6.find())
{
// System.out.println(m6.group());
patfilled = m6.group().replace("(22)", " ");
patfilled = patfilled.replace("Filed", " ");
patfilled= patfilled.replace("PCT", " ");
patfilled = patfilled.replace(":", " ");
patfilled = patfilled.replace("\n", "");
patfilled= patfilled.trim();
}
while (m7.find())
{
patdate = m7.group().replace("(45) Date of Patent: ", " ");
patdate = patdate.replace("(45) Date of Patent.: ", " ");
patdate = patdate.replace("(45)", " ");
patdate = patdate.replace("Date", " ");
patdate = patdate.replace("of", " ");
patdate = patdate.replace("Patent.: ", " ");
patdate = patdate.replace("Patent: ", " ");
patdate = patdate.replace("Reissued", " ");
patdate = patdate.replace(":", " ");
patdate = patdate.replace("Patent", " ");
patdate = patdate.replace("*", " ");
patdate = patdate.trim();
}
System.out.println("File name:"+f.getName());
System.out.println(patno +"\n"+patit+"\n"+patdate+"\n"+patfilled+"\n"+appno+"\n-------");
// boolean st = addPatent (patno,patit,patdate,patfilled,appno);
// if ( st == true ) System.out.println(patno+" added");
// else System.out.println(patno+" not added");
count++;
}
System.out.print("-----Finised "+count+" Files------ \n");
}
finally
{
if( document != null )
{
document.close();
}
}
}
}
catch (Exception e)
{
System.out.println(e.getStackTrace());
//System.out.println(e.getLocalizedMessage());
System.out.println(e.getMessage());
System.out.println(e.getCause());
//System.out.println(e.getClass());
e.printStackTrace();
}
}
static boolean addPatent(String pno,String ptitle,String pat_date ,String filed_date , String appl_no )
{
int i=0;
boolean status =false;
try {
Class.forName("com.mysql.jdbc.Driver").newInstance();
Connection con = DriverManager.getConnection("jdbc:mysql://localhost:3306/patent", "root","ragesh");
PreparedStatement st = con.prepareStatement("insert into patents_info values (?,?,?,?,?,?)");
st.setString(1, pno);
st.setString(2, ptitle);
st.setString(3,pat_date);
st.setString(4,filed_date);
st.setString(5,appl_no);
st.setInt(6,0);
i=st.executeUpdate();
if (i > 0) status= true;
}
catch (Exception e)
{
e.printStackTrace();
}
return status;
}
public static List<File> getAllChildFiles(File[] dir)
{
List<File> result = new ArrayList<File>();
for (File file : dir)
{
if (file.isDirectory())
{
File[] children = file.listFiles();
List<File> grandChildren = getAllChildFiles(children);
result.addAll(grandChildren);
}
else
{
result.add(file);
}
}
return result;
}
}
This programs gives output up to some iterations , but halts and thorw exception like above specified ..
Sample output with Exception :
File name:06019327.pdf
Number: 6,019,327
[54] INSTALLATION STRUCTURE OF OUTDOOR
COMMUNICATION DRIVE
[45] Feb. 1, 2000
[22] Aug. 30, 1996
Related U.S. Application Data
[21] 08/704,920
-------
File name:06019328.pdf
Number: 6,019,328
[54] STAY-PUT PEGBOARD ACCESSORY
[45] Feb. 1, 2000
[22] Jan. 27, 1999
[21] 09/238,242
-------
File name:06019329.pdf
Number: 6,019,329
[54] CLAMPS
[45] Feb. 1, 2000
[22] Oct. 30, 1997
[21] 08/961,310
-------
File name:06019330.pdf
Number: 6,019,330
[54] ROOF GUARD DEVICE FOR LIFTING
OBJECTS ON TO A ROOF
[45] Feb. 1, 2000
[22] Nov. 20, 1997
[21] 08/974,866
-------
File name:06019331.pdf
Number: 6,019,331
[54] CANTILEVER BRACKET ASSEMBLY
[45] Feb. 1, 2000
[22] May 28, 1997
Related U.S. Application Data
[21] 08/865,587
-------
[Ljava.lang.StackTraceElement;#43a6684f
Error: Expected a long type, actual='930[299'
java.io.IOException: Error: Expected a long type, actual='930[299'
at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1669)
at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100)
at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:632)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1205)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1097)
at PatentAdder.main(PatentAdder.java:60)
2nd Problem
Sometimes the execution freezes.. That is it just showing the blinking cursor after some more iterations .... Why... ?
File name:06019329.pdf
Number: 6,019,329
[54] CLAMPS
[45] Feb. 1, 2000
[22] Oct. 30, 1997
[21] 08/961,310
-------
File name:06019330.pdf
Number: 6,019,330
[54] ROOF GUARD DEVICE FOR LIFTING
OBJECTS ON TO A ROOF
[45] Feb. 1, 2000
[22] Nov. 20, 1997
[21] 08/974,866
-------
File name:06019331.pdf
Number: 6,019,331
[54] CANTILEVER BRACKET ASSEMBLY
[45] Feb. 1, 2000
[22] May 28, 1997
Related U.S. Application Data
[21] 08/865,587
-------
(__ cursor blinks on... and execution freezes )
Please help me to resolve this 2 issues:
JDK version : 1.6
PDF Box 1.8.3
This is caused by PDFBox not following the PDF Reference to the letter :)
Tokens in a PDF token stream may be delimited by white space (as usual for most programming language), but also implicitly: because the next character is a delimiter of its own, since it introduces a special function. Therefore, it's totally valid -- and certainly not unusual -- to encounter constructions such as
/A[123/B(C)]
which is entirely equivalent to the slightly longer
/A [ 123 /B (C) ]
From ISO "PDF 32000-1:2008", 7.2.2 Character Set:
The PDF character set is divided into three classes, called regular, delimiter, and white-space characters. This classification determines the grouping of characters into tokens. The rules defined in this sub-clause apply to all characters in the file except within strings, streams, and comments.
The White-space characters shown [...]
The delimiter characters (, ), <, >, [, ], {, }, /, and % are special [..]
The original code shows the current implementation (taken from http://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/BaseParser.java):
/**
1681 * This method is used to read a token by the {#linkplain #readInt()} method and the {#linkplain #readLong()} method.
1682 *
1683 * #return the token to parse as integer or long by the calling method.
1684 * #throws IOException throws by the {#link #pdfSource} methods.
1685 */
1686 protected final StringBuilder readStringNumber() throws IOException
1687 {
1688 int lastByte = 0;
1689 StringBuilder buffer = new StringBuilder();
1690 while( (lastByte = pdfSource.read() ) != 32 &&
1691 lastByte != 10 &&
1692 lastByte != 13 &&
1693 lastByte != 60 && //see sourceforge bug 1714707
1694 lastByte != 0 && //See sourceforge bug 853328
1695 lastByte != -1 )
1696 {
1697 buffer.append( (char)lastByte );
1698 }
1699 if( lastByte != -1 )
1700 {
1701 pdfSource.unread( lastByte );
1702 }
1703 return buffer;
1704 }
The 'next character' is tested against the whitespace characters from Table 1 in 7.2.2 (top to bottom, "Space", "Line Feed", "Carriage Return", and the Nul character -- though they are still missing the "Form Feed" code 0x0C and, very odd, the common "Tab" 0x09. They do test, however, for an end-of-file (the -1) and < (60), the latter probably because someone ran into a similar bug before. (I could not locate the original bug report #1714707 but I can infer it must have been similar to your issue.)
This list must be completed by adding the following characters, copied verbatim from Table 2 in 7.2.2:
Table 2 – Delimiter characters
Glyph Decimal Hexadecimal Octal Name
( 40 28 50 LEFT PARENTHESIS
) 41 29 51 RIGHT PARENTHESIS [1]
< 60 3C 60 LESS-THAN SIGN
> 62 3E 62 GREATER-THAN SIGN
[ 91 5B 133 LEFT SQUARE BRACKET
] 93 5D 135 RIGHT SQUARE BRACKET
{ 123 7B 173 LEFT CURLY BRACKET
} 125 7D 175 RIGHT CURLY BRACKET
/ 47 2F 57 SOLIDUS
% 37 25 45 PERCENT SIGN
The odd ones out are { and } since, currently, they only appear inside PostScript snippets, and those are not base objects but contained inside a stream. But perhaps they were historically "reserved for future expansion" (which should no longer be an issue, now the PDF format has been frozen as an ISO specification).
Also, the character % in itself is a delimiter, but it needs some special handling as well as it introduces a comment:
The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line [...] (7.2.3 Comments)
(Note there is a little ambiguity there:
A conforming reader shall ignore comments, and treat them as single white-space characters. That is, a comment separates the token preceding it from the one following it.
which should not be necessary, because the previous line already says the comment ends before the end-of-line; and so the end-of-line itself ought to remain in the input stream and thus act as a separator. Perhaps nothing more than a case of a belt-and-suspenders approach.)
[1] On reviewing: actually, the closing parenthesis is redundant. It can only occur after a matching opening parentheses, and that introduces a string. Viewed one token at a time, you should never encounter a stray ) -- if you do, that indicates a malformed PDF stream.
The readLong method reads a long from the underlaying stream. As the PDFBox API states that method is throwing an IOException that has been generated by the PushBackInputStream used as input source (pdfSource).
In your case the log is pretty self-explanatory, it seems there's a square bracket '[' in your stream, which make the long conversion impossible.
You have two options:
check you input and your parser logic (or perform a sanity check before using PDDocument.load)
narrow the scope of your try and catch block to line 60 of your class to handle the specific IOException and react accordingly (if possible in your method logic)
About the freeze issues
Are you sure the code is not stuck in one of your:
while(mX.find())
{
...
}
blocks? I find the design pretty error prone, especially for X = 1 and 2. I have no time to go into the logic but you may want to refactor the while condition as follow:
long TIMEOUT = 15000l; // 15 seconds
long now = System.currentTimeMillis(); // init the long just above the while
while(mX.find() && (System.currentTimeMillis() - now) < TIMEOUT)
{
...
}
I deal with a file problem.
IBM 7918 Ayse Durlanik 7600 Computer
------------------------------------
Gama 2342 Mehmet Guzel 8300 Civil
------------------------------------
Lafarge 3242 Ahmet Bilir 4700 Chemical
------------------------------------
Intel 3255 Serhan Atmaca 9200 Electrical
------------------------------------
Bilkent 3452 Fatma Guler 2500 Computer
------------------------------------
Public 1020 Aysen Durmaz 1500 Mechanical
------------------------------------
Havelsan 2454 Sule Dilbaz 2800 Electrical
------------------------------------
Tai 3473 Fethi Oktam 3600 Computer
------------------------------------
Nurol 4973 Ayhan Ak 4100 Civil
------------------------------------
Pfizer 3000 Fusun Ot 2650 Chemical
------------------------------------
This is the text file and I don't want to read this =
"------------------------------------ "
Here is the method:
Scanner scn = null;
File fp = new File("C:/Users/Efe/Desktop/engineers.txt");
try {
scn = new Scanner(fp);
while (scn.hasNextLine()) {
{
if (!scn.next().equals("------------------------------------")) {
String comp = scn.next();
int id = Integer.parseInt(scn.next());
String name = scn.next();
String surname = scn.next();
double sal = Double.parseDouble(scn.next());
String area = scn.next();
Engineer e = new Engineer(comp, id, name, surname, sal, area);
list.add(e);
}
}
scn.close();
}
This is the code where I get an exception at run-time:
Exception in thread "AWT-EventQueue-0" java.lang.NumberFormatException:
For input string: "Ayse" at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
What is wrong with the code?
You're off by one...in the line
if (!scn.next().equals("------------------------------------")) {
if the next token is not the dashed line, then it is lost. Consider assigning it to a temporary variable.
In your case, "IBM" is lost, comp == 7918, and parseInt is called with an argument of "Ayse", leading to the runtime exception.
This is when application trying to convert string to one of the numeric types, but that string does have the appropriate format to convert.
Can you show further "IBM 7918 Ayse Durlanik 7600 Computer"