RSS FEED - data parsing

RSS FEED - data parsing - java

How can I retrieve the location from following parsed data?
<description>Origin date/time: Mon, 29 Mar 2021 04:23:32 ; Location: BLACKFORD,PERTH/KINROSS ; Lat/long: 56.284,-3.759 ; Depth: 7 km ; Magnitude: 1.0</description>
This details is within the description tag and description is already been parsed to an array list. How do just get the location out of it?

You can use the regex, (?<=Location: ).*?(?= ;) to find and extract the required match.
Solution using Stream API:
import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
public class Main {
public static void main(String[] args) {
String str = "<description>Origin date/time: Mon, 29 Mar 2021 04:23:32 ; Location: BLACKFORD,PERTH/KINROSS ; Lat/long: 56.284,-3.759 ; Depth: 7 km ; Magnitude: 1.0</description>";
List<String> list = Pattern.compile("(?<=Location: ).*?(?= ;)")
.matcher(str)
.results()
.map(MatchResult::group)
.collect(Collectors.toList());
System.out.println(list);
}
}
Output:
[BLACKFORD,PERTH/KINROSS]
Non-Stream solution:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String str = "<description>Origin date/time: Mon, 29 Mar 2021 04:23:32 ; Location: BLACKFORD,PERTH/KINROSS ; Lat/long: 56.284,-3.759 ; Depth: 7 km ; Magnitude: 1.0</description>";
Matcher matcher = Pattern.compile("(?<=Location: ).*?(?= ;)").matcher(str);
List<String> list = new ArrayList<>();
while (matcher.find()) {
list.add(matcher.group());
}
System.out.println(list);
}
}
Output:
[BLACKFORD,PERTH/KINROSS]
Explanation of the regex at regex101:

If all you get is
Origin date/time: Mon, 29 Mar 2021 04:23:32 ; Location: BLACKFORD,PERTH/KINROSS ; Lat/long: 56.284,-3.759 ; Depth: 7 km ; Magnitude: 1.0
You're going to have to either (a) determine the standard that dictates this format if any or (b) do it yourself i.e. look at the structure and decide to parse based on that.
Simple way with split()
It seems you can use the split() method on a String using separator " ; ". That should give you an array of length 5.
You could then assume Location is always in the second position or simply iterate over the array until you find the string that starts with Location.
Example
public class Location {
public static void main(String[] args) {
String rawData = "Origin date/time: Mon, 29 Mar 2021 04:23:32 ; Location: BLACKFORD,PERTH/KINROSS ; Lat/long: 56.284,-3.759 ; Depth: 7 km ; Magnitude: 1.0\r\n";
String[] dataArray = rawData.split(" ; ");
System.out.println(dataArray[1]);
}
}
The Regular Expression Way
Alternatively, you can use a regular expression that could give you the value outright without going through the steps I just described. The value you are looking for is always preceded by Location: and ends with ; Have a look at this primer to get going.
Pattern pattern = Pattern.compile("(?<=Location: ).*?;", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(rawData);
boolean matchFound = matcher.find();
if(matchFound) {
System.out.println("Match found: "+matcher.group());
} else {
System.out.println("Match not found");
}

Use a dictionary along with Regex :
string pattern = #"(?'key'[^:]+):\s+(?'value'.*)";
string input = "Origin date/time: Mon, 29 Mar 2021 04:23:32 ; Location: BLACKFORD,PERTH/KINROSS ; Lat/long: 56.284,-3.759 ; Depth: 7 km ; Magnitude: 1.0";
string[] splitArray = input.Split(new char[] { ';' });
Dictionary<string, string> dict = splitArray.Select(x => Regex.Match(x, pattern))
.GroupBy(x => x.Groups["key"].Value.Trim(), y => y.Groups["value"].Value.Trim())
.ToDictionary(x => x.Key, y => y.FirstOrDefault());
string location = dict["Location"];
Or this
string pattern = #"(?'key'[^:]+):\s+(?'value'[^;]+);?";
string input = "Origin date/time: Mon, 29 Mar 2021 04:23:32 ; Location: BLACKFORD,PERTH/KINROSS ; Lat/long: 56.284,-3.759 ; Depth: 7 km ; Magnitude: 1.0";
string[] splitArray = input.Split(new char[] { ';' });
MatchCollection matches = Regex.Matches(input, pattern);
Dictionary<string, string> dict = matches.Cast<Match>()
.GroupBy(x => x.Groups["key"].Value.Trim(), y => y.Groups["value"].Value.Trim())
.ToDictionary(x => x.Key, y => y.FirstOrDefault());
string location = dict["Location"];

Try
String desc = "Origin date/time: Mon, 29 Mar 2021 04:23:32 ; Location: BLACKFORD,PERTH/KINROSS ; Lat/long: 56.284,-3.759 ; Depth: 7 km ; Magnitude: 1.0";
String[] parts = desc.split(";");
for ( String part : parts )
{
if ( part.contains("Location") )
{
parts = part.split(":");
System.out.println("***************** Location is: '" + parts[1].trim() + "'");
break;
}
}

Related

Split the right substring in a list

I am trying to store the line string in a list to process it. With the current state just the first element is being removed. I want to remove the letter substring from the line string before process it. How can I fix that?
I appreciate any help.
Simple:
stop 04:48 05:18 05:46 06:16 06:46 07:16 07:46 16:46 17:16 17:46 18:16 18:46 19:16
Apple chair car 04:52 05:22 05:50 06:20 06:50 07:20 07:50 16:50 17:20 17:50 18:20 18:50 19:20
Result:
[04:48, 05:18, 05:46, 06:16, 06:46, 07:16, 07:46, 16:46, 17:16, 17:46, 18:16, 18:46, 19:16]
[04:52, 05:22, 05:50, 06:20, 06:50, 07:20, 07:50, 16:50, 17:20, 17:50, 18:20, 18:50, 19:20]
Code:
if (line.contains(":")) {
String delims = " ";
String[] tokens = line.split(delims);
List<String> list = new ArrayList<String>(
Arrays.asList(tokens));
list.remove(0);
System.out.println(tokens);
}

First replace and then do splitting.
string.replaceFirst("(?m)^.*?(?=\\d+:\\d+)", "").split("\\s+");
DEMO
string.replaceFirst("(?m)^.*?(?=\\d+:\\d+)", "") will replace the starting alphabets plus spaces with an empty string.
Now do splitting on spaces against the resultant string will give you the desired output.

Here is an alternative without regex, end result will be string that you can split by space.
public class StringReplace {
public static void main(String[] args) {
String output = replace("Apple chair car 04:52 05:22 05:50 06:20 06:50 07:20 07:50 16:50 17:20 17:50 18:20 18:50 19:20");
List<String> tokens = new ArrayList<>();
Collections.addAll(tokens, output.split(" "));
}
private static String replace(String input) {
char[] chars = input.toCharArray();
StringBuilder builder = new StringBuilder();
for (char character: chars) {
// test against ASCII range 0 to ':' and 'space'
if ((int)character > 47 && (int)(character) < 59 || (int)character == 32) {
builder.append(character);
}
}
return builder.toString().trim();
}
}
Result >> 04:52 05:22 05:50 06:20 06:50 07:20 07:50 16:50 17:20 17:50 18:20 18:50 19:20

Regex to match a number or nothing

i need to get a regex that can match something like this :
1234 <CIRCLE> 12 12 12 </CIRCLE>
1234 <RECTANGLE> 12 12 12 12 </RECTANGLE>
i've come around to write this regex :
(\\d+?) <([A-Z]+?)> (\\d+?) (\\d+?) (\\d+?) (\\d*)? (</[A-Z]+?>)
It works fine for when i'm trying to match the rectangle, but it doesn't work for the circle
the problem is my fifth group is not capturing though it should be ??

Try
(\\d+?) <([A-Z]+?)> (\\d+?) (\\d+?) (\\d+?) (\\d+ )?(</[A-Z]+?>)
(I changed the last "\d" group to make the space optional too.)

That is because only (\\d*)? part is optional, but spaces before and after it are mandatory, so you end up requiring two spaces at end, if last (\\d*) would not be found. Try maybe with something like
(\\d+?) <([A-Z]+?)> (:?(\\d+?) ){3,4}(</[A-Z]+?>)
Oh, and if you want to make sure that closing tag is same as opening one you can use group references like \\1 will represent match from first group. So maybe update your regex to something like
(\\d+?) <([A-Z]+?)> (:?(\\d+?) ){3,4}(</\\2>)
// ^^^^^^^-----------------------^^^
// group 2 here value need to match one from group 2

Solution for just the numbers:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.annotation.Nonnull;
public class Q26005150
{
private static final Pattern P = Pattern.compile("(\\d+)");
public static void main(String[] args)
{
final String s1 = "1234 <CIRCLE> 12 12 12 </CIRCLE>";
final String s2 = "1234 <RECTANGLE> 12 12 12 12 </RECTANGLE>";
final List<Integer> l1 = getAllMatches(s1);
final List<Integer> l2 = getAllMatches(s2);
System.out.println("l1 = " + l1);
System.out.println("l2 = " + l2);
}
private static List<Integer> getAllMatches(#Nonnull final String s)
{
final Matcher m = P.matcher(s);
final List<Integer> matches = new ArrayList<Integer>();
while(m.find())
{
matches.add(Integer.valueOf(m.group(1)));
}
return matches;
}
}
Outputs:
l1 = [1234, 12, 12, 12]
l2 = [1234, 12, 12, 12, 12]
Answer on GitHub
Stackoverflow GitHub repository

Solution for the Numbers and the Tags
private static final Pattern P = Pattern.compile("(<\/?(\w+)>|(\d+))");
public static void main(String[] args)
{
final String s1 = "1234 <CIRCLE> 12 12 12 </CIRCLE>";
final String s2 = "1234 <RECTANGLE> 12 12 12 12 </RECTANGLE>";
final List<String> l1 = getAllMatches(s1);
final List<String> l2 = getAllMatches(s2);
System.out.println("l1 = " + l1);
System.out.println("l2 = " + l2);
}
private static List<String> getAllMatches(#Nonnull final String s)
{
final Matcher m = P.matcher(s);
final List<String> matches = new ArrayList<String>();
while(m.find())
{
final String match = m.group(1);
matches.add(match);
}
return matches;
}
Outputs:
l1 = [1234, <CIRCLE>, 12, 12, 12, </CIRCLE>]
l2 = [1234, <RECTANGLE>, 12, 12, 12, 12, </RECTANGLE>]
Answer on GitHub
Stackoverflow GitHub repository

assuming the labels between "<" & ">" has to match and the numbers in between are identical
use this pattern
^\d+\s<([A-Z]+)>\s(\d+\s)(\2)+<\/(\1)>$
Demo
or if numbers in the middle do not have to be identical and or optional:
^\d+\s<([A-Z]+)>\s(\d+\s)*<\/(\1)>$

What is this java.io.IOException: Error: Expected a long type, actual='930[299' tells?

I created a program to read and extract text from PDF files... But it producing this exception during execution..
java.io.IOException: Error: Expected a long type, actual='930[299'
at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1669)
at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100)
at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:632)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1205)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1097)
at PatentAdder.main(PatentAdder.java:60)
This is my code :
import java.awt.Rectangle;
import java.io.File;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.TrueFileFilter;
import org.apache.commons.io.filefilter.WildcardFileFilter;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFTextStripperByArea;
public class PatentAdder {
/**
* #param args
*/
public static String patno,patit,patdate,patfilled,appno;
private static int File;
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
int cnt=0;
if( args.length == 1 )
{
// usage();
}
else
{
PDDocument document = null;
try
{
File dataDir = new File("F:/patents/test/tittest/USP2002w17/06/378/pdfs");
File[] files = dataDir.listFiles();
// String[] files = dataDir.list();
int count=0;
// System.out.println ("Satrt1");
for (File file : files) {
// System.out.println ("Satrt2");
File f = file;
if (!f.isDirectory()) {
document = PDDocument.load(f.getAbsolutePath());
if( document.isEncrypted() )
{
try
{
document.decrypt( "" );
}
catch( InvalidPasswordException e )
{
System.err.println( "Error: Document is encrypted with a password." );
System.exit( 1 );
}
} }
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
// Rectangle rectt = new Rectangle( 590, 108, 600, 100 ); // enlarge title
Rectangle rectt = new Rectangle( 288, 60, 222, 40 );
Rectangle rect = new Rectangle( 55, 108, 230, 600 ); // US-Patent title h40
// Rectangle rect = new Rectangle( 108, 210, 480, 499 ); //full enlarge
stripper.addRegion( "class1", rect );
stripper.addRegion("class2", rectt);
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
String title = "(?s)\\(54\\)\\s*([\\w\\s,-]+)|(?s)\\[54\\]\\s*([\\w\\s,-]+)";
String in ="((?s)\\(\\d\\d\\)\\s+Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Assignee:))|((?s)\\[\\d\\d\\)\\s+Inventor:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Assignee:))|((?s)\\(\\d\\d\\)\\s+Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Assignee:))|((?s)\\(\\d\\d\\)\\s+Inventor:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Assignee:))";
String as ="((?s)\\(\\d\\d\\)\\s+Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Notice:))|((?s)\\(\\d\\d\\)\\s+Assignee:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Notice:))|(Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+)(?=Notice:))";
String app_no ="(?s)\\(21\\)\\s*([\\w\\s,.://-]+)|(?s)\\[21\\]\\s*([\\w\\s,.://-]+)";
String filed ="((?s)\\(22\\)\\s*([\\w\\s,.://-]+))|((?s)\\(22\\)\\s*([\\w\\s,.://-]+)(?=\\s*\\n\\s*Related))|((?s)\\[22\\]\\s*([\\w\\s,.://-]+))|((?s)\\[22\\]\\s*([\\w\\s,.://-]+)(?=\\s*\\n\\s*Related))";
String term ="((?s)\\s*Term\\s*([\\w\\s,.://-]+))|((?s)\\s*Term\\s*([\\w\\s,.://-]+))";
String pat_no = "(?s)\\s*Patent No\\.\\:\\s*([\\w\\d\\s,.://-]+)|(?s)\\s*Patent Number\\:\\s*([\\w\\d\\s,.://-]+)";
String pat_dt = "(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventor:)|(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventors:)|(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\[\\d*\\]\\s+Inventor:)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventors:)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)";
// System.out.println(rg);
String region = stripper.getTextForRegion( "class1" );
// System.out.println(region);
String regiont = stripper.getTextForRegion( "class2" );
Pattern p = Pattern.compile(in);
Matcher m = p.matcher(region);
Pattern p2 = Pattern.compile(as);
Matcher m2 = p2.matcher(region);
Pattern p3 = Pattern.compile(title);
Matcher m3 = p3.matcher(region);
Pattern p4 = Pattern.compile(pat_no);
Matcher m4 = p4.matcher(regiont);
Pattern p5 = Pattern.compile(app_no);
Matcher m5 = p5.matcher(region);
Pattern p6 = Pattern.compile(filed);
Matcher m6 = p6.matcher(region);
Pattern p7 = Pattern.compile(pat_dt);
Matcher m7 = p7.matcher(regiont);
while(m.find())
{
// System.out.println(m.group());
}
while(m2.find())
{
// System.out.println(m2.group());
}
while(m3.find())
{
// System.out.println(m3.group());
patit = m3.group().replace("(54)", " ");
patit = patit.trim();
}
while(m4.find())
{
// System.out.println(m4.group());
patno = m4.group().replace("Patent No.: ", " ");
patno = patno.replace("Patent No: ", " ");
patno = patno.replace("Patent", " ");
patno = patno.replace("No.:", " ");
patno = patno.replace("No:", " ");
patno = patno.replace("Number: ", " ");
patno = patno.replace("Number.: ", " ");
patno = patno.trim();
}
while(m5.find())
{
// System.out.println(m5.group());
appno = m5.group().replace("(21)", " ");
appno = appno.replace("Appl. No.: ", " ");
appno = appno.replace("Appl.", " ");
appno = appno.replace("No.", " ");
appno = appno.replace(":"," ");
appno = appno.trim();
}
while(m6.find())
{
// System.out.println(m6.group());
patfilled = m6.group().replace("(22)", " ");
patfilled = patfilled.replace("Filed", " ");
patfilled= patfilled.replace("PCT", " ");
patfilled = patfilled.replace(":", " ");
patfilled = patfilled.replace("\n", "");
patfilled= patfilled.trim();
}
while (m7.find())
{
patdate = m7.group().replace("(45) Date of Patent: ", " ");
patdate = patdate.replace("(45) Date of Patent.: ", " ");
patdate = patdate.replace("(45)", " ");
patdate = patdate.replace("Date", " ");
patdate = patdate.replace("of", " ");
patdate = patdate.replace("Patent.: ", " ");
patdate = patdate.replace("Patent: ", " ");
patdate = patdate.replace("Reissued", " ");
patdate = patdate.replace(":", " ");
patdate = patdate.replace("Patent", " ");
patdate = patdate.replace("*", " ");
patdate = patdate.trim();
}
System.out.println("File name:"+f.getName());
System.out.println(patno +"\n"+patit+"\n"+patdate+"\n"+patfilled+"\n"+appno+"\n-------");
// boolean st = addPatent (patno,patit,patdate,patfilled,appno);
// if ( st == true ) System.out.println(patno+" added");
// else System.out.println(patno+" not added");
count++;
}
System.out.print("-----Finised "+count+" Files------ \n");
}
finally
{
if( document != null )
{
document.close();
}
}
}
}
catch (Exception e)
{
System.out.println(e.getStackTrace());
//System.out.println(e.getLocalizedMessage());
System.out.println(e.getMessage());
System.out.println(e.getCause());
//System.out.println(e.getClass());
e.printStackTrace();
}
}
static boolean addPatent(String pno,String ptitle,String pat_date ,String filed_date , String appl_no )
{
int i=0;
boolean status =false;
try {
Class.forName("com.mysql.jdbc.Driver").newInstance();
Connection con = DriverManager.getConnection("jdbc:mysql://localhost:3306/patent", "root","ragesh");
PreparedStatement st = con.prepareStatement("insert into patents_info values (?,?,?,?,?,?)");
st.setString(1, pno);
st.setString(2, ptitle);
st.setString(3,pat_date);
st.setString(4,filed_date);
st.setString(5,appl_no);
st.setInt(6,0);
i=st.executeUpdate();
if (i > 0) status= true;
}
catch (Exception e)
{
e.printStackTrace();
}
return status;
}
public static List<File> getAllChildFiles(File[] dir)
{
List<File> result = new ArrayList<File>();
for (File file : dir)
{
if (file.isDirectory())
{
File[] children = file.listFiles();
List<File> grandChildren = getAllChildFiles(children);
result.addAll(grandChildren);
}
else
{
result.add(file);
}
}
return result;
}
}
This programs gives output up to some iterations , but halts and thorw exception like above specified ..
Sample output with Exception :
File name:06019327.pdf
Number: 6,019,327
[54] INSTALLATION STRUCTURE OF OUTDOOR
COMMUNICATION DRIVE
[45] Feb. 1, 2000
[22] Aug. 30, 1996
Related U.S. Application Data
[21] 08/704,920
-------
File name:06019328.pdf
Number: 6,019,328
[54] STAY-PUT PEGBOARD ACCESSORY
[45] Feb. 1, 2000
[22] Jan. 27, 1999
[21] 09/238,242
-------
File name:06019329.pdf
Number: 6,019,329
[54] CLAMPS
[45] Feb. 1, 2000
[22] Oct. 30, 1997
[21] 08/961,310
-------
File name:06019330.pdf
Number: 6,019,330
[54] ROOF GUARD DEVICE FOR LIFTING
OBJECTS ON TO A ROOF
[45] Feb. 1, 2000
[22] Nov. 20, 1997
[21] 08/974,866
-------
File name:06019331.pdf
Number: 6,019,331
[54] CANTILEVER BRACKET ASSEMBLY
[45] Feb. 1, 2000
[22] May 28, 1997
Related U.S. Application Data
[21] 08/865,587
-------
[Ljava.lang.StackTraceElement;#43a6684f
Error: Expected a long type, actual='930[299'
java.io.IOException: Error: Expected a long type, actual='930[299'
at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1669)
at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100)
at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:632)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1205)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1097)
at PatentAdder.main(PatentAdder.java:60)
2nd Problem
Sometimes the execution freezes.. That is it just showing the blinking cursor after some more iterations .... Why... ?
File name:06019329.pdf
Number: 6,019,329
[54] CLAMPS
[45] Feb. 1, 2000
[22] Oct. 30, 1997
[21] 08/961,310
-------
File name:06019330.pdf
Number: 6,019,330
[54] ROOF GUARD DEVICE FOR LIFTING
OBJECTS ON TO A ROOF
[45] Feb. 1, 2000
[22] Nov. 20, 1997
[21] 08/974,866
-------
File name:06019331.pdf
Number: 6,019,331
[54] CANTILEVER BRACKET ASSEMBLY
[45] Feb. 1, 2000
[22] May 28, 1997
Related U.S. Application Data
[21] 08/865,587
-------
(__ cursor blinks on... and execution freezes )
Please help me to resolve this 2 issues:
JDK version : 1.6
PDF Box 1.8.3

This is caused by PDFBox not following the PDF Reference to the letter :)
Tokens in a PDF token stream may be delimited by white space (as usual for most programming language), but also implicitly: because the next character is a delimiter of its own, since it introduces a special function. Therefore, it's totally valid -- and certainly not unusual -- to encounter constructions such as
/A[123/B(C)]
which is entirely equivalent to the slightly longer
/A [ 123 /B (C) ]
From ISO "PDF 32000-1:2008", 7.2.2 Character Set:
The PDF character set is divided into three classes, called regular, delimiter, and white-space characters. This classification determines the grouping of characters into tokens. The rules defined in this sub-clause apply to all characters in the file except within strings, streams, and comments.
The White-space characters shown [...]
The delimiter characters (, ), <, >, [, ], {, }, /, and % are special [..]
The original code shows the current implementation (taken from http://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/BaseParser.java):
/**
1681 * This method is used to read a token by the {#linkplain #readInt()} method and the {#linkplain #readLong()} method.
1682 *
1683 * #return the token to parse as integer or long by the calling method.
1684 * #throws IOException throws by the {#link #pdfSource} methods.
1685 */
1686 protected final StringBuilder readStringNumber() throws IOException
1687 {
1688 int lastByte = 0;
1689 StringBuilder buffer = new StringBuilder();
1690 while( (lastByte = pdfSource.read() ) != 32 &&
1691 lastByte != 10 &&
1692 lastByte != 13 &&
1693 lastByte != 60 && //see sourceforge bug 1714707
1694 lastByte != 0 && //See sourceforge bug 853328
1695 lastByte != -1 )
1696 {
1697 buffer.append( (char)lastByte );
1698 }
1699 if( lastByte != -1 )
1700 {
1701 pdfSource.unread( lastByte );
1702 }
1703 return buffer;
1704 }
The 'next character' is tested against the whitespace characters from Table 1 in 7.2.2 (top to bottom, "Space", "Line Feed", "Carriage Return", and the Nul character -- though they are still missing the "Form Feed" code 0x0C and, very odd, the common "Tab" 0x09. They do test, however, for an end-of-file (the -1) and < (60), the latter probably because someone ran into a similar bug before. (I could not locate the original bug report #1714707 but I can infer it must have been similar to your issue.)
This list must be completed by adding the following characters, copied verbatim from Table 2 in 7.2.2:
Table 2 – Delimiter characters
Glyph Decimal Hexadecimal Octal Name
( 40 28 50 LEFT PARENTHESIS
) 41 29 51 RIGHT PARENTHESIS [1]
< 60 3C 60 LESS-THAN SIGN
> 62 3E 62 GREATER-THAN SIGN
[ 91 5B 133 LEFT SQUARE BRACKET
] 93 5D 135 RIGHT SQUARE BRACKET
{ 123 7B 173 LEFT CURLY BRACKET
} 125 7D 175 RIGHT CURLY BRACKET
/ 47 2F 57 SOLIDUS
% 37 25 45 PERCENT SIGN
The odd ones out are { and } since, currently, they only appear inside PostScript snippets, and those are not base objects but contained inside a stream. But perhaps they were historically "reserved for future expansion" (which should no longer be an issue, now the PDF format has been frozen as an ISO specification).
Also, the character % in itself is a delimiter, but it needs some special handling as well as it introduces a comment:
The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line [...] (7.2.3 Comments)
(Note there is a little ambiguity there:
A conforming reader shall ignore comments, and treat them as single white-space characters. That is, a comment separates the token preceding it from the one following it.
which should not be necessary, because the previous line already says the comment ends before the end-of-line; and so the end-of-line itself ought to remain in the input stream and thus act as a separator. Perhaps nothing more than a case of a belt-and-suspenders approach.)
[1] On reviewing: actually, the closing parenthesis is redundant. It can only occur after a matching opening parentheses, and that introduces a string. Viewed one token at a time, you should never encounter a stray ) -- if you do, that indicates a malformed PDF stream.

The readLong method reads a long from the underlaying stream. As the PDFBox API states that method is throwing an IOException that has been generated by the PushBackInputStream used as input source (pdfSource).
In your case the log is pretty self-explanatory, it seems there's a square bracket '[' in your stream, which make the long conversion impossible.
You have two options:
check you input and your parser logic (or perform a sanity check before using PDDocument.load)
narrow the scope of your try and catch block to line 60 of your class to handle the specific IOException and react accordingly (if possible in your method logic)
About the freeze issues
Are you sure the code is not stuck in one of your:
while(mX.find())
{
...
}
blocks? I find the design pretty error prone, especially for X = 1 and 2. I have no time to go into the logic but you may want to refactor the while condition as follow:
long TIMEOUT = 15000l; // 15 seconds
long now = System.currentTimeMillis(); // init the long just above the while
while(mX.find() && (System.currentTimeMillis() - now) < TIMEOUT)
{
...
}

Reformat the String after removing a word from it

I have a String as "AASS MON 01 2013 365.00 HJJ Call"
I need to remove the String HJJ from the above String and need the output as
AASS MON 01 2013 365.00 HJJ Call
I tried the following thing
if(symbol.contains("HJJ"))
{
symbol = symbol.replace("HJJ","");
}
But with this i am getting output as
AASS MON 01 2013 365.00 Call ( One extra space before call )
Where i want it to be
AASS MON 01 2013 365.00 Call

Here is what I usually use:
public static String removeExtraSpaces(String input) {
return input.trim().replaceAll(" +", " ");
}
trim removes beginning and ending spaces while replaceAll replaces any line of spaces by one single space.

public class Trimimg
{
public static void main(String[]args)
{
String str = "AASS MON 01 2013 365.00 HJJ Call";
String newStr = str.replace(" HJJ", "");
System.out.println(newStr);
}
}

Regular expressions: 100 errors

I'm trying to learn about regular expressions but am not doing so well after reading through the java tutorial.
This program is supposed to take an imput in the format:
a) add dd dd together
b) subtract 05 from 13
c) add 02 to 03
And return the dd (+ or -) dd = answer
The (wrong) way I set this up is to have the prog try to find either of the 3 matches, and continue to do so until the user inputs "bye." If there isn't a match it should just prompt the user for an input again.
Here's my code! With exactly 100 errors. :/
If anyone can help me with the syntax, it'd really be appreciated!
import java.util.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Calculator {
public static void main(String[] args){
Scanner imp = new Scanner(System.in);
System.out.println("yes> ");
String s = imp.nextLine();
if (s.equals("bye")) {
System.exit(0);
}
while (true) {
Pattern p = Pattern.compile(s); //compile string, check for formats
Matcher x = p.matcher(\badd\b\s\d\d\s\d\d\s\btogether\b); //format add 12 12 together
Matcher y = p.matcher(\bsubtract\b\s\d\d\s\d\d\s\bfrom\b); //format subtract 05 from 13
Matcjer z = p.matcher(\badd\b\s\d\d\s\bto\b\s\d\d); //format add 02 to 03
boolean b = p.matches;
boolean l = x.matches;
boolean i = y.matches;
boolean g = z.matches;
if (l.equals(true))
return (\d\d " + " \d\d " = " \d\d+\d\d);
else if (i.equals(true))
return (\d\d " + " \d\d " = " \d\d-\d\d);
else if (g.equals(true))
return (\d\d " + " \d\d " = " \d\d+\d\d);
}
}
}

Have you tried looking at your code in an IDE such as Eclipse or IntelliJ IDEA? They'll highlight the errors for you. The main one I'm seeing is that you're not putting the regular expressions in strings. Java doesn't have native regexes, so you'll need to supply them as strings. Here's an example:
Matcher x = p.matcher("\\badd\\b\\s\\d\\d\\s\\d\\d\\s\\btogether\\b"); //format add 12 12 together
Notice how I've doubled up the backslashes. This is because it's the escape character in Java as well as in regexes. The compiler will interpret the above string as \badd\b\s\d\d\s\d\d\s\btogether\b, and then the regular expression parser will interpret the escape characters properly.

ugh where to begin...
first off Pattern.compile() is expecting the regex (the format strings) while matcher() expects the string to test against
#Samir has shown you what was wrong with the regexes in the code itself (I edited them a bit for more clarity)
l.matches needs ()
you cannot call methods on primitive boolean variables if(b) is sufficient to test if it is true or not
and to get specific submatches you need to use capturing groups
to concatenate strings together you can use +
to output something to the console System.out.println should be used not return
with the most obvious errors solved:
import java.util.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Calculator {
public static void main(String[] args){
Scanner imp = new Scanner(System.in);
System.out.println("yes> ");
while (true) {
String s = imp.nextLine();//put getting the input inside the loop or it's infinite
if (s.equals("bye")) {
System.exit(0);
}
Matcher x = Pattern.compile("add\\s(\\d+)\\s(\\d+)\\stogether").matcher(s); //format add 12 12 together
Matcher y = Pattern.compile("subtract\\s(\\d+)\\sfrom\\s(\\d+)").matcher(s); //format subtract 05 from 13
Matcjer z = Pattern.compile("add\\s(\\d+)\\sto\\s(\\d+)").matcher(s); //format add 02 to 03
boolean l = x.matches();
boolean i = y.matches();
boolean g = z.matches();
if (l){
System.out.println(l.group(1) + " + " + l.group(2) + " = " +
(Integer.parseInt(l.group(1))+Integer.parseInt(l.group(2))) );
}else if (i){
System.out.println(i.group(1) + " - " + i.group(2) + " = " +
(Integer.parseInt(i.group(1))+Integer.parseInt(i.group(2))) );
}else if (g){
System.out.println(g.group(1) + " + " + g.group(2) + " = " +
(Integer.parseInt(g.group(1))+Integer.parseInt(g.group(2))) );
}
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

RSS FEED - data parsing - java

Related

Split the right substring in a list

Regex to match a number or nothing

What is this java.io.IOException: Error: Expected a long type, actual='930[299' tells?

Reformat the String after removing a word from it

Regular expressions: 100 errors

Categories

Resources