I am trying to parse pdf file using Apache Tika by using ByteArrayInputStream for Binary files... And started getting error for some pdf file and for some it is parsing very well.. Earlier I was able to parse same pdf files using Tika, but now when I tried using ByteArrayInputStream, I started getting error..I think there is some problem with the ByteArray This is the Error I am getting..
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser#652489c0
And this is my code...
if (page.isBinary()) {
handleBinary(page, curURL);
}
public int handleBinary(Page page, WebURL curURL) {
try {
binaryParser.parse(page.getBinaryData());
page.setText(binaryParser.getText());
handleMetaData(page, binaryParser.getMetaData());
//System.out.println(" pdf url " +page.getWebURL().getURL());
//System.out.println("Text" +page.getText());
} catch (Exception e) {
// TODO: handle exception
}
return PROCESS_OK;
}
public class BinaryParser {
private String text;
private Map<String, String> metaData;
private Tika tika;
public BinaryParser() {
tika = new Tika();
}
public void parse(byte[] data) {
InputStream is = null;
try {
is = new ByteArrayInputStream(data);
text = null;
Metadata md = new Metadata();
metaData = new HashMap<String, String>();
text = tika.parseToString(is, md).trim();
processMetaData(md);
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(is);
}
}
public String getText() {
return text;
}
public void setText(String text) {
this.text = text;
}
private void processMetaData(Metadata md){
if ((getMetaData() == null) || (!getMetaData().isEmpty())) {
setMetaData(new HashMap<String, String>());
}
for (String name : md.names()){
getMetaData().put(name.toLowerCase(), md.get(name));
}
}
public Map<String, String> getMetaData() {
return metaData;
}
public void setMetaData(Map<String, String> metaData) {
this.metaData = metaData;
}
}
public class Page {
private WebURL url;
private String html;
// Data for textual content
private String text;
private String title;
private String keywords;
private String authors;
private String description;
private String contentType;
private String contentEncoding;
private byte[] binaryData;
private List<WebURL> urls;
private ByteBuffer bBuf;
private final static String defaultEncoding = Configurations
.getStringProperty("crawler.default_encoding", "UTF-8");
public boolean load(final InputStream in, final int totalsize,
final boolean isBinary) {
if (totalsize > 0) {
this.bBuf = ByteBuffer.allocate(totalsize + 1024);
} else {
this.bBuf = ByteBuffer.allocate(PageFetcher.MAX_DOWNLOAD_SIZE);
}
final byte[] b = new byte[1024];
int len;
double finished = 0;
try {
while ((len = in.read(b)) != -1) {
if (finished + b.length > this.bBuf.capacity()) {
break;
}
this.bBuf.put(b, 0, len);
finished += len;
}
} catch (final BufferOverflowException boe) {
System.out.println("Page size exceeds maximum allowed.");
return false;
} catch (final Exception e) {
System.err.println(e.getMessage());
return false;
}
this.bBuf.flip();
if (isBinary) {
binaryData = new byte[bBuf.limit()];
bBuf.get(binaryData);
} else {
this.html = "";
this.html += Charset.forName(defaultEncoding).decode(this.bBuf);
this.bBuf.clear();
if (this.html.length() == 0) {
return false;
}
}
return true;
}
public boolean isBinary() {
return binaryData != null;
}
public byte[] getBinaryData() {
return binaryData;
}
Any suggestions what wrong I am doing...!!
UPDATED:-
After upgrading to pdfbox 1.6.0 version, I started getting this error for some pdf...
Parsing Error, Skipping Object
java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream#70dbdc4b
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
And for some pdf this error...
Did not found XRef object at specified startxref position 0
Invalid dictionary, found: '' but expected: '/'
WARN [Crawler 2] Did not found XRef object at specified startxref position 0
This is a known bug of PDFBox version 1.4.0. Just update to PDFBox 1.5.0+.
Check this release notes:
[PDFBOX-578] NPE NullPointerException in PDPageNode.getCount
And this JIRA ticket.
Related
I am uploading my App on play store but get me bellow error:
Zip Path Traversal Your app contains an unsafe unzipping pattern that
may lead to a Path Traversal vulnerability. Please see this Google
Help Center article to learn how to fix the issue.
org.apache.cordova.Zip.unzipSync
I edited my source code like this LINK, but get me error.
Here is my source code changed:
public class Zip extends CordovaPlugin {
private static final String LOG_TAG = "Zip";
// Can't use DataInputStream because it has the wrong endian-ness.
private static int readInt(InputStream is) throws IOException {
int a = is.read();
int b = is.read();
int c = is.read();
int d = is.read();
return a | b << 8 | c << 16 | d << 24;
}
#Override
public boolean execute(String action, CordovaArgs args, final CallbackContext callbackContext) throws JSONException {
if ("unzip".equals(action)) {
unzip(args, callbackContext);
return true;
}
return false;
}
private void unzip(final CordovaArgs args, final CallbackContext callbackContext) {
this.cordova.getThreadPool().execute(new Runnable() {
public void run() {
unzipSync(args, callbackContext);
}
});
}
private void unzipSync(CordovaArgs args, CallbackContext callbackContext) {
InputStream inputStream = null;
try {
String zipFileName = args.getString(0);
String outputDirectory = args.getString(1);
// Since Cordova 3.3.0 and release of File plugins, files are accessed via cdvfile://
// Accept a path or a URI for the source zip.
Uri zipUri = getUriForArg(zipFileName);
Uri outputUri = getUriForArg(outputDirectory);
CordovaResourceApi resourceApi = webView.getResourceApi();
File tempFile = resourceApi.mapUriToFile(zipUri);
if (tempFile == null || !tempFile.exists()) {
String errorMessage = "Zip file does not exist";
callbackContext.error(errorMessage);
Log.e(LOG_TAG, errorMessage);
return;
}
File outputDir = resourceApi.mapUriToFile(outputUri);
outputDirectory = outputDir.getAbsolutePath();
outputDirectory += outputDirectory.endsWith(File.separator) ? "" : File.separator;
if (outputDir == null || (!outputDir.exists() && !outputDir.mkdirs())) {
String errorMessage = "Could not create output directory";
callbackContext.error(errorMessage);
Log.e(LOG_TAG, errorMessage);
return;
}
OpenForReadResult zipFile = resourceApi.openForRead(zipUri);
ProgressEvent progress = new ProgressEvent();
progress.setTotal(zipFile.length);
inputStream = new BufferedInputStream(zipFile.inputStream);
inputStream.mark(10);
int magic = readInt(inputStream);
if (magic != 875721283) { // CRX identifier
inputStream.reset();
} else {
// CRX files contain a header. This header consists of:
// * 4 bytes of magic number
// * 4 bytes of CRX format version,
// * 4 bytes of public key length
// * 4 bytes of signature length
// * the public key
// * the signature
// and then the ordinary zip data follows. We skip over the header before creating the ZipInputStream.
readInt(inputStream); // version == 2.
int pubkeyLength = readInt(inputStream);
int signatureLength = readInt(inputStream);
inputStream.skip(pubkeyLength + signatureLength);
progress.setLoaded(16 + pubkeyLength + signatureLength);
}
// The inputstream is now pointing at the start of the actual zip file content.
ZipInputStream zis = new ZipInputStream(inputStream);
inputStream = zis;
ZipEntry ze;
byte[] buffer = new byte[32 * 1024];
boolean anyEntries = false;
while ((ze = zis.getNextEntry()) != null) {
try {
anyEntries = true;
String compressedName = ze.getName();
if (ze.isDirectory()) {
try {
File dir = new File(outputDirectory + compressedName);
File f = new File(dir, ze.getName());
String canonicalPath = f.getCanonicalPath();
if (!canonicalPath.startsWith(dir.toString())){
dir.mkdirs();
}else {
if (inputStream != null) {
try {
inputStream.close();
} catch (IOException e) {
}
}
}
} catch (Exception e) {
String errorMessage = "An error occurred while unzipping.";
callbackContext.error(errorMessage);
Log.e(LOG_TAG, errorMessage, e);
}
} else {
File file = new File(outputDirectory + compressedName);
File f = new File(file, ze.getName());
String canonicalPath = f.getCanonicalPath();
if (!canonicalPath.startsWith(file.toString())) {
file.getParentFile().mkdirs();
if (file.exists() || file.createNewFile()) {
try {
Log.w("Zip", "extracting: " + file.getPath());
FileOutputStream fout = new FileOutputStream(file);
int count;
while ((count = zis.read(buffer)) != -1) {
fout.write(buffer, 0, count);
}
fout.close();
} catch (Exception e) {
String errorMessage = "An error occurred while unzipping.";
callbackContext.error(errorMessage);
Log.e(LOG_TAG, errorMessage, e);
}
}
}else {
if (inputStream != null) {
try {
inputStream.close();
} catch (IOException e) {
}
}
}
}
progress.addLoaded(ze.getCompressedSize());
updateProgress(callbackContext, progress);
zis.closeEntry();
} catch (Exception e) {
String errorMessage = "An error occurred while unzipping.";
callbackContext.error(errorMessage);
Log.e(LOG_TAG, errorMessage, e);
}
}
// final progress = 100%
progress.setLoaded(progress.getTotal());
updateProgress(callbackContext, progress);
if (anyEntries)
callbackContext.success();
else
callbackContext.error("Bad zip file");
} catch (Exception e) {
String errorMessage = "An error occurred while unzipping.";
callbackContext.error(errorMessage);
Log.e(LOG_TAG, errorMessage, e);
} finally {
if (inputStream != null) {
try {
inputStream.close();
} catch (IOException e) {
}
}
}
}
private void updateProgress(CallbackContext callbackContext, ProgressEvent progress) throws JSONException {
PluginResult pluginResult = new PluginResult(PluginResult.Status.OK, progress.toJSONObject());
pluginResult.setKeepCallback(true);
callbackContext.sendPluginResult(pluginResult);
}
private Uri getUriForArg(String arg) {
CordovaResourceApi resourceApi = webView.getResourceApi();
Uri tmpTarget = Uri.parse(arg);
return resourceApi.remapUri(
tmpTarget.getScheme() != null ? tmpTarget : Uri.fromFile(new File(arg)));
}
private static class ProgressEvent {
private long loaded;
private long total;
public long getLoaded() {
return loaded;
}
public void setLoaded(long loaded) {
this.loaded = loaded;
}
public void addLoaded(long add) {
this.loaded += add;
}
public long getTotal() {
return total;
}
public void setTotal(long total) {
this.total = total;
}
public JSONObject toJSONObject() throws JSONException {
return new JSONObject(
"{loaded:" + loaded +
",total:" + total + "}");
}
}
}
Here is the (without most of the functions) definition of a class called note.
public class Note
{
private String text;
String fileName = "";
NoteManager noteManager = null;
List<String> hyperlinks = new ArrayList<String>();
public static final int BUFFER_SIZE = 512;
public Note(NoteManager noteManager) {
this.noteManager = noteManager;
this.text = "";
}
public Note(NoteManager noteManager, String content) {
this(noteManager);
if (content == null)
setText("");
else
setText(content);
}
public Note(NoteManager noteManager, CharSequence content) {
this(noteManager, content.toString());
}
....some functions....
public static Note newFromFile(NoteManager noteManager, Context context,
String filename) throws IOException
{
FileInputStream inputFileStream = context.openFileInput(filename);
StringBuilder stringBuilder = new StringBuilder();
byte[] buffer = new byte[BUFFER_SIZE];
int len;
while ((len = inputFileStream.read(buffer)) > 0)
{
String line = new String(buffer, 0, len);
stringBuilder.append(line);
buffer = new byte[Note.BUFFER_SIZE];
}
Note n = new Note(noteManager, stringBuilder.toString().trim());
n.fileName = filename;
inputFileStream.close();
return n;
}
.... some functions attributed to this class
}
These notes are managed by a class called NoteManager.java, which I have abbreviated below:
public class NoteManager
{
Context context=null;
ArrayList<Note> notes = new ArrayList<Note>();
..... some functions...
public void addNote(Note note)
{
if (note == null || note.noteManager != this || notes.contains(note)) return;
note.noteManager = this;
notes.add(note);
try
{
note.saveToFile(context);
} catch (IOException e)
{
e.printStackTrace();
}
}
....some functions....
public void loadNotes()
{
String[] files = context.fileList();
notes.clear();
for (String fname:files)
{
try
{
notes.add(Note.newFromFile(this, context, fname));
} catch (IOException e)
{
e.printStackTrace();
}
}
}
}
public void addNote(Note note)
{
if (note == null || notes.contains(note)) return;
note.noteManager = this;
notes.add(note);
try
{
note.saveToFile(context);
} catch (IOException e)
{
e.printStackTrace();
}
}
I am trying to work out why this notepad app creates random new notes when the app is fully shutdown and then reopened, however I just cannot see what the problem is. I have cut out all the functions which didnt seem to relate to the problem, so the logical error must be here somewhere.
How does one go about finding what I am guessing to be some kind of circular reference or lack of checks?
Android typically uses UTF-8, with multi-byte characters. Creating a new String on a arbitrary byte sub-array can have issues at begin and end, if you deviate from ASCII.
public static Note newFromFile(NoteManager noteManager, Context context,
String filename) throws IOException
{
Path path = Paths.get(filename);
byte[] bytes = Files.readAllBytes(path);
String content = new String(bytes, "UTF-8");
Note n = new Note(noteManager, content.trim());
n.fileName = filename;
noteManager.add(n); // One registration?
return n;
}
The problem of having multiple instances of a node might need the addition within newFromFile or maybe an extra check:
public void addNote(Note note)
{
if (note == null || note.noteManager != this || notes.contains(note)) {
return;
}
note.noteManager = this;
notes.add(note);
And finally a Note must be well defined.
public class Note extends Comparable<Note> {
private NoteManager noteManager:
private final String content; // Immutable.
public NoteManager(NoteManager noteManager, String content) {
this.noteManager = noteManager;
this.content = content;
}
... compare on the immutable content
... hashCode on content
Not being to be able to change the content, and comparing on the string content, means notes cannot be doubled, change in the set, mixing up the set ordering.
I am building an app with a self-made LruDiskCache to reduce loading times. The LruDiskCache downloads a file if it is not already present and returns it via a callback. I'm having a very weird issue where the first image isn't loaded properly. This only happends the first time a activity is started. If you open the same activity a secon time the image is loaded properly.
After debugging i found that i get the following debug message: --- SkImageDecoder Factory returned null. I've read multiple issues regarding this problem, but the all involve downloading an image through an inputstream, while i'm first downloading the image to persistent storage.
My cache-class:
public class LruDiskCache {
private static final String LOGTAG = "LruDiskCache";
//Cache size
private final long cacheMaxSize;
private volatile long currentCacheSize;
public static final int DEFAULT_CACHE_SIZE_KB = 1024;
public static final int MINIMUM_CACHE_SIZE_KB = 128;
//Preferences
private final String cachePreferencesName;
private final String chachePreferencesSize = "cacheSize";
private final SharedPreferences cachePreferences;
//Lock
private final Object mDiskCacheLock = new Object();
//Cache Path
private final File cachePath;
//Initialisation
private boolean openingCache;
//Static variables
private static final long BYTES_IN_KB = 1024;
public LruDiskCache (Context c, String cacheName, int cacheMaxSizeKB){
//Preferences
cachePreferencesName = cacheName + "CachePreferences";
cachePreferences = c.getSharedPreferences(cachePreferencesName, Context.MODE_PRIVATE);
//Paths
cachePath = new File(c.getFilesDir().getPath()+"/"+cacheName);
//Cache size
if(cacheMaxSizeKB < MINIMUM_CACHE_SIZE_KB){
throw new IllegalArgumentException
("Invalid cache size, size must be bigger than "
+ MINIMUM_CACHE_SIZE_KB + "kb.");
}
cacheMaxSize = cacheMaxSizeKB * BYTES_IN_KB;
//Initialize
openingCache = true;
new InitializeCache().execute();
}
//PUBLIC METHODS
public synchronized void getRemoteFile(URL remoteFile, Callback c){
if(openingCache){
try {
mDiskCacheLock.wait(1000);
} catch (InterruptedException e) {
c.onRequestedFileRetrieved(null, false);
}
}
String path = createFilePath(remoteFile);
File f = new File(path);
if(f.exists() && f.isFile()){
f.setLastModified(System.currentTimeMillis());
c.onRequestedFileRetrieved(f, true);
} else {
new RemoteFileDownloadTask(remoteFile, f, c).execute();
}
}
//PRIVATE METHODS
private String createFilePath(URL key){
return cachePath + "/" + key.getHost() + key.getPath();
}
private synchronized void cleanUpCache(){
long cacheSize = getDirSize(cachePath);
int failedToDeleteCounter = 0;
while(cacheSize > cacheMaxSize){
File toDelete = getFirstRequestedFile(cachePath);
Log.w("LruDiskCache", "File to be deleted: " + toDelete.getName() + ".");
long toDeleteSize = toDelete.length();
if(!toDelete.delete()){
failedToDeleteCounter++;
} else {
Log.w("LruDiskCache", "Deleted file to clean cache.");
cacheSize -= toDeleteSize;
}
if(failedToDeleteCounter > 100){
Log.w("LruDiskCache", "Failed to clean cache, could not delete files.");
break;
}
}
}
private File getFirstRequestedFile(File directory){
File first = null;
for(File f : directory.listFiles()){
if(f.isFile()){
if(first == null){
first = f;
} else {
if(f.lastModified() < first.lastModified()){
first = f;
}
}
} else if (f.isDirectory()) {
File firstReqInDir = getFirstRequestedFile(f);
if(first == null){
first = firstReqInDir;
} else if (firstReqInDir.lastModified() < first.lastModified()){
first = firstReqInDir;
}
}
}
return first;
}
private long getDirSize(File dir) {
long bytes = 0;
for (File f : dir.listFiles()) {
if (f.isDirectory()) {
bytes += getDirSize(f);
} else {
bytes += f.length();
}
}
return bytes;
}
//ASYNC TASKS
private class InitializeCache implements Runnable{
public void execute(){
new Thread(this).start();
}
#Override
public void run() {
synchronized (mDiskCacheLock) {
Log.d("Initialize cache", "Starting init...");
if (!cachePath.exists()) {
if (!cachePath.mkdirs()) {
mDiskCacheLock.notifyAll();
openingCache = false;
}
}
currentCacheSize = getDirSize(cachePath);
Log.d("LruDiskCache", "Cache size: " + currentCacheSize / BYTES_IN_KB + "kb.");
if(currentCacheSize > cacheMaxSize){
cleanUpCache();
}
Log.d("Initialize cache", "Did init...");
mDiskCacheLock.notifyAll();
openingCache = false;
}
}
}
private class RemoteFileDownloadTask extends AsyncTask<Void, Void, File> {
private final Callback c;
private URL remoteFile;
private File localFile;
protected RemoteFileDownloadTask(URL remoteFile, File localFile, Callback c){
this.c = c;
this.remoteFile = remoteFile;
this.localFile = localFile;
}
#Override
protected File doInBackground(Void... params) {
if(retrieveRemoteFile(remoteFile, localFile)){
return localFile;
} else {
return null;
}
}
#Override
protected void onPostExecute(File file) {
super.onPostExecute(file);
currentCacheSize += file.length();
if(currentCacheSize > cacheMaxSize){
cleanUpCache();
}
c.onRequestedFileRetrieved(file, true);
}
public boolean retrieveRemoteFile(URL remote, File filePath) {
try {
if (filePath.exists() && filePath.isFile()) {
return true;
} else {
if (!filePath.getParentFile().exists()) {
if (!filePath.getParentFile().mkdirs()) {
return false;
}
}
}
} catch (Exception e){
return false;
}
Log.w("LruDiskCache", "Created empty file: " + filePath.getPath());
Log.w("LruDiskCache", "Downloading source from: " + remote.toString());
try {
FileOutputStream fos;
InputStream is;
BufferedInputStream bis;
fos = new FileOutputStream(filePath);
HttpURLConnection con = (HttpURLConnection) remote.openConnection();
if(con.getResponseCode() != HttpURLConnection.HTTP_OK){
Log.e("Receiver", "HTTP Response code is not OK");
return false;
}
is = con.getInputStream();
bis = new BufferedInputStream(is);
while(bis.available() > 0){
fos.write(bis.read());
}
bis.close();
is.close();
fos.close();
} catch (Exception e) {
return false;
}
Log.d("LruDiskCacheReceiver", filePath.getPath() + " received, " + filePath.length() + " bytes.");
return true;
}
}
//CALLBACK INTERFACE
public interface Callback{
public abstract void onRequestedFileRetrieved(File f, boolean success);
}
}
And the implementation is shown here:
URL emblemURL = new URL(data[position].getEmblemUrl());
cache.getRemoteFile(emblemURL, new LruDiskCache.Callback() {
#Override
public void onRequestedFileRetrieved(File f, boolean success) {
if (success && f.exists()) {
Drawable bitmap = Drawable.createFromPath(f.getAbsolutePath());
emblem.setImageDrawable(bitmap);
}
}
});
Any help is appreciated.
I am writing an app for Android that grabs meta data from SHOUTcast mp3 streams. I am using a pretty nifty class I found online that I slightly modified, but I am still having 2 problems.
1) I have to continuously ping the server to update the metadata using a TimerTask. I am not fond of this approach but it was all I could think of.
2) There is a metric tonne of garbage collection while my app is running. Removing the TimerTask got rid of the garbage collection issue so I am not sure if I am just doing it wrong or if this is normal.
Here is the class I am using:
public class IcyStreamMeta {
protected URL streamUrl;
private Map<String, String> metadata;
private boolean isError;
public IcyStreamMeta(URL streamUrl) {
setStreamUrl(streamUrl);
isError = false;
}
/**
* Get artist using stream's title
*
* #return String
* #throws IOException
*/
public String getArtist() throws IOException {
Map<String, String> data = getMetadata();
if (!data.containsKey("StreamTitle"))
return "";
try {
String streamTitle = data.get("StreamTitle");
String title = streamTitle.substring(0, streamTitle.indexOf("-"));
return title.trim();
}catch (StringIndexOutOfBoundsException e) {
return "";
}
}
/**
* Get title using stream's title
*
* #return String
* #throws IOException
*/
public String getTitle() throws IOException {
Map<String, String> data = getMetadata();
if (!data.containsKey("StreamTitle"))
return "";
try {
String streamTitle = data.get("StreamTitle");
String artist = streamTitle.substring(streamTitle.indexOf("-")+1);
return artist.trim();
} catch (StringIndexOutOfBoundsException e) {
return "";
}
}
public Map<String, String> getMetadata() throws IOException {
if (metadata == null) {
refreshMeta();
}
return metadata;
}
public void refreshMeta() throws IOException {
retreiveMetadata();
}
private void retreiveMetadata() throws IOException {
URLConnection con = streamUrl.openConnection();
con.setRequestProperty("Icy-MetaData", "1");
con.setRequestProperty("Connection", "close");
//con.setRequestProperty("Accept", null);
con.connect();
int metaDataOffset = 0;
Map<String, List<String>> headers = con.getHeaderFields();
InputStream stream = con.getInputStream();
if (headers.containsKey("icy-metaint")) {
// Headers are sent via HTTP
metaDataOffset = Integer.parseInt(headers.get("icy-metaint").get(0));
} else {
// Headers are sent within a stream
StringBuilder strHeaders = new StringBuilder();
char c;
while ((c = (char)stream.read()) != -1) {
strHeaders.append(c);
if (strHeaders.length() > 5 && (strHeaders.substring((strHeaders.length() - 4), strHeaders.length()).equals("\r\n\r\n"))) {
// end of headers
break;
}
}
// Match headers to get metadata offset within a stream
Pattern p = Pattern.compile("\\r\\n(icy-metaint):\\s*(.*)\\r\\n");
Matcher m = p.matcher(strHeaders.toString());
if (m.find()) {
metaDataOffset = Integer.parseInt(m.group(2));
}
}
// In case no data was sent
if (metaDataOffset == 0) {
isError = true;
return;
}
// Read metadata
int b;
int count = 0;
int metaDataLength = 4080; // 4080 is the max length
boolean inData = false;
StringBuilder metaData = new StringBuilder();
// Stream position should be either at the beginning or right after headers
while ((b = stream.read()) != -1) {
count++;
// Length of the metadata
if (count == metaDataOffset + 1) {
metaDataLength = b * 16;
}
if (count > metaDataOffset + 1 && count < (metaDataOffset + metaDataLength)) {
inData = true;
} else {
inData = false;
}
if (inData) {
if (b != 0) {
metaData.append((char)b);
}
}
if (count > (metaDataOffset + metaDataLength)) {
break;
}
}
// Set the data
metadata = IcyStreamMeta.parseMetadata(metaData.toString());
// Close
stream.close();
}
public boolean isError() {
return isError;
}
public URL getStreamUrl() {
return streamUrl;
}
public void setStreamUrl(URL streamUrl) {
this.metadata = null;
this.streamUrl = streamUrl;
this.isError = false;
}
public static Map<String, String> parseMetadata(String metaString) {
Map<String, String> metadata = new HashMap<String, String>();
String[] metaParts = metaString.split(";");
Pattern p = Pattern.compile("^([a-zA-Z]+)=\\'([^\\']*)\\'$");
Matcher m;
for (int i = 0; i < metaParts.length; i++) {
m = p.matcher(metaParts[i]);
if (m.find()) {
metadata.put((String)m.group(1), (String)m.group(2));
}
}
return metadata;
}
}
And here is my timer:
private void getMeta() {
timer.schedule(new TimerTask() {
public void run() {
try {
icy = new IcyStreamMeta(new URL(stationUrl));
runOnUiThread(new Runnable() {
public void run() {
try {
artist.setText(icy.getArtist());
title.setText(icy.getTitle());
} catch (IOException e) {
e.printStackTrace();
} catch (StringIndexOutOfBoundsException e) {
e.printStackTrace();
}
}
});
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
},0,5000);
}
Much appreciation for any assistance!
I've replaced the IcyStreamMeta class in my program and am getting the meta data from the 7.html file that is a part of the SHOUTcast spec. Far less data usage and all that so I feel it is a better option.
I am still using the TimerTask, which is acceptable. There is practically no GC any more and I am happy with using 7.html and a little regex. :)
I am trying to parse pdf file using Apache Tika after upgrading PDFBOX version to 1.6.0... And I started getting this error for few pdf files.
Any suggestions?
java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream#3a72d4e5
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.Tika.parseToString(Tika.java:357)
at edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
at edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:461)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
at java.lang.Thread.run(Thread.java:662)
WARN [Crawler 2] Did not found XRef object at specified startxref position 0
And this is my code.
if (page.isBinary()) {
handleBinary(page, curURL);
}
-------------------------------------------------------------------------------
public int handleBinary(Page page, WebURL curURL) {
try {
binaryParser.parse(page.getBinaryData());
page.setText(binaryParser.getText());
handleMetaData(page, binaryParser.getMetaData());
//System.out.println(" pdf url " +page.getWebURL().getURL());
//System.out.println("Text" +page.getText());
} catch (Exception e) {
// TODO: handle exception
}
return PROCESS_OK;
}
public class BinaryParser {
private String text;
private Map<String, String> metaData;
private Tika tika;
public BinaryParser() {
tika = new Tika();
}
public void parse(byte[] data) {
InputStream is = null;
try {
is = new ByteArrayInputStream(data);
text = null;
Metadata md = new Metadata();
metaData = new HashMap<String, String>();
text = tika.parseToString(is, md).trim();
processMetaData(md);
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(is);
}
}
public String getText() {
return text;
}
public void setText(String text) {
this.text = text;
}
private void processMetaData(Metadata md){
if ((getMetaData() == null) || (!getMetaData().isEmpty())) {
setMetaData(new HashMap<String, String>());
}
for (String name : md.names()){
getMetaData().put(name.toLowerCase(), md.get(name));
}
}
public Map<String, String> getMetaData() {
return metaData;
}
public void setMetaData(Map<String, String> metaData) {
this.metaData = metaData;
}
}
public class Page {
private WebURL url;
private String html;
// Data for textual content
private String text;
private String title;
private String keywords;
private String authors;
private String description;
private String contentType;
private String contentEncoding;
// binary data (e.g, image content)
// It's null for html pages
private byte[] binaryData;
private List<WebURL> urls;
private ByteBuffer bBuf;
private final static String defaultEncoding = Configurations
.getStringProperty("crawler.default_encoding", "UTF-8");
public boolean load(final InputStream in, final int totalsize,
final boolean isBinary) {
if (totalsize > 0) {
this.bBuf = ByteBuffer.allocate(totalsize + 1024);
} else {
this.bBuf = ByteBuffer.allocate(PageFetcher.MAX_DOWNLOAD_SIZE);
}
final byte[] b = new byte[1024];
int len;
double finished = 0;
try {
while ((len = in.read(b)) != -1) {
if (finished + b.length > this.bBuf.capacity()) {
break;
}
this.bBuf.put(b, 0, len);
finished += len;
}
} catch (final BufferOverflowException boe) {
System.out.println("Page size exceeds maximum allowed.");
return false;
} catch (final Exception e) {
System.err.println(e.getMessage());
return false;
}
this.bBuf.flip();
if (isBinary) {
binaryData = new byte[bBuf.limit()];
bBuf.get(binaryData);
} else {
this.html = "";
this.html += Charset.forName(defaultEncoding).decode(this.bBuf);
this.bBuf.clear();
if (this.html.length() == 0) {
return false;
}
}
return true;
}
public boolean isBinary() {
return binaryData != null;
}
public byte[] getBinaryData() {
return binaryData;
}
Are you sure that you don't accidentally truncate the PDF document when you load it into the binary buffer in the Page class?
There are multiple potential problems in your Page.load() method. To start with, the finished + b.length > this.bBuf.capacity() should be finished + len > this.bBuf.capacity() since the read() method could have returned fewer than b.length bytes. Also, are you sure that the totalsize argument you give is accurate? Finally, it could be that the given document is larger than the MAX_DOWNLOAD_SIZE limit.