How can i add all elements in html with Jsoup? - java

File input = new File("1727209867.htm");
Document doc = Jsoup.parse(input, "UTF-8","http://www.facebook.com/people/Alison-Vella/1727209867");
I am trying to parse this html file which is saved and using in local system. But parsing dont parse all html. So i cant reach information that i need. Parse only work for 6k char with this code but actually html file has 60k char.

This is not possible in jsoup, but with a workaround:
final File input = new File("example.html");
final int maxLength = 6000; // Limit of char's to read
InputStream is = new FileInputStream(input); // Open file for reading
StringBuilder sb = new StringBuilder(maxLength); // Init the "buffer" with the size required
int count = 0; // Count of chars readen
int c; // Char for reading
while( ( c = is.read() ) != -1 && count < maxLength ) // Read a single char until limit is reached
{
sb.append((char) c); // Save the char into the buffer
count++; // increment the chars readen
}
Document doc = Jsoup.parse(sb.toString()); // Parse the Html from buffer
Explained:
Read the file char-by-char into a buffer until you reach the limit
Parse the text from buffer and process it with jsoup
Problem: This wont take care about closing tags etc - it will stop reading exactly if you are on the limit.
(Possible) Solutions:
ignore this and stop exactly where you are, parse this and "fix" or drop the hanging html
if you are at the end, read until you reach next closing tag or > char
if you are at the end, read until you reach next block-tag
if you are at the end, read until a specific tag or comment

Related

How to modify a given String (from CSV)

I need to write a program for a project at university which should cut some specific parts out of a given CSV File. I've started already but I don't know how to keep only the content (sentence and vote values) or min. to remove the date part.
PARENT,"Lorem ipsum...","3","0","Town","09:17, 29/11/2016"
REPLY,"Loren ipsum...”,"2","0","Town","09:18, 29/11/2016"
After the program ran I want to have it like this:
Lorem ipsum... (String) 3 (int) 0 (int)
Loren ipsum... (String) 2 (int) 0 (int)
I have no problem with writing a parser (read in, remove separators) but I don't know how realize this thing.
You can create your own data structure that contains a string, and two integers and then do the following while reading from the csv file. Only include the stuff you want from the csv based on the column number which is the index of the String array returned by the split() method.
Scanner reader = new Scanner(new File("path to your CSV File"));
ArrayList<DataStructure> csvData = new ArrayList<>();
while(reader.hasNextLine())
{
String[] csvLine = reader.nextLine().split(",");
DataStructure data = new DataStructure(
csvLine[1],
Integer.parseInt(csvLine[2]),
Integer.parseInt(csvLine[3]));
csvData.add(data);
}

Splitting binary file on tags?

I have a ModSecurity log file which contains parts which contain either text or binary data. I need to split this file according to the tags which are noted at the start of each part so i can do filtering of data for permanent storage.
So for example I have:
--tag1--
<text>
--tag2--
<binary data>
--tag3--
<text>
At first I thought it was all text so i made a parser to parse all the different pieces by reading the line and using a pattern to check if it was a new part. But now I need to read the file in binary. So what would be the best way to achieve this?
So far i've made a test to get a specific part by keeping the last several characters in a String buffer to check for the string and then start printing when the buffer contains that string. The same is done to stop. However since the buffer needs to fill up before it can check the end tag, the end tag will have been added to the byte array so once the part is complete, I remove the final bytes from the array to get the part needed.
public byte[] binaryDataReader(String startTag, String endTag) throws IOException{
File file = new File("20160926-161148-V#ksog7ZjVRfyQUPtAdOmgAAAAM");
try (FileInputStream fis = new FileInputStream(file);ByteArrayOutputStream buffer = new ByteArrayOutputStream()) {
System.out.println("Total file size to read (in bytes) : "+ fis.available());
int content;
String lastChars = "";
String status = "nok";
while ((content = fis.read()) != -1) {
if (lastChars.length() > 14) {
lastChars = lastChars.substring(lastChars.length() - 14, lastChars.length()) + (char) content;
} else {
lastChars += (char) content;
}
if(status.equals("ok")){
buffer.write(content);
}
if (lastChars.equals(startTag)) {
status = "ok";
}else if(lastChars.equals(endTag)){
status = "nok";
}
}
buffer.flush();
byte[] data = buffer.toByteArray();
data = Arrays.copyOf(data, data.length-15);
return data;
} catch (IOException e) {
//log
throw e;
}
}
Now i need to make this a general solution for many more tags by including patterns. But I was wondering: Is this a decent way of splitting a binary file or is there a better/easier way to achieve this?

How to skip a part of the file then read a line?

I have a code that reads a file using buffered reader and split, said file was created via a method that automatically adds 4KB of empty space at the beginning of the file, this results in when I read the following happens:
First the Code:
BufferedReader metaRead = new BufferedReader(new FileReader(metaFile));
String metaLine = "";
String [] metaData = new String [100000];
while ((metaLine = metaRead.readLine()) != null){
metaData = metaLine.split(",");
for (int i = 0; i < metaData.length; i++){
System.out.println(metaData[i]);
}
}
This is the result, keep in mind this file already exists and contains the values:
//4096 spaces then the first actual word in the document which is --> testTable2
Name
java.lang.String
true
No Reference
Is there a way to skip the first 4096 spaces, and get straight to the actual value within the file so I can get the result regularly? Because I'll be using the metaData array later in other operations, and I'm pretty sure the spaces will mess up the number of slots within the array. Any suggestions would be appreciated.
If you're using Eclipse, the auto-completion should help.
metaRead.skip(4096);
https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
You could (as mentioned) simply do:
metaRead.skip(4096);
If the whitespace always occupies that many characters, or you could just avoid lines which are empty
while ((metaLine = metaRead.readLine()) != null){
if(metaLine.trim().length() > 0){
metaData = metaLine.split(",");
for (int i = 0; i < metaData.length; i++){
System.out.println(metaData[i]);
}
}
}

Java - PDFBox - ReplaceString - Issues with parsed tokens (possibly encoding?)

I've been struggling with an issue related to PDFBox and PDF editing. I have been assigned the task to edit a couple of strings given a PDF file, and to output a mirrored version of the files with the edited strings into it. I've been told that the problem has been solved in the past using this tool, so I have been told to do the same. The function I am using is this :
public void doIt( String inputFile, String outputFile, String strToFind, String message)
throws IOException, COSVisitorException
{
// the document
PDDocument doc = null;
try
{
doc = PDDocument.load( inputFile );
List pages = doc.getDocumentCatalog().getAllPages();
for( int i=0; i<pages.size(); i++ )
{
PDPage page = (PDPage)pages.get( i );
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream() );
parser.parse();
List tokens = parser.getTokens();
for( int j=0; j<tokens.size(); j++ )
{
Object next = tokens.get( j );
if( next instanceof PDFOperator )
{
PDFOperator op = (PDFOperator)next;
//Tj and TJ are the two operators that display
//strings in a PDF
if( op.getOperation().equals( "Tj" ) )
{
//Tj takes one operator and that is the string
//to display so lets update that operator
COSString previous = (COSString)tokens.get( j-1 );
String string = previous.getString();
string = string.replaceFirst( strToFind, message );
previous.reset();
previous.append( string.getBytes("ISO-8859-1") );
}
else if( op.getOperation().equals( "TJ" ) )
{
COSArray previous = (COSArray)tokens.get( j-1 );
for( int k=0; k<previous.size(); k++ )
{
Object arrElement = previous.getObject( k );
if( arrElement instanceof COSString )
{
COSString cosString = (COSString)arrElement;
String string = cosString.getString();
string = string.replaceFirst( strToFind, message );
cosString.reset();
cosString.append( string.getBytes("ISO-8859-1") );
}
}
}
}
}
//now that the tokens are updated we will replace the
//page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens( tokens );
page.setContents( updatedStream );
}
doc.save( outputFile );
}
finally
{
if( doc != null )
{
doc.close();
}
}
}
Which is the code that is being used in a file contained into the PDFBox examples (https://svn.apache.org/repos/asf/pdfbox/tags/1.5.0/pdfbox/src/main/java/org/apache/pdfbox/examples/pdmodel/ReplaceString.java).
The file I have been given, however, is not being modified at all from this function. Nothing happens at all. Upon further inspection, I decided to analyze the sequencing of the tokens produced from the parser. The file is being parsed correctly in everything other than the COSString elements, which contain gibberish characters that look like they have been wrongly encoded (bunch of random symbols and numbers). I tried parsing other documents, and the function works with some of them, but not on everything I passed as input (a latex output file was modified correctly and had correctly encoded COSStrings, whereas other automatically generated pdfs produced no results with gibberish COSString content). I am also fairly sure the rest of the structure is being read correctly, since I rebuild the output on a different file, and the output file looks exactly the same as the input, which seems to mean that the file structure is being analyzed correctly.The file contains Identity-H encoded fonts.
I tried parsing the very same file using the PDFTextStripper (which extracts text from PDFs), and the parsing output from there returns the correct text output, using this:
PDFTextStripper pdfStripper = new PDFTextStripper("UTF-8");
String result = pdfStripper.getText(doc);
System.out.println(result);
Could it be an encoding issue? Can I tell the PDFStreamParser (or whoever holds the responsability) to force an encoding on read? Is it even an encoding issue, since the text extraction is working correctly?
Thanks in advance for the help.
Some files use font subsets. Lets say that the subset uses only the characters E, G, L, and O. So GOOGLE would appear in the file as hex byte values 2, 4, 4, 2, 3, and 1.
Now if you want to change GOOGLE into APPLE you'll have three problems:
1) your subset doesn't contain the characters A, L and P
2) the size will be different
3) It is quite possible that the string you're searching is splitted in several parts.
Btw the current version is 1.8.10. The ReplaceString utility has been removed in the upcoming 2.0 version to avoid giving the illusion that characters can easily be replaced.
This answer is somewhat speculative, because you haven't linked to a PDF.
Inside PDF text can be stored at two places:
Content stream
X Object inside Resource
Inside content stream mostly text are associated with TJ or Tj operator. But texts associated with Tj or TJ are not always in ASCII format, it may be some byte values. We can extract text from these byte value after mapping character codes to unicode values using proper encoding and mapping. While extracting text we use mapping and encoding, but we do not have a reverse mapping to check if a glyph belong to which character code. So basically we should replace character codes of string to be replaced with character codes of new string.
Example:
1. (Text) Tj
2. (12 45 5 3)Tj
Also we should replace string in content stream as well as X Object (if present) inside resource.
So I think this might be helpful.
GoodLuck!

Generating a .ov2 file with Java

I am trying to figure out how to create a .ov2 file to add POI data to a TomTom GPS device. The format of the data needs to be as follow:
An OV2 file consists of POI records. Each record has the following data format.
1 BYTE, char, POI status ('0' or '2')
4 BYTES, long, denotes length of the POI record.
4 BYTES, long, longitude * 100000
4 BYTES, long, latitude * 100000
x BYTES, string, label for POI, x =3D=3D total length =96 (1 + 3 * 4)
Terminating null byte.
I found the following PHP code that is supposed to take a .csv file, go through it line by line, split each record and then write it into a new file in the proper format. I was hoping someone would be able to help me translate this to Java. I really only need the line I marked with the '--->' arrow. I do not know PHP at all, but everything other than that one line is basic enough that I can look at it and translate it, but I do not know what the PHP functions are doing on that one line. Even if someone could explain it well enough then maybe I could figure it out in Java. If you can translate it directly, please do, but even an explanation would be helpful. Thanks.
<?php
$csv = file("File.csv");
$nbcsv = count($csv);
$file = "POI.ov2";
$fp = fopen($file, "w");
for ($i = 0; $i < $nbcsv; $i++) {
$table = split(",", chop($csv[$i]));
$lon = $table[0];
$lat = $table[1];
$des = $table[2];
--->$TT = chr(0x02).pack("V",strlen($des)+14).pack("V",round($lon*100000)).pack("V",round($lat*100000)).$des.chr(0x00);
#fwrite($fp, "$TT");
}
fclose($fp);
?>
Load a file into an array, where each element is a line from the file.
$csv = file("File.csv");
Count the number of elements in the array.
$nbcsv = count($csv);
Open output file for writing.
$file = "POI.ov2";
$fp = fopen($file, "w");
While $i < number of array items, $i++
for ($i = 0; $i < $nbcsv; $i++) {
Right trim the line (remove all whitespace), and split the string by ','. $table is an array of values from the csv line.
$table = split(",", chop($csv[$i]));
Assign component parts of the table to their own variables by numeric index.
$lon = $table[0];
$lat = $table[1];
$des = $table[2];
The tricky bit.
chr(02) is literally character code number 2.
pack is a binary processing function. It takes a format and some data.
V = unsigned long (always 32 bit, little endian byte order).
I'm sure you can work out the maths bits, but you need to convert them into little endian order 32 bit values.
. is a string concat operator.
Finally it is terminated with chr(0). Null char.
$TT = chr(0x02).
pack("V",strlen($des)+14).
pack("V",round($lon*100000)).
pack("V",round($lat*100000)).
$des.chr(0x00);
Write it out and close the file.
#fwrite($fp, "$TT");
}
fclose($fp);
The key in JAVA is to apply proper byte order ByteOrder.LITTLE_ENDIAN to the ByteBuffer.
The whole function:
private static boolean getWaypoints(ArrayList<Waypoint> geopoints, File f)
{
try{
FileOutputStream fs = new FileOutputStream(f);
for (int i=0;i<geopoints.size();i++)
{
fs.write((byte)0x02);
String desc = geopoints.get(i).getName();
int poiLength = desc.toString().length()+14;
fs.write(ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN).putInt(poiLength).array());
int lon = (int)Math.round((geopoints.get(i).getLongitudeE6()/1E6)*100000);
fs.write(ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN).putInt(lon).array());
int lat = (int)Math.round((geopoints.get(i).getLatitudeE6()/1E6)*100000);
fs.write(ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN).putInt(lat).array());
fs.write(desc.toString().getBytes());
fs.write((byte)0x00);
}
fs.close();
return true;
}
catch (Exception e)
{
return false;
}
}

Categories

Resources