Search for a string in html file using Jsoup

Search for a string in html file using Jsoup - java

Can anyone help me with searching for a particular string in HTML file using Jsoup or any other method. There are inbuilt methods but they help in extracting title or script texts inside a specific tags and not string in general.
In this code I have used one such inbuilt method to extract title from the html page.
But I want to search a string instead.
package dynamic_tester;
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class tester {
public static void main(String args[])
{
Document htmlFile = null;
{
try {
htmlFile = Jsoup.parse(new File("x.html"), "ISO-8859-1");
}
catch (IOException e)
{
e.printStackTrace();
}
String title = htmlFile.title();
System.out.println("Title = "+title);
}
}
}

Here's a sample. It reads the HTML file as text String and then performs search on that String.
package com.example;
import java.io.FileInputStream;
import java.nio.charset.Charset;
public class SearchTest {
public static void main(String[] args) throws Exception {
StringBuffer htmlStr = getStringFromFile("test.html", "ISO-8859-1");
boolean isPresent = htmlStr.indexOf("hello") != -1;
System.out.println("is Present ? : " + isPresent);
}
private static StringBuffer getStringFromFile(String fileName, String charSetOfFile) {
StringBuffer strBuffer = new StringBuffer();
try(FileInputStream fis = new FileInputStream(fileName)) {
byte[] buffer = new byte[10240]; //10K buffer;
int readLen = -1;
while( (readLen = fis.read(buffer)) != -1) {
strBuffer.append( new String(buffer, 0, readLen, Charset.forName(charSetOfFile)));
}
} catch(Exception ex) {
ex.printStackTrace();
strBuffer = new StringBuffer();
}
return strBuffer;
}
}

Related

How can I compile a Java class stored in a string, or using it's path with a given input?

I'm trying to make my own pretty print for java files, similar to JDoodle. How can I compile a java class, given either it's location as a string, or its content as a string, as well as do it given a text file for std inputs, all the while recording the output as a seperate string. Sorry if this seems troublesome. Any help is appreciated!
EDIT: I do know about the java.tools.ToolProvider and Tool, but even if it is the solution, I don't know what to do with it, as the documentation is too confusing for me, or too sparse.

OK, I got an answer. I used Eclipse's compiler(cause I dont have JDK in my school laptop) to compile and used processbuilder to run the produced .class file, redirected the output using redirectOutput to a file which I read to get the output. Thanks- Here is the code.
/*PRETTYPRINT*/
/*
* Code to HTML
* Uses highlightjs in order to create a html form for your code, you can also give inputs and outputs
* */
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
public class PrettyPrint {
public static void main(String[] args) throws FileNotFoundException{
String javaFile = readFile(args[0]);
String commandLine = readFile(args[1]);
String output = readFile(args[2]);
String html = "<!DOCTYPE html>\n"
+"<html>\n"
+"<head>"
+"<link rel=\"stylesheet\" href=\"highlightjs/styles/a11y-dark.css\" media= \"all\">\r\n"
+"<script src=\"highlightjs/highlight.pack.js\"></script>\r\n"
+"<script>hljs.initHighlightingOnLoad();</script>"
+"<script src=\"https://cdnjs.cloudflare.com/ajax/libs/jspdf/1.5.3/jspdf.debug.js\" integrity=\"sha384-NaWTHo/8YCBYJ59830LTz/P4aQZK1sS0SneOgAvhsIl3zBu8r9RevNg5lHCHAuQ/\" crossorigin=\"anonymous\"></script>\r\n"
+"<script src=\"https://cdn.jsdelivr.net/npm/html2canvas#1.0.0-rc.5/dist/html2canvas.min.js\"></script>"
+"<meta charset=\"utf-8\">"
+"<style>code{overflow-x: visible;}body{background-color:#888888;color:#444444;}h1{text-align:center;color:#444444;}</style>"
+"</head>"
+"<body style=\"font-family: 'Consolas';\">\n"
+"<h1 style=\"text-align: center\">Java Code</h1>"
+"<pre><code class=\"java\" style=\"overflow-x:visible\">"
+toHTML(javaFile)
+"</code></pre>"
+"<br>\n"
+"<h1>Inputs</h1>"
+"<pre><code class = \"nohighlight hljs\" style=\"overflow-x:visible\">"
+toHTML(commandLine)
+"</code></pre>"
+"<br>\n"
+"<h1>Output</h1>"
+"<pre><code class = \"nohighlight hljs\" style=\"overflow-x:visible\">"
+toHTML(output)
+"</code></pre>"
+"</body>\n"
+"<script>"
+"console.log(document.body.innerHTML);"
//+String.format("function print(){const filename='%s';html2canvas(document.body).then(canvas=>{let pdf = new jsPDF('p','mm', 'a4');pdf.addImage(canvas.toDataURL('image/png'), 'PNG', 0, 0, 1000, 1000);pdf.save(filename);});}print();",args[3].substring(args[3].lastIndexOf('/')+1, args[3].length()-4)+"pdf")
+ "</script>"
+"</html>\n";
//System.out.println(html);
try {
File file = new File("output.html");
PrintWriter fileWriter = new PrintWriter(file);
fileWriter.print(html);
fileWriter.close();
} catch(IOException e) {
e.printStackTrace();
}
}
public static String toHTML(String str) {
String html = str;
html = html.replace("&","&");
html = html.replace("\"", """);
html = html.replace("\'", "&apos;");
html = html.replace("<", "<");
html = html.replace(">", ">");
//html = html.replace("\n", "<br>");
html = html.replace("\t", "  ");
html+= "<br>";
return html;
}
public static String readFile(String filePath)
{
String content = "";
try
{
content = new String ( Files.readAllBytes( Paths.get(filePath) ) );
}
catch (IOException e)
{
e.printStackTrace();
}
return content;
}
}
/**PROCESSBUILDEREXAMPLE**/
import java.io.*;
import org.eclipse.jdt.core.compiler.CompilationProgress;
import org.eclipse.jdt.core.compiler.batch.BatchCompiler;
public class ProcessBuilderExample {
private static String JAVA_FILE_LOCATION;
public static void main(String args[]) throws IOException{
JAVA_FILE_LOCATION = args[0];
CompilationProgress progress = null;
BatchCompiler.compile(String.format("-classpath rt.jar %s",args[0]), new PrintWriter(System.out), new PrintWriter(System.err), progress);
Process process = new ProcessBuilder("java", "-cp",
JAVA_FILE_LOCATION.substring(0,JAVA_FILE_LOCATION.lastIndexOf("\\")),
JAVA_FILE_LOCATION.substring(JAVA_FILE_LOCATION.lastIndexOf("\\")+1,JAVA_FILE_LOCATION.length()-5))
.redirectInput(new File(args[1]))
.redirectOutput(new File(args[2])).start();
try {
process.waitFor();
PrettyPrint.main(args);
} catch(Exception e) {
e.printStackTrace();
}
}
}
Keep these 2 in the same folder and run processbuilderexample with 3 arguments. The code's loc, the input file's loc, and the output file to write to.

Encoding with UTF-16 in Java

I'm trying to read/write a .txt file in UTF-16 so that I can input/output Japanese characters into/from my program. I have read many similar questions, articles and the Java Docs, virtually copied their code and still can't figure out where I am going wrong. If I output it to the console, or whenever I check the contents of the file (using the correct encoding) all I see is a '?' in place of 'あ'.
Application class:
public class App {
public static void main(String[] args) {
String[] s = {"あ"}; //A test String array
FileReader.write("unicode.txt", "UTF-16", s, false);
System.out.println("File: " + FileReader.read("unicode.txt", "UTF-16") + " Hard-coded example: あ");
}
}
FileReader class:
import java.io.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.nio.charset.Charset;
public class FileReader {
public static String[] read(String fileName, String encoding) {
ArrayList<String> content = new ArrayList<String>();
try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(fileName), Charset.forName(encoding).newDecoder()))) {
for(String s = reader.readLine(); s != null; s = reader.readLine()) {
content.add(s);
}
reader.close();
} catch(IOException e) {
System.out.println("An IOException(Input) has been thrown.");
e.printStackTrace();
}
return convertToStringArray(content);
}
public static void write(String fileName, String encoding, String[] content, boolean append) {
try(BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileName, append), Charset.forName(encoding).newEncoder()))) {
for(String s : content) {
writer.write(s);
writer.newLine();
}
writer.close();
} catch (IOException e) {
System.out.println("An IOException(appending=" + append + ") has been thrown.");
e.printStackTrace();
}
}
private static String[] convertToStringArray(ArrayList<String> list) {
String[] array = new String[list.size()];
list.toArray(array);
return array;
}
}

Extracting links of a facebook page

How can I extract all the links of a facebook page. Can I extract it using jsoup and pass "like" link as parameter to extract all the user's info who liked that particular page
private static String readAll(Reader rd) throws IOException
{
StringBuilder sb = new StringBuilder();
int cp;
while ((cp = rd.read()) != -1)
{
sb.append((char) cp);
}
return sb.toString();
}
public static JSONObject readurl(String url) throws IOException, JSONException
{
InputStream is = new URL(url).openStream();
try
{
BufferedReader rd = new BufferedReader
(new InputStreamReader(is, Charset.forName("UTF-8")));
String jsonText = readAll(rd);
JSONObject json = new JSONObject(jsonText);
return json;
}
finally
{
is.close();
}
}
public static void main(String[] args) throws IOException,
JSONException, FacebookException
{
try
{
System.out.println("\nEnter the search string:");
#SuppressWarnings("resource")
Scanner sc=new Scanner(System.in);
String s=sc.nextLine();
JSONObject json = readurl("https://graph.facebook.com/"+s);
System.out.println(json);
}}
CAN i MODIFY THIS AND INTEGRATE THIS CODE. BELOW CODE EXTRACTS ALL LINKS OF A PARTICULAR PAGE. i TRIED TO THE ABOVE CODE BUT IT'S NOT WORKING
String url = "http://www.firstpost.com/tag/crime-in-india";
Document doc = Jsoup.connect(url).get();
Elements links = doc.getElementsByTag("a");
System.out.println(links.size());
for (Element link : links)
{
System.out.println(link.absUrl("href") +trim(link.text(), 35));
}
}
public static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width-1) + ".";
else
return s;
}
}

you can try alternative way also like this :
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.LinkedHashSet;
import java.util.Set;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class URLExtractor {
private static class HTMLPaserCallBack extends HTMLEditorKit.ParserCallback {
private Set<String> urls;
public HTMLPaserCallBack() {
urls = new LinkedHashSet<String>();
}
public Set<String> getUrls() {
return urls;
}
#Override
public void handleSimpleTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
#Override
public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
private void handleTag(Tag t, MutableAttributeSet a, int pos) {
if (t == Tag.A) {
Object href = a.getAttribute(HTML.Attribute.HREF);
if (href != null) {
String url = href.toString();
if (!urls.contains(url)) {
urls.add(url);
}
}
}
}
}
public static void main(String[] args) throws IOException {
InputStream is = null;
try {
String u = "https://www.facebook.com/";
URL url = new URL(u);
is = url.openStream(); // throws an IOException
HTMLPaserCallBack cb = new HTMLPaserCallBack();
new ParserDelegator().parse(new BufferedReader(new InputStreamReader(is)), cb, true);
for (String aUrl : cb.getUrls()) {
System.out.println("Found URL: " + aUrl);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe) {
// nothing to see here
}
}
}
}

Kind of works, but im not sure you could use jsoup for this I would rather look into casperjs or phantomjs
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class getFaceBookLinks {
public static Elements getElementsByTag_then_FilterBySelector (String tag, String httplink, String selector){
Document doc = null;
try {
doc = Jsoup.connect(httplink).get();
} catch (IOException e) {
e.printStackTrace();
}
Elements links = doc.getElementsByTag(tag);
return links.select(selector);
}
//Test functionality
public static void main(String[] args){
// The class name for the like links on facebook is UFILikeLink
Elements likeLinks = getElementsByTag_then_FilterBySelector("a", "http://www.facebook.com", ".UFILikeLink");
System.out.println(likeLinks);
}
}

itext Converting PDF to csv

I am trying to use itext framework to convert a pdf file into a csv for import into excel.
The output is garbled and I pressume I am missing a step in regards to format conversion however I can't seem to find the information in the itext site and am looking for assistance.
Current is as below.
package com.pdf.convert;
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
public class ThirdPDF {
private static String INPUTFILE = "/location/test.pdf";
private static String OUTPUTFILE = "/location/test.csv";
public static void main(String[] args) throws DocumentException,
IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
// Only page number 2 will be included
if (i == 2) {
page = writer.getImportedPage(reader, i);
Image instance = Image.getInstance(page);
document.add(instance);
}
}
document.close();
}
}

Converting PDF file to CSV file.
Present Directory and File creation is based on Android Framework.
Change your path and Directory as per your Framework Accordingly.
private void convertPDFToCSV(String pdfFilePath) {
String myfolder = Environment.getExternalStorageDirectory() + "/Mycsv";
if (createFolder(myfolder)) {
try {
Document document = new Document();
document.open();
FileOutputStream fos=new FileOutputStream(myfolder + "/MyCSVFile.csv");
StringBuilder parsedText=new StringBuilder();
PdfReader reader1 = new PdfReader(pdfFilePath);
int n = reader1.getNumberOfPages();
for (int i = 0; i <n ; i++) {
parsedText.append(parsedText+PdfTextExtractor.getTextFromPage(reader1, i+1).trim()+"\n") ;
//Extracting the content fromx the different pages
}
StringReader stReader = new StringReader(parsedText.toString());
int t;
while((t=stReader.read())>0)
fos.write(t);
document.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
private boolean createFolder(String myfolder) {
File f = new File(myfolder);
if (!f.exists()) {
if (!f.mkdir()) {
return false;
} else {
return true;
}
}else{
return true;
}
}

FileReader and BufferedReader

I have 3 methods
for open file
for read file
for return things read in method read
this my code :
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package javaapplication56;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.rmi.RemoteException;
import java.util.logging.Level;
import java.util.logging.Logger;
/**
*
* #author x
*/
public class RemoteFileObjectImpl extends java.rmi.server.UnicastRemoteObject implements RemoteFileObject
{
public RemoteFileObjectImpl() throws java.rmi.RemoteException {
super();
}
File f = null;
FileReader r = null;
BufferedReader bfr = null;
String output = "";
public void open(String fileName) {
//To read file passWord
f = new File(fileName);
}
public String readLine() {
try {
String temp = "";
String newLine = System.getProperty("line.separator");
r = new FileReader(f);
while ((temp = bfr.readLine()) != null) {
output += temp + newLine;
bfr.close();
}
}
catch (IOException ex) {
ex.printStackTrace();
}
return output;
}
public void close() {
try {
bfr.close();
} catch (IOException ex) {
}
}
public static void main(String[]args) throws RemoteException{
RemoteFileObjectImpl m = new RemoteFileObjectImpl();
m.open("C:\\Users\\x\\Documents\\txt.txt");
m.readLine();
m.close();
}
}
But it does not work.

What do you expect it to do, you are not doing anything with the line you read, just
m.readLine();
Instead:
String result = m.readLine();
or use the output variable that you saved.
Do you want to save it to a variable, print it, write it to another file?
Update: after your update in the comments:
Your variable bfr is never created/initialized. You are only doing this:
r = new FileReader(f);
so bfr is still null.
You should do something like this instead:
bfr = new BufferedReader(new FileReader(f));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Search for a string in html file using Jsoup - java

Related

How can I compile a Java class stored in a string, or using it's path with a given input?

Encoding with UTF-16 in Java

Extracting links of a facebook page

itext Converting PDF to csv

FileReader and BufferedReader

Categories

Resources