Trying to extract content from url in java - java

I am trying to extract the content of a webpage from a URL. I have already written the code but I think I have made a mistake in the regex part. When I run the code only the first line appears in the console. I am using NetBeans. Code that I already have:
private static String text;
public static void main(String[]args){
URL u;
InputStream is = null;
DataInputStream dis;
String s;
try {
u = new URL("http://ghr.nlm.nih.gov/gene/AKT1 ");
is = u.openStream();
dis = new DataInputStream(new BufferedInputStream(is));
text="";
while ((s = dis.readLine()) != null) {
text+=s;
}
} catch (MalformedURLException mue) {
System.out.println("Ouch - a MalformedURLException happened.");
mue.printStackTrace();
System.exit(1);
} catch (IOException ioe) {
System.out.println("Oops- an IOException happened.");
ioe.printStackTrace();
System.exit(1);
} finally {
String pattern = "(?i)(<P>)(.+?)";
System.out.println(text.split(pattern)[1]);
try {
is.close();
} catch (IOException ioe) {
}
}
}
}

Consider extracting your webpage information through dedicated html parsing APIs like jsoup. A simple example with your url to extract all the elements with the <p> tags would be:
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("http://ghr.nlm.nih.gov/gene/AKT1")
.get();
Elements els = doc.select("p");
for (Element el : els) {
System.out.println(el.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
Console:
On this page:
The official name of this gene is “v-akt murine thymoma viral oncogene homolog 1.”
AKT1 is the gene's official symbol. The AKT1 gene is also known by other names, listed below.
Read more about gene names and symbols on the About page.
The AKT1 gene provides instructions for making a protein called AKT1 kinase. This protein is found in various cell types throughout the body, where it plays a critical role in many signaling pathways. For example, AKT1 kinase helps regulate cell growth and division (proliferation), the process by which cells mature to carry out specific functions (differentiation), and cell survival. AKT1 kinase also helps control apoptosis, which is the self-destruction of cells when they become damaged or are no longer needed.
...

You are missing a new line character during string concatenation.
Append the text with a new line char after every line is read.
Change:
while ((s = dis.readLine()) != null) {
text+=s;
}
To:
while ((s = dis.readLine()) != null) {
text += s + "\n";
}
I suggest you use, StringBulder over String for building the final text.
StringBuilder text = new StringBuilder( 1024 );
...
while ((s = dis.readLine()) != null) {
text.append( s ).append( "\n" );
}
...
System.out.println( text.toString() );

Related

How to write a binary file from a String and retrieve it again to a String?

I have a string and want to persist it into a file and be able to retrieve it again into a String.
Something is wrong with my code because It's supposing that I must write something binary non readable but when i Open the file I can read this:
Original string:
[{"name":"asdasd","etName":"111","members":[]}]
Stored string in binary file:
[ { " n a m e " : " a s d a s d " , " e t N a m e " : " 1 1 1 " , " m e m b e r s " : [ ] } ]
I detect two problems:
Is not stored in binary! I can read it. It's supposed to be a confused binary text unreadable but I can read it.
When i retrieve it it's being retrieved with that strange space between the characters. So it doesn't works.
This is my code for storing the string:
public static void storeStringInBinary(String string, String path) {
DataOutputStream os = null;
try {
os = new DataOutputStream(new FileOutputStream(path));
os.writeChars(string);
os.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
if (os != null) {
try {
os.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
And this is my code for reading it from binary to a String:
public static String retrieveStringFromBinary(String file) {
String string = null;
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader (file));
String line = null;
StringBuilder stringBuilder = new StringBuilder();
while((line = reader.readLine()) != null) {
stringBuilder.append(line);
}
return stringBuilder.toString();
} catch (Exception e){
e.printStackTrace();
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return string;
}
Firstly, there isn't really a distinction between a text file and a binary file. A text file is just a file who's content falls in the range of byte values that correspond to characters.
If you want to encrypt the content of the file so it is unreadable just by catting the file then you will need to choose an appropriate encryption method.
Secondly Mixing Readers/Writers and Streams in Java is never a good idea, pick one style and stick to it.
The problem with your function that saves the string to a file is that you are using the writeChars() method, which from the doc does the following:
Writes a char to the underlying output stream as a 2-byte value, high byte first. If no exception is thrown, the counter written is incremented by 2.
Since your string is made up of single byte characters this is leading to the padding of your string with null bytes, which are being converted to spaces when read back in. If you change this to writeBytes() then you should get output without the extra null byte.
The null byte will also stop your read function working as the readLine() function will return null on it's first call due to the leading 0x00 in the file.
Try this out:
public static void storeStringInBinary(String string, String path) {
try(ObjectOutputStream os = new ObjectOutputStream(new FileOutputStream(path))) {
os.writeObject(string);
} catch (IOException e) {
e.printStackTrace();
}
}
public static String retrieveStringFromBinary(String file) {
String string = null;
try (ObjectInputStream reader = new ObjectInputStream(new FileInputStream(file))){
string = (String) reader.readObject();
} catch (ClassNotFoundException | IOException e) {
e.printStackTrace();
}
return string;
}

Java too many open files exception

I have a problem on my code; basically I have an array containing some key:
String[] ComputerScience = { "A", "B", "C", "D" };
And so on, containing 40 entries.
My code reads 900 pdf from 40 folder corresponding to each element of ComputerScience, manipulates the extracted text and stores the output in a file named A.txt , B.txt, ecc ...
Each folder "A", "B", ecc contains 900 pdf.
After a lot of documents, an exception "Too many open files" is thrown.
I'm supposing that I am correctly closing files handler.
static boolean writeOccurencesFile(String WORDLIST,String categoria, TreeMap<String,Integer> map) {
File dizionario = new File(WORDLIST);
FileReader fileReader = null;
FileWriter fileWriter = null;
try {
File cat_out = new File("files/" + categoria + ".txt");
fileWriter = new FileWriter(cat_out, true);
} catch (IOException e) {
e.printStackTrace();
}
try {
fileReader = new FileReader(dizionario);
} catch (FileNotFoundException e) { }
try {
BufferedReader bufferedReader = new BufferedReader(fileReader);
if (dizionario.exists()) {
StringBuffer stringBuffer = new StringBuffer();
String parola;
StringBuffer line = new StringBuffer();
int contatore_index_parola = 1;
while ((parola = bufferedReader.readLine()) != null) {
if (map.containsKey(parola) && !parola.isEmpty()) {
line.append(contatore_index_parola + ":" + map.get(parola).intValue() + " ");
map.remove(parola);
}
contatore_index_parola++;
}
if (! line.toString().isEmpty()) {
fileWriter.append(getCategoryID(categoria) + " " + line + "\n"); // print riga completa documento N x1:y x2:a ...
}
} else { System.err.println("Dictionary file not found."); }
bufferedReader.close();
fileReader.close();
fileWriter.close();
} catch (IOException e) { return false;}
catch (NullPointerException ex ) { return false;}
finally {
try {
fileReader.close();
fileWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return true;
}
But the error still comes. ( it is thrown at:)
try {
File cat_out = new File("files/" + categoria + ".txt");
fileWriter = new FileWriter(cat_out, true);
} catch (IOException e) {
e.printStackTrace();
}
Thank you.
EDIT: SOLVED
I found the solution, there was, in the main function in which writeOccurencesFile is called, another function that create a RandomAccessFile and doesn't close it.
The debugger sais that Exception has thrown in writeOccurencesFile but using Java Leak Detector i found out that the pdf were already opened and not close after parsing to pure text.
Thank you!
Try using this utility specifically designed for the purpose.
This Java agent is a utility that keeps track of where/when/who opened files in your JVM. You can have the agent trace these operations to find out about the access pattern or handle leaks, and dump the list of currently open files and where/when/who opened them.
When the exception occurs, this agent will dump the list, allowing you to find out where a large number of file descriptors are in use.
i have tried using try-with resources; but the problem remains.
Also running in system macos built-in console print out a FileNotFound exception at the line of FileWriter fileWriter = ...
static boolean writeOccurencesFile(String WORDLIST,String categoria, TreeMap<String,Integer> map) {
File dizionario = new File(WORDLIST);
try (FileWriter fileWriter = new FileWriter( "files/" + categoria + ".txt" , true)) {
try (FileReader fileReader = new FileReader(dizionario)) {
try (BufferedReader bufferedReader = new BufferedReader(fileReader)) {
if (dizionario.exists()) {
StringBuffer stringBuffer = new StringBuffer();
String parola;
StringBuffer line = new StringBuffer();
int contatore_index_parola = 1;
while ((parola = bufferedReader.readLine()) != null) {
if (map.containsKey(parola) && !parola.isEmpty()) {
line.append(contatore_index_parola + ":" + map.get(parola).intValue() + " ");
map.remove(parola);
}
contatore_index_parola++;
}
if (!line.toString().isEmpty()) {
fileWriter.append(getCategoryID(categoria) + " " + line + "\n"); // print riga completa documento N x1:y x2:a ...
}
} else {
System.err.println("Dictionary file not found.");
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
return true;
}
This is the code that i am using now, although the bad managing of Exception, why the files seem to be not closed?
Now i am making a test with File Leak Detector
Maybe your code raises another exception that you are not handling. Try add catch (Exception e) before finally block
You also can move BufferedReader declaration out the try and close it in finally

Append a string in front of line in java?

I am creating a pattern lock based project in android.
I have a file called category.txt
The content of the file is as below
Sports:Race:Arcade:
No what i want is that whenever the user draw a pattern for a specific games category the pattern should get append in front of that category.
eg :
Sports:Race:"string/pattern string to be appended here for race"Arcade:
i have used following code but it is not working.
private void writefile(String getpattern,String category)
{
String str1;
try {
file = new RandomAccessFile(filewrite, "rw");
while((str1 = file.readLine()) != null)
{
String line[] = str1.split(":");
if(line[0].toLowerCase().equals(category.toLowerCase()))
{
String colon=":";
file.write(category.getBytes());
file.write(colon.getBytes());
file.write(getpattern.getBytes());
file.close();
Toast.makeText(getActivity(),"In Writefile",Toast.LENGTH_LONG).show();
}
}
}
catch (FileNotFoundException e)
{
e.printStackTrace();
}
catch(IOException io)
{
io.printStackTrace();
}
}
please help !
Using RandomAccessFile you have to calculate the position. I think it's much easier to just replace the file content with a little help from apache-commons-io FileUtils. This might be not the best idea if you have a very large file but it's quite simple.
String givenCategory = "Sports";
String pattern = "stringToAppend";
final String colon = ":";
try {
List<String> lines = FileUtils.readLines(new File("someFile.txt"));
String modifiedLine = null;
int index = 0;
for (String line : lines) {
String[] categoryFromLine = line.split(colon);
if (givenCategory.equalsIgnoreCase(categoryFromLine[0])) {
modifiedLine = new StringBuilder().append(pattern).append(colon).append(givenCategory).append(colon).toString();
break;
}
index++;
}
if (modifiedLine != null) {
lines.set(index, modifiedLine);
FileUtils.writeLines(new File("someFile.txt"), lines);
}
} catch (IOException e1) {
// do something
}

I need to contain all matches of a Regex into a text file; I'm new to java programming

I'm trying to contain all matches found into a text document, I have been banging my head on my desk for the past 3 hours and figured it would be time I asked for help.
My current issue is with the List<String> and I'm not sure if it because the information entered is wrong or if it's my file print methods. It does not print to file and with other means of printing such as writer.println(returnvalue) and even then, it still only displays one of the matches and not all, I do have the matches appearing in console just to make sure they are showing and they are.
Edit2: Sorry this would be my first question on stackoverflow, I guess my question is How would you print all the data from a list array to a text file?
Edit3: My newest problem is printing out all matches i am currently stuck printing out the last match, any advice?
public static void RegexChecker(String TheRegex, String line){
String Result= "";
List<String> returnvalue = new ArrayList<String>();
Pattern checkRegex = Pattern.compile(TheRegex);
Matcher regexMatcher = checkRegex.matcher(line);
int count = 0 ;
FileWriter writer = null;
try {
writer = new FileWriter("output.txt");
} catch (IOException e1) {
e1.printStackTrace();
}
while ( regexMatcher.find() ){
if (regexMatcher.group().length() != 0){
returnvalue.add(regexMatcher.group());
System.out.println( regexMatcher.group().trim() );
}
for(String str: returnvalue) {
try {
out.write(String.valueOf(returnvalue.get(i)));
writer.write(str);
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Get the for out of while. You want to write to the file only after all matches have been added to the list. The for-each block needs some modifications as well.
The for-each construct gives you values from iteration over the collection. You need not obtain the values again using an index.
Try this:
while (regexMatcher.find()) {
if (regexMatcher.group().length() != 0) {
returnvalue.add(regexMatcher.group());
System.out.println(regexMatcher.group().trim());
}
}
try {
for (String str : returnvalue) {
writer.write(str + "\n");
}
writer.flush();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}

filtering files

I want to check the file-type of a file. I thought about magic numbers, but how to use it
with Java.
I want only allow Textfiles and filter files like jpg etc. in my programm.
Some ideas, what can I do.
private String path;
private String fileText;
private String textLine;
public LoadModel(String path) {
this.path = path;
this.fileText = "";
FileReader read = null;
BufferedReader bufRead = null;
if (path != null && new File(path).exists()
&& !(new File(path).isDirectory())) {
try {
read = new FileReader(path);
bufRead = new BufferedReader(read);
do {
try {
this.textLine = bufRead.readLine();
} catch (IOException ex) {
Logger.getLogger(LoadModel.class.getName()).log(Level.SEVERE, null, ex);
}
if (this.textLine != null) {
this.fileText = this.fileText + this.textLine + "\n";
}
} while (this.textLine != null);
} catch (FileNotFoundException ex) {
Logger.getLogger(LoadModel.class.getName()).log(Level.SEVERE, null, ex);
}
} else {
HinweisDialogController.hinweisDialogOK("Die angegebene Datei existiert nicht");
}
}
Here you can find the list of API's available for identify mime type in java with code sample.
Also in java 7 have an option
Files.probeContentType(path)
.
You can try java.nio.file.Files.probeContentType which is designed to determine a file content type. For example this test
System.out.println(Files.probeContentType(Paths.get("1.xml")));
System.out.println(Files.probeContentType(Paths.get("1.txt")));
prints
text/xml
text/plain
see API for more details
If you need your code to work on earlier versions of JDK (not JDK7) you may use Apache Tika's MimeType detector, which has MimeType#detect() method
More information here

Categories

Resources