How to extract a substring from a string in java - java

What I am doing is validating URLs from my code. So I have a file with url's in it and I want to see if they exist or not. If they exist, the web page contains xml code in which there will be an email address I want to extract.
I go round a while loop and in each instance, if the url exists, The xml is added to a string. This one big string contains the xml code. What I want to do is extract the email address from this string with the xml code in it. I can't use the methods in the string api as they require you to specify the sarting index which I don't know as it varies each time.
What I was hoping to do was search the string for a sub-string starting with (e.g. "<email id>") and ending with (e.g. "</email id>") and add the string between these strings to a seperate string.
Does anyone know if this is possible to do or if there is an easier/different way of doing what I want to do?
Thanks.

If you know well the structure of the XML document, I'll recommand to use XPath.
For example, with emails contained in <email>a#b.com</email>, there will a XPath request like /root/email (depends on your xml structure)
By executing this XPath query on your XML file, you will automatically get all <email> element (Node) returned in an array. And if you have XML element, you have XML content. (#getNodeValue)

To answer your subject question: .indexOf, or, regular expressions.
But after a brief review of your question, you should really be processing the XML document properly.

A regular expression that will find and return strings between two " characters:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
private final static Pattern pattern = Pattern.compile("\"(.*?)\"");
private void doStuffWithStringsBetweenQuotes(String source) {
Matcher matcher = pattern.matcher(source);
while (matcher.find()) {
String match = matcher.group(1);
}
}

Have you try to use Regex? Probably a sample document will be very useful for this kind of question.

Check out the org.xml.sax API. It is very easy to use and allows you to parse through XML and do whatever you want with the contents whenever you come across anything of interest. So you could easily add some logic to look for < email > start elements then save the contents (characters) which will contain your email address.

If I understand your question correctly you are extracting pieces of XML from multiple web pages and concatenating them into a big 'xml' string,
something that looks like
"<somedata>blah</somedata>
<email>a.b#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.c#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.d#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
"
I'd advise making that a somewhat valid xml document by including a root element.
"
<?xml version="1.0" encoding="ISO-8859-1"?>
<newRoot>
<somedata>blah</somedata>
<email>a.b#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.c#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.d#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
</newroot>"
Then you could load that into an Xml Document object and can use Xpath expressions to extract the email nodes and their values.
If you don't want to do that that you could use the indexOf(String str, int fromIndex) method to find the <email> and </email> (or whatever they are called) positions. and then substring based on those. That's not a particularly clean or easy to read way of doing it though.

Related

java get next few words in string

I am trying to search a .txt file that contains HTML in it. I need to search the file for specific HTML tags, then grab the following next few characters of code. I am new to java, but am willing to learn what I need to.
For example: Say I have the code: <span class="date">Apr 13</span> and all I need is the date(Apr 13). How do I go about doing this?
Thanks a lot!
Have a look at String class docs and try to find the method to search the string.
Since you said you are getting it from a HTML file, you can have a look at Jsoup which is a HTML parser, which will make searching for strings in HTML documents a lot easier.
With jsoup, you can do it like this
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements spans = doc.select("span");
for (Element element : spans) {
System.out.println(element.html());
}
try this
Matcher m = Pattern.compile(">(.*?)<").matcher(s);
while(m.find()) {
String s = m.group(1);
}
If you want is something basic (I thought it would be good as you are new), you can use this :
if(s.indexOf("span class=\"date\"")!=0)
s=s.substring(s.indexOf(">")+1,s.lastIndexOf("<"));
But this answer is specific to your question than a broad one
String yourString = "<span class=\"date\">Apr 13</span>"
String date = yourString.split("class=\"date\">")[1].split("</sp")[0];

Parse html string in java servlet

I am trying to parse below string in java servlet.
<con>
<status>OK</status>
<session>12312332432</session>
</con>
I want value of <session> element.
Ideas ?
If it's not XML, you can use regex:
import java.util.regex.*;
String aParser="<con><status>OK</status><session>12312332432</session></con>";
Pattern p=Pattern.compile("<session>(.*)</session>");
Matcher m=p.matcher(aParser);
while(m.find())
{
System.out.println(m.group(1));
}
For XML purposes you should use an XMLParser.
Look at SAXParser it will do the job.
SAXParser is very good when dealing with big files, because the whole document isn´t hold
in memory.
If you just need session value and not the entire XMl data.just split() the string as below
xmlString.split(starttag)[1].split(endtag)[0];

Regex Email addresses out of xml

My question: What's a good way to parse the information below?
I have a java program that gets it's input from XML. I have a feature which will send an error email if there was any problem in the processing. Because parsing the XML could be a problem, I want to have a feature that would be able to regex the emails out of the xml (because if parsing was the problem then I couldn't get the error e-mails out of the xml normally).
Requirements:
I want to be able to parse the to, cc, and bcc attributes seperately
There are other elements which have to, cc, and bcc attributes
Whitespace does not matter, so my example may show the attributes on a newline, but that's not always the case.
The order of the attributes does not matter.
Here's an example of the xml:
<error_options
to="your_email#your_server.com"
cc="cc_error#your_server.com"
bcc="bcc_error#your_server.com"
reply_to="someone_else#their_server.com"
from="bo_error#some_server.org"
subject="Error running System at ##TIMESTAMP##"
force_send="false"
max_email_size="10485760"
oversized_email_action="zip;split_all"
>
I tried this error_options.{0,100}?to="(.*?)", but that matched me down to reply_to. That made me think there are probably some cases I might miss, which is why I'm posting this as a question.
This piece will put all attributes from your String s="<error_options..." into a map:
Pattern p = Pattern.compile("\\s+?(.+?)=\"(.+?)\\s*?\"",Pattern.DOTALL);
Map a = new HashMap() ;
Matcher m = p.matcher(s) ;
while( m.find() ) {
String key = m.group(1).trim() ;
String val = m.group(2).trim() ;
a.put(key, val) ;
}
...then you can extract the values that you're interested in from that map.
This question is similar to RegEx match open tags except XHTML self-contained tags. Never ever parse XML or HTML with regular expressions. There are many XML parser implementation in Java to do this task properly. Read the document and parse the attributes one by one.
Don't mind, if the users XML is not well-formed, the parsers can handle a lot of sloppiness.
/<error_options(?=\s)[^>]*?(?<=\n)\s*to="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*cc="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*bcc="([^"]*)"/s;

get a set of data from a chat log by using regex and java

I'm writing a Java program where I have to extract some data from a chat log file for a further processing using regex(I am new to regular expressions by the way). The chat log schema is defined as follows:[hh:mm:ss] string.
But the specific lines I would like to extract data are in the form of
[hh:mm:ss] <data1> data2. The data I would like to extract are hh:mm:ss, data1 and data2.
At first, I have tried to extract the time which was easier using
Pattern.compile("(\d{2}:\d{2}:\d{2}).
I have even been able to extract the data1 separately using
Pattern p1=Pattern.compile("<(.*)>"); and it was fine.
But when I try to get "hh:mm:ss",data1 and data2 by using the following regex
Pattern p=Pattern.compile("(\d{2}:\d{2}:\d{2}) <(.*)> (.*)") I have no match found.
So does any one have an Idea on how I can proceed in that case to achieve my goal?
Well if you were matching your own pattern everything would have been fine. You forget about the brackets of the time: [ hh:mm:ss ] . See here:
String text = "22:44:55 <data quite much> data 2";
text = text.replaceAll("(\\d{2}:\\d{2}:\\d{2}) <(.*)> (.*)", "replacement");
System.out.println(text);
text = "[22:44:55] <data quite much> data 2";
text = text.replaceAll("(\\d{2}:\\d{2}:\\d{2}) <(.*)> (.*)", "replacement");
System.out.println(text);
This produces:
replacement
[22:44:55] <data quite much> data 2
So first string was matched and second one - not. Just as expected.
Probably you will just need to change your pattern to \\[(\\d{2}:\\d{2}:\\d{2})\\] <(.*)> (.*).

Convert HTML symbols and HTML names to HTML number using Java

I have an XML which contains many special symbols like ® (HTML number &#174) etc.
and HTML names like &atilde (HTML number &#227) etc.
I am trying to replace these HTML symbols and HTML names with corresponding HTML number using Java. For this, I first converted XML file to string and then used replaceAll method as:
File fn = new File("myxmlfile.xml");
String content = FileUtils.readFileToString(fn);
content = content.replaceAll("®", "&\#174");
FileUtils.writeStringToFile(fn, content);
But this is not working.
Can anyone please tell how to do it.
Thanks !!!
The signature for the replaceAll method is:
public String replaceAll(String regex, String replacement)
You have to be careful that your first parameter is a valid regular expression. The Java Pattern class describes the constructs used in a Java regular expression.
Based on what I see in the Pattern class description, I don't see what's wrong with:
content = content.replaceAll("®", "&\#174");
You could try:
content = content.replaceAll("\\p(®)", "&\#174");
and see if that works better.
I don't think that \# is a valid escape sequence.
BTW, what's wrong with "&#174" ?
If you want HTML numbers try first escaping for XML.
Use EscapeUtils from Apache Commons Lang.
Java may have trouble dealing with it, so first I prefere to escape Java, and after that XML or HTML.
String escapedStr= StringEscapeUtils.escapeJava(yourString);
escapedStr= StringEscapeUtils.escapeXML(yourString);
escapedStr= StringEscapeUtils.escapeHTML(yourString);

Categories

Resources