How to extract Full Name From a Url in Java - java

i need a library to extract file's full name from it's URL(Direct Download Link). I want a powerful library. I use FileNameUtils from Apache commons, But this class does not support a lot of URLs.
I want a library which supports these Urls:
https://example.cdn.com/mp4/7/9/5/file_795f32460d111df334849ee8336e56ca.mp4?e=1535545105&h=4772d27a70cd9b1c665b712f62592c47&download=1
name : file_795f32460d111df334849ee8336e56ca.mp4
http://example.cdn.comr/post/93/3/Jozve-Kamele-arbi.abp.zip
name : Jozve-Kamele-arbi.abp.zip
http://cdl.example.com/?b=dl-software&f=Windows.8.1.Enterprise.x86.Aug.2018_n.part1.rar
name : dl-software&f=Windows.8.1.Enterprise.x86.Aug.2018_n.part1.rar
https://www.google.com/url?sa=t&source=web&rct=j&url=http://www.pdf995.com/samples/pdf.pdf&ved=2ahUKEwjV096X-ZHdAhVQzlkKHTpUBV4QFjAAegQIARAB&usg=AOvVaw3HFvAQ7GNf5QjsUo05ot-j
name: pdf.pdf
Can anyone help me? Thanks.
I apologize in advance if the grammar of my sentence is not correct. because I can't speak English well.

You could actually also try to solve this problem with regular expressions (like e.g (?i)([^=/&?]+\\.(" + EXTENSIONS + "))\\b), if you have a list of the files extensions you are interested in.
Here is an example of such a method which extracts a file from a URL:
private static final String EXTENSIONS = "ez|aw|atom|atomcat|atomsvc|ccxml|cdmia|cdmic|cdmid|cdmio|cdmiq|cu|davmount|dbk|dssc|xdssc|ecma|emma|epub|exi|pfr|gml|gpx|gxf|stk|ipfix|jar|ser|class|js|json|jsonml|lostxml|hqx|cpt|mads|mrc|mrcx|mathml|mbox|mscml|metalink|meta4|mets|mods|mp4s|mp4|mxf|oda|opf|ogx|omdoc|oxps|xer|pdf|pgp|prf|p10|p7s|p8|ac|cer|crl|pkipath|pki|pls|cww|pskcxml|rdf|rif|rnc|rl|rld|rs|gbr|mft|roa|rsd|rss|rtf|sbml|scq|scs|spq|spp|sdp|setpay|setreg|shf|rq|srx|gram|grxml|sru|ssdl|ssml|tfi|tsd|plb|psb|pvb|tcap|pwn|aso|imp|acu|air|fcdt|xdp|xfdf|ahead|azf|azs|azw|acc|ami|apk|cii|fti|atx|mpkg|m3u8|swi|iota|aep|mpm|bmi|rep|cdxml|mmd|cdy|cla|rp9|c11amc|c11amz|csp|cdbcmsg|cmc|clkx|clkk|clkp|clkt|clkw|wbs|pml|ppd|car|pcurl|dart|rdz|fe_launch|dna|mlp|dpg|dfac|kpxx|ait|svc|geo|mag|nml|esf|msf|qam|slt|ssf|ez2|ez3|fdf|mseed|gph|ftc|fnc|ltf|fsc|oas|oa2|oa3|fg5|bh2|ddd|xdw|xbd|fzs|txd|ggb|ggt|gxt|g2w|g3w|gmx|kml|kmz|gac|ghf|gim|grv|gtm|tpl|vcg|hal|zmm|hbci|les|hpgl|hpid|hps|jlt|pcl|pclxl|sfd-hdstx|mpy|irm|sc|igl|ivp|ivu|igm|i2g|qbo|qfx|rcprofile|irp|xpr|fcs|jam|rms|jisp|joda|karbon|chrt|kfo|flw|kon|ksp|htke|kia|sse|lasxml|lbd|lbe|123|apr|pre|nsf|org|scm|lwp|portpkg|mcd|mc1|cdkey|mwf|mfm|flo|igx|mif|daf|dis|mbk|mqy|msl|plc|txf|mpn|mpc|xul|cil|cab|xlam|xlsb|xlsm|xltm|eot|chm|ims|lrm|thmx|cat|stl|ppam|pptm|sldm|ppsm|potm|docm|dotm|wpl|xps|mseq|mus|msty|taglet|nlu|nnd|nns|nnw|ngdat|n-gage|rpst|rpss|edm|edx|ext|odc|otc|odb|odf|odft|odg|otg|odi|oti|odp|otp|ods|ots|odt|odm|ott|oth|xo|dd2|oxt|pptx|sldx|ppsx|potx|xlsx|xltx|docx|dotx|mgp|dp|esa|paw|str|ei6|efif|wg|plf|pbd|box|mgz|qps|ptid|bed|mxl|musicxml|cryptonote|cod|rm|rmvb|link66|st|see|sema|semd|semf|ifm|itp|iif|ipk|mmf|teacher|dxp|sfs|sdc|sda|sdd|smf|sgl|smzip|sm|sxc|stc|sxd|std|sxi|sti|sxm|sxw|sxg|stw|svd|xsm|bdm|xdm|tao|tmo|tpt|mxs|tra|utz|umj|unityweb|uoml|vcx|vis|vsf|wbxml|wmlc|wmlsc|wtb|nbp|wpd|wqd|stf|xar|xfdl|hvd|hvs|hvp|osf|osfpvg|saf|spf|cmp|zaz|vxml|wgt|hlp|wsdl|wspolicy|7z|abw|ace|dmg|aam|aas|bcpio|torrent|bz|vcd|cfs|chat|pgn|nsc|cpio|csh|dgc|wad|ncx|dtb|res|dvi|evy|eva|bdf|gsf|psf|pcf|snf|arc|spl|gca|ulx|gnumeric|gramps|gtar|hdf|install|iso|jnlp|latex|mie|application|lnk|wmd|wmz|xbap|mdb|obd|crd|clp|mny|pub|scd|trm|wri|nzb|p7r|rar|ris|sh|shar|swf|xap|sql|sit|sitx|srt|sv4cpio|sv4crc|t3|gam|tar|tcl|tex|tfm|obj|ustar|src|fig|xlf|xpi|xz|xaml|xdf|xenc|dtd|xop|xpl|xslt|xspf|yang|yin|zip|adp|s3m|sil|eol|dra|dts|dtshd|lvp|pya|ecelp4800|ecelp7470|ecelp9600|rip|weba|aac|caf|flac|mka|m3u|wax|wma|rmp|wav|xm|cdx|cif|cmdf|cml|csml|xyz|ttc|otf|ttf|woff|woff2|bmp|cgm|g3|gif|ief|ktx|png|btif|sgi|psd|sub|dwg|dxf|fbs|fpx|fst|mmr|rlc|mdi|wdp|npx|wbmp|xif|webp|3ds|ras|cmx|ico|sid|pcx|pnm|pbm|pgm|ppm|rgb|tga|xbm|xpm|xwd|dae|dwf|gdl|gtw|mts|vtu|appcache|css|csv|n3|dsc|rtx|tsv|ttl|vcard|curl|dcurl|mcurl|scurl|sub|fly|flx|gv|3dml|spot|jad|wml|wmls|java|nfo|opml|etx|sfv|uu|vcs|vcf|3gp|3g2|h261|h263|h264|jpgv|ogv|dvb|fvt|pyv|viv|webm|f4v|fli|flv|m4v|mng|vob|wm|wmv|wmx|wvx|avi|movie|smv|ice";
private static final Pattern FILE_DETECT = Pattern.compile("(?i)([^=/&?]+\\.(" + EXTENSIONS + "))\\b");
public static Optional<String> extractFileFrom(String url) {
Matcher matcher = FILE_DETECT.matcher(url);
return (matcher.find()) ? Optional.of(matcher.group(1)) : Optional.empty();
}
And here is a test which demonstrates how to use the method above:
public static void main(String[] args) throws ParseException {
List<String> strings = Arrays.asList(
"https://example.cdn.com/mp4/7/9/5/file_795f32460d111df334849ee8336e56ca.mp4?e=1535545105&h=4772d27a70cd9b1c665b712f62592c47&download=1",
"http://example.cdn.comr/post/93/3/Jozve-Kamele-arbi.abp.zip",
"http://cdl.example.com/?b=dl-software&f=Windows.8.1.Enterprise.x86.Aug.2018_n.part1.rar",
"https://www.google.com/url?sa=t&source=web&rct=j&url=http://www.pdf995.com/samples/pdf.pdf&ved=2ahUKEwjV096X-ZHdAhVQzlkKHTpUBV4QFjAAegQIARAB&usg=AOvVaw3HFvAQ7GNf5QjsUo05ot-j",
"https://www.google.com/url?sa=t&source=web&rct=j&url=http://www.pdf995.com/samples/pdf.PDF&ved=2ahUKEwjV096X-ZHdAhVQzlkKHTpUBV4QFjAAegQIARAB&usg=AOvVaw3HFvAQ7GNf5QjsUo05ot-j");
strings.stream().map(s -> extractFileFrom(s)).collect(Collectors.toList())
.forEach(System.out::println);
}
If you execute the main method you will see this on the console:
Optional[file_795f32460d111df334849ee8336e56ca.mp4]
Optional[Jozve-Kamele-arbi.abp.zip]
Optional[Windows.8.1.Enterprise.x86.Aug.2018_n.part1.rar]
Optional[pdf.pdf]
Optional[pdf.PDF]

I use this method, hope it helps you too. It will parse from question marks, hash too.
public static String parseFileNameFromUrl(String url) {
if (url == null) {
return "";
}
try {
URL res = new URL(url);
String resHost = res.getHost();
if (resHost.length() > 0 && url.endsWith(resHost)) {
// handle ...example.com
return "";
}
} catch (MalformedURLException e) {
e.printStackTrace();
return "";
}
int startIndex = url.lastIndexOf('/') + 1;
int length = url.length();
// find end index for ?
int lastQuestionMarkPos = url.lastIndexOf('?');
if (lastQuestionMarkPos == -1) {
lastQuestionMarkPos = length;
}
// find end index for #
int lastHashPos = url.lastIndexOf('#');
if (lastHashPos == -1) {
lastHashPos = length;
}
// calculate the end index
int endIndex = Math.min(lastQuestionMarkPos, lastHashPos);
return url.substring(startIndex, endIndex);
}

Related

Is there any way to compare Strings lke the "like" in sql? [duplicate]

Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\<.*?>", "")
will work, but some things like & won't be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).
Use a HTML parser instead of regex. This is dead simple with Jsoup.
public static String html2text(String html) {
return Jsoup.parse(html).text();
}
Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.
See also:
RegEx match open tags except XHTML self-contained tags
What are the pros and cons of the leading Java HTML parsers?
XSS prevention in JSP/Servlet web application
If you're writing for Android you can do this...
androidx.core.text.HtmlCompat.fromHtml(instruction,HtmlCompat.FROM_HTML_MODE_LEGACY).toString()
If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:
replaceAll("\\<[^>]*>","")
but you will run into issues if the user enters something malformed, like <bhey!</b>.
You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.
The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.
Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.
import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {
}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main(String[] args) {
try {
// the HTML to convert
FileReader in = new FileReader("java-new.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
} catch (Exception e) {
e.printStackTrace();
}
}
}
ref : Remove HTML tags from a file to extract only the TEXT
I think that the simpliest way to filter the html tags is:
private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
public static String removeTags(String string) {
if (string == null || string.length() == 0) {
return string;
}
Matcher m = REMOVE_TAGS.matcher(string);
return m.replaceAll("");
}
Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).
Source htmlSource = new Source(htmlText);
Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
Renderer htmlRend = new Renderer(htmlSeg);
System.out.println(htmlRend.toString());
On Android, try this:
String result = Html.fromHtml(html).toString();
The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):
It removes line breaks from the text
It converts text <script> into <script>
If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:
// breaks multi-level of escaping, preventing &lt;script&gt; to be rendered as <script>
String replace = input.replace("&", "");
// decode any encoded html, preventing <script> to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);
Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.
And here is a bunch of test cases (input to output):
{"regular string", "regular string"},
{"A link", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"<script>", ""},
{"&lt;script&gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}
If you find a way to make it better, please let me know.
HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.
This should work -
use this
text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.
and this
text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like , &, > etc.
You can simply use the Android's default HTML filter
public String htmlToStringFilter(String textToFilter){
return Html.fromHtml(textToFilter).toString();
}
The above method will return the HTML filtered string for your input.
You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.
The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...
replaceAll("\\<[\s]*tag[^>]*>","")
Then HTML-decode special characters such as &. The result should not be considered to be sanitized.
One more way can be to use com.google.gdata.util.common.html.HtmlToText class
like
MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));
This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.
The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".
So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):
import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
/**
* Take HTML and give back the text part while dropping the HTML tags.
*
* There is some risk that using TagSoup means we'll permute non-HTML text.
* However, it seems to work the best so far in test cases.
*
* #author dan
* #see TagSoup
*/
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;
public Html2Text2() {
}
public void parse(String str) throws IOException, SAXException {
XMLReader reader = new Parser();
reader.setContentHandler(this);
sb = new StringBuffer();
reader.parse(new InputSource(new StringReader(str)));
}
public String getText() {
return sb.toString();
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
for (int idx = 0; idx < length; idx++) {
sb.append(ch[idx+start]);
}
}
#Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
sb.append(ch);
}
// The methods below do not contribute to the text
#Override
public void endDocument() throws SAXException {
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
}
#Override
public void endPrefixMapping(String prefix) throws SAXException {
}
#Override
public void processingInstruction(String target, String data)
throws SAXException {
}
#Override
public void setDocumentLocator(Locator locator) {
}
#Override
public void skippedEntity(String name) throws SAXException {
}
#Override
public void startDocument() throws SAXException {
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
}
#Override
public void startPrefixMapping(String prefix, String uri)
throws SAXException {
}
}
Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HTML2Text extends HTMLEditorKit.ParserCallback {
private static final Logger log = Logger
.getLogger(Logger.GLOBAL_LOGGER_NAME);
private StringBuffer stringBuffer;
private Stack<IndexType> indentStack;
public static class IndexType {
public String type;
public int counter; // used for ordered lists
public IndexType(String type) {
this.type = type;
counter = 0;
}
}
public HTML2Text() {
stringBuffer = new StringBuffer();
indentStack = new Stack<IndexType>();
}
public static String convert(String html) {
HTML2Text parser = new HTML2Text();
Reader in = new StringReader(html);
try {
// the HTML to convert
parser.parse(in);
} catch (Exception e) {
log.severe(e.getMessage());
} finally {
try {
in.close();
} catch (IOException ioe) {
// this should never happen
}
}
return parser.getText();
}
public void parse(Reader in) throws IOException {
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
log.info("StartTag:" + t.toString());
if (t.toString().equals("p")) {
if (stringBuffer.length() > 0
&& !stringBuffer.substring(stringBuffer.length() - 1)
.equals("\n")) {
newLine();
}
newLine();
} else if (t.toString().equals("ol")) {
indentStack.push(new IndexType("ol"));
newLine();
} else if (t.toString().equals("ul")) {
indentStack.push(new IndexType("ul"));
newLine();
} else if (t.toString().equals("li")) {
IndexType parent = indentStack.peek();
if (parent.type.equals("ol")) {
String numberString = "" + (++parent.counter) + ".";
stringBuffer.append(numberString);
for (int i = 0; i < (4 - numberString.length()); i++) {
stringBuffer.append(" ");
}
} else {
stringBuffer.append("* ");
}
indentStack.push(new IndexType("li"));
} else if (t.toString().equals("dl")) {
newLine();
} else if (t.toString().equals("dt")) {
newLine();
} else if (t.toString().equals("dd")) {
indentStack.push(new IndexType("dd"));
newLine();
}
}
private void newLine() {
stringBuffer.append("\n");
for (int i = 0; i < indentStack.size(); i++) {
stringBuffer.append(" ");
}
}
public void handleEndTag(HTML.Tag t, int pos) {
log.info("EndTag:" + t.toString());
if (t.toString().equals("p")) {
newLine();
} else if (t.toString().equals("ol")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("ul")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("li")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("dd")) {
indentStack.pop();
;
}
}
public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
log.info("SimpleTag:" + t.toString());
if (t.toString().equals("br")) {
newLine();
}
}
public void handleText(char[] text, int pos) {
log.info("Text:" + new String(text));
stringBuffer.append(text);
}
public String getText() {
return stringBuffer.toString();
}
public static void main(String args[]) {
String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol> <li>This</li> <li>is</li> <li>an</li> <li>ordered</li> <li>list <p>with</p> <ul> <li>another</li> <li>list <dl> <dt>This</dt> <dt>is</dt> <dd>sdasd</dd> <dd>sdasda</dd> <dd>asda <p>aasdas</p> </dd> <dd>sdada</dd> <dt>fsdfsdfsd</dt> </dl> <dl> <dt>vbcvcvbcvb</dt> <dt>cvbcvbc</dt> <dd>vbcbcvbcvb</dd> <dt>cvbcv</dt> <dt></dt> </dl> <dl> <dt></dt> </dl></li> <li>cool</li> </ul> <p>stuff</p> </li> <li>cool</li></ol><p></p></body></html>";
System.out.println(convert(html));
}
}
Alternatively, one can use HtmlCleaner:
private CharSequence removeHtmlFrom(String html) {
return new HtmlCleaner().clean(html).getText();
}
Use Html.fromHtml
HTML Tags are
<a href=”…”> <b>, <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>, <em>, <font size=”…” color=”…” face=”…”>
<h1>, <h2>, <h3>, <h4>, <h5>, <h6>
<i>, <p>, <small>
<strike>, <strong>, <sub>, <sup>, <tt>, <u>
As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.
Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.
Example
String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";
Then
Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());
Output
This is about me text that the user can put into their profile
Here is one more variant of how to replace all(HTML Tags | HTML Entities | Empty Space in HTML content)
content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", ""); where content is a String.
I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:
noHTMLString.replaceAll("\\&.*?\\;", "");
instead of this:
html = html.replaceAll(" ","");
html = html.replaceAll("&"."");
It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.
static public String getUrlContentsAsText(String url) {
String content = "";
StringBean stringBean = new StringBean();
stringBean.setURL(url);
content = stringBean.getStrings();
return content;
}
Here is another way to do it:
public static String removeHTML(String input) {
int i = 0;
String[] str = input.split("");
String s = "";
boolean inTag = false;
for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
inTag = true;
}
if (!inTag) {
for (i = 0; i < str.length; i++) {
s = s + str[i];
}
}
return s;
}
One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:
InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())
One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".
String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
html = html.replace(tag, NEW_LINE_MARK+tag);
}
String text = Jsoup.parse(html).text();
text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");
classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim()
Sometimes the html string come from xml with such &lt. When using Jsoup we need parse it and then clean it.
Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);
While only using Jsoup.parse(htmlstrl).text() can't remove tags.
Try this for javascript:
const strippedString = htmlString.replace(/(<([^>]+)>)/gi, "");
console.log(strippedString);
You can use this method to remove the HTML tags from the String,
public static String stripHtmlTags(String html) {
return html.replaceAll("<.*?>", "");
}
My 5 cents:
String[] temp = yourString.split("&");
String tmp = "";
if (temp.length > 1) {
for (int i = 0; i < temp.length; i++) {
tmp += temp[i] + "&";
}
yourString = tmp.substring(0, tmp.length() - 1);
}
To get formateed plain html text you can do that:
String BR_ESCAPED = "<br/>";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");
To get formateed plain text change <br/> by \n and change last line by:
nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");
I know it is been a while since this question as been asked, but I found another solution, this is what worked for me:
Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
Source source= new Source(htmlAsString);
Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
String clearedHtml= m.replaceAll("");

How to cut html tags on JSP page [duplicate]

Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\<.*?>", "")
will work, but some things like & won't be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).
Use a HTML parser instead of regex. This is dead simple with Jsoup.
public static String html2text(String html) {
return Jsoup.parse(html).text();
}
Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.
See also:
RegEx match open tags except XHTML self-contained tags
What are the pros and cons of the leading Java HTML parsers?
XSS prevention in JSP/Servlet web application
If you're writing for Android you can do this...
androidx.core.text.HtmlCompat.fromHtml(instruction,HtmlCompat.FROM_HTML_MODE_LEGACY).toString()
If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:
replaceAll("\\<[^>]*>","")
but you will run into issues if the user enters something malformed, like <bhey!</b>.
You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.
The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.
Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.
import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {
}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main(String[] args) {
try {
// the HTML to convert
FileReader in = new FileReader("java-new.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
} catch (Exception e) {
e.printStackTrace();
}
}
}
ref : Remove HTML tags from a file to extract only the TEXT
I think that the simpliest way to filter the html tags is:
private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
public static String removeTags(String string) {
if (string == null || string.length() == 0) {
return string;
}
Matcher m = REMOVE_TAGS.matcher(string);
return m.replaceAll("");
}
Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).
Source htmlSource = new Source(htmlText);
Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
Renderer htmlRend = new Renderer(htmlSeg);
System.out.println(htmlRend.toString());
On Android, try this:
String result = Html.fromHtml(html).toString();
The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):
It removes line breaks from the text
It converts text <script> into <script>
If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:
// breaks multi-level of escaping, preventing &lt;script&gt; to be rendered as <script>
String replace = input.replace("&", "");
// decode any encoded html, preventing <script> to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);
Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.
And here is a bunch of test cases (input to output):
{"regular string", "regular string"},
{"A link", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"<script>", ""},
{"&lt;script&gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}
If you find a way to make it better, please let me know.
HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.
This should work -
use this
text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.
and this
text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like , &, > etc.
You can simply use the Android's default HTML filter
public String htmlToStringFilter(String textToFilter){
return Html.fromHtml(textToFilter).toString();
}
The above method will return the HTML filtered string for your input.
You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.
The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...
replaceAll("\\<[\s]*tag[^>]*>","")
Then HTML-decode special characters such as &. The result should not be considered to be sanitized.
One more way can be to use com.google.gdata.util.common.html.HtmlToText class
like
MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));
This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.
The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".
So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):
import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
/**
* Take HTML and give back the text part while dropping the HTML tags.
*
* There is some risk that using TagSoup means we'll permute non-HTML text.
* However, it seems to work the best so far in test cases.
*
* #author dan
* #see TagSoup
*/
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;
public Html2Text2() {
}
public void parse(String str) throws IOException, SAXException {
XMLReader reader = new Parser();
reader.setContentHandler(this);
sb = new StringBuffer();
reader.parse(new InputSource(new StringReader(str)));
}
public String getText() {
return sb.toString();
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
for (int idx = 0; idx < length; idx++) {
sb.append(ch[idx+start]);
}
}
#Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
sb.append(ch);
}
// The methods below do not contribute to the text
#Override
public void endDocument() throws SAXException {
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
}
#Override
public void endPrefixMapping(String prefix) throws SAXException {
}
#Override
public void processingInstruction(String target, String data)
throws SAXException {
}
#Override
public void setDocumentLocator(Locator locator) {
}
#Override
public void skippedEntity(String name) throws SAXException {
}
#Override
public void startDocument() throws SAXException {
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
}
#Override
public void startPrefixMapping(String prefix, String uri)
throws SAXException {
}
}
Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HTML2Text extends HTMLEditorKit.ParserCallback {
private static final Logger log = Logger
.getLogger(Logger.GLOBAL_LOGGER_NAME);
private StringBuffer stringBuffer;
private Stack<IndexType> indentStack;
public static class IndexType {
public String type;
public int counter; // used for ordered lists
public IndexType(String type) {
this.type = type;
counter = 0;
}
}
public HTML2Text() {
stringBuffer = new StringBuffer();
indentStack = new Stack<IndexType>();
}
public static String convert(String html) {
HTML2Text parser = new HTML2Text();
Reader in = new StringReader(html);
try {
// the HTML to convert
parser.parse(in);
} catch (Exception e) {
log.severe(e.getMessage());
} finally {
try {
in.close();
} catch (IOException ioe) {
// this should never happen
}
}
return parser.getText();
}
public void parse(Reader in) throws IOException {
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
log.info("StartTag:" + t.toString());
if (t.toString().equals("p")) {
if (stringBuffer.length() > 0
&& !stringBuffer.substring(stringBuffer.length() - 1)
.equals("\n")) {
newLine();
}
newLine();
} else if (t.toString().equals("ol")) {
indentStack.push(new IndexType("ol"));
newLine();
} else if (t.toString().equals("ul")) {
indentStack.push(new IndexType("ul"));
newLine();
} else if (t.toString().equals("li")) {
IndexType parent = indentStack.peek();
if (parent.type.equals("ol")) {
String numberString = "" + (++parent.counter) + ".";
stringBuffer.append(numberString);
for (int i = 0; i < (4 - numberString.length()); i++) {
stringBuffer.append(" ");
}
} else {
stringBuffer.append("* ");
}
indentStack.push(new IndexType("li"));
} else if (t.toString().equals("dl")) {
newLine();
} else if (t.toString().equals("dt")) {
newLine();
} else if (t.toString().equals("dd")) {
indentStack.push(new IndexType("dd"));
newLine();
}
}
private void newLine() {
stringBuffer.append("\n");
for (int i = 0; i < indentStack.size(); i++) {
stringBuffer.append(" ");
}
}
public void handleEndTag(HTML.Tag t, int pos) {
log.info("EndTag:" + t.toString());
if (t.toString().equals("p")) {
newLine();
} else if (t.toString().equals("ol")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("ul")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("li")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("dd")) {
indentStack.pop();
;
}
}
public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
log.info("SimpleTag:" + t.toString());
if (t.toString().equals("br")) {
newLine();
}
}
public void handleText(char[] text, int pos) {
log.info("Text:" + new String(text));
stringBuffer.append(text);
}
public String getText() {
return stringBuffer.toString();
}
public static void main(String args[]) {
String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol> <li>This</li> <li>is</li> <li>an</li> <li>ordered</li> <li>list <p>with</p> <ul> <li>another</li> <li>list <dl> <dt>This</dt> <dt>is</dt> <dd>sdasd</dd> <dd>sdasda</dd> <dd>asda <p>aasdas</p> </dd> <dd>sdada</dd> <dt>fsdfsdfsd</dt> </dl> <dl> <dt>vbcvcvbcvb</dt> <dt>cvbcvbc</dt> <dd>vbcbcvbcvb</dd> <dt>cvbcv</dt> <dt></dt> </dl> <dl> <dt></dt> </dl></li> <li>cool</li> </ul> <p>stuff</p> </li> <li>cool</li></ol><p></p></body></html>";
System.out.println(convert(html));
}
}
Alternatively, one can use HtmlCleaner:
private CharSequence removeHtmlFrom(String html) {
return new HtmlCleaner().clean(html).getText();
}
Use Html.fromHtml
HTML Tags are
<a href=”…”> <b>, <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>, <em>, <font size=”…” color=”…” face=”…”>
<h1>, <h2>, <h3>, <h4>, <h5>, <h6>
<i>, <p>, <small>
<strike>, <strong>, <sub>, <sup>, <tt>, <u>
As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.
Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.
Example
String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";
Then
Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());
Output
This is about me text that the user can put into their profile
Here is one more variant of how to replace all(HTML Tags | HTML Entities | Empty Space in HTML content)
content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", ""); where content is a String.
I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:
noHTMLString.replaceAll("\\&.*?\\;", "");
instead of this:
html = html.replaceAll(" ","");
html = html.replaceAll("&"."");
It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.
static public String getUrlContentsAsText(String url) {
String content = "";
StringBean stringBean = new StringBean();
stringBean.setURL(url);
content = stringBean.getStrings();
return content;
}
Here is another way to do it:
public static String removeHTML(String input) {
int i = 0;
String[] str = input.split("");
String s = "";
boolean inTag = false;
for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
inTag = true;
}
if (!inTag) {
for (i = 0; i < str.length; i++) {
s = s + str[i];
}
}
return s;
}
One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:
InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())
One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".
String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
html = html.replace(tag, NEW_LINE_MARK+tag);
}
String text = Jsoup.parse(html).text();
text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");
classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim()
Sometimes the html string come from xml with such &lt. When using Jsoup we need parse it and then clean it.
Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);
While only using Jsoup.parse(htmlstrl).text() can't remove tags.
Try this for javascript:
const strippedString = htmlString.replace(/(<([^>]+)>)/gi, "");
console.log(strippedString);
You can use this method to remove the HTML tags from the String,
public static String stripHtmlTags(String html) {
return html.replaceAll("<.*?>", "");
}
My 5 cents:
String[] temp = yourString.split("&");
String tmp = "";
if (temp.length > 1) {
for (int i = 0; i < temp.length; i++) {
tmp += temp[i] + "&";
}
yourString = tmp.substring(0, tmp.length() - 1);
}
To get formateed plain html text you can do that:
String BR_ESCAPED = "<br/>";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");
To get formateed plain text change <br/> by \n and change last line by:
nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");
I know it is been a while since this question as been asked, but I found another solution, this is what worked for me:
Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
Source source= new Source(htmlAsString);
Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
String clearedHtml= m.replaceAll("");

Glpk java and .mod file

I've got a .mod file and I can run it in java(Using netbeans).
The file gets data from another file .dat, because the guy who was developing it used GUSEK. Now we need to implement it in java, but i dont know how to put data in the K constant in the .mod file.
Doesn't matter the way, can be through database querys or file reading.
I dont know anything about math programming, i just need to add values to the already made glpk function.
Here's the .mod function:
# OPRE
set K;
param mc {k in K};
param phi {k in K};
param cman {k in K};
param ni {k in K};
param cesp;
param mf;
var x {k in K} binary;
minimize custo: sum {k in K} (mc[k]*phi[k]*(1-x[k]) + cman[k]*phi[k]*x[k]);
s.t. recursos: sum {k in K} (cman[k]*phi[k]*x[k]) - cesp <= 0;
s.t. ocorrencias: sum {k in K} (ni[k] + (1-x[k])*phi[k]) - mf <= 0;
end;
And here's the java code:
package br.com.genera.service.otimi;
import org.gnu.glpk.*;
public class Gmpl implements GlpkCallbackListener, GlpkTerminalListener {
private boolean hookUsed = false;
public static void main(String[] arg) {
String[] nomeArquivo = new String[2];
nomeArquivo[0] = "C:\\PodaEquipamento.mod";
System.out.println(nomeArquivo[0]);
GLPK.glp_java_set_numeric_locale("C");
System.out.println(nomeArquivo[0]);
new Gmpl().solve(nomeArquivo);
}
public void solve(String[] arg) {
glp_prob lp = null;
glp_tran tran;
glp_iocp iocp;
String fname;
int skip = 0;
int ret;
// listen to callbacks
GlpkCallback.addListener(this);
// listen to terminal output
GlpkTerminal.addListener(this);
fname = arg[0];
lp = GLPK.glp_create_prob();
System.out.println("Problem created");
tran = GLPK.glp_mpl_alloc_wksp();
ret = GLPK.glp_mpl_read_model(tran, fname, skip);
if (ret != 0) {
GLPK.glp_mpl_free_wksp(tran);
GLPK.glp_delete_prob(lp);
throw new RuntimeException("Model file not found: " + fname);
}
// generate model
GLPK.glp_mpl_generate(tran, null);
// build model
GLPK.glp_mpl_build_prob(tran, lp);
// set solver parameters
iocp = new glp_iocp();
GLPK.glp_init_iocp(iocp);
iocp.setPresolve(GLPKConstants.GLP_ON);
// do not listen to output anymore
GlpkTerminal.removeListener(this);
// solve model
ret = GLPK.glp_intopt(lp, iocp);
// postsolve model
if (ret == 0) {
GLPK.glp_mpl_postsolve(tran, lp, GLPKConstants.GLP_MIP);
}
// free memory
GLPK.glp_mpl_free_wksp(tran);
GLPK.glp_delete_prob(lp);
// do not listen for callbacks anymore
GlpkCallback.removeListener(this);
// check that the hook function has been used for terminal output.
if (!hookUsed) {
System.out.println("Error: The terminal output hook was not used.");
System.exit(1);
}
}
#Override
public boolean output(String str) {
hookUsed = true;
System.out.print(str);
return false;
}
#Override
public void callback(glp_tree tree) {
int reason = GLPK.glp_ios_reason(tree);
if (reason == GLPKConstants.GLP_IBINGO) {
System.out.println("Better solution found");
}
}
}
And i'm getting this in the console:
Reading model section from C:\PodaEquipamento.mod...
33 lines were read
Generating custo...
C:\PodaEquipamento.mod:24: no value for K
glp_mpl_build_prob: invalid call sequence
Hope someone can help, thanks.
The best way would be to read the data file the same way you read the modelfile.
ret = GLPK.glp_mpl_read_data(tran, fname_data, skip);
if (ret != 0) {
GLPK.glp_mpl_free_wksp(tran);
GLPK.glp_delete_prob(lp);
throw new RuntimeException("Data file not found: " + fname_data);
}
I resolved just copying the data block from the .data file into the .mod file.
Anyway,Thanks puhgee.

Filter (search and replace) array of bytes in an InputStream

I have an InputStream which takes the html file as input parameter. I have to get the bytes from the input stream .
I have a string: "XYZ". I'd like to convert this string to byte format and check if there is a match for the string in the byte sequence which I obtained from the InputStream. If there is then, I have to replace the match with the bye sequence for some other string.
Is there anyone who could help me with this? I have used regex to find and replace. however finding and replacing byte stream, I am unaware of.
Previously, I use jsoup to parse html and replace the string, however due to some utf encoding problems, the file seems to appear corrupted when I do that.
TL;DR: My question is:
Is a way to find and replace a string in byte format in a raw InputStream in Java?
Not sure you have chosen the best approach to solve your problem.
That said, I don't like to (and have as policy not to) answer questions with "don't" so here goes...
Have a look at FilterInputStream.
From the documentation:
A FilterInputStream contains some other input stream, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.
It was a fun exercise to write it up. Here's a complete example for you:
import java.io.*;
import java.util.*;
class ReplacingInputStream extends FilterInputStream {
LinkedList<Integer> inQueue = new LinkedList<Integer>();
LinkedList<Integer> outQueue = new LinkedList<Integer>();
final byte[] search, replacement;
protected ReplacingInputStream(InputStream in,
byte[] search,
byte[] replacement) {
super(in);
this.search = search;
this.replacement = replacement;
}
private boolean isMatchFound() {
Iterator<Integer> inIter = inQueue.iterator();
for (int i = 0; i < search.length; i++)
if (!inIter.hasNext() || search[i] != inIter.next())
return false;
return true;
}
private void readAhead() throws IOException {
// Work up some look-ahead.
while (inQueue.size() < search.length) {
int next = super.read();
inQueue.offer(next);
if (next == -1)
break;
}
}
#Override
public int read() throws IOException {
// Next byte already determined.
if (outQueue.isEmpty()) {
readAhead();
if (isMatchFound()) {
for (int i = 0; i < search.length; i++)
inQueue.remove();
for (byte b : replacement)
outQueue.offer((int) b);
} else
outQueue.add(inQueue.remove());
}
return outQueue.remove();
}
// TODO: Override the other read methods.
}
Example Usage
class Test {
public static void main(String[] args) throws Exception {
byte[] bytes = "hello xyz world.".getBytes("UTF-8");
ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
byte[] search = "xyz".getBytes("UTF-8");
byte[] replacement = "abc".getBytes("UTF-8");
InputStream ris = new ReplacingInputStream(bis, search, replacement);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
int b;
while (-1 != (b = ris.read()))
bos.write(b);
System.out.println(new String(bos.toByteArray()));
}
}
Given the bytes for the string "Hello xyz world" it prints:
Hello abc world
The following approach will work but I don't how big the impact is on the performance.
Wrap the InputStream with a InputStreamReader,
wrap the InputStreamReader with a FilterReader that replaces the strings, then
wrap the FilterReader with a ReaderInputStream.
It is crucial to choose the appropriate encoding, otherwise the content of the stream will become corrupted.
If you want to use regular expressions to replace the strings, then you can use Streamflyer, a tool of mine, which is a convenient alternative to FilterReader. You will find an example for byte streams on the webpage of Streamflyer. Hope this helps.
I needed something like this as well and decided to roll my own solution instead of using the example above by #aioobe. Have a look at the code. You can pull the library from maven central, or just copy the source code.
This is how you use it. In this case, I'm using a nested instance to replace two patterns two fix dos and mac line endings.
new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
Here's the full source code:
/**
* Simple FilterInputStream that can replace occurrances of bytes with something else.
*/
public class ReplacingInputStream extends FilterInputStream {
// while matching, this is where the bytes go.
int[] buf=null;
int matchedIndex=0;
int unbufferIndex=0;
int replacedIndex=0;
private final byte[] pattern;
private final byte[] replacement;
private State state=State.NOT_MATCHED;
// simple state machine for keeping track of what we are doing
private enum State {
NOT_MATCHED,
MATCHING,
REPLACING,
UNBUFFER
}
/**
* #param is input
* #return nested replacing stream that replaces \n\r (DOS) and \r (MAC) line endings with UNIX ones "\n".
*/
public static InputStream newLineNormalizingInputStream(InputStream is) {
return new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
}
/**
* Replace occurances of pattern in the input. Note: input is assumed to be UTF-8 encoded. If not the case use byte[] based pattern and replacement.
* #param in input
* #param pattern pattern to replace.
* #param replacement the replacement or null
*/
public ReplacingInputStream(InputStream in, String pattern, String replacement) {
this(in,pattern.getBytes(StandardCharsets.UTF_8), replacement==null ? null : replacement.getBytes(StandardCharsets.UTF_8));
}
/**
* Replace occurances of pattern in the input.
* #param in input
* #param pattern pattern to replace
* #param replacement the replacement or null
*/
public ReplacingInputStream(InputStream in, byte[] pattern, byte[] replacement) {
super(in);
Validate.notNull(pattern);
Validate.isTrue(pattern.length>0, "pattern length should be > 0", pattern.length);
this.pattern = pattern;
this.replacement = replacement;
// we will never match more than the pattern length
buf = new int[pattern.length];
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
// copy of parent logic; we need to call our own read() instead of super.read(), which delegates instead of calling our read
if (b == null) {
throw new NullPointerException();
} else if (off < 0 || len < 0 || len > b.length - off) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return 0;
}
int c = read();
if (c == -1) {
return -1;
}
b[off] = (byte)c;
int i = 1;
try {
for (; i < len ; i++) {
c = read();
if (c == -1) {
break;
}
b[off + i] = (byte)c;
}
} catch (IOException ee) {
}
return i;
}
#Override
public int read(byte[] b) throws IOException {
// call our own read
return read(b, 0, b.length);
}
#Override
public int read() throws IOException {
// use a simple state machine to figure out what we are doing
int next;
switch (state) {
case NOT_MATCHED:
// we are not currently matching, replacing, or unbuffering
next=super.read();
if(pattern[0] == next) {
// clear whatever was there
buf=new int[pattern.length]; // clear whatever was there
// make sure we start at 0
matchedIndex=0;
buf[matchedIndex++]=next;
if(pattern.length == 1) {
// edgecase when the pattern length is 1 we go straight to replacing
state=State.REPLACING;
// reset replace counter
replacedIndex=0;
} else {
// pattern of length 1
state=State.MATCHING;
}
// recurse to continue matching
return read();
} else {
return next;
}
case MATCHING:
// the previous bytes matched part of the pattern
next=super.read();
if(pattern[matchedIndex]==next) {
buf[matchedIndex++]=next;
if(matchedIndex==pattern.length) {
// we've found a full match!
if(replacement==null || replacement.length==0) {
// the replacement is empty, go straight to NOT_MATCHED
state=State.NOT_MATCHED;
matchedIndex=0;
} else {
// start replacing
state=State.REPLACING;
replacedIndex=0;
}
}
} else {
// mismatch -> unbuffer
buf[matchedIndex++]=next;
state=State.UNBUFFER;
unbufferIndex=0;
}
return read();
case REPLACING:
// we've fully matched the pattern and are returning bytes from the replacement
next=replacement[replacedIndex++];
if(replacedIndex==replacement.length) {
state=State.NOT_MATCHED;
replacedIndex=0;
}
return next;
case UNBUFFER:
// we partially matched the pattern before encountering a non matching byte
// we need to serve up the buffered bytes before we go back to NOT_MATCHED
next=buf[unbufferIndex++];
if(unbufferIndex==matchedIndex) {
state=State.NOT_MATCHED;
matchedIndex=0;
}
return next;
default:
throw new IllegalStateException("no such state " + state);
}
}
#Override
public String toString() {
return state.name() + " " + matchedIndex + " " + replacedIndex + " " + unbufferIndex;
}
}
There isn't any built-in functionality for search-and-replace on byte streams (InputStream).
And, a method for completing this task efficiently and correctly is not immediately obvious. I have implemented the Boyer-Moore algorithm for streams, and it works well, but it took some time. Without an algorithm like this, you have to resort to a brute-force approach where you look for the pattern starting at every position in the stream, which can be slow.
Even if you decode the HTML as text, using a regular expression to match patterns might be a bad idea, since HTML is not a "regular" language.
So, even though you've run into some difficulties, I suggest you pursue your original approach of parsing the HTML as a document. While you are having trouble with the character encoding, it will probably be easier, in the long run, to fix the right solution than it will be to jury-rig the wrong solution.
I needed a solution to this, but found the answers here incurred too much memory and/or CPU overhead. The below solution significantly outperforms the others here in these terms based on simple benchmarking.
This solution is especially memory-efficient, incurring no measurable cost even with >GB streams.
That said, this is not a zero-CPU-cost solution. The CPU/processing-time overhead is probably reasonable for all but the most demanding/resource-sensitive scenarios, but the overhead is real and should be considered when evaluating the worthiness of employing this solution in a given context.
In my case, our max real-world file size that we are processing is about 6MB, where we see added latency of about 170ms with 44 URL replacements. This is for a Zuul-based reverse-proxy running on AWS ECS with a single CPU share (1024). For most of the files (under 100KB), the added latency is sub-millisecond. Under high-concurrency (and thus CPU contention), the added latency could increase, however we are currently able to process hundreds of the files concurrently on a single node with no humanly-noticeable latency impact.
The solution we are using:
import java.io.IOException;
import java.io.InputStream;
public class TokenReplacingStream extends InputStream {
private final InputStream source;
private final byte[] oldBytes;
private final byte[] newBytes;
private int tokenMatchIndex = 0;
private int bytesIndex = 0;
private boolean unwinding;
private int mismatch;
private int numberOfTokensReplaced = 0;
public TokenReplacingStream(InputStream source, byte[] oldBytes, byte[] newBytes) {
assert oldBytes.length > 0;
this.source = source;
this.oldBytes = oldBytes;
this.newBytes = newBytes;
}
#Override
public int read() throws IOException {
if (unwinding) {
if (bytesIndex < tokenMatchIndex) {
return oldBytes[bytesIndex++];
} else {
bytesIndex = 0;
tokenMatchIndex = 0;
unwinding = false;
return mismatch;
}
} else if (tokenMatchIndex == oldBytes.length) {
if (bytesIndex == newBytes.length) {
bytesIndex = 0;
tokenMatchIndex = 0;
numberOfTokensReplaced++;
} else {
return newBytes[bytesIndex++];
}
}
int b = source.read();
if (b == oldBytes[tokenMatchIndex]) {
tokenMatchIndex++;
} else if (tokenMatchIndex > 0) {
mismatch = b;
unwinding = true;
} else {
return b;
}
return read();
}
#Override
public void close() throws IOException {
source.close();
}
public int getNumberOfTokensReplaced() {
return numberOfTokensReplaced;
}
}
I came up with this simple piece of code when I needed to serve a template file in a Servlet replacing a certain keyword by a value. It should be pretty fast and low on memory. Then using Piped Streams I guess you can use it for all sorts of things.
/JC
public static void replaceStream(InputStream in, OutputStream out, String search, String replace) throws IOException
{
replaceStream(new InputStreamReader(in), new OutputStreamWriter(out), search, replace);
}
public static void replaceStream(Reader in, Writer out, String search, String replace) throws IOException
{
char[] searchChars = search.toCharArray();
int[] buffer = new int[searchChars.length];
int x, r, si = 0, sm = searchChars.length;
while ((r = in.read()) > 0) {
if (searchChars[si] == r) {
// The char matches our pattern
buffer[si++] = r;
if (si == sm) {
// We have reached a matching string
out.write(replace);
si = 0;
}
} else if (si > 0) {
// No match and buffered char(s), empty buffer and pass the char forward
for (x = 0; x < si; x++) {
out.write(buffer[x]);
}
si = 0;
out.write(r);
} else {
// No match and nothing buffered, just pass the char forward
out.write(r);
}
}
// Empty buffer
for (x = 0; x < si; x++) {
out.write(buffer[x]);
}
}

How to deal with the URISyntaxException

I got this error message :
java.net.URISyntaxException: Illegal character in query at index 31: http://finance.yahoo.com/q/h?s=^IXIC
My_Url = http://finance.yahoo.com/q/h?s=^IXIC
When I copied it into a browser address field, it showed the correct page, it's a valid URL, but I can't parse it with this: new URI(My_Url)
I tried : My_Url=My_Url.replace("^","\\^"), but
It won't be the url I need
It doesn't work either
How to handle this ?
Frank
You need to encode the URI to replace illegal characters with legal encoded characters. If you first make a URL (so you don't have to do the parsing yourself) and then make a URI using the five-argument constructor, then the constructor will do the encoding for you.
import java.net.*;
public class Test {
public static void main(String[] args) {
String myURL = "http://finance.yahoo.com/q/h?s=^IXIC";
try {
URL url = new URL(myURL);
String nullFragment = null;
URI uri = new URI(url.getProtocol(), url.getHost(), url.getPath(), url.getQuery(), nullFragment);
System.out.println("URI " + uri.toString() + " is OK");
} catch (MalformedURLException e) {
System.out.println("URL " + myURL + " is a malformed URL");
} catch (URISyntaxException e) {
System.out.println("URI " + myURL + " is a malformed URL");
}
}
}
Use % encoding for the ^ character, viz. http://finance.yahoo.com/q/h?s=%5EIXIC
You have to encode your parameters.
Something like this will do:
import java.net.*;
import java.io.*;
public class EncodeParameter {
public static void main( String [] args ) throws URISyntaxException ,
UnsupportedEncodingException {
String myQuery = "^IXIC";
URI uri = new URI( String.format(
"http://finance.yahoo.com/q/h?s=%s",
URLEncoder.encode( myQuery , "UTF8" ) ) );
System.out.println( uri );
}
}
http://java.sun.com/javase/6/docs/api/java/net/URLEncoder.html
Rather than encoding the URL beforehand you can do the following
String link = "http://example.com";
URL url = null;
URI uri = null;
try {
url = new URL(link);
} catch(MalformedURLException e) {
e.printStackTrace();
}
try{
uri = new URI(url.toString())
} catch(URISyntaxException e {
try {
uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(),
url.getPort(), url.getPath(), url.getQuery(),
url.getRef());
} catch(URISyntaxException e1 {
e1.printStackTrace();
}
}
try {
url = uri.toURL()
} catch(MalfomedURLException e) {
e.printStackTrace();
}
String encodedLink = url.toString();
A general solution requires parsing the URL into a RFC 2396 compliant URI (note that this is an old version of the URI standard, which java.net.URI uses).
I have written a Java URL parsing library that makes this possible: galimatias. With this library, you can achieve your desired behaviour with this code:
String urlString = //...
URLParsingSettings settings = URLParsingSettings.create()
.withStandard(URLParsingSettings.Standard.RFC_2396);
URL url = URL.parse(settings, urlString);
Note that galimatias is in a very early stage and some features are experimental, but it is already quite solid for this use case.
A space is encoded to %20 in URLs, and to + in forms submitted data (content type application/x-www-form-urlencoded). You need the former.
Using Guava:
dependencies {
compile 'com.google.guava:guava:28.1-jre'
}
You can use UrlEscapers:
String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);
Don't use String.replace, this would only encode the space. Use a library instead.
Coudn't imagine nothing better for
http://server.ru:8080/template/get?type=mail&format=html&key=ecm_task_assignment&label=Согласовать с контрагентом&descr=Описание&objectid=2231
that:
public static boolean checkForExternal(String str) {
int length = str.length();
for (int i = 0; i < length; i++) {
if (str.charAt(i) > 0x7F) {
return true;
}
}
return false;
}
private static final Pattern COLON = Pattern.compile("%3A", Pattern.LITERAL);
private static final Pattern SLASH = Pattern.compile("%2F", Pattern.LITERAL);
private static final Pattern QUEST_MARK = Pattern.compile("%3F", Pattern.LITERAL);
private static final Pattern EQUAL = Pattern.compile("%3D", Pattern.LITERAL);
private static final Pattern AMP = Pattern.compile("%26", Pattern.LITERAL);
public static String encodeUrl(String url) {
if (checkForExternal(url)) {
try {
String value = URLEncoder.encode(url, "UTF-8");
value = COLON.matcher(value).replaceAll(":");
value = SLASH.matcher(value).replaceAll("/");
value = QUEST_MARK.matcher(value).replaceAll("?");
value = EQUAL.matcher(value).replaceAll("=");
return AMP.matcher(value).replaceAll("&");
} catch (UnsupportedEncodingException e) {
throw LOGGER.getIllegalStateException(e);
}
} else {
return url;
}
}
I had this exception in the case of a test for checking some actual accessed URLs by users.
And the URLs are sometime contains an illegal-character and hang by this error.
So I make a function to encode only the characters in the URL string like this.
String encodeIllegalChar(String uriStr,String enc)
throws URISyntaxException,UnsupportedEncodingException {
String _uriStr = uriStr;
int retryCount = 17;
while(true){
try{
new URI(_uriStr);
break;
}catch(URISyntaxException e){
String reason = e.getReason();
if(reason == null ||
!(
reason.contains("in path") ||
reason.contains("in query") ||
reason.contains("in fragment")
)
){
throw e;
}
if(0 > retryCount--){
throw e;
}
String input = e.getInput();
int idx = e.getIndex();
String illChar = String.valueOf(input.charAt(idx));
_uriStr = input.replace(illChar,URLEncoder.encode(illChar,enc));
}
}
return _uriStr;
}
test:
String q = "\\'|&`^\"<>)(}{][";
String url = "http://test.com/?q=" + q + "#" + q;
String eic = encodeIllegalChar(url,'UTF-8');
System.out.println(String.format(" original:%s",url));
System.out.println(String.format(" encoded:%s",eic));
System.out.println(String.format(" uri-obj:%s",new URI(eic)));
System.out.println(String.format("re-decoded:%s",URLDecoder.decode(eic)));
If you're using RestangularV2 to post to a spring controller in java you can get this exception if you use RestangularV2.one() instead of RestangularV2.all()
Replace spaces in URL with + like If url contains dimension1=Incontinence Liners then replace it with dimension1=Incontinence+Liners.

Categories

Resources