Logical help - java regex - java

I have a string which contains image tags, more than 1. Now I need to regex out the alt= tag. I tried it like this:
while (m3.find()) {
Pattern p4 = Pattern.compile("<!\\[CDATA\\[(.*?)\\]\\]>");
Matcher m4 = p4.matcher(m3.group());
while (m4.find()) {
if(m4.group().contains("<img src")) {
Pattern p6 = Pattern.compile("<img src(.*?)/>");
Matcher m6 = p6.matcher(m4.group());
while (m6.find()) {
Pattern p7 = Pattern.compile("alt=\"(.*?)\"");
Matcher m7 = p7.matcher(m6.group());
while (m7.find()) {
messages.add(m4.group().replace(m6.group(), m7.group().replace("alt=", "").replace("\"", "")).replace("<![CDATA[", "").replace("]]>", ""));
}
}
} else {
messages.add(m4.group().replace("<![CDATA[", "").replace("]]>", ""));
}
}
}
The problem is: there is more than 1 image tag. messages is an ArrayList. I need just 1 messages.add for ALL images in the actual message. The code as it is does something very different and I don't have any idea how to fix it or where my mistakes are :/ I just want replace the whole with the content of alt="...", but every the actual message contains. Can anyone help me?

Maybe you should use a third party library like jsoup, which allows parsing, extracting and modifying html documents in a jquery fashion. Using this library, changing the attributes of certain html elements (as explained here) should work like this:
Document doc = Jsoup.parse(html);
doc.select("div.comments a").attr("rel", "nofollow");

Related

Formatting string content xtext 2.14

Given a grammar (simplified version below) where I can enter arbitrary text in a section of the grammar, is it possible to format the content of the arbitrary text? I understand how to format the position of the arbitrary text in relation to the rest of the grammar, but not whether it is possible to format the content string itself?
Sample grammar
Model:
'content' content=RT
terminal RT: // (returns ecore::EString:)
'RT>>' -> '<<RT';
Sample content
content RT>>
# Some sample arbitrary text
which I would like to format
<<RT
you can add custom ITextReplacer to the region of the string.
assuming you have a grammar like
Model:
greetings+=Greeting*;
Greeting:
'Hello' name=STRING '!';
you can do something like the follow in the formatter
def dispatch void format(Greeting model, extension IFormattableDocument document) {
model.prepend[newLine]
val region = model.regionFor.feature(MyDslPackage.Literals.GREETING__NAME)
val r = new AbstractTextReplacer(document, region) {
override createReplacements(ITextReplacerContext it) {
val text = region.text
var int index = text.indexOf(SPACE);
val offset = region.offset
while (index >=0){
it.addReplacement(region.textRegionAccess.rewriter.createReplacement(offset+index, SPACE.length, "\n"))
index = text.indexOf(SPACE, index+SPACE.length()) ;
}
it
}
}
addReplacer(r)
}
this will turn this model
Hello "A B C"!
into
Hello "A
B
C"!
of course you need to come up with a more sophisticated formatter logic.
see How to define different indentation levels in the same document with Xtext formatter too

Regex for Facebook Photo ID from a URL in Java

I’m trying to figure out a regex expression in Java to obtain the photo's id for this facebook url:
https://www.facebook.com/566162426788268/photos/a.566209603450217.1073741828.566162426788268/1214828765254961/?type=3&theater
I have come up with the solution
\w++(?=/\?)
but it does not work.
Help appreciated!
In Java you could look for "\w+(?=\/\?)" and since there is no match groups extracted you get group 0.
Example snippet:
String photoIdPatternAsString = "\\w+(?=\\/\\?)";
Matcher postIdMatcher = Pattern.compile(photoIdPatternAsString).matcher(postUrl);
if (postIdMatcher.find()) {
postId = postIdMatcher.group(0);
} else throw new IOException();

get a substring with regex [duplicate]

I need a regex pattern for finding web page links in HTML.
I first use #"(<a.*?>.*?</a>)" to extract links (<a>), but I can't fetch href from that.
My strings are:
<a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="www.example.com/page.php/404" ....></a>
1, 2 and 3 are valid and I need them, but number 4 is not valid for me
(? and = is essential)
Thanks everyone, but I don't need parsing <a>. I have a list of links in href="abcdef" format.
I need to fetch href of the links and filter it, my favorite urls must be contain ? and = like page.php?id=5
Thanks!
I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links. It will match whether double or single quotes are used.
<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1
You can view a full explanation of this regex at here.
Snippet playground:
const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/;
const textToMatchInput = document.querySelector('[name=textToMatch]');
document.querySelector('button').addEventListener('click', () => {
console.log(textToMatchInput.value.match(linkRx));
});
<label>
Text to match:
<input type="text" name="textToMatch" value='<a href="google.com"'>
<button>Match</button>
</label>
Using regex to parse html is not recommended
regex is used for regularly occurring patterns.html is not regular with it's format(except xhtml).For example html files are valid even if you don't have a closing tag!This could break your code.
Use an html parser like htmlagilitypack
You can use this code to retrieve all href's in anchor tag using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var hrefList = doc.DocumentNode.SelectNodes("//a")
.Select(p => p.GetAttributeValue("href", "not found"))
.ToList();
hrefList contains all href`s
Thanks everyone (specially #plalx)
I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as
<a\s+(?:[^>]*?\s+)?href="([^"]*)"
would suffice to capture all URLs. If you want to make sure they contain at least a query string, you could just use
<a\s+(?:[^>]*?\s+)?href="([^"]+\?[^"]+)"
My final regex string:
First use one of this:
st = #"((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+ \w\d:##%/;$()~_?\+-=\\\.&]*)";
st = #"<a href[^>]*>(.*?)</a>";
st = #"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)";
st = #"((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:##%/;$()~_?\+,\-=\\.&]+)";
st = #"(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)";
st = #"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:##%/;$()~_?\+-=\\\.&]*)";
st = #"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
st = #"(<a.*?>.*?</a>)";
st = #"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?.*?)(?:[s>""'])";
st = #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
st = #"(http|https)://([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)";
st = #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$";
st = #"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*";
my choice is
#"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*"
Second Use this:
st = "(.*)?(.*)=(.*)";
Problem Solved. Thanks every one :)
Try this :
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
var res = Find(html);
}
public static List<LinkItem> Find(string file)
{
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, #"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, #"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
string t = Regex.Replace(value, #"\s*<.*?>\s*", "",
RegexOptions.Singleline);
i.Text = t;
list.Add(i);
}
return list;
}
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href + "\n\t" + Text;
}
}
}
Input:
string html = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> 2.<a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> ";
Result:
[0] = {www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
[1] = {http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
C# Scraping HTML Links
Scraping HTML extracts important page elements. It has many legal uses
for webmasters and ASP.NET developers. With the Regex type and
WebClient, we implement screen scraping for HTML.
Edited
Another easy way:you can use a web browser control for getting href from tag a,like this:(see my example)
public Form1()
{
InitializeComponent();
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
}
private void Form1_Load(object sender, EventArgs e)
{
webBrowser1.DocumentText = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"https://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"www.aaa.xx/xx.zz/xxx\" ....></a>";
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
List<string> href = new List<string>();
foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("a"))
{
href.Add(el.GetAttribute("href"));
}
}
Try this regex:
"href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))"
You will get more help from discussions over:
Regular expression to extract URL from an HTML link
and
Regex to get the link in href. [asp.net]
Hope its helpful.
HTMLDocument DOC = this.MySuperBrowser.Document as HTMLDocument;
public IHTMLAnchorElement imageElementHref;
imageElementHref = DOC.getElementById("idfirsticonhref") as IHTMLAnchorElement;
Simply try this code
I came up with this one, that supports anchor and image tags, and supports single and double quotes.
<[a|img]+\\s+(?:[^>]*?\\s+)?[src|href]+=[\"']([^\"']*)['\"]
So
click here
Will match:
Match 1: /something.ext
And
<a href='/something.ext'>click here</a>
Will match:
Match 1: /something.ext
Same goes for img src attributes
I took a much simpler approach. This one simply looks for href attributes, and captures the value (between apostrophes) trailing it into a group named url:
href=['"](?<url>.*?)['"]
I think in this case it is one of the simplest pregmatches
/<a\s*(.*?id[^"]*")/g
gets links with the variable id in the address
starts from href including it, gets all characters/signs (. - excluding new line signs)
until first id occur, including it, and next all signs to nearest next " sign ([^"]*)

Parser JSoup change the tags to lower case letter

I did some research and it seems that is standard Jsoup make this change. I wonder if there is a way to configure this or is there some other Parser I can be converted to a document of Jsoup, or some way to fix this?
Unfortunately not, the constructor of Tag class changes the name to lower case:
private Tag(String tagName) {
this.tagName = tagName.toLowerCase();
}
But there are two ways to change this behavour:
If you want a clean solution, you can clone / download the JSoup Git and change this line.
If you want a dirty solution, you can use reflection.
Example for #2:
Field tagName = Tag.class.getDeclaredField("tagName"); // Get the field which contains the tagname
tagName.setAccessible(true); // Set accessible to allow changes
for( Element element : doc.select("*") ) // Iterate over all tags
{
Tag tag = element.tag(); // Get the tag of the element
String value = tagName.get(tag).toString(); // Get the value (= name) of the tag
if( !value.startsWith("#") ) // You can ignore all tags starting with a '#'
{
tagName.set(tag, value.toUpperCase()); // Set the tagname to the uppercase
}
}
tagName.setAccessible(false); // Revert to false
Here is a code sample (version >= 1.11.x):
Parser parser = Parser.htmlParser();
parser.settings(new ParseSettings(true, true));
Document doc = parser.parseInput(html, baseUrl);
There is ParseSettings class introduced in version 1.9.3.
It comes with options to preserve case for tags and attributes.
You must use xmlParser instead of htmlParser and the tags will remain unchanged. One line does the trick:
String html = "<camelCaseTag>some text</camelCaseTag>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
I am using 1.11.1-SNAPSHOT version which does not have this piece of code.
private Tag(String tagName) {
this.tagName = tagName.toLowerCase();
}
So I checked ParseSettings as suggested above and changed this piece of code from:
static {
htmlDefault = new ParseSettings(false, false);
preserveCase = new ParseSettings(true, true);
}
to:
static {
htmlDefault = new ParseSettings(true, true);
preserveCase = new ParseSettings(true, true);
}
and skipped test cases while building JAR.

url harvester string manipulation

I'm doing a recursive url harvest.. when I find an link in the source that doesn't start with "http" then I append it to the current url. Problem is when I run into a dynamic site the link without an http is usually a new parameter for the current url. For example if the current url is something like http://www.somewebapp.com/default.aspx?pageid=4088 and in the source for that page there is a link which is default.aspx?pageid=2111. In this case I need do some string manipulation; this is where I need help.
pseudocode:
if part of the link found is a contains a substring of the current url
save the substring
save the unique part of the link found
replace whatever is after the substring in the current url with the unique saved part
What would this look like in java? Any ideas for doing this differently? Thanks.
As per comment, here's what I've tried:
if (!matched.startsWith("http")) {
String[] splitted = url.toString().split("/");
java.lang.String endOfURL = splitted[splitted.length-1];
boolean b = false;
while (!b && endOfURL.length() > 5) { // f.bar shortest val
endOfURL = endOfURL.substring(0, endOfURL.length()-2);
if (matched.contains(endOfURL)) {
matched = matched.substring(endOfURL.length()-1);
matched = url.toString().substring(url.toString().length() - matched.length()) + matched;
b = true;
}
}
it's not working well..
I think you are doing this the wrong way. Java has two classes URL and URI which are capable of parsing URL/URL strings much more accurately than a "string bashing" solution. For example the URL constructor URL(URL, String) will create a new URL object in the context of an existing one, without you needing to worry whether the String is an absolute URL or a relative one. You would use it something like this:
URL currentPageUrl = ...
String linkUrlString = ...
// (Exception handling not included ...)
URL linkUrl = new URL(currentPageUrl, linkUrlString);

Categories

Resources