Copying parts of a string several times in Java/Android - java

I have a source code from a website where textmessages start with "< h2 >" and end with "< /h2 >". In my app, I read the source code and make it into a string. Now I want to read only the messages, and have tried with this:
returned = get.getInternetData("http://blablabla.com");
int start = returned.indexOf("<h2>") + 4;
int end = returned.indexOf("</h2>");
String message = returned.substring(start, end);
The problem is that I only get the very first message! My idea was to use a scanner object and do something like
while (scan.hasNext("<h2>")) {
}
But there are no get-methods from the scanner. How can read all the messages from the source code?

you should do something like this:
while (returned.indexOf("<h2>", lastIndex)!=-1) {
....
do your thing
...
increment lastIndex
}

Using Jsoup you can do this:
Document doc = Jsoup.connect("http://blablabla.com").get();
Elements h2Tag = doc.select("h2");
ArrayList<String> messages = new ArrayList<String>();
for(Element mess: h2Tag){
messages.add(mess.text());
}

Related

JSOUP - Extract Data Web dinamis

I tried to extract the price. Can anyone please help me? There is no output for the price and its weight ,, I've tried several ways but not out the results
Document doc = Jsoup.connect("https://www.jakmall.com/tokocamzone/mi-travel-charger-20a-output-fast-charging#9730928979371").get();
Elements rows = doc.getElementsByAttributeValue("class", "div[dp__price dp__price--2 format__money]");
System.out.println("rows.size() = " + rows.size());
String index = "";
for (Element span : rows) {
index = span.text();
}
System.out.println("index = " + index);
I've tried another way but I did not get the result. I was very curious but did not find it the right way
if you run this line of code above you will discover thtat there is no price ordiv[dp__price dp__price--2 format__money] DOM. There is only Javascript.
String d = doc.getElementsByClass("dp__header__info").outherHtml();
System.out.println(d);
Jsoup is not able to fetch the price because content is loaded dynamically after page loading. Consider using Selenium which more powerfull and supports JavaScript websites,

Java appendChild (CSV to XML conversion) doesn't work for ONE node

So I need to convert a CSV file to seperate XML files (one XML file per line in the CSV file). And this all works fine, except for the fact that it refuses to add one value. It adds the other without a problem, but for some mysterious reason refuses to create a tag for one of my nodes.
Document newDoc = documentBuilder.newDocument();
Element rootElement = newDoc.createElement("XMLoutput");
newDoc.appendChild(rootElement);
String header = headers.get(col);
String value = null;
String value2 = null;
if (col < rowValues.length) {
if(header.equals("delay")) {
value = rowValues[col];
Thread.sleep(Long.parseLong(value));
}
Element shipidElement = newDoc.createElement("shipID");
shipidElement.appendChild(newDoc.createTextNode(FilenameUtils.getBaseName(csvFileName)));
rootElement.appendChild(shipidElement);
if(header.equals("centraleID")) {
value = rowValues[col];
System.out.println(value); //to check if the if condition works, it does
Element centralElement = newDoc.createElement(header);
Text child = newDoc.createTextNode(value);
centralElement.appendChild(child);
rootElement.appendChild(centralElement);
}
else if(header.equals("afstandTotKade")) {
value2 = rowValues[col];
Element curElement = newDoc.createElement(header);
curElement.appendChild(newDoc.createTextNode(value2));
rootElement.appendChild(curElement);
}
String timeStamp = new SimpleDateFormat("HH:mm:ss").format(new Date());
Element timeElement = newDoc.createElement("Timestamp");
timeElement.appendChild(newDoc.createTextNode(timeStamp));
rootElement.appendChild(timeElement);
}
So in the above code, the if loop checking for CentraleID actually works, because it prints out the values, however the XML file does not add a tag, not even if I just insert a string instead of the header value. It does, however, insert the "afstandTotKade" node and timestamp node. I am dumbfounded.
PS: this is only part of the code of course, but the problem is so minute, that it seemed superfluous to add all of it.
PPS: the code was originally the same as the others, I've just been playing around.
This is the resulting XML File btw, so the other if does add the nodes, and I know the spelling is correct in checking the header because it does print out the values (the sout code) when I run it:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<XMLoutput>
<shipID>1546312</shipID>
<afstandTotKade>1000</afstandTotKade>
<Timestamp>22:06:33</Timestamp>
</XMLoutput>

Creating a poll system using PircBot

I'm new to Java and I'm trying to create a poll system using PircBot.
So far my code is this:
if (message.startsWith("!poll")) {
String polly = message.substring(6);
String[] vote = polly.split(" ");
String vote1 = vote[0];
String vote2 = vote[1];
}
Which splits the strings so that someone can type !poll "option1 option2" for example and it will be split into vote1 = option1 and vote2 = option2.
I'm kind of lost from here. Am I even heading in the right direction for creating a voting system?
I figure that I'd have a separate statement as follows.
if (message.equalsIgnoreCase("!vote " + option1))
But I'm not sure where to go with that either.

get a substring with regex [duplicate]

I need a regex pattern for finding web page links in HTML.
I first use #"(<a.*?>.*?</a>)" to extract links (<a>), but I can't fetch href from that.
My strings are:
<a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="www.example.com/page.php/404" ....></a>
1, 2 and 3 are valid and I need them, but number 4 is not valid for me
(? and = is essential)
Thanks everyone, but I don't need parsing <a>. I have a list of links in href="abcdef" format.
I need to fetch href of the links and filter it, my favorite urls must be contain ? and = like page.php?id=5
Thanks!
I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links. It will match whether double or single quotes are used.
<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1
You can view a full explanation of this regex at here.
Snippet playground:
const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/;
const textToMatchInput = document.querySelector('[name=textToMatch]');
document.querySelector('button').addEventListener('click', () => {
console.log(textToMatchInput.value.match(linkRx));
});
<label>
Text to match:
<input type="text" name="textToMatch" value='<a href="google.com"'>
<button>Match</button>
</label>
Using regex to parse html is not recommended
regex is used for regularly occurring patterns.html is not regular with it's format(except xhtml).For example html files are valid even if you don't have a closing tag!This could break your code.
Use an html parser like htmlagilitypack
You can use this code to retrieve all href's in anchor tag using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var hrefList = doc.DocumentNode.SelectNodes("//a")
.Select(p => p.GetAttributeValue("href", "not found"))
.ToList();
hrefList contains all href`s
Thanks everyone (specially #plalx)
I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as
<a\s+(?:[^>]*?\s+)?href="([^"]*)"
would suffice to capture all URLs. If you want to make sure they contain at least a query string, you could just use
<a\s+(?:[^>]*?\s+)?href="([^"]+\?[^"]+)"
My final regex string:
First use one of this:
st = #"((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+ \w\d:##%/;$()~_?\+-=\\\.&]*)";
st = #"<a href[^>]*>(.*?)</a>";
st = #"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)";
st = #"((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:##%/;$()~_?\+,\-=\\.&]+)";
st = #"(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)";
st = #"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:##%/;$()~_?\+-=\\\.&]*)";
st = #"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
st = #"(<a.*?>.*?</a>)";
st = #"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?.*?)(?:[s>""'])";
st = #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
st = #"(http|https)://([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)";
st = #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$";
st = #"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*";
my choice is
#"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*"
Second Use this:
st = "(.*)?(.*)=(.*)";
Problem Solved. Thanks every one :)
Try this :
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
var res = Find(html);
}
public static List<LinkItem> Find(string file)
{
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, #"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, #"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
string t = Regex.Replace(value, #"\s*<.*?>\s*", "",
RegexOptions.Singleline);
i.Text = t;
list.Add(i);
}
return list;
}
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href + "\n\t" + Text;
}
}
}
Input:
string html = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> 2.<a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> ";
Result:
[0] = {www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
[1] = {http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
C# Scraping HTML Links
Scraping HTML extracts important page elements. It has many legal uses
for webmasters and ASP.NET developers. With the Regex type and
WebClient, we implement screen scraping for HTML.
Edited
Another easy way:you can use a web browser control for getting href from tag a,like this:(see my example)
public Form1()
{
InitializeComponent();
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
}
private void Form1_Load(object sender, EventArgs e)
{
webBrowser1.DocumentText = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"https://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"www.aaa.xx/xx.zz/xxx\" ....></a>";
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
List<string> href = new List<string>();
foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("a"))
{
href.Add(el.GetAttribute("href"));
}
}
Try this regex:
"href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))"
You will get more help from discussions over:
Regular expression to extract URL from an HTML link
and
Regex to get the link in href. [asp.net]
Hope its helpful.
HTMLDocument DOC = this.MySuperBrowser.Document as HTMLDocument;
public IHTMLAnchorElement imageElementHref;
imageElementHref = DOC.getElementById("idfirsticonhref") as IHTMLAnchorElement;
Simply try this code
I came up with this one, that supports anchor and image tags, and supports single and double quotes.
<[a|img]+\\s+(?:[^>]*?\\s+)?[src|href]+=[\"']([^\"']*)['\"]
So
click here
Will match:
Match 1: /something.ext
And
<a href='/something.ext'>click here</a>
Will match:
Match 1: /something.ext
Same goes for img src attributes
I took a much simpler approach. This one simply looks for href attributes, and captures the value (between apostrophes) trailing it into a group named url:
href=['"](?<url>.*?)['"]
I think in this case it is one of the simplest pregmatches
/<a\s*(.*?id[^"]*")/g
gets links with the variable id in the address
starts from href including it, gets all characters/signs (. - excluding new line signs)
until first id occur, including it, and next all signs to nearest next " sign ([^"]*)

ROME API to parse RSS/Atom

I'm trying to parse RSS/Atom feeds with the ROME library. I am new to Java, so I am not in tune with many of its intricacies.
Does ROME automatically use its modules to handle different feeds as it comes across them, or do I have to ask it to use them? If so, any direction on this.
How do I get to the correct 'source'? I was trying to use item.getSource(), but it is giving me fits. I guess I am using the wrong interface. Some direction would be much appreciated.
Here is the meat of what I have for collection my data.
I noted two areas where I am having problems, both revolving around getting Source Information of the feed. And by source, I want CNN, or FoxNews, or whomever, not the Author.
Judging from my reading, .getSource() is the correct method.
List<String> feedList = theFeeds.getFeeds();
List<FeedData> feedOutput = new ArrayList<FeedData>();
for (String sites : feedList ) {
URL feedUrl = new URL(sites);
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedUrl));
List<SyndEntry> entries = feed.getEntries();
for (SyndEntry item : entries){
String title = item.getTitle();
String link = item.getUri();
Date date = item.getPublishedDate();
Problem here --> ** SyndEntry source = item.getSource();
String description;
if (item.getDescription()== null){
description = "";
} else {
description = item.getDescription().getValue();
}
String cleanDescription = description.replaceAll("\\<.*?>","").replaceAll("\\s+", " ");
FeedData feedData = new FeedData();
feedData.setTitle(title);
feedData.setLink(link);
And Here --> ** feedData.setSource(link);
feedData.setDate(date);
feedData.setDescription(cleanDescription);
String preview =createPreview(cleanDescription);
feedData.setPreview(preview);
feedOutput.add(feedData);
// lets print out my pieces.
System.out.println("Title: " + title);
System.out.println("Date: " + date);
System.out.println("Text: " + cleanDescription);
System.out.println("Preview: " + preview);
System.out.println("*****");
}
}
getSource() is definitely wrong - it returns back SyndFeed to which entry in question belongs. Perhaps what you want is getContributors()?
As far as modules go, they should be selected automatically. You can even write your own and plug it in as described here
What about trying regex the source from the URL without using the API?
That was my first thought, anyway I checked against the RSS standardized format itself to get an idea if this option is actually available at this level, and then try to trace its implementation upwards...
In RSS 2.0, I have found the source element, however it appears that it doesn't exist in previous versions of the spec- not good news for us!
[ is an optional sub-element of 1
Its value is the name of the RSS channel that the item came from, derived from its . It has one required attribute, url, which links to the XMLization of the source.

Categories

Resources