Now, how can I screen-scrape such a html line (using java)?

Now, how can I screen-scrape such a html line (using java)? - java

I am trying to screen-scrape a html page so I can extract desired valuable data from it and into a text file. So far it's going well until I came across this within the html page:
<td> <b>In inventory</b>: 0.3 kg<br /><b>Equipped</b>: -4.5 kg
The above line in the html code for the page often varies. So it need to figure about a way to scan the line (regardless of what it contains) for the weight (in this case would be 0.3 and -4.5) and store this data into 2 seperate doubles as of such:
double inventoryWeight = 0.3 double equippedWeight = -4.5
I would like this to be done using pure java; if need be, do not hesitate to notify me of any third-party programs which can be executed within my java application to achieve this (but please vividly explain if so).
Thank you a bunch!

RegEx is usually a good solution for scraping text. Parentheses denote "capturing groups", which are stored and can then be accessed using Matcher.group(). [-.\d]+ matches anything consisting of one or more digits (0-9), periods, and hyphens. .* matches anything (but sometimes not newline characters). Here it's just used to essentially "throw away" everything you don't care about.
import java.util.regex.*;
public class Foo {
public static void main(String[] args) {
String regex = ".*inventory<\\/b>: ([-.\\d]+).*Equipped<\\/b>: ([-.\\d]+).*";
String text = "<td> <b>In inventory</b>: 0.3 kg<br /><b>Equipped</b>: -4.5 kg";
// Look for a match
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
// Get the matched text
if (matcher.matches()) {
String inventoryWeight = matcher.group(1);
String equippedWeight = matcher.group(2);
System.out.println("Inventory weight: " + inventoryWeight);
System.out.println("Equipped weight: " + equippedWeight);
} else {
System.out.println("No match!");
}
}
}

Do you have this piece of html as String? If so, just search for <b>Equipped</b>. Then get <b>Equipped</b> end char position plus one. And then build new string by appending char by char until it's not a number or dot.
When you have those numbers in String variables you simply convert them to Doubles by using double aDouble = Double.parseDouble(aString)

Related

Reading Polynomials from Input File (in Java)

I need to read a polynomial from a file. The problem I am facing is if the coefficients were more than one digit or if it had a negative sign, I don't know how to make the program recognize multiple digits are a single integer. That is going to really complicate my code when doing it manually since I need to make the program "aware" of its surrounding like when to start and stop. So, are there any in-built libraries that can be used to parse this? (While I was researching, I found Pattern and Matcher classes but not really sure if it can be used in this context).
For example, I know how to code for this 2x^2 + 4x + 8 but not for something like -22x^2 - 652x + 898.
Any help is appreciated.

I think you are on the right trail using Pattern and Matcher Class. I found this post: Splitting a string using Regex in Java
I am not completely sure what you are looking to do with this, or what the file looks like exactly, but pivoting off the above post you could do something like:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Polynomial
{
public static void main (String[] args)
{
final String poly = "-2123x^-232-22x^2-652x+898";
// Does not handle a missing + or - on the first term
Pattern p = Pattern.compile("([-+]\\d+)x?[\\^]?([-]?\\d+)?");
Matcher m = p.matcher(poly);
while (m.find())
{
System.out.println("Entire match:" + m.group(0));
// Showing option of casting to Integer
System.out.println("Coefficient:" + Integer.parseInt(m.group(1)));
// Possibly null since the last few parts don't have exponent or x
if (null != m.group(2))
{
System.out.println("Exponent:" + m.group(2));
}
}
}
}

Filtering string between double or single quotations with varying spaces

I have these two variations of this string
name='Anything can go here'
name="Anything can go here"
where name= can have spaces like so
name=(text)
name =(text)
name = (text)
I need to extract the text between the quotes, I'm not sure what's the best way to approach this, should I just have mechanism to cut the string off at quotes and do you have an example where I wont have many case handling, or should I use regex.

I'm not sure I understand the question exactly but I'll give it my best shot:
If you want to just assign a variable name2 to the string inside the quotation marks then you can easily do :
String name = 'Anything can go here';
String name2= name.replace("'","");
name2 = name2.replace("\"","");

You're wanting to get Anything can go here whether it's in between single quotes or double quotes. Regex has the capabilities of doing this regardless of the spaces before or after the "=" by using the following pattern:
"[\"'](.+)[\"']"
Breakdown:
[\"'] - Character class consisting of a double or single quote
(.+) - One or more of any character (may or may not match line terminators stored in capture group 1
[\"'] - Character class consisting of a double or single quote
In short, we are trying to capture anything between single or double quotes.
Example:
public static void main(String[] args) {
List<String> data = new ArrayList(Arrays.asList(
"name='Anything can go here'",
"name = \"Really! Anything can go here\""
));
for (String d : data) {
Matcher matcher = Pattern.compile("[\"'](.+)[\"']").matcher(d);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
Results:
Anything can go here
Really! Anything can go here

Word not preceded by a regular expression

There are plenty of these questions but they all focus on having a couple of characters.
In a text file i have TXX and txx and i need to find those. But I also have Base64 encoded pictures.
Meaning I have
"picture":"/9j/4AAQSkTXX . . .
Basically TXX, txx can appear randomly in Base64-encoded pictures.
I used the following regular expression:
(?<!"picture":")(?:(\w|\/|\+)+)(TXX|txx)
I also realized it should probably be changed into:
(?<!"picture":")(?:(\d|\w|\/|\+|\=)+)(TXX|txx)
But it says I'm doing a catastrophic backtracking, and even without the (?:) (non-capturing group) it still doesn't work. Basically it just doesn't take the "picture":" and the first char and takes everything else.
Since I cannot put a regular expression inside the negative look-behind with a quantifier like
(?<!"picture":".+)TXX|txx
How should I form that regular expression so that these pass
"something-txx": "somerandomstring"
value not picture: "some other stringtxxsome string"
But this doesn't
"picture":"txxl5l71JGwnxMXAmJGOt8ZPwN24JNgtZpYHPBQLTViqVatk4ZoZhY+husj7Pgv3ag4NmpJ4CBlXudzydA5c+5QecmgaPz9vLrSbzRa+tNns0GjUfD+NSa5ZHo9KRf2nCWLl7360x2Kx8zA6dquNqubjoElpVRo2Dq0GOmZ8HMycktxxH08veKg84OPlCZvdDqvNxkPhOB0sn5wly+vdgx1Di82KzMxMlAoJQZkSJdGjZ0+UrlCJi/Xysc5GCPETtxxgUAgEAieNoQQLygg/P8K8VLaFCVVez+/SfMmPo74sNyxGz+/0YI8QKBQCAQCP4DPG6MeLrZcQvihFar46L6govdPE69movlMhIPh0NYaRJTtu2e+FQWyPkqDSsLqker0fKJVR0Oe5ap1RqoWD+pfuo7hefhbVJcfA8VlK42ycudJlIlMd1iMrnakePok5BPDyoUSvnhBMsEs9XMQ+PYrDQRqwd0Oj2vh/eVleXj5OMF7BSqhq2YjEa2TQ83nNDrPeHp5YWQEmXg4+vPPeLzIoR4gUAgEAcvvgETxtCiBcI/ifY2Y2aA57eWu7lJBAIBAKBQCB4eP62EC/JYWmoPBnFeieRnGKnk7e3yWTiYjN5fZPYLId5kcV67sHtcLBt+vZG4VzIu93lVe8SqUmsdzpsrDz7jse2tZrs+O/kxc7z5oGE/PtB+XOWs7tCtpB4z9NIkGf9YU3JeSmb0yV422np5AI8eaTXX"
Sample input is on :
http://pastebin.com/5XJVNqGS
(I know pastebin is bad since the expiration but i'm having problem pasting that amount of text as the page stucks)
And the results should be:
Result1: "some-txx": value
Result2: hereisTXX: "1235"
Result3: "GROUPDATA" : "{DATA1: sample, TXX-value:12312 ,DATA2: sample2}"

I believe you can use a rather useful Java "to-some-extent" variable-width look-behind:
(?<!"picture":"[^"]{0,10000})(?i:txx)
You can adjust the 10000 value in case you have longer Base64-encoded strings.
Tested on RegexPlanet
In case you have very large images, use a reverse-string trick with a reversed regex (look-aheads can be of undefined variable size):
String rx = "(?i)\"[^\"]*\"\\s*:\\s*\"[^\"]*xxt[^\"]*\"(?![^\"]*\":\"erutcip\")";
Sample Java program on Ideone:
import java.util.regex.*;
class HelloWorld{
public static void main(String []args){
String str = "THE_HUIGE_STRING_THAT_CAUSED_Body is limited to 30000 characters;you entered 53501_ISSUE";
str = new StringBuilder(str).reverse().toString();
String rx = "\"?[^\"]*\"?\\s*\"?[^\"\\n\\r]*(?:xxt|XXT)[^\"\\n\\r]*(?![^\"]*\":\"erutcip\")";
Pattern ptrn = Pattern.compile(rx);
Matcher m = ptrn.matcher(str);
while (m.find()) {
System.out.println(new StringBuilder(m.group(0)).reverse().toString());
}
m = ptrn.matcher(new StringBuilder("\"something-txx\": \"somerandomstring\"").reverse().toString());
while (m.find()) {
System.out.println(new StringBuilder(m.group(0)).reverse().toString());
}
}
}

How to best strip out certain strings in a file?

If I have a file with the following content:
11:17 GET this is my content #2013
11:18 GET this is my content #2014
11:19 GET this is my content #2015
How can I use a Scanner and ignore certain parts of a `String line = scanner.nextLine();?
The result that I like to have would be:
this is my content
this is my content
this is my content
So I'd like to trip everything from the start until GET, and then take everything until the # char.
How could this easily be done?

You can use the String.indexOf(String str) and String.indexOf(char ch) methods. For example:
String line = scanner.nextLine();
int start = line.indexOf("GET");
int end = line.indexOf('#');
String result = line.substring(start + 4, end);

One way might be
String strippedStart = scanner.nextLine().split(" ", 3)[2];
String result = strippedStart.substring(0, strippedStart.lastIndexOf("#")).trim();
This assumes the are always two space separated tokens at the beginning (11:22 GET or 11:33 POST, idk).

You could do something like this:-
String line ="11:17 GET this is my content #2013";
int startIndex = line.indexOf("GET ");
int endIndex = line.indexOf("#");
line = line.substring(startIndex+4, endIndex-1);
System.out.println(line);

In my opinion the best solution for your problem would be using Java regex. Using regex you can define which group or groups of text you want to retrieve and what kind of text comes where. I haven't been working with Java in a long time, so I'll try to help you out from the top of my head. I'll try to give you a point in the right direction.
First off, compile a pattern:
Pattern pattern = Pattern.compile("^\d{1,2}:\d{1,2} GET (.*?) #\d+$", Pattern.MULTILINE);
First part of the regex says that you expect one or two digits followed by a colon followed by one or two digits again. After that comes the GET (you can use GET|POST if you expect those words or \w+? if you expect any word). Then you define the group you want with the parentheses. Lastly, you put the hash and any number of digits with at least one digit. You might consider putting flags DOTALL and CASE_INSENSITIVE, although I don't think you'll be needing them.
Then you continue with the matcher:
Matcher matcher = pattern.matcher(textToParse);
while (matcher.find())
{
//extract groups here
String group = matcher.group(1);
}
In the while loop you can use matcher.group(1) to find the text in the group you selected with the parentheses (the text you'd like extracted). matcher.group(0) gives the entire find, which is not what you're currently looking for (I guess).
Sorry for any errors in the code, it has not been tested. Hope this puts you on the right track.

You can try this rather flexible solution:
Scanner s = new Scanner(new File("data"));
Pattern p = Pattern.compile("^(.+?)\\s+(.+?)\\s+(.*)\\s+(.+?)$");
Matcher m;
while (s.hasNextLine()) {
m = p.matcher(s.nextLine());
if (m.find()) {
System.out.println(m.group(3));
}
}
This piece of code ignores first, second and last words from every line before printing them.
Advantage is that it relies on whitespaces rather than specific string literals to perform the stripping.

Regular expression, value in between quotes

I'm having a little trouble constructing the regular expression using java.
The constraint is, I need to split a string seperated by !. The two strings will be enclosed in double quotes.
For example:
"value"!"value"
If I performed a java split() on the string above, I want to get:
value
value
However the catch is value can be any characters/punctuations/numerical character/spaces/etc..
So here's a more concrete example. Input:
""he! "l0"!"wor!"d1"
Java's split() should return:
"he! "l0
wor!"d1
Any help is much appreciated. Thanks!

Try this expression: (".*")\s*!\s*(".*")
Although it would not work with split, it should work with Pattern and Matcher and return the 2 strings as groups.
String input = "\" \"he\"\"\"\"! \"l0\" ! \"wor!\"d1\"";
Pattern p = Pattern.compile("(\".*\")\\s*!\\s*(\".*\")");
Matcher m = p.matcher(input);
if(m.matches())
{
String s1 = m.group(1); //" "he""""! "l0"
String s2 = m.group(2); //"wor!"d1"
}
Edit:
This would not work for all cases, e.g. "he"!"llo" ! "w" ! "orld" would get the wrong groups. In that case it would be really hard to determine which ! should be the separator. That's why often rarely used characters are used to separate parts of a string, like # in email addresses :)

have the value split on "!" instead of !
String REGEX = "\"!\"";
String INPUT = "\"\"he! \"l0\"!\"wor!\"d1\"";
String[] items = p.split(INPUT);

It feels like you need to parse on:
DOUBLEQUOTE = "
OTHER = anything that isn't a double quote
EXCLAMATION = !
ITEM = (DOUBLEQUOTE (OTHER | (DOUBLEQUOTE OTHER DOUBLEQUOTE))* DOUBLEQUOTE
LINE = ITEM (EXCLAMATION ITEM)*
It feels like it's possible to create a regular expression for the above (assuming the double quotes in an ITEM can't be nested even further) BUT it might be better served by a very simple grammer.
This might work... excusing missing escapes and the like
^"([^"]*|"[^"]*")*"(!"([^"]*|"[^"]*")*")*$
Another option would be to match against the first part, then, if there's a !and more, prune off the ! and keep matching (excuse the no-particular-language, I'm just trying to illustrate the idea):
resultList = []
while(string matches \^"([^"]*|"[^"]*")*(.*)$" => match(1)) {
resultList += match
string = match(2)
if(string.beginsWith("!")) {
string = string[1:end]
} elseif(string.length > 0) {
// throw an error, since there was no exclamation and the string isn't done
}
}
if(string.length > 0) {
// throw an exception since the string isn't done
}
resultsList == the list of items in the string
EDIT: I realized that my answer doesn't really work. You can have a single doublequote inside the strings, as well as exclamation marks. As such, you really CAN'T have "!" inside one of the strings. As such, the idea of 1) pull quotes off the ends, 2) split on '"!"' is really the right way to go.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Now, how can I screen-scrape such a html line (using java)? - java

Related

Reading Polynomials from Input File (in Java)

Filtering string between double or single quotations with varying spaces

Word not preceded by a regular expression

How to best strip out certain strings in a file?

Regular expression, value in between quotes

Categories

Resources