I'm working on a project where i have to screen scrape a website and get a string. This is a part of the text.
a href = "/dashboard/index/2971"
title="Project1:Project1">Project1
I need to get the "/dashboard/index/2971" part of the Whole Text using regex. Currently i have this:
while(true){
if (buff.readLine()!=null){
String wholeText = buff.readLine();
System.out.println(wholeText.contains("title=Project1"));
htmlCode += buff.readLine() + "\n";
}else{
break;
}
This just identifies the "title=Project1" String. I need to get the "/dashboard/index/2971" part and put it in a string.
<?php
$str = 'a href = "/dashboard/index/2971" title="Project1:Project1">Projeca...';
preg_match_all('#href\s*=\s*"(.*?)"#', $str, $matches, PREG_SET_ORDER);
$foundURLs = array();
foreach ($matches as $match) {
$foundURLs[] = $match[1];
}
var_dump($foundURLs);
Related
For a given plain JSON data do the following formatting:
replace all the special characters in key with underscore
remove the key double quote
replace the : with =
Example:
JSON Data: {"no/me": "139.82", "gc.pp": "\u0000\u000", ...}
After formatting: no_me="139.82", gc_pp="\u0000\u000"
Is it possible with a regular expression? or any other single command execution?
A single regex for the whole changes may be overkill. I think you could code something similar to this:
(NOTE: Since i do not code in java, my example is in javascript, just to get you the idea of it)
var json_data = '{"no/me": "139.82", "gc.pp": "0000000", "foo":"bar"}';
console.log(json_data);
var data = JSON.parse(json_data);
var out = '';
for (var x in data) {
var clean_x = x.replace(/[^a-zA-Z0-9]/g, "_");
if (out != '') out += ', ';
out += clean_x + '="' + data[x] + '"';
}
console.log(out);
Basically you loop through the keys and clean them (remove not-wanted characters), with the new key and the original value you create a new string with the format you like.
Important: Bear in mind overlapping ids. For example, both no/me and no#me will overlap into same id no_me. this may not be important since your are not outputting a JSON after all. I tell you just in case.
I haven't done Java in a long time, but I think you need something like this.
I'm assuming you mean 'all Non-Word characters' by specialchars here.
import java.util.regex.*;
String JsonData = '{"no/me": "139.82", "gc.pp": "\u0000\u000", ...}';
// remove { and }
JsonData = JsonData.substring(0, JsonData.length() - 1);
try {
Pattern regex = Pattern.compile("(\"[^\"]+\")\\s*:"); // find the keys, including quotes and colon
Matcher regexMatcher = regex.matcher(JsonData);
while (regexMatcher.find()) {
String temp = regexMatcher.group(1); // "no/me":
String key = regexMatcher.group(2).replaceAll("\\W", "_") + "="; // no_me=
JsonData.replaceAll(temp, key);
}
} catch (PatternSyntaxException ex) {
// regex has syntax error
}
System.out.println(JsonData);
I am crawling websites using crawler4j. I am using jsoup to extract content and save it in a text format file. Then I use omegaT to find the number of words in those text files.
The problem I am having is with text extraction. I am using the following function to extract the text from html.
public static String cleanTagPerservingLineBreaks(String html) {
String result = "";
if (html == null)
return html;
Document document = Jsoup.parse(html);
document.outputSettings(new Document.OutputSettings()
.prettyPrint(false));
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
result = document.html().replaceAll("\\\\n", "\n");
result = result.replaceAll(" ", " ");
result = result.trim();
result = Jsoup.clean(result, "", Whitelist.none(),
new Document.OutputSettings().prettyPrint(false));
return result;
}
In the line result = document.html().replaceAll("\\\\n", "\n"); when I use document.text() it gives me a well formatted text with appropriate spaces. But when I do the word count from omegaT, the unique words are not shown properly. If I keep using document.html() then I get a proper word count but there are no paces between some text(eg. WomenNew ArrivalsTops & BlousesPants & DenimDresses & SkirtsMenView All MenNew) and tags like strong, em are not removed by Jsoup.
Is there a way to put spaces between all the tags and properly strip content? And a explanation on why the fluctuation in word count is happening, if possible.
I am coding a TextFormatter that replaces special characters with HTML tags.
"_" = "< i >" and "< /i >"
"*" = "< b >" and "< /b >"
so.. my codes is as follows..
public String convertBold() {
if (countStrings("_") % 2 == 1)
return 1;
String tag = "<b>";
String result = "";
while (find String("_", psn) >= 2) {
int newPsn = findString("_", psn);
// Copy the code before the "_" into the result
result = result + line.substring(psn, newPsn);
// Add the tag and change the tag
result = result + tag;
if (tag.equals("<B>"))
tag = "</B>";
else
tag = "<B>";
//update the psn
psn = newPsn++;
}
//copy the rest of the string
result = result + line.substring(psn);
return result;
}
What I need help with is that nesting tags in HTML can cause errors. I don't understand how to properly nest tags in HTML since if I don't close a tag before inserting a new one it causes it an error. I know the way I phrase this could make it slightly confusing but I would appreciate any help and if I can answer any question to clean up any confusion let me know.
Thank you in advance! - Vexial
Suppose that your markup text is correct:
/**
*#param s string to HTML
*
*/
String convert(String s){
while(s.indexOf("_")!= -1 ||s.indexOf("*") != -1){
if(s.indexOf("_") != -1){
s = s.replaceFirst("\\_", "<i>");
s = s.substring(0, s.lastIndexOf("_"))+"</i>"+s.substring(s.lastIndexOf("_")+1);
}
if(s.indexOf("*") != -1){
s = s.replaceFirst("\\*", "<b>");
s = s.substring(0, s.lastIndexOf("*"))+"</b>"
+s.substring(s.lastIndexOf("*")+1);
}
}//end while
return s;
}
I tried searching for something similar, and couldn't find anything. I'm having difficulty trying to replace a few characters after a specific part in a URL.
Here is the URL: https://scontent-b.xx.fbcdn.net/hphotos-xpf1/v/t1.0-9/s130x130/10390064_10152552351881633_355852593677844144_n.jpg?oh=479fa99a88adea07f6660e1c23724e42&oe=5519DE4B
I want to remove the /v/ part, leave the t1.0-9, and also remove the /s130x130/.I cannot just replace s130x130, because those may be different variables. How do I go about doing that?
I have a previous URL where I am using this code:
if (pictureUri.indexOf("&url=") != -1)
{
String replacement = "";
String url = pictureUri.replaceAll("&", "/");
String result = url.replaceAll("().*?(/url=)",
"$1" + replacement + "$2");
String pictureUrl = null;
if (result.startsWith("/url="))
{
pictureUrl = result.replace("/url=", "");
}
}
Can I do something similar with the above URL?
With the regex
/v/|/s\d+x\d+/
replaced with
/
It turns the string from
https://scontent-b.xx.fbcdn.net/hphotos-xpf1/v/t1.0-9/s130x130/10390064_10152552351881633_355852593677844144_n.jpg?oh=479fa99a88adea07f6660e1c23724e42&oe=5519DE4B
to
https://scontent-b.xx.fbcdn.net/hphotos-xpf1/t1.0-9/10390064_10152552351881633_355852593677844144_n.jpg?oh=479fa99a88adea07f6660e1c23724e42&oe=5519DE4B
as seen here. Is this what you're trying to do?
I want to perform the following functionality :
From a given paragraph extract the given String, like
String str= "Hello this is paragraph , Ali#yahoo.com . i am entering random email here as this one AHmar#gmail.com " ;
What I have to do is to parse the whole paragraph, read the Email address, and print their server names , i have tried it using for loop with substring method , did use indexOf , but might be my logic is not that good to get it , can someone help me with it please?
You need to use Regular Expression for this case.
Try the below Regex: -
String str= "Hello this is paragraph , Ali#yahoo.com . i am " +
"entering random email here as this one AHmar#gmail.com " ;
Pattern pattern = Pattern.compile("#(\\S+)\\.\\w+");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
OUTPUT: -
yahoo
gmail
UPDATE: -
Here's the code with substring and indexOf: -
String str= "Hello this is paragraph , Ali#yahoo.com . i am " +
"entering random email here as this one AHmar#gmail.com " ;
while (str.contains("#") && str.contains(".")) {
int index1 = str.lastIndexOf("#"); // Get last index of `#`
int index2 = str.indexOf(".", index1); // Get index of first `.` after #
// Substring from index of # to index of .
String serverName = str.substring(index1 + 1, index2);
System.out.println(serverName);
// Replace string by removing till the last #,
// so as not to consider it next time
str = str.substring(0, index1);
}
You need to use a regular expression to extract the email. Start off with this test harness code. Next, construct your regular expression and you should be able to extract the email address.
Try this:-
String e= "Hello this is paragraph , Ali#yahoo.com . i am entering random email here as this one AHmar#gmail.comm";
e= e.trim();
String[] parts = e.split("\\s+");
for (String e: parts)
{
if(e.indexOf('#') != -1)
{
String temp = e.substring(e.indexOf("#") + 1);
String serverName = temp.substring(0, temp.indexOf("."));
System.out.println(serverName); }}