Would regex be a good choice to parse SMTP received lines - java

I want to parse elements of RFC822 (SMTP) "Received" lines, which are defined formally in the spec, e.g.:
atom = 1*
[...]
received = "Received" ":" ; one per relay
["from" domain] ; sending host
["by" domain] ; receiving host
["via" atom] ; physical path
*("with" atom) ; link/mail protocol
["id" msg-id] ; receiver msg id
["for" addr-spec] ; initial form
";" date-time ; time received
[...]
msg-id = "" ; Unique message id
[...]
addr-spec = local-part "#" domain ; global address
etc. for domain, date-time, etc.
Here's a real example:
Received: from ll-194.132.162.89.kv.sovam.net.ua (ll-194.132.162.89.kv.sovam.net.ua [83.170.243.194] (may be forged)) by raq2073.uk2.net (8.10.2/8.10.2) with ESMTP id lASHDDE10765 for <johnsmithsvt#matts.co.uk>; Wed, 28 Nov 2007 17:13:13 GMT
Would regex be a good strategy to capture the parts of a received line?
I realize that many SMTP servers don't format received lines properly (in real life).
Otherwise, does anyone know of a library in Java that does this well?
Edit Here's a fiddle showing a regex and tests that I've banged on for a while, which seems to work.
Received:\s+(?:from\s+(.+?))?(?:\(qmail (.+?)\))?(?:\s+by\s+(.+?))?(?:\\s+via\s+(.+?))?(?:\s+with\s+(.+?))?(?:\;?\s+id\s+(.+?))?(?:\s+for\s+(.+?))?(?:;\s*(?!.*\;.*)(.+))?$

The choice really depends on exactly what you want to achieve.
For capturing specific parts of a Receiver-line (e.g. 'give me the From-part'), regexes are awesome.
If you need a full-fledged parser for this grammar, then regexes alone will not suffice. Especially the addr-spec has so many special cases that a regex cannot hope to handle each one correctly (explanation). Regexes are not parsers.
Last time I needed an actual parser, I wrote my own using JavaCC. I would only recommend going down that road if you know a thing or two about grammars and parsing.

Related

How to remove extra Escape characters from a text column in spark dataframe

My Data in a json looks like -
{"text": "\"I have recently taken out a 12 month mobile phone contract with Virgin but despite two calls to customer help I still am getting a message on my phone indicating \\\"No Service\\\" although intermittently I do get connected.\"", "created_at": "\"2018-08-27 16:58:30\"", "service_id": "51870", "category_id": "249"}
I read this JSON Using -
val complaintsSourceRaw = spark.read.json("file:///complaints.jsonl")
When i read the data in dataframe, it looks like
|249 |"2018-08-27 16:58:30"|51870 |"I have recently taken out a 12 month mobile phone contract with Virgin but despite two calls to customer help I still am getting a message on my phone indicating **\"No Service\"** although intermittently I do get connected."
Issue is
**\"No Service\"** need to be like **"No Service"**
How i am trying -
complaintsSourceRaw.withColumn("text_cleaned", functions.regexp_replace(complaintsSourceRaw.col("text"), "\", ""));
However \ character excapes my " and code breaks. Any idea how to acieve this?
You need to escape the "\" character, so in your regexp_replace you should look for two backslash ("\\") characters, not one.

How to enable `relevance-trace` using MarkLogic Java API?

I'm implementing quite a complex search using MarkLogic Java API. I would like to enable relevance-trace (Relavance trace) to see how my results are scored. Unfortunately, I don't know how to enable it in Java API. I have tried something like:
DatabaseClient client = initClient();
var qmo = client.newServerConfigManager().newQueryOptionsManager();
var searchOptions = "<search:options xmlns=\"http://marklogic.com/appservices/search\">\n"
+ " <search-option>relevance-trace</search-option>\n"
+ " </search:options>";
qmo.writeOptions("searchOptions", new StringHandle(searchOptions).withFormat(Format.XML));
QueryManager qm = client.newQueryManager();
StructuredQueryBuilder qb = qm.newStructuredQueryBuilder("searchOptions");
// query definition
qm.search(query, new SearchHandle())
Unfortunately it ends up with following error:
"Local message: /config/query write failed: Internal Server Error. Server Message: XDMP-DOCNONSBIND:
xdmp:get-request-body(\"xml\") -- No namespace binding for prefix search at line 1 . See the
MarkLogic server error log for further detail."
My question is how to use search options in MarkLogic API, especially I'm interested in relevance-trace and simple-score
Update 1
As suggested by #Jamess Kerr I have change my options to
var searchOptions = "<options xmlns=\"http://marklogic.com/appservices/search\">\n"
+ " <search-option>relevance-trace</search-option>\n"
+ " </options>";
but unfortunately, it still doesn't work. After that change I get error:
Local message: /config/query write failed: Internal Server Error. Server Message: XDMP-UPDATEFUNCTIONFROMQUERY: xdmp:apply(function() as item()*) -- Cannot apply an update function from a query . See the MarkLogic server error log for further detail.
Your search options XML uses the search: namespace prefix but you don't define that prefix. Since you are setting the default namespace, just drop the search: prefix from the search:options open and close tags.
The original Java Query contains both syntactical and semantic issues:
First of all, it is an invalid MarkLogic XQuery in the sense that it has only query option(s) portion. Bypassing the namespace binding prefix is another wrong end of the stick.
To tweak your original query, please replace a search text in between the search:qtext tag ( the pink line ) and run the query.
Result:
Matched and Listing 2 documents:
Matched 1 locations in /medals/coin_1333113127296.xml with 94720 score:
73. …pulsating maple leaf coin another world-first, the [Royal Canadian Mint]is proud to launch a numismatic breakthrough from its ambitious and creative R&D team...
Matched 1 locations in /medals/coin_1333078361643.xml with 94720 score:
71. ...the [Royal Canadian Mint]and Royal Australian Mint have put an end to the dispute relating to the circulation coin colouring process...
Without a semantic criterion, to put it into context, your original query will be an equivalent of removing the search:qtext and performing a fuzzy search.
Note:
If you use serialised term search or search constraints instead of text search, you should get higher score results.
MarkLogic Java API operates in unfiltered mode by default, while the cts:search operates in filtered mode by default. Just be mindful of how you construct the query and the expected score in Java API.
And it is really intended for bulk data write/extract/transformation. qconsole is, in my opinion, more befitting to tune specific query and gather search score, relevance and computation details.

How do I findout domain expiry date of a .org and .in website in java

I'm using WhoisClient (org.apache.commons.net.whois.WhoisClient) to retrieve my website domain expiry date. It is working for the domain with .com extension. When I try to check the expiry date for one of my .org domain the result says No match for domain.org. How do I find out the expiry date of a .org and .in extension domain?
I'm using the following code for getting the expiry date of the domain
String domainName = mydomain.replaceFirst("^(http[s]?://www\\.|http[s]?://|www\\.)","");
WhoisClient whois = new WhoisClient();
whois.connect(WhoisClient.DEFAULT_HOST);
String whoisData1 = whois.query("=" + domainName);
whois.disconnect();
Do not bother with the whois protocol.
Now (since August 26th, 2019) per ICANN requirements, all gTLDs need to have an RDAP server. RDAP is like the successor of whois: kind of the same content exchanged but this time on top of HTTPS with some fixed JSON format. Hence, trivial to parse.
The expiry date will be in the "events" array with an action called "expiration".
You can go to https://data.iana.org/rdap/dns.json to find out the .ORG RDAP server, it is at URL https://rdap.publicinterestregistry.net/rdap/org/
You need to learn a little more about RDAP to understand how to use it (structure of the query and the reply), you can find some introduction at https://about.rdap.org/
But in short your case, this emulates what you need to do:
$ wget -qO - https://rdap.publicinterestregistry.net/rdap/org/domain/slashdot.org | jq '.events[] | select(.eventAction | contains("expiration")) | .eventDate'
"2019-10-04T04:00:00.000Z"
PS1: if you get no match from a whois query normally it really means that the domain does not exist; it could also be because of rate limiting
PS2: .IN may not have an RDAP server yet, since it is a ccTLD it is not bound by ICANN rules.

Get documents between a range of dates in IBM Notes

I'm going crazy with all possible query syntaxes used on IBM Notes (old Lotus) databases for searching documents.
I just need all documents (i.e. emails) created (or delivered, which seems to be the same) between a given range of dates, using lotus.domino.Database.search(query) method in Java package for IBM Notes. Consider that I already know the dates format in my system ("dd/MM/yyyy").
Which should be the query?
First of all: To find out about the syntax just create a view in Domino Designer or check the views that are there (e.g. in your own mail database) and check the "Selection"- formula. Then remove the "SELECT" statement in front of it and use that as query.
Your query would be quite simple:
Form = "Memo" : "Reply" & #Date(#Created) >= [2018/01/01] & #Date(#Created) <= [2018/05/04]
if you are not sure, which date format your server uses, then just use this query instead:
Form = "Memo" : "Reply" &
#Date(#Created) >= #Date( 2018 ; 1 ; 1 ) &
#Date(#Created) <= #Date( 2018 ; 5 ; 4 )
This is the right formula for all mail- types. If you need alle calendar- type- documents, then use Form = "Appointment" : "Notice".
As a rule of thumb: Just go to the items- tab in the properties of any document you want to return and examine all items in the left hand site. Then simply use the item name in your formula as variable (except Body: That one would need special treatment).

Extract the results of a GWT service

Messy, complicated question, but here goes. I'm working on an integration project with Google Checkout, and there is a Google Checkout GWT service that returns the currency conversion rates used by the Checkout web interface to convert USD into local currencies. This endpoint is hosted at https://market.android.com/publish/gwt/, and staring at Firebug I see this going to the server:
7|0|6|https://market.android.com/publish/gwt/|FCCA4108CB89BFC2FEC78BA7363D4AF6|com.google.wireless.android.vending.developer.
shared.MerchantService|getCurrencyExchangeRates|com.google.common.money.CurrencyCode/112449834|java.util.ArrayList/4159755760
|1|2|3|4|2|5|6|5|235|6|13|5|18|5|81|5|53|5|72|5|102|5|121|5|177|5|175|5|205|5|204|5|55|5|86|-1|
and this being returned
//OK[235,3,'D0JA',2,86,3,'CXXg',2,55,3,'DW2A',2,204,3,'X9NA',2,205,3,'EuvA',2,175,3,'VIig',2,177,3,'E2Dw',2,121,3,'E4ziA',2,1
02,3,'do$Q',2,72,3,'T82w',2,53,3,'Ds0Q',2,81,3,'Cq5g',2,18,3,'Dlfg',2,13,1,["com.google.common.collect.RegularImmutableList/4
40499227","com.google.common.money.SimpleMoney/627983206","com.google.common.money.CurrencyCode/112449834"],0,7]
Forgive the odd formatting: can't quite get the code block to format right.
Wandering the web for hours on end I was able to determine that the RegularImmutableList class is in the Guava libraries (at http://code.google.com/p/guava-libraries/). What I'm looking for is:
I can't find the com.google.common.money.SimpleMoney or com.google.common.money.CurrencyCode classes anywhere: anyone seen them?
The GWT wire format appears to be an odd JSON string. I see various references to Google Groups messages talking about descriptions of the wire format, but can't find the underlying messages or any coherent reference that would let me reverse this: anyone have a handle on a handy reference? If I can at least understand WHAT the encoding is I might be able to get away without the class files from question 1 above.
I started wandering through the Android Market api library at http://code.google.com/p/android-market-api/, figuring they have to have done SOME of the Android Market communication integration, and they appear to have done so using protobufs. Is there any decent reference for the GWT/protobufs communication bits?
The underlying reason for this craziness is that I need to be able to take regular exchange rate values from Google Checkout so when I'm importing sales transactions in foreign currencies I can do the conversion at the prevailing rate at the time of the transaction. The current Checkout reporting formats do NOT provide this, so most folks end up using alternative sources of exchange rates that don't match what Google uses. It is clearly a shortcoming on the part of Google Checkout's integration interface, but if we got started on shortcomings of Google Checkout's interface we'd be here all week. My intention is to poll the Checkout interface for newly fulfilled orders and then request the appropriate exchange rate table so I can figure out in near real-time what the incoming payments are. I've got the polling bit down pat but can't quite get past the exchange rate bit.
While trying to create a script to bulk upload in-app products for my application (CSV upload constantly failed with obscure error messages), I have managed to understand the GWT AJAX protocol.
It's actually pretty simple, except it requires you to know structure of all used classes. Or guess it, as is the case with internal classes used by Google. :)
I'll use examples from the question to explain the protocol in detail.
Request format
7|0|6|https://market.android.com/publish/gwt/|FCCA4108CB89BFC2FEC78BA7363D4AF6|com.google.wireless.android.vending.developer.shared.MerchantService|getCurrencyExchangeRates|com.google.common.money.CurrencyCode/112449834|java.util.ArrayList/4159755760|1|2|3|4|2|5|6|5|235|6|13|5|18|5|81|5|53|5|72|5|102|5|121|5|177|5|175|5|205|5|204|5|55|5|86|-1|
The request is pipe-delimited list of tokens with the following meaning:
7 - protocol version
0 - flags. 1 is FLAG_ELIDE_TYPE_NAMES, 2 is FLAG_RPC_TOKEN_INCLUDED
6 - string token count
6 string tokens:
https://market.android.com/publish/gwt/
FCCA4108CB89BFC2FEC78BA7363D4AF6
com.google.wireless.android.vending.developer.shared.MerchantService
getCurrencyExchangeRates
com.google.common.money.CurrencyCode/112449834
java.util.ArrayList/4159755760
The actual encoded request, which references strings from the list above using 1-based indices:
1 - https://market.android.com/publish/gwt/ - base URL
2 - FCCA4108CB89BFC2FEC78BA7363D4AF6 - some hash, which is references as serializationPolicyStrongName in GWT sources.
3 - com.google.wireless.android.vending.developer.shared.MerchantService - service name
4 - getCurrencyExchangeRates - method name
2 - parameter count. Parameter types follow:
5 - com.google.common.money.CurrencyCode/112449834
6 - java.util.ArrayList/4159755760
Serialized parameters. Each object is represented either by its classname and list of serialized fields or by negative integer back-reference to previously encountered object. In our case we have two objects:
5 - com.google.common.money.CurrencyCode/112449834, which only has one integer field: 235
6 - java.util.ArrayList/4159755760, which has one integer length field 13, followed by 13 serialized list items. Note that 12 of them are CurrencyCode objects serialized just as the above one, and the last one is a backreference (-1) to the very first object we encountered while (de-)serializing this request, i.e. CurrencyCode(235)
Response format
//OK[235,3,'D0JA',2,86,3,'CXXg',2,55,3,'DW2A',2,204,3,'X9NA',2,205,3,'EuvA',2,175,3,'VIig',2,177,3,'E2Dw',2,121,3,'E4ziA',2,102,3,'do$Q',2,72,3,'T82w',2,53,3,'Ds0Q',2,81,3,'Cq5g',2,18,3,'Dlfg',2,13,1,["com.google.common.collect.RegularImmutableList/440499227","com.google.common.money.SimpleMoney/627983206","com.google.common.money.CurrencyCode/112449834"],0,7]
The response is very similar in format to the request except it's JS-formatted array (though not JSON, as it uses invalid single quotes), and it's in reverse order.
The field meaning is as follows:
7 - protocol version
0 - flags, same as for request
Array of string tokens:
com.google.common.collect.RegularImmutableList/440499227
com.google.common.money.SimpleMoney/627983206
com.google.common.money.CurrencyCode/112449834
And then goes one serialized object of type 1 - com.google.common.collect.RegularImmutableList/440499227 with one integer length field 13, followed by 13 serialized objects of class 2 - com.google.common.money.SimpleMoney/627983206. Each SimpleMoney object has two fields, for example:
'Dlfg' - long integer field encoded as base64 number. This particular one is 940000
3, 18 - CurrencyCode object with integer field 18
What you are looking at is GWT-RPC serialization format. Unfortunatelly it is not publicly documented. Fortunatelly GWT is open-source so you could look at the source to see how it is produced.
Note: This format might change between GWT versions (I known it did in 2.2). This is most likelly also a reason why Google does not document it - if they did they'd need to keep it backward compatible.
Class names that you see are Java classes that Google Checkout uses internally. When GWT is compiled to JS the names get mangled so you don't see them any more.
As noted this is GWT-RPC.
What you are trying to do is reverse-engineer Google internal APIs. I wouldn't do that because, a. It might change without notice, breaking your app and, b. I'm sure Goog wouldn't like it and it probably violates the service agreement (have you read it?).
I have some code made in VB that may be useful for you to realize how to parse GWT Serialized strings. "Datos" contains the string you received.
aAux = Split(Datos, ",[")
aAux(1) = Replace(aAux(1), "],0,7]", "")
aAux(0) = Replace(aAux(0), "//OK[", "")
aAux(0) = Replace(aAux(0), "'", "")
aDescripcion = Split(aAux(1), """,""")
aValor = Split(aAux(0), ",")
InvertirArray aValor
For X = 0 To UBound(aValor)
If Not IsNumeric(aValor(X)) Then
Exit For
End If
If adescripcion(Int(aValor(X))-1) = "gov.senasa.embalajemadera.shared.domain.Pais/3238585366" Then
For Y = X + 1 To UBound(aValor)
If Int(aValor(Y)) = "" Then '- Do what you want
end if
If adescripcion(Int(aValor(Y))) = "java.lang.Integer/3438268394" Then
'- Do what you want
Next Y
End If
Next X
Of course you have to adapt it to your needs and you will have to play a little bit with the arrays...
InvertirArray:
Public Sub InvertirArray(ByRef Arr() As String)
'- el array va tiene que empezar en 0
Dim X As Long
Dim Hasta As Long
Dim Tmp As String
If UBound(Arr) Mod 2 = 0 Then
'- Es impar
Hasta = UBound(Arr) + 1
Else
Hasta = UBound(Arr)
End If
For X = LBound(Arr) To UBound(Arr) \ 2
Tmp = Arr(X)
Arr(X) = Arr(UBound(Arr) - X)
Arr(UBound(Arr) - X) = Tmp
Next X
end sub
And of course you need to decode and encode Long Numbers and dates. So:
Public Function EncodeDateGwt(Numero As Double, Optional isDate As Boolean = False) As String
Dim s As String
Dim a As Double
Dim i As Integer
Dim u As Integer
Dim Base As String
Numero = IIf(isDate, Numero * 1000, Numero)
Base = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_$"
Do While Val(Numero) <> 0
a = Numero
i = 0
Do While a >= 64
i = i + 1
a = a / 64
Loop
If i <> u - 1 And u <> 0 Then EncodeDateGwt = EncodeDateGwt & String(u - i - 1, Left(Base, 1))
a = Int(a)
EncodeDateGwt = EncodeDateGwt + Mid(Base, a + 1, 1)
Numero = Numero - a * (64 ^ i)
u = i
Loop
EncodeDateGwt = EncodeDateGwt & String(i, Left(Base, 1))
End Function
Public Function DecodeDateGwt(Texto As String, Optional isDate As Boolean = False) As Long
Dim Base As String
Dim a As Integer
Base = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_$"
For a = 1 To Len(Texto)
DecodeDateGwt = DecodeDateGwt + (InStr(Base, Mid(Texto, a, 1)) - 1) * (Len(Base) ^ ((Len(Texto) - (a))))
Next
DecodeDateGwt = IIf(isDate, DecodeDateGwt / 1000, DecodeDateGwt)
'devuelve timestamp
End Function
If what you need to encode/decode is a date, then you need to do this before:
Call encodegwtdate(date2unix("20/02/2016"),true)
Public Function Date2Unix(ByVal vDate As Date) As Long
Date2Unix = DateDiff("s", Unix1970, vDate)
End Function
Public Function Unix2Date(vUnixDate As Long) As Date
Unix2Date = DateAdd("s", vUnixDate, Unix1970)
End Function
Hope you solve it. By the way, does anyone knows what negative numbers means?????

Categories

Resources