how to parse xml element node susing pig script? - java

I am using pig latin for a large XML dump. I am trying to get the value of the xml nodes like location and temp_c in pig latin. The file is like
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet href="latest_ob.xsl" type="text/xsl"?>
<current_observation version="1.0"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://www.weather.gov/view/current_observation.xsd">
<credit>NOAA's National Weather Service</credit>
<credit_URL>http://weather.gov/</credit_URL>
<image>
<url>http://weather.gov/images/xml_logo.gif</url>
<title>NOAA's National Weather Service</title>
<link>http://weather.gov</link>
</image>
<suggested_pickup>15 minutes after the hour</suggested_pickup>
<suggested_pickup_period>60</suggested_pickup_period>
<location>Unknown Station</location>
<station_id>51WH0</station_id>
<observation_time>Last Updated on Dec 23 2014, 11:00 pm LST</observation_time>
<observation_time_rfc822>Tue, 23 Dec 2014 23:00:00 +1000</observation_time_rfc822>
<temperature_string>71.4 F (21.9 C)</temperature_string>
<temp_f>71.4</temp_f>
<temp_c>21.9</temp_c>
<water_temp_f>75.9</water_temp_f>
<water_temp_c>24.4</water_temp_c>
<wind_string>North at 24.6 MPH (21.38 KT)</wind_string>
<wind_dir>North</wind_dir>
<wind_degrees>20</wind_degrees>
<wind_mph>24.6</wind_mph>
<wind_gust_mph>0.0</wind_gust_mph>
<wind_kt>21.38</wind_kt>
<pressure_string>1015.0 mb</pressure_string>
<pressure_mb>1015.0</pressure_mb>
<dewpoint_string>58.1 F (14.5 C)</dewpoint_string>
<dewpoint_f>58.1</dewpoint_f>
<dewpoint_c>14.5</dewpoint_c>
</current_observation>

May be it will help you, try this out.
REGISTER piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A = LOAD 'xmls/your_file.xml' using org.apache.pig.piggybank.storage.XMLLoader('current_observation') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'current_observation/location'), XPath(x, 'current_observation/temp_c');
dump B;

use this:
data = LOAD '/path/your_file.xml'
USING org.apache.pig.piggybank.storage.StreamingXMLLoader(
'current_observation',
'credit, credit_URL, image, suggested_pickup, suggested_pickup_period, location, station_id, observation_time,temp_f, temp_c, water_temp_f, water_temp_c, wind_string, wind_dir, wind_degrees, wind_mph, wind_gust_mph, wind_kt, pressure_string, pressure_mb, dewpoint_string, dewpoint_f, dewpoint_c'
) AS (
credit: {(attr:map[], content:chararray)}
credit_URL: {(attr:map[], content:chararray)}
.
.
.
);
dump data;

Related

VTD-XML reading gives no results

I am trying to read a RSS content using VTD-XML. Below is a sample of RSS.
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<?xml-stylesheet type="text/xsl" href="rss.xsl"?>
<channel>
<title>MyRSS</title>
<atom:link href="http://www.example.com/rss.php" rel="self" type="application/rss+xml" />
<link>http://www.example.com/rss.php</link>
<description>MyRSS</description>
<language>en-us</language>
<pubDate>Tue, 22 May 2018 13:15:15 +0530</pubDate>
<item>
<title>Title 1</title>
<pubDate>Tue, 22 May 2018 13:14:40 +0530</pubDate>
<link>http://www.example.com/news.php?nid=47610</link>
<guid>http://www.example.com/news.php?nid=47610</guid>
<description>bla bla bla</description>
</item>
</channel>
</rss>
Anyway as you know, some RSS feeds can contain more styling info etc. However in every RSS, the <channel> and <item> will be common, at least for the ones I need to use.
I tried VTD XML to read this as quickly as possible. Below is the code.
VTDGen vg = new VTDGen();
if (vg.parseHttpUrl(appDataBean.getUrl(), true)) {
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/channel/item");
int result = -1;
while ((result = ap.evalXPath()) != -1) {
if (vn.matchElement("item")) {
do {
//do something with the contnets in the item node
Log.d("VTD", vn.toString(vn.getText()));
} while (vn.toElement(VTDNav.NEXT_SIBLING));
}
}
}
Unfortunately this did not print anything. What am I doing wrong here? Also non of the RSS feeds are very big, so I need to read them in couple of miliseconds. This code is on Android.

How to import text file data to database using spring and hibernate

Here is my tesseract OCR converted .txt file
ATITENDANCE SHEET
*Department: #ELECTRONICS *Date: #19/08/ 2017
*Year: #FIRST *Division: #A
*Subject Code: #TM404 *Teacher Code: #10447
#001 #002 #003 #004
#005- #006 #007 #008-
#009 #010 #011- #012
#013 Ieflll- #015 #016
«#01-7- #018 #019 mae-
#021 #022 #023 #024
#025 -#-0%6- #027 #028
#029 #030 I903! #032
#033- #034 #035 #036-
#037 #036- #039 #040
#041 #042 #043 #044
#045 #046 #047 .6048-
'#'0|19' #050 #051- #052
#053 #054 #055 #056
-#O§i- #058 #059 #060
#061 I096? #063 m
IGOGE #066 #067 #068
#069 #070 #071 #072
#073 #074 i375- #076
#077 #078 #079 #080
EE—Tahoma-20~B—44
So,How can I stored into the database
Department:ELECTRONICS Date: 19/08/ 2017
Year: FIRST Division: A
Subject Code: TM404 Teacher Code: 10447
Roll no get auto incremented

SoapUI: response with CDATA is giving null. searched various article but no luck

Hello I have below SOAP reponse.
`
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetWeatherResponse xmlns="http://www.webserviceX.NET">
<GetWeatherResult><![CDATA[<?xml version="1.0" encoding="utf-16"?>
<CurrentWeather>
<Location>Cape Town, Cape Town International Airport, South Africa (FACT) 33-59S 018-36E 0M</Location>
<Time>Jun 04, 2016 - 05:00 AM EDT / 2016.06.04 0900 UTC</Time>
<Wind> from the SE (130 degrees) at 21 MPH (18 KT):0</Wind>
<Visibility> greater than 7 mile(s):0</Visibility>
<SkyConditions> mostly clear</SkyConditions>
<Temperature> 60 F (16 C)</Temperature>
<DewPoint> 44 F (7 C)</DewPoint>
<RelativeHumidity> 55%</RelativeHumidity>
<Pressure> 30.39 in. Hg (1029 hPa)</Pressure>
<Status>Success</Status>
</CurrentWeather>]]></GetWeatherResult>
</GetWeatherResponse>
</soap:Body>
</soap:Envelope>"`
for the following request: http://www.webservicex.net/globalweather.asmx.
I want to read XML Data in above response using script. but it is giving null.
I have tried script as below
def groovyUtils = new com.eviware.soapui.support.GroovyUtils(context)
def holder = groovyUtils.getXmlHolder(messageExchange.responseContent)
holder.namespaces["ns"] = "http://www.webserviceX.NET/" def
weatherinfo= holder.getNodeValue("//ns:GetWeatherResult/text()")
log.info weatherinfo
bit Instead of getting above reponse I am getting NULL. I have read the SOAPUI documentation on CDATA but its not working.
I got answer. It was just one extra slash in namespace that was giving me null.
Both the below scripts are working now.
> def respXmlHolder = new
> com.eviware.soapui.support.XmlHolder(messageExchange.getResponseContentAsXml())
> respXmlHolder.namespaces["ns1"] ="http://www.webserviceX.NET" def
> CDATAXml= respXmlHolder.getNodeValue("//ns1:GetWeatherResult/text()")
> log.info CDATAXml
def groovyUtils = new com.eviware.soapui.support.GroovyUtils(context)
def holder = groovyUtils.getXmlHolder(messageExchange.responseContent)
holder.namespaces["ns"] = "http://www.webserviceX.NET"
def weatherinfo= holder.getNodeValue("//ns:GetWeatherResult/text()")
log.info weatherinfo

OutputSream.write is too slow

I am encountering with a senerior like this:
My project has a servlet to catch a request from perl. The request is to download a file. The request is a multipartRequest.
#RequestMapping(value = "/*", method = RequestMethod.POST)
public void tdRequest(#RequestHeader("Authorization") String authenticate,
HttpServletResponse response,
HttpServletRequest request) throws Exception
{
if (ServletFileUpload.isMultipartContent(request))
{
ServletFileUpload sfu = new ServletFileUpload();
FileItemIterator items = sfu.getItemIterator(request);
while (items.hasNext())
{
FileItemStream item = items.next();
if (("action").equals(item.getFieldName()))
{
InputStream stream = item.openStream();
String value = Streams.asString(stream);
if (("upload").equals(value))
{
uploadRequest(items, response);
return;
}
else if (("download").equals(value))
{
downloadRequest(items, response);
return;
}
The problem is not here, it appears on the downloadRequest() function.
void downloadRequest(FileItemIterator items,
HttpServletResponse response) throws Exception
{
log.info("Start downloadRequest.......");
OutputStream os = response.getOutputStream();
File file = new File("D:\\clip.mp4");
FileInputStream fileIn = new FileInputStream(file);
//while ((datablock = dataOutputStreamServiceImpl.readBlock()) != null)
byte[] outputByte = new byte[ONE_MEGABYE];
while (fileIn.read(outputByte) != -1)
{
System.out.println("--------" + (i = i + 1) + "--------");
System.out.println(new Date());
//dataContent = datablock.getContent();
System.out.println("Start write " + new Date());
os.write(outputByte, 0,outputByte.length);
System.out.println("End write " + new Date());
//System.out.println("----------------------");
}
os.close();
}
}
I try to read and write blocks of 1MB from the file. However, it takes too long for downloading the whole file. ( my case is 20mins for file of 100MB)
I try to sysout and I saw a result like this:
The first few blocks can read, write data realy fast:
--------1--------
Mon Dec 07 16:24:20 ICT 2015
Start write Mon Dec 07 16:24:20 ICT 2015
End write Mon Dec 07 16:24:21 ICT 2015
--------2--------
Mon Dec 07 16:24:21 ICT 2015
Start write Mon Dec 07 16:24:21 ICT 2015
End write Mon Dec 07 16:24:21 ICT 2015
--------3--------
Mon Dec 07 16:24:21 ICT 2015
Start write Mon Dec 07 16:24:21 ICT 2015
End write Mon Dec 07 16:24:21 ICT 2015
But the next block is slower than the previous
--------72--------
Mon Dec 07 16:29:22 ICT 2015
Start write Mon Dec 07 16:29:22 ICT 2015
End write Mon Dec 07 16:29:29 ICT 2015
--------73--------
Mon Dec 07 16:29:29 ICT 2015
Start write Mon Dec 07 16:29:29 ICT 2015
End write Mon Dec 07 16:29:37 ICT 2015
--------124--------
Mon Dec 07 16:38:22 ICT 2015
Start write Mon Dec 07 16:38:22 ICT 2015
End write Mon Dec 07 16:38:35 ICT 2015
--------125--------
Mon Dec 07 16:38:35 ICT 2015
Start write Mon Dec 07 16:38:35 ICT 2015
End write Mon Dec 07 16:38:48 ICT 2015
The problem is in the os.write()
I realy cannot understand how the outputStream write, why it take such a long time like that? or I made some mistakes?
Sorry for my bad english. I realy need your support. Thank in advance!
This is the perl code from the client side
# ----- get connected to download the file
#
$Response = $ua->request(POST $remoteHost ,
Content_Type => 'form-data',
Authorization => $Authorization,
'Proxy-Authorization' => $Proxy_Authorization ,
Content => [ DOS => 1 ,
action => 'download' ,
first_run => 0 ,
dl_filename => $dl_filename ,
delivery_dir => $delivery_dir ,
verbose => $Verbose ,
debug => $debug ,
version => $VERSION
]
);
unless ($Response->is_success) {
my $Msg = $Response->error_as_HTML;
# Remove HTML tags - we're in a DOS shell!
$Msg =~ s/<[^>]+>//g;
print "ERROR! SERVER RESPONSE:\n$Msg\n";
print "$remoteHost\n\n" if $Options{'v'};
Error "Could not connect to " . $remoteHost ;
}
my $Result2 = $Response->content();
Error "Abnormal termination...\n$Result2" if $Result2 =~ /_APP_ERROR_/;
open(F, ">$dl_filename") or Error "Could not open '$dl_filename'!";
binmode F; # unless $dl_filename =~ /\.txt$|\.htm$/;
print F $Result2;
close F;
print "received.\n";
}
One problem is that fileIn.read(outputByte) can read random number of bytes, not only full outputByte. You read few KB, then you store full 1MB, and very fast you are running out of space on disk. Try this, notice the "readed" parameter.
void downloadRequest(FileItemIterator items,
HttpServletResponse response) throws Exception
{
log.info("Start downloadRequest.......");
OutputStream os = response.getOutputStream();
File file = new File("D:\\clip.mp4");
FileInputStream fileIn = new FileInputStream(file);
//while ((datablock = dataOutputStreamServiceImpl.readBlock()) != null)
byte[] outputByte = new byte[ONE_MEGABYE];
int readed =0;
while ((readed =fileIn.read(outputByte)) != -1)
{
System.out.println("--------" + (i = i + 1) + "--------");
System.out.println(new Date());
//dataContent = datablock.getContent();
System.out.println("Start write " + new Date());
os.write(outputByte, 0,readed );
System.out.println("End write " + new Date());
//System.out.println("----------------------");
}
os.close();
}
}
It looks like your download performance gets slower and slower, the further you are getting into the download. You start out at one or less seconds per block, by block 72 it is 7+ seconds per block and by block 128 it is 13 seconds per block.
There is nothing on the server side to explain this. Rather, it has the "smell" of the client side doing something wrong. My guess is that the client side is reading the data from the socket into an in-memory data structure, and that data structure (maybe just a String or StringBuffer or StringBuilder) is getting larger and larger. Either the time take to expand it is getting larger, or your memory footprint is growing and the GC is taking longer and longer. (Or both.)
If you showed us the client-side code .....
UPDATE
As I suspected, this line of code will be reading the entire content into the Perl equivalent of a string builder before turning it into a string.
my $Result2 = $Response->content();
Depending on how it is implemented under the hood, this will lead to repeated copying of the data as the builder runs out of buffer space and needs to be expanded. Depending on the buffer expansion strategy that Perl employs for this, it could give O(N^2) behavior, where N is the size of the file you are transferring. (The evidence is that you are not getting O(N) behavior ...)
If you want a faster downloads, you need to stream the data on the client side. Read the response content in chunks and write them to the output file. (I'm not a Perl expert, so I can't offer you code.) This will also reduce the memory footprint on the client side ... which could be important if your file sizes increase.

Java UDF Date Regex Extractor for Pig?

I am trying to create a UDF for importing into Pig that matches a Regex pattern on a date. The Regex has been tested and works accordingly, but I am having trouble with the following code:
package com.date.format;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class DATERANGE extends EvalFunc<String> {
#Override
public String exec(Tuple arg0) throws IOException {
try
{
String pattern = "(Oct\\W(?:1[5-9]|2[0-3])\\W(?:(?:0?9|10):\\d{2}:\\d{2}|11:00:00))";
Pattern pat = Pattern.compile(pattern);
Matcher match = pat.matcher((String) arg0.get(0));
if(match.find())
{
return match.group(0);
}
else return "none";
}
catch(Exception e)
{
throw new IOException("Caught exception processing input row ", e);
}
}
}
After compiling the above java code and exporting it as a jar and running it inside Hadoop using the following Pig script:
register 'DATEFormat.jar';
ld = LOAD 'dates/date_data_three' AS (date:chararray);
loop = foreach ld generate com.date.format.DATERANGE(date) as d:chararray;
dump loop;
I get the following error:
ERROR 2078: Caught error from UDF: com.date.format.DATERANGE [Caught exception
processing input row ]
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator
for alias loop
at org.apache.pig.PigServer.openIterator(PigServer.java:912)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:752)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.loadScript(GruntParser.java:566)
at org.apache.pig.tools.grunt.GruntParser.processScript(GruntParser.java:513)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.
Script(PigScriptParser.java:1014)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:550)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:228)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:542)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias loop
at org.apache.pig.PigServer.storeEx(PigServer.java:1015)
at org.apache.pig.PigServer.store(PigServer.java:974)
at org.apache.pig.PigServer.openIterator(PigServer.java:887)
... 16 more
The data file contains dates as shown below:
Wed Oct 15 09:26:09 BST 2014
Wed Oct 15 19:26:09 BST 2014
Wed Oct 18 08:26:09 BST 2014
Wed Oct 23 10:26:09 BST 2014
Sun Oct 05 09:26:09 BST 2014
Wed Nov 20 19:26:09 BST 2014
Does anybody know the correct way to implement a Java UDF for Pig that would work with the Regex I have provided?
Thanks
I recommend you to use REGEX_EXTRACT build-in command, this will be very easy instead of writing UDF.
ld = LOAD 'input.txt' AS (date:chararray);
loop = foreach ld generate REGEX_EXTRACT(date,'(Oct\\W(?:1[5-9]|2[0-3])\\W(?:(?:0?9|10):\\d{2}:\\d{2}|11:00:00))',1) as d:chararray;
C = FILTER loop by d is not null;
D = FOREACH C GENERATE $0;
DUMP D;
Output:
(Oct 15 09:26:09)
(Oct 23 10:26:09)
Your Regex UDF also working fine for me. i just copied your input and java code and executed locally. It works perfectly. Please see the below output that i got from your UDF code. I guess you may need to check your classpath are properly set or not.
(Oct 15 09:26:09)
(none)
(none)
(Oct 23 10:26:09)
(none)
(none)
Even better, you could use ToDate:
load your data into filtered_raw_financings_csvs with close_date as a chararray:
financings_csvs = FOREACH filtered_raw_financings_csvs
GENERATE name,
city,
state,
(close_date==''?NULL:ToDate(close_date, 'dd-MMM-yy')) AS close_date
;
Build your date format string as described here:
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
This snippet is shown in context here:
http://nathan.vertile.com/blog/2015/04/17/handling-dates-in-hadoop-pig/

Categories

Resources