CSV Files in Python (Sliding Window) - java

I am a beginner to Python, I need a help with manipulating csv files in Python.
I am trying to do sliding window mechanism for each row in dataset.
for an example if the dataset is this
timestamp | temperature | windspeed
965068200 9.61883 60.262
965069100 9.47203 60.1664
965070000 9.31125 60.0145
965070900 9.13649 59.8064
and if user specified window size is 3,the result should be something like
timestamp | temperature-2 | temperature-1 |temperature-0 | windspeed-2 | windspeed-1 | windspeed-0
965070000 9.61883 9.47203 9.31125 60.262 60.1664 60.0145
965070900 9.47203 9.31125 9.13649 60.1664 60.0145 59.8064
I could do this by using List of ObjectsArray in Java.Reading CSV and generate new CSV which it contains transformed dataset.
Here is the code
http://pastebin.com/cQnTBg8d #researh
I need to do this in Python , please help me to solve this.
Thank you

This answer assumes you are using Python 3.x - for Python 2.x some changes are required (some obvious places are commented)
For the data format in the question, this could be a starting point in Python:
import collections
def slide(infile,outfile,window_size):
queue=collections.deque(maxlen=window_size)
line=infile.readline()
headers=[s.strip() for s in line.split("|")]
row=[headers[0]]
for h in headers[1:]
for i in reversed(range(window_size)):
row.append("%s-%i"%(h,i))
outfile.write(" | ".join(row))
outfile.write("\n")
for line in infile:
queue.append(line.split())
if len(queue)==window_size:
row=[queue[-1][0]]
for j in range(1,len(headers)):
for old in queue:
row.append(old[j])
outfile.write("\t".join(row))
outfile.write("\n")
ws=3
with open("infile.csv","r") as inf:
with open("outfile.csv","w") as outf:
slide(inf,outf,ws)
actually this code is all about using a queue to keep the input rows for the window and not more - everything else is text-to-list-to-text.
With actual csv-data (see comment)
import csv
import collections
def slide(infile,outfile,window_size):
r=csv.reader(infile)
w=csv.writer(outfile)
queue=collections.deque(maxlen=window_size)
headers=next(r) # r.next() on python 2
l=[headers[0]]
for h in headers[1:]
for i in reversed(range(window_size)):
l.append("%s-%i"%(h,i))
w.writerow(l)
hrange=range(1,len(headers))
for row in r:
queue.append(row)
if len(queue)==window_size:
l=[queue[-1][0]]
for j in hrange:
for old in queue:
l.append(old[j])
w.writerow(l)
ws=3
with open("infile.csv","r") as inf: # rb and no newline param on python 2
with open("outfile.csv","w") as outf: # wb and no newline param on python 2
slide(inf,outf,ws)

Related

Weka Experimenter mod : Problem creating experiment copy to run

In the experimenter mod in weka I have this configuration :
Results destination : Ex1_1.arff
Experiment type : cross-validation | number of folds : 10 | Classification
Dataset : soybean.arff.gz
Iteration Control : Number of repetitions : 10 | Data sets first
Algorithms : J48 -C 0.25 -M 2
This experience is saved as Ex1_1.xml (saving with .exp give the following error : Couldn't save experiment file : Ex1_1.exp Reason : java.awt.Component$AccessibleAWTComponent$AccessibleAWTComponentHandler)
And when I try to run this experience I have the following error : Problem creating experiment copy to run: java.awt.Component$AccessibleAWTComponent$AccessibleAWTComponentHandler
So it seem I have a problem with something like AWT in java... Do somebody have an idea ?
Thank you !

Data Truncation on SBS2008

I am trying to read events from the event log of two windows machines. One is Windows 10 Enterprise (works perfectly) and one is Small Business Server 2008. On the SBS2008 machine output is truncated at the 78th character but I can’t see why.
This is the powershell command that I am running:
get-eventlog Application -After (Get-Date).AddDays(-10) |
Where-Object {$_.EntryType -like 'Error' -or $_.EntryType -like
'Information'} | Format-Table -Property TimeGenerated, Index,
EventID, EntryType, Source, Message -autosize | Out-String -Width 4000
It performs fine in a powershell edit window on both machines, no truncation.
If I run it in a command window using powershell -file GetEvents.ps1 the output is truncated:
C:\BITS>powershell -file GetEvents.ps1
I cannot work out what is doing this. This is the Java code:
String command = "powershell.exe " + Cmd;
Process powerShellProcess = Runtime.getRuntime().exec(command);
powerShellProcess.getOutputStream().close();
String line;
BufferedReader stdout = new BufferedReader(new InputStreamReader(powerShellProcess.getInputStream()));
while ((line = stdout.readLine()) != null) {
if (line.length() > 0) {
System.out.println(line);
}
}
Can anyone suggest a better way to get the events out of the log or suggest how I read the output from the powershell script without it being broken up by carriage returns? This was quite disappointing as I had tested it extensively on my machine (the W10 one) only to find it fails on the SBS2008 customer machine!
I have checked the libraries and java versions used on the different machines and they are the same. It’s not the println statement because I do some parsing of the string within the final ‘While’ block and the incoming line is definitely truncated.
I have subsequently tried using Get-winevent but when I put a filterhashtable flag on the command it fails when run on SBS2008 saying 'the parameter is incorrect' (works fine on W10).
My ideal would be able to get all the events from a specific logfile since a given eventID (because I can store that) but it seems to be virtually impossible to get the same output across all windows operating systems in the same format. Any suggestions welcome.
The answer for SBS2008 is here; http://www.mcbsys.com/blog/2011/04/powershell-get-winevent-vs-get-eventlog/
SBS2008 cannot use a hashtable, must use filterxml. Unfortunately when I use filterxml on SBS2008 it does not return the error message, everything else, just no the message. This is using the prescribed method of cutting and pasting the XML query from Event Viewer (https://blogs.msdn.microsoft.com/powershell/2011/04/14/using-get-winevent-filterxml-to-process-windows-events/).
After more research I have come up with a script which does (sort of) what I want. It lacks the eventindex (which is a shame) but it consistently returns the events from the System & Application eventlogs:
$fx = '<QueryList>
<Query Id="0" Path="Application">
<Select Path="Application">*[System[(Level=1 or Level=2 or Level=3) and TimeCreated[timediff(#SystemTime) <= 43200000]]]</Select>
</Query>
</QueryList>'
function Set-Culture([System.Globalization.CultureInfo] $culture) { [System.Threading.Thread]::CurrentThread.CurrentUICulture = $culture ; [System.Threading.Thread]::CurrentThread.CurrentCulture = $culture } ; Set-Culture en-US ; get-winevent -FilterXml $fx | out-string -width 470
$fx = '<QueryList>
<Query Id="0" Path="System">
<Select Path="System">*[System[(Level=1 or Level=2 or Level=3) and TimeCreated[timediff(#SystemTime) <= 43200000]]]</Select>
</Query>
</QueryList>'
Set-Culture en-US ; get-winevent -FilterXml $fx | out-string -width 470
I hope that this is useful to someone else!

How to extract one integer value from multiple integers present

I have the following string, output of df command.
$ df |grep data
/dev/block/dm-1 11066964 2103848 8946732 20% /data
I want to extract value
8946732
using java regex, I have tried
(.*?\s{3}\d+.\d+)
but it is not working fine.
If your expected output is always on fixed index, then you can use this :
>>> s = "/dev/block/dm-1 11066964 2103848 8946732 20% /data"
>>> s.split()[3]
'8946732'
ok, so as requested with using regex, here it is
>>> import re
>>> re.search(r'(?is)(.*?\s+\d+\s+\d+\s+)(\d+)',s).group(2)
'8946732'
using regex as below, it match non space follow by 2 digits then capture next digits
import re
re.match('\S+\s+(?:\d+\s+){2}(\d+)', "/dev/block/dm-1 11066964 2103848 8946732 20% /data").group(1)
# '8946372'
refer demo
You can ask df to print a specific column with the --output option:
df --output=avail /dev/block/dm-1 | tail -n1
The tail command is just for stripping the header Avail out.

python stanford posttager, java command failed after running for some time

I am using stanford posttager toolkit to tag list of words from academic papers. Here is my codes of this part:
st = StanfordPOSTagger(stanford_tagger_path, stanford_jar_path, encoding = 'utf8', java_options = '-mx2048m')
word_tuples = st.tag(document)
document is a list of words derived from nltk.word_tokenize, they come from mormal academic papers so usually there are several thousand of words (mostly 3000 - 4000). I need to process over 10000 files so I keep calling these functions. My program words fine on a small test set with 270 files, but when the number of file gets bigger, the program gives out this error (Java heap space 2G):
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed
Note that this error does not occur immediately after the execution, it happens after some time of running. I really don't know the reason. Is this because my 3000 - 4000 words are too much ? Thank you very much for help !(Sorry for the bad edition, the error information is too long)
Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.
import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger=StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options='-mx4096m' ### Setting higher memory limit for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())
I assume you have tried increasing the Java stack via the Tagger settings like so
stanford.POSTagger([...], java_options="-mxSIZEm")
Cf the docs, default is 1000:
def __init__(self, [...], java_options='-mx1000m')
In order to test if it is a problem with the size of the dataset, you can tokenize your text into sentences, e.g. using the Punkt Tokenizer and output them right after tagging.

Value of IPC_CREAT when used in JNA

I need to communicate a java application and a c process via posix message queue, and I am using JNA in java application.
In C process, when create message queue I am using:
key_t key = 112233;
int msgflg = IPC_CREAT | 0666;
msqid = msgget(key, msgflg )
What is the value of IPC_CREAT to use in Java application?
I found in ipc.h :
/usr/include/sys/ipc.h:#define IPC_CREAT 0001000 /* create entry if key doesn't exist */
May I safety assume that I can use 512? (decimal) ?
Thanks.
I would suggest you use 950, because
final int IPC_CREAT = 0001000;
int msgflg = IPC_CREAT | 0666;
System.out.println(msgflg);
outputs
950
I may not understand your question, because
printf("%i\n", 0001000 | 0666);
also outputs
950
Edit
Yes.
final int IPC_CREAT = 0001000;
System.out.printf("%d%n", IPC_CREAT);
Output is 512. And,
printf("%i\n", 0001000);
Output is 512. So you could use decimal 512. Or the binary version like C.

Categories

Resources