Calculating the average of multiple time-series with random sampling interval

Calculating the average of multiple time-series with random sampling interval - java

I've tried searching for answers, but couldn't find one that exactly match my problem.
I'm doing a stochastic simulator of biological systems, where the outcome is a "Scatter-plot" time series with concentration levels at some random points in time. Now i would like to be able to take the average time-series of multiple simulation runs and are in doubt how to proceed as up to 500 simulation runs, each with several thousands measurements, can be expected.
Naturally, i could "bucket" the intervals probably losing some precision or try to interpolate the missing measurements. But what is the preferred method in my case?
This has to be implemented in Java and i would prefer a citation to a paper that explains the method.
Thanks!

If you want a book, Simulation Modeling & Analysis by Law or Discrete Event System Simulation by Banks, Carson, Nelson & Nicol both devote several chapters to time series output analysis. For "breaking news", there are several analysis tracks that have papers on recent developments in the field in the Paper Archives section at WinterSim.org. For a flow-chart of how to decide what type of analysis may be appropriate, see Figure 4 on p.60 of this tutorial paper from WinterSim 2007.

Related

How to speed up the model creation process of OpenNLP

I am using OpenNLP Token Name finder for parsing the Unstructured data, I have created a corpus(training set) of 4MM records but as I am creating a model out of this corpus using OpenNLP API's in Eclipse, process is taking around 3 hrs which is very time consuming. Model is building on default parameters that is iteration 100 and cutoff 5.
So my question is, how can I speed up this process, how can I reduce the time taken by the process for building the model.
Size of the corpus could be the reason for this but just wanted to know if someone came across this kind of problem and if so, then how to solve this.
Please provide some clue.
Thanks in advance!

Usually the first approach to handle such issues is to split the training data to several chunks, and let each one to create a model of its own. Afterwards you merge the models. I am not sure that this is valid in this case (I'm not an OpenNLP expert), there's another solution below. Also, as it seems that the OpenNLP API provides only a single threaded train() methods, I would file an issue requesting a multi threaded option.
For a slow single threaded operation the two main slowing factors are IO and CPU, and both can be handled separately:
IO - which hard drive do you use? Regular (magnetic) or SSD? moving to SSD should help.
CPU - which CPU are you using? moving to a faster CPU will help. Don't pay attention to the number of cores, as here you want the raw speed.
An option you may want to consider to to get an high CPU server from Amazon web services or Google Compute Engine and run the training there - you can download the model afterwards. Both give you high CPU servers utilizing Xeon (Sandy Bridge or Ivy Bridge) CPUs and local SSD storage.

I think you should make algorithm related changes before upgrading the hardware.
Reducing the sentence size
Make sure you don't have unnecessarily long sentences in the training sample. Such sentences don't increase the performance but have a huge impact on computation. (Not sure of the order) I generally put a cutoff at 200 words/sentence. Also look at the features closely, these are the default feature generators
two kinds of WindowFeatureGenerator with a default window size of only two
OutcomePriorFeatureGenerator
PreviousMapFeatureGenerator
BigramNameFeatureGenerator
SentenceFeatureGenerator
These features generators generate the following features in the given sentence for the word: Robert.
Sentence: Robert, creeley authored many books such as Life and Death, Echoes and Windows.
Features:
w=robert
n1w=creeley
n2w=authored
wc=ic
w&c=robert,ic
n1wc=lc
n1w&c=creeley,lc
n2wc=lc
n2w&c=authored,lc
def
pd=null
w,nw=Robert,creeley
wc,nc=ic,lc
S=begin
ic is Initial Capital, lc is lower case
Of these features S=begin is the only sentence dependant feature, which marks that Robert occurred in the start of the sentence.
My point is to explain the role of a complete sentence in training. You can actually drop the SentenceFeatureGenerator and reduce the sentence size further to only accomodate few words in the window of the desired entity. This will work just as well.
I am sure this will have a huge impact on complexity and very little on performace.
Have you considered sampling?
As I have described above, the features are very sparse representation of the context. May be you have many sentences with duplicates, as seen by the feature generators. Try to detect these and sample in a way to represent sentences with diverse patterns, ie. it should be impossible to write only a few regular expressions that matches them all. In my experience, training samples with diverse patterns did better than those that represent only a few patterns, even though the former had a much smaller number of sentences. Sampling this way should not affect the model performance at all.
Thank you.

Beat Matching Algorithm

I've recently begun trying to create a mobile app (iOS/Android) that will automatically beat match (http://en.wikipedia.org/wiki/Beatmatching) two songs.
I know that this exists out there, and there have been others who have had some success, but I'm running into issues related to the accuracy of the players.
Specifically, I run into "sync" issues where the "beats" don't line up. The various methods used to date are:
Calculate the BPM in advance, identify a "beat" (using something like sonicapi.com), and trying to line up appropriately, and begin a mix in with its playback rate adjusted (tempo adjustment)
Utilizing a bunch of meta data to trigger specific starts and stops
What does NOT work:
Leveraging echonest's API (it beat matches on the server, we want to do it on the client)
Something like pydub (does not do it in realtime)
Who uses this algorithm today:
iwebdj
Traktor
Does anyone have any suggestions on how to solve this problem? I've seen lots of people do it, but doing it in real time on a mobile device seems to be an issue.

There are lots of methods for solving this problem, some of which work better than others. Matthew Davies has published several papers on the matter, among many others. Glancing at this article seems to break down some of the steps necessary for doing this. I built a beat tracker in Matlab (unfortunately...) with a fellow student and our goal was to create an outro/intro between 2 songs so that the tempo was seamless between them. We wanted to do this for songs that varied in BPM by a small amount (+-7 or so BPM between the two). Our method went sort of like this:
Find two songs in our database that had overlapping 'key center'. So lets say 2 songs, both in Am.
Find this particular overlap of key centers between the two. Say 30 seconds into song 1 and 60 seconds into song 2
Now create a beat map, using an onset-detection algorithm with peak picking; Also, this was helpful for us.
Pick the first 'beat' for each track, and overlap the two tracks at that point. Now, since they are slightly different BPM from each other, the beats won't really line up with each other.
From this, we created a sort of map that gave us the sample offsets between beats of song A and beats of song B. From this, we wanted to be able to time-stretch the fade-in region of song B so that each one of its onsets (beats in this case) lined up at the correct sample index as the onsets from song A, over ITS fade-out region. So for example, if onset 2 from song B was shown as 5,000 samples ahead of onset 2 from song A, we simply stretched that 5,000 sample region so that onset 2 matched exactly between both songs.
This seems like it would sound weird, but it actually sounded pretty good. Although this was done entirely offline in Matlab, I am also looking for a way to do this in real-time in a mobile app. Not entirely sure about libraries you can use for this in Android world, but I imagine that it would be most efficient in C++.
A couple of libraries I have come across would be good for prototyping something, or at least studying the source code to get a better understanding of how you could do this in a mobile app:
Essentia (great community, open-source)
Aubio (also seems to be maintained pretty well, open-source)
Additional things to read up on for doing this kind of stuff in iOS land:
vDSP Programming guide
This article may also help
I came across this project that is doing some beat detection. Although it seems pretty out-dated unfortunately, it may offer some additional insights.
Unfortunately it isn't as simple as just 'pressing play' at the same time to align beats, unless you are assuming very specific aspects about them (exact tempos, etc.).
If you reallllly have some time on your hands, you should check out Tristan Jehan's (founder of Echonest) thesis; it is jam packed with algorithms and methods for beat detection, etc.

Detect specific sound in audio file [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
How can the tempo/BPM of a song be determined programmatically? What algorithms are commonly used, and what considerations must be made?

This is challenging to explain in a single StackOverflow post. In general, the simplest beat-detection algorithms work by locating peaks in sound energy, which is easy to detect. More sophisticated methods use comb filters and other statistical/waveform methods. For a detailed explication including code samples, check this GameDev article out.

The keywords to search for are "Beat Detection", "Beat Tracking" and "Music Information Retrieval". There is lots of information here: http://www.music-ir.org/
There is a (maybe) annual contest called MIREX where different algorithms are tested on their beat detection performance.
http://nema.lis.illinois.edu/nema_out/mirex2010/results/abt/mck/
That should give you a list of algorithms to test.
A classic algorithm is Beatroot (google it), which is nice and easy to understand. It works like this:
Short-time FFT the music to get a sonogram.
Sum the increases in magnitude over all frequencies for each time step (ignore the decreases). This gives you a 1D time-varying function called the "spectral flux".
Find the peaks using any old peak detection algorithm. These are called "onsets" and correspond to the start of sounds in the music (starts of notes, drum hits, etc).
Construct a histogram of inter-onset-intervals (IOIs). This can be used to find likely tempos.
Initialise a set of "agents" or "hypotheses" for the beat-tracking result. Feed these agents the onsets one at a time in order. Each agent tracks the list of onsets that are also beats, and the current tempo estimate. The agents can either accept the onsets, if they fit closely with their last tracked beat and tempo, ignore them if they are wildly different, or spawn a new agent if they are in-between. Not every beat requires an onset - agents can interpolate.
Each agent is given a score according to how neat its hypothesis is - if all its beat onsets are loud it gets a higher score. If they are all regular it gets a higher score.
The highest scoring agent is the answer.
Downsides to this algorithm in my experience:
The peak-detection is rather ad-hoc and sensitive to threshold parameters and whatnot.
Some music doesn't have obvious onsets on the beats. Obviously it won't work with those.
Difficult to know how to resolve the 60bpm-vs-120bpm issue, especially with live tracking!
Throws away a lot of information by only using a 1D spectral flux. I reckon you can do much better by having a few band-limited spectral fluxes (and maybe one broadband one for drums).
Here is a demo of a live version of this algorithm, showing the spectral flux (black line at the bottom) and onsets (green circles). It's worth considering the fact that the beat is extracted from only the green circles. I've played back the onsets just as clicks, and to be honest I don't think I could hear the beat from them, so in some ways this algorithm is better than people at beat detection. I think the reduction to such a low-dimensional signal is its weak step though.
Annoyingly I did find a very good site with many algorithms and code for beat detection a few years ago. I've totally failed to refind it though.
Edit: Found it!
Here are some great links that should get you started:
http://marsyasweb.appspot.com/
http://www.vamp-plugins.org/download.html

Beat extraction involves the identification of cognitive metric structures in music. Very often these do not correspond to physical sound energy - for example, in most music there is a level of syncopation, which means that the "foot-tapping" beat that we perceive does not correspond to the presence of a physical sound. This means that this is a quite different field to onset detection, which is the detection of the physical sounds, and is performed in a different way.
You could try the Aubio library, which is a plain C library offering both onset and beat extraction tools.
There is also the online Echonest API, although this involves uploading an MP3 to a website and retrieving XML, so might not be so suitable..
EDIT: I came across this last night - a very promising looking C/C++ library, although I haven't used it myself. Vamp Plugins

The general area of research you are interested in is called MUSIC INFORMATION RETRIEVAL
There are many different algorithms that do this but they all are fundamentally centered around ONSET DETECTION.
Onset detection measures the start of an event, the event in this case is a note being played. You can look for changes in the weighted fourier transform (High Frequency Content) you can look for large changes in spectrial content. (Spectrial Difference). (there are a couple of papers that I recommend you look into further down) Once you apply an onset detection algorithm you pick off where the beats are via thresholding.
There are various algorithms that you can use once you've gotten that time localization of the beat. You can turn it into a pulse train (create a signal that is zero for all time and 1 only when your beat happens) then apply a FFT to that and BAM now you have a Frequency of Onsets at the largest peak.
Here are some papers to lead you in the right direction:
https://web.archive.org/web/20120310151026/http://www.elec.qmul.ac.uk/people/juan/Documents/Bello-TSAP-2005.pdf
https://adamhess.github.io/Onset_Detection_Nov302011.pdf
Here is an extension to what some people are discussing:
Someone mentioned looking into applying a machine learning algorithm: Basically collect a bunch of features from the onset detection functions (mentioned above) and combine them with the raw signal in a neural network/logistic regression and learn what makes a beat a beat.
look into Dr Andrew Ng, he has free machine learning lectures from Stanford University online (not the long winded video lectures, there is actually an online distance course)

If you can manage to interface with python code in your project, Echo Nest Remix API is a pretty slick API for python:
There's a method analysis.tempo which will give you the BPM. It can do a whole lot more than simple BPM, as you can see from the API docs or this tutorial

Perform a Fourier transform, and find peaks in the power spectrum. You're looking for peaks below the 20 Hz cutoff for human hearing. I'd guess typically in the 0.1-5ish Hz range to be generous.
SO question that might help: Bpm audio detection Library
Also, here is one of several "peak finding" questions on SO: Peak detection of measured signal
Edit: Not that I do audio processing. It's just a guess based on the fact that you're looking for a frequency domain property of the file...
another edit: It is worth noting that lossy compression formats like mp3, store Fourier domain data rather than time domain data in the first place. With a little cleverness, you can save yourself some heavy computation...but see the thoughtful comment by cobbal.

To repost my answer: The easy way to do it is to have the user tap a button in rhythm with the beat, and count the number of taps divided by the time.

Others have already described some beat-detection methods. I want to add that there are some libraries available that provide techniques and algorithms for this sort of task.
Aubio is one of them, it has a good reputation and it's written in C with a C++ wrapper so you can integrate it easily with a cocoa application (all the audio stuff in Apple's frameworks is also written in C/C++).

There are several methods to get the BPM but the one I find the most effective is the "beat spectrum" (described here).
This algorithm computes a similarity matrix by comparing each short sample of the music with every others. Once the similarity matrix is computed it is possible to get average similarity between every samples pairs {S(T);S(T+1)} for each time interval T: this is the beat spectrum. The first high peak in the beat spectrum is most of the time the beat duration. The best part is you can also do things like music structure or rythm analyses.

I'd imagine this will be easiest in 4-4 dance music, as there should be a single low frequency thud about twice a second.

Peak detection in accelerometer data

I am trying to detect peaks in the accelerometer data so I can find the number of steps. The speed I have it polling on it is game. I think that should be a good speed to give me data but not to give me too many data points. Are there any algorithms you recommend to figure out the peak? I currently have the data in and excel and I tried graphing it out but there are way too many little jumps up and down.

I once used peak detection algorithm for some computer vision application. There are many sophisticated techniques, but I wrote very raw approach myself:
I used windowed averaging filter which smoothed all the local ups and downs (if your peak is narrow used smaller window size).
Then I took discrete derivative and find all points where derivative sign-ess changed from +ve to -ve. Then finally I took average of values around all those points and applied threshold around 1/3 of average.
It's not best approach but worked out for me well. You can play around with different discrete filters either in matlab or python. There is a very good plugin in python called scipy it can make ur life easy.

Simulation in java [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am novice to the simulation world, and want to learn how programmers develop real simulation projects in java. I would use eclipse.
Could anyone point to other things that I need to know (e.g. other packages, software etc. and their purposes)?
I am afraid the question might seem a bit vague as it is not clear which type of project I am talking about. But being a novice, let me say that it's to begin how to code a simulation project.

If you are building a Monte-Carlo model for a discrete event simulation or simulation model for pricing derivatives you should find there is a body of framework code already out there. If you are doing a numerical simulation such as a finite-element model you will be basing your simulation on a matrix computation library. Other types of simulation exist, but these are the two most likely cases.
I've never written a finite element model and know next to nothing about these, although I did have occasion to port one to DEC Visual FORTRAN at one point. Although the program (SAFIR, if anyone cares) was commented in French, the porting exercise consisted of modifying two date functions for a total of 6 lines of FORTRAN code - and writing a makefile.
Monte-Carlo models consist of measuring some base population to get distributions of one or more variables of interest. Then you take a Pseudo-Random number generator with good statistical and geometric properties (the Mersenne Twister algorithm is widely used for this) and write a function to convert the output of this to a random variable with the appropriate distribution. You will probably be able to find library functions that do this unless your variables have a really unusual distribution.
Then you build or obtain a simulation framework and write a routine that takes the random variables and does whatever computation you want to do for the model. You run it, storing the results of each simulation, until the error is within some desired tolerance level. After that, you calculate statistics (means, distributions etc.) from all of the runs of the simulation model.
There are quite a lot of resources on the web, and many books on simulation modelling, particularly in the area of derivatives pricing. You should hunt around and see what you can find.
As an aside, the random module on Python has conversion functions for quite a few distributions. If you want one you could get that and port the appropriate conversion function to java. You could use the output of the python one with the same random number seed to test the correctness of the java one.

Discrete-event simulation is a good option for problems that can be modeled as individual events that take place at specific times. Key activities are:
randomly generating times and durations based on empirical data, and
accumulating statistics as the simulation runs.
For example, you could simulate the activity in a parking garage as the entries and departures of a cars and the loss of customers who can't enter because the garage is full. This can be done with two model classes, a Car and the Garage, and three infrastructure classes, an Event class (described below), a Schedule to manage events, and a Monitor to accumulate data.
Here's a brief sketch of how it could work.
Event
An Event has a time, and represents calling a specific method on an object of a specific class.
Schedule
The Schedule keeps a queue of Events, ordered by Event time. The Schedule drives the overall simulation with a simple loop. As long as there are remaining Events (or until the Event that marks the end of the simulation run):
take the earliest Event from the queue,
set the "world clock" to the time of that event, and
invoke whatever action the Event specifies.
Car
The Car class holds the inter-arrival and length-of-stay statistics.
When a Car arrives, it:
logs its arrival with the Monitor,
consults the world clock, determines how long before the next Car should arrive, and posts that arrival Event on the Schedule.
asks the Garage whether it is full:
if full, the Car logs its departure as a lost customer with the Monitor.
if not full, the Car:
logs its entry with the Monitor,
tells the Garage it has entered (so that the Garage can decrease its available capacity),
determines how long it will stay, and posts its departure Event with the Schedule.
When a Car departs, it:
tells the Garage (so the Garage can increase available capacity), and
logs its departure with the Monitor.
Garage
The Garage keeps track of the Cars that are currently inside, and knows about its available capacity.
Monitor
The Monitor keeps track of the statistics in which you're interested: number of customers (successfully-arriving Cars), number of lost customers (who arrived when the lot was full), average length of stay, revenue (based on rate charged for parking), etc.
A simulation run
Start the simulation by putting two Events into the schedule:
the arrival of the first Car (modeled by instantiating a Car object and calling its "arrive" event) and
the end of the simulation.
Repeat the basic simulation loop until the end-of-simulation event is encountered. At that point, ask the Garage to report on its current occupants, and ask the Monitor to report the overall statistics for the session.

The short answer is that it depends.
Unless you can make the question more specific, there is no way to give an answer.
What do you want to simulate?
For example, if you want to simulate adding two numbers, you can do it using something like:
a = b + c;
If you want to simulate the bouncing of a ball, you can do that using a little bit of math equations and the graphic libraries.
If you want to simulate a web browser, you can do that too.
So the exact answer depends on what simulation you want to do.

Come up with a problem first.
There's no such things as a generic "simulation". There are lots of techniques out there.
If you're just a gamer who wants to have pseudo-physics, maybe something like this would be what you had in mind.

Have a look at Repast Symphony: http://repast.sourceforge.net/repast_simphony.html
"Repast Simphony 2.0 Beta, released on 12/3/2010, is a tightly integrated, richly interactive, cross platform Java-based modeling system that runs under Microsoft Windows, Apple Mac OS X, and Linux. It supports the development of extremely flexible models of interacting agents for use on workstations and small computing clusters. 
Repast Simphony models can be developed in several different forms including the ReLogo
 dialect of Logo, point-and-click flowcharts, Groovy, or Java, all of which can be fluidly interleaved NetLogo
 models can also be imported.
Repast Simphony has been successfully used in many application domains including social science, consumer products, supply chains, possible future hydrogen infrastructures
, and ancient pedestrian traffico name a few."

This is an old question, but for Simulation in Java I just installed and tested JavaSim by
Mark Little, University of Newcastle upon Tyne. As far as I can tell, it works very well if you have a model you can convert into a discrete event simulation. See Mark's site http://markclittle.blogspot.com.au/2008/03/csimjavasim.html. I also attempted to use Desmo-J, which is very extensive and has a 2-D graphical mode, but could not get it going under JDK 1.6 on a Mac.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.