I want to extract raw data using pcap and wincap. Since i will be testing it against a neural network trained with NSLKDD dataset, i want to know how to get those 41 attributes from raw data?.. or even if that is not possible is it possible to obtain features like src_bytes, dst host_same_srv_rate, diff_srv_rate, count, dst_host_serror_rate, wrong_fragment from raw live captured packets from pcap?
If someone would like to experiment with KDD '99 features despite the bad reputation of the dataset, I created a tool named kdd99extractor to extract subset of KDD features from live traffic or .pcap file.
This tool was created as part of one university project. I haven't found detailed documentation of KDD '99 features so the resulting values may be bit different compared to original KDD. Some sources used are mentioned in README. Also the implementation is not complete. For example, the content features dealing with payload are not implemented.
It is available in my github repository.
The 1999 KDD Cup Data is flawed and should not be used anymore
Even this "cleaned up" version (NSL KDD) is not realistic.
Furthermore, many of the "cleanups" they did are not sensible. Real data has duplicates, and the frequencies of such records is important. By removing duplicates, you bias your data towards the more rare observations. You must not do this blindly "just because", or even worse: to reduce the data set size.
The biggest issue however remains:
KDD99 is not realistic in any way
It wasn't realistic even in 1999, but the internet has changed a lot since back then.
It's not reasonable to use this data set for machine learning. The attacks in it are best detected by simple packet inspection firewall rules. The attacks are well understood, and appropriate detectors - highly efficient, with 100% detection rate and 0% false positives - should be available in many cases on modern routers. They are so omnipresent that these attacks virtually do not exist anymore since 1998 or so.
If you want real attacks, look for SQL injections and similar. But these won't show up in pcap files, yet the largely undocumented way the KDDCup'99 features were extracted from this...
Stop using this data set.
Seriously, it's useless data. Labeled, large, often used, but useless.
It seems that I am late to reply. But, as other people already answered, the KDD99 data-set is outdated.
I don't know about the usefulness of the NSL-KDD dataset. However, there is a couple of things:
When getting information from network traffic, the best you can do is to get statistical information (content-based information is usually encrypted). What you can do is to create your own data-set to describe the behaviors you want to consider as "normal". Then, train the neural network to detect deviations from that "normal" behavior.
Be careful knowing that even the definition of "normal" behavior changes from network to network and from time to time.
You can have a look to this work, I was involved in it, in which besides taking the statistical features of the original KDD, takes additional features from a real network environment.
The software is under request and it is free for academic purposes! Here two links to publications:
http://link.springer.com/chapter/10.1007/978-94-007-6818-5_30
http://www.iaeng.org/publication/WCECS2012/WCECS2012_pp30-35.pdf
Thanks!
Related
We use a XBRL processor to ingest filings from SEC. Often times, a company declares a metric in different filings with different concepts - with or without exactly matching values - but to be regarded as the same financial metric. Essentially when you want to create a stitched view of all the filings, these numbers should appear on the same row. I'd provide an example to make it clear:
ASGN's 2020 10-K filing uses us-gaap:IncomeLossFromContinuingOperationsBeforeIncomeTaxesMinorityInterestAndIncomeLossFromEquityMethodInvestments to report EBT.
ASGN's 2021 10-K filing uses us-gaap:IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest to report EBT.
If you notice, even the figures for 2020 and 2019 do not match between the two filings. My question is - how do you reconcile these cases in code - to create a stitched/continuous view? Is this a solved problem or is it more of a process where you need to make manual interventions? Are there libraries that help with this? Is there mapping information available with the SEC that can be used - even when the data do not agree? Would be great if anyone can help with this. Thanks.
From personal experience I can give you a list of considerations when it comes to non-program-development people who work in the financial sector and submit standardized information:
the level of respect they have for the "you have to do things this way" paradigm is effectively 0.
expectance that filings aren't filled out properly/correctly should be at 100%.
Even though sec filings are meant to consolidate data in a standardized, meaningful, and readily available transparent facet, the financial sector is plagued with ambiguity and interoperable terms which may differ from corporate entity to corporate entity.
... or in short ... in their point of view "ILFCOBITEINI and ILFCOBITMIAILFEMI look pretty similar, so they pretty much mean the same thing."
As far as I know, there is no support on behalf of sec or other federation entities which is in charge of controlling sec filing accuracy since the idea is " you file it wrong... you pay a fine " .... meaning that due to interoperability of forms that "wrong" level is pretty ambiguous.
As such, the problem is that you must account for unexpected pseudo-fails when it comes to filings, meaning that you should probably write some code which does structural-to-content identity matches across different entries.
I'd advise using a reasoning logic subsystem (that you'll have to write) instead of a simple switch-case statement operating on a "if-this-exists-else" basis. ... and always consider that the level of incompetence in the financial sector is disgustingly high.
It depends ...
Why is the data for the same "row" (ex. revenues), for the same time period (ex.the 12 months ending dec 31, 2020), different? (Merger or acquisition? Accounting restatement? Something else?)
How might you handle this example, if you were manually "by hand" creating a financial model for this company in a spreadsheet?
Possible approaches:
"Most recent": For each row for each time period, use the most recently reported data.
"As first reported": For each row and each time period, use the "as first reported" data.
These are only two of several ways to present the data.
Neither of the above is "correct" or "better". Each has pros and cons.
Thoughts? Questions?
Point 1: differences aren't unusual, as companies make restatements and corrections from one year to the other. You will find them anywhere, not only with XBRL.
Point 2: they are using labels that look the same for two distinct concepts. At first, that should not exist in that case, as it induces to error if one is just downloading the labeled tables from the SEC. However, the FASB may have changed that from one year to the other. Did you check it? There are other reasons for this kind of error, which are actually subject of an ongoing research project of mine. They involve error and fraud. So, be careful. There could be more to it.
To answer your question, there is no way to make sure you are doing your work correctly given those discrepancies other than getting an accountant/lawyer to check them. You could also get an intern ;)
I have about a year of experience in coding in Java. To hone my skills I'm trying to write a Calendar/journal entry desktop app in Java. I've realized that I still have no experience in data persistence and still don't really understand what the data persistence options would be for this program -- So perhaps I'm jumping the gun, and the design choices that I'm hoping to implement aren't even applicable once I get into the nitty gritty.
I mainly want to write a calendar app that allows you to log daily journal entries with associated activity logs for time spent on daily tasks. In terms of adding, editing and viewing the journal entries, using a hash table with the dates of the entries as keys and the entries themselves as the values seems most Big-Oh efficient (O(1) average case for each using a hash table).
However, I'm also hoping to implement a feature that could, given a certain range of dates, provide a simple analysis of average amount of time spent on certain tasks per day. If this is one of the main features I'm interested in, am I wrong in thinking that perhaps a sorted array would be more Big-Oh efficient? Especially considering that the data entries are generally expected to already be added date by date.
Or perhaps there's another option I'm unaware of?
The reason I'm asking is because of the answer provided by this following question: Why not use hashing/hash tables for everything?
And the reason I'm unsure if I'm even asking the right question is because of the answer to the following question: Whats the best data structure for a calendar / day planner?
If so, I would really appreciate being directed other resources on data persistence in java.
Thank you for the help!
Use a NavigableMap interface (implemented by TreeMap, a red-black tree).
This allows you to easily and efficiently select date ranges and traverse over events in key order.
As an aside, if you consider time or date intervals to be "half-open" it will make many problems easier. That is, when selecting events, include the lower bound in results, but exclude the upper. The methods of NavigableMap, like subMap(), are designed to work this way, and it's a good practice when you are working with intervals of any quantity, as it's easy to define a sequence of intervals without overlap or gaps.
Depends on how serious you want your project to be. In all cases, be careful of premature optimization. This is when you try too hard to make your code "efficient", and sacrifice readability/maintainability in the process. For example, there is likely a way of doing manual memory management with native code to make a more efficient implementation of a data structure for your calendar, but it likely does not outweigh the beneits of using familiar APIs etc. It might do, but you only know when you run your code.
Write readable code
Run it, test for performance issues
Use a profiler (e.g. JProfiler) to identify the code that is responsible for poor performance
Optimise that code
Repeat
For code that will "work", but will not be very scalable, a simple List will usually do fine. You can use JSONs to store your objects, and a library such as Jackson Databind to map between List and JSON. You could then simply save it to a file for persistence.
For an application that you want to be more robust and protected against data corruption, a database is probably better. With this, you can guarantee that, for example, data is not partially written, concurrent access to the same data will not result in corruption, and a whole host of other benefits. However, you will need to have a database server running alongside your application. You can use JDBC and suitable drivers for your database vendor (e.g. Mysql) to connect to, read from and write to the database.
For a serious application, you will probably want to create an API for your persistence. A framework like Spring is very helpful for this, as it allows you to declare REST endpoints using annotations, and introduces useful programming concepts, such as containers, IoC/Dependency Injection, Testing (unit tests and integration tests), JPA/ORM systems and more.
Like I say, this is all context dependent, but above all else, avoid premature optimization.
This thread might give you some ideas what data structure to use for Range Queries.
Data structure for range query
And it even might be easier to use a database and using an API to query for the desired range.
If you are using (or are able to use) Guava, you might consider using RangeMap (*).
This would allow you to use, say, a RangeMap<Instant, Event>, which you could then query to say "what event is occurring at time T".
One drawback is that you wouldn't be able to model concurrent events (e.g. when you are double-booked in two meetings).
(*) I work for Google, Guava is Google's open-sourced Java library. This is the library I would use, but others with similar range map offerings are available.
We have market data handlers which publish quotes to KDB Ticker Plant. We use exxeleron q java libary for this purpose. Unfortunately latency is quite high: hundreds milliseconds when we try to insert a batch of records. May you suggest some latency tips for KDB + Java binding, as we need to publish quite fast.
There's not enough information in this message to give a fully qualified response, but having done the same with Java+KDB it really comes down to eliminating the possibilities. This is common sense, really, nothing super technical.
make sure you're inserting asynchronously
Verify it's exxeleron q java that is causing the latency. I don't think there's 100's of millis overhead there.
Verify the CPU that your tickerplant is on isn't overloaded. Consider re-nicing, core binding, etc
Analyse your network latencies. Also, if you're using Linux, there's a few tcp tweaks you can try, e.g. TCP_QUICKACK
As you're using Java, be smarter about garbage collection. It's highly configurable, although not directly controllable.
if you find out the tickerplant is the source of latency, you could either recode it to not write to disk - or get a faster local disk.
There's so many more suggestions, but the question is a bit too ambiguous.
EDIT
Back in 2007, with old(ish) servers and a very old version of KDB+ we were managing an insertion rate of 90k rows per second using the vanilla c.java. That was after many rounds of the above points. I'm sure you can achieve way more now, it's a matter of finding where the bottlenecks are and fixing them one by one.
Make sure the data publish to ticket plant are is batch, like wait for a little bit to insert say few rows of data in batch, but not insert row by row once any new records coming
I have an application that lets users publish unstructured keywords. Simultaneously, other users can publish items that must be matched to one or more specified keywords. There is no restriction on the keywords either set of users may use, so simply hoping for a collision is likely to mean very few matches, when the reality is users might have used different keywords for the same thing or they are close enough (eg, 'bicycles' and 'cycling', or 'meat' and 'food').
I need this to work on mobile devices (Android), so I'm happy to sacrifice matching accuracy for efficiency and a small footprint. I know about s-match but this relies on a backing dictionary of 15MB, so it isn't ideal.
What other ideas/approaches/frameworks might help with this?
Your example of 'bicycles' and 'cycling' could be addressed by a take on the Levenshtein edit-distance algorithm since the two words are somewhat related. But your example of 'meat' and 'food' would indeed require a sizable backing dictionary, unless of course the concept set or target audience is limited to say, foodies.
Have you considered hosting the dictionary as a web service and accessing the data as needed? The drawback of course is that your app would only work while in network coverage.
I'm working on a Java application, one of its functions is to show detailed information in graph form with the odd statistic and "top 10" list here and there.
The data is being generated live by the application, consider it an internet "honeypot", data is the result of external attacks, the graphs will need to be of varying forms such as
Overall Statistics (Charts showing frequency of attacks per minute/hour/day, No. of attacks today, No. of attack-type attacks, Top 10 attackers)
Per Sensor (Charts showing frequency of attacks per minute/hour/day, Sensor 1 attacks today,No. of attack-type attacks, Top 10 attackers)
Per Attack-Type (Pie Chart)
The information for each attack type can vary quite a bit and there will be other information some have and some don't (e.g. a DoS will have an attacker-address whereas a Remote Exploit to upload a file will have attacker-address and file-name).
Initially I approached this by creating Classes, there is a DoS data structure within which all the details of that attack can be stored and these are store inside a vector, but this ended up becoming a serious headache very fast.
The obvious solution to me is to create a database (MySQL?) with a table for each attack type, from this, gaining all the 1., 2. and 3. information is merely an SQL query away.
However, I can't help but feel that my database solution is a tad nasy and that I'm missing something here, so after hitting my head against the problem I'm asking here.
Any pointers greatly appreciated!
I'd lean towards building the entire concept of 'attack' out as a class composed of all of the potential objects and fields necessary to describe any type of attack. You could specify interfaces as necessary to specify the contract of each particular attack type (for factory creation, etc) but then persist the entire object to a database with a schema pretty much identical to your implementation class structure. This should probably give you a pretty good ability to do the reporting that you want and I think implementation would be reasonably straightforward.
Without knowing just how large your attack tree is, it's a little difficult to be sure my approach is correct, but maybe this will be useful.
Not sure but what you're describing looks like an OLAP cube so maybe consider using a star schema or a snowflake schema and have a look at something like Pentaho:
A complete Business Intelligence platform that includes reporting, analysis (OLAP), dashboards, data mining and data integration (ETL).