XBRL: How do you merge rows from different filings?

XBRL: How do you merge rows from different filings? - java

We use a XBRL processor to ingest filings from SEC. Often times, a company declares a metric in different filings with different concepts - with or without exactly matching values - but to be regarded as the same financial metric. Essentially when you want to create a stitched view of all the filings, these numbers should appear on the same row. I'd provide an example to make it clear:
ASGN's 2020 10-K filing uses us-gaap:IncomeLossFromContinuingOperationsBeforeIncomeTaxesMinorityInterestAndIncomeLossFromEquityMethodInvestments to report EBT.
ASGN's 2021 10-K filing uses us-gaap:IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest to report EBT.
If you notice, even the figures for 2020 and 2019 do not match between the two filings. My question is - how do you reconcile these cases in code - to create a stitched/continuous view? Is this a solved problem or is it more of a process where you need to make manual interventions? Are there libraries that help with this? Is there mapping information available with the SEC that can be used - even when the data do not agree? Would be great if anyone can help with this. Thanks.

From personal experience I can give you a list of considerations when it comes to non-program-development people who work in the financial sector and submit standardized information:
the level of respect they have for the "you have to do things this way" paradigm is effectively 0.
expectance that filings aren't filled out properly/correctly should be at 100%.
Even though sec filings are meant to consolidate data in a standardized, meaningful, and readily available transparent facet, the financial sector is plagued with ambiguity and interoperable terms which may differ from corporate entity to corporate entity.
... or in short ... in their point of view "ILFCOBITEINI and ILFCOBITMIAILFEMI look pretty similar, so they pretty much mean the same thing."
As far as I know, there is no support on behalf of sec or other federation entities which is in charge of controlling sec filing accuracy since the idea is " you file it wrong... you pay a fine " .... meaning that due to interoperability of forms that "wrong" level is pretty ambiguous.
As such, the problem is that you must account for unexpected pseudo-fails when it comes to filings, meaning that you should probably write some code which does structural-to-content identity matches across different entries.
I'd advise using a reasoning logic subsystem (that you'll have to write) instead of a simple switch-case statement operating on a "if-this-exists-else" basis. ... and always consider that the level of incompetence in the financial sector is disgustingly high.

It depends ...
Why is the data for the same "row" (ex. revenues), for the same time period (ex.the 12 months ending dec 31, 2020), different? (Merger or acquisition? Accounting restatement? Something else?)
How might you handle this example, if you were manually "by hand" creating a financial model for this company in a spreadsheet?
Possible approaches:
"Most recent": For each row for each time period, use the most recently reported data.
"As first reported": For each row and each time period, use the "as first reported" data.
These are only two of several ways to present the data.
Neither of the above is "correct" or "better". Each has pros and cons.
Thoughts? Questions?

Point 1: differences aren't unusual, as companies make restatements and corrections from one year to the other. You will find them anywhere, not only with XBRL.
Point 2: they are using labels that look the same for two distinct concepts. At first, that should not exist in that case, as it induces to error if one is just downloading the labeled tables from the SEC. However, the FASB may have changed that from one year to the other. Did you check it? There are other reasons for this kind of error, which are actually subject of an ongoing research project of mine. They involve error and fraud. So, be careful. There could be more to it.
To answer your question, there is no way to make sure you are doing your work correctly given those discrepancies other than getting an accountant/lawyer to check them. You could also get an intern ;)

Related

Sorted Array vs Hashtable: Which data structure would be more efficient in searching over a range of dates in a calendar app?

I have about a year of experience in coding in Java. To hone my skills I'm trying to write a Calendar/journal entry desktop app in Java. I've realized that I still have no experience in data persistence and still don't really understand what the data persistence options would be for this program -- So perhaps I'm jumping the gun, and the design choices that I'm hoping to implement aren't even applicable once I get into the nitty gritty.
I mainly want to write a calendar app that allows you to log daily journal entries with associated activity logs for time spent on daily tasks. In terms of adding, editing and viewing the journal entries, using a hash table with the dates of the entries as keys and the entries themselves as the values seems most Big-Oh efficient (O(1) average case for each using a hash table).
However, I'm also hoping to implement a feature that could, given a certain range of dates, provide a simple analysis of average amount of time spent on certain tasks per day. If this is one of the main features I'm interested in, am I wrong in thinking that perhaps a sorted array would be more Big-Oh efficient? Especially considering that the data entries are generally expected to already be added date by date.
Or perhaps there's another option I'm unaware of?
The reason I'm asking is because of the answer provided by this following question: Why not use hashing/hash tables for everything?
And the reason I'm unsure if I'm even asking the right question is because of the answer to the following question: Whats the best data structure for a calendar / day planner?
If so, I would really appreciate being directed other resources on data persistence in java.
Thank you for the help!

Use a NavigableMap interface (implemented by TreeMap, a red-black tree).
This allows you to easily and efficiently select date ranges and traverse over events in key order.
As an aside, if you consider time or date intervals to be "half-open" it will make many problems easier. That is, when selecting events, include the lower bound in results, but exclude the upper. The methods of NavigableMap, like subMap(), are designed to work this way, and it's a good practice when you are working with intervals of any quantity, as it's easy to define a sequence of intervals without overlap or gaps.

Depends on how serious you want your project to be. In all cases, be careful of premature optimization. This is when you try too hard to make your code "efficient", and sacrifice readability/maintainability in the process. For example, there is likely a way of doing manual memory management with native code to make a more efficient implementation of a data structure for your calendar, but it likely does not outweigh the beneits of using familiar APIs etc. It might do, but you only know when you run your code.
Write readable code
Run it, test for performance issues
Use a profiler (e.g. JProfiler) to identify the code that is responsible for poor performance
Optimise that code
Repeat
For code that will "work", but will not be very scalable, a simple List will usually do fine. You can use JSONs to store your objects, and a library such as Jackson Databind to map between List and JSON. You could then simply save it to a file for persistence.
For an application that you want to be more robust and protected against data corruption, a database is probably better. With this, you can guarantee that, for example, data is not partially written, concurrent access to the same data will not result in corruption, and a whole host of other benefits. However, you will need to have a database server running alongside your application. You can use JDBC and suitable drivers for your database vendor (e.g. Mysql) to connect to, read from and write to the database.
For a serious application, you will probably want to create an API for your persistence. A framework like Spring is very helpful for this, as it allows you to declare REST endpoints using annotations, and introduces useful programming concepts, such as containers, IoC/Dependency Injection, Testing (unit tests and integration tests), JPA/ORM systems and more.
Like I say, this is all context dependent, but above all else, avoid premature optimization.

This thread might give you some ideas what data structure to use for Range Queries.
Data structure for range query
And it even might be easier to use a database and using an API to query for the desired range.

If you are using (or are able to use) Guava, you might consider using RangeMap (*).
This would allow you to use, say, a RangeMap<Instant, Event>, which you could then query to say "what event is occurring at time T".
One drawback is that you wouldn't be able to model concurrent events (e.g. when you are double-booked in two meetings).
(*) I work for Google, Guava is Google's open-sourced Java library. This is the library I would use, but others with similar range map offerings are available.

How to decide what is the relevant group in a precsion and recall computation?

One of the most famous measurements for an information retrieval system is to compute its precision and recall. For both cases, we need to compute the number of total relevant documents and compare it with the documents that the system has returned. My question is that how can we find this super set of relevant documents in following scenario:
Consider we have an academic search engine which its job is to accept a full name of an academic paper and base on some algorithms, it returns a list of relevant papers. Here, to judge whether or not the system has a good accuracy, we wish to calculate its precision and recall. But We do not know how can we produce a set of relevant papers -which the search engine should return them, regarding to different user's query- and accordingly, computing the precision and recall.

Most relevant document sets for system design involve showing documents to a user (real person).
Non-human evaluation:
You could probably come up with a "fake" evaluation in your particular instance. I would expect that the highest ranked paper for the paper "Variations in relevance judgments and the measurement of retrieval effectiveness" [1] would be that paper itself. So you could take your data and create an automatic evaluation. It wouldn't tell you if your system was really finding new things (which you care about), but it would tell you if your system was terrible or not.
e.g. If you were at a McDonald's, and you asked a mapping system where the nearest McDonald's was, and it didn't find the one you were at, you would know this to be some kind of system failure.
For a real evaluation:
You come up with a set queries, and you judge the top K results from your systems for each of them. In practice, you can't look at all millions of papers for each query -- so you approximate the recall set by the recall set you currently know about. This is why it's important to have some diversity in the systems you're pooling. Relevance is tricky; people disagree a lot on what documents are relevant to a query.
In your case: people will disagree about what papers are relevant to another paper. But that's largely okay, since they will mostly agree on the obvious ones.
Disagreement is okay if you're comparing systems:
This paradigm only makes sense if you're comparing different information retrieval systems. It doesn't help you understand how good a single system is, but it helps you determine if one system is better than the other reliably [1].
[1] Voorhees, Ellen M. "Variations in relevance judgments and the measurement of retrieval effectiveness." Information processing & management 36.5 (2000): 697-716.

How to publish to KDB Ticker Plant from Java effectively

We have market data handlers which publish quotes to KDB Ticker Plant. We use exxeleron q java libary for this purpose. Unfortunately latency is quite high: hundreds milliseconds when we try to insert a batch of records. May you suggest some latency tips for KDB + Java binding, as we need to publish quite fast.

There's not enough information in this message to give a fully qualified response, but having done the same with Java+KDB it really comes down to eliminating the possibilities. This is common sense, really, nothing super technical.
make sure you're inserting asynchronously
Verify it's exxeleron q java that is causing the latency. I don't think there's 100's of millis overhead there.
Verify the CPU that your tickerplant is on isn't overloaded. Consider re-nicing, core binding, etc
Analyse your network latencies. Also, if you're using Linux, there's a few tcp tweaks you can try, e.g. TCP_QUICKACK
As you're using Java, be smarter about garbage collection. It's highly configurable, although not directly controllable.
if you find out the tickerplant is the source of latency, you could either recode it to not write to disk - or get a faster local disk.
There's so many more suggestions, but the question is a bit too ambiguous.
EDIT
Back in 2007, with old(ish) servers and a very old version of KDB+ we were managing an insertion rate of 90k rows per second using the vanilla c.java. That was after many rounds of the above points. I'm sure you can achieve way more now, it's a matter of finding where the bottlenecks are and fixing them one by one.

Make sure the data publish to ticket plant are is batch, like wait for a little bit to insert say few rows of data in batch, but not insert row by row once any new records coming

NSL KDD Features from Raw Live Packets?

I want to extract raw data using pcap and wincap. Since i will be testing it against a neural network trained with NSLKDD dataset, i want to know how to get those 41 attributes from raw data?.. or even if that is not possible is it possible to obtain features like src_bytes, dst host_same_srv_rate, diff_srv_rate, count, dst_host_serror_rate, wrong_fragment from raw live captured packets from pcap?

If someone would like to experiment with KDD '99 features despite the bad reputation of the dataset, I created a tool named kdd99extractor to extract subset of KDD features from live traffic or .pcap file.
This tool was created as part of one university project. I haven't found detailed documentation of KDD '99 features so the resulting values may be bit different compared to original KDD. Some sources used are mentioned in README. Also the implementation is not complete. For example, the content features dealing with payload are not implemented.
It is available in my github repository.

The 1999 KDD Cup Data is flawed and should not be used anymore
Even this "cleaned up" version (NSL KDD) is not realistic.
Furthermore, many of the "cleanups" they did are not sensible. Real data has duplicates, and the frequencies of such records is important. By removing duplicates, you bias your data towards the more rare observations. You must not do this blindly "just because", or even worse: to reduce the data set size.
The biggest issue however remains:
KDD99 is not realistic in any way
It wasn't realistic even in 1999, but the internet has changed a lot since back then.
It's not reasonable to use this data set for machine learning. The attacks in it are best detected by simple packet inspection firewall rules. The attacks are well understood, and appropriate detectors - highly efficient, with 100% detection rate and 0% false positives - should be available in many cases on modern routers. They are so omnipresent that these attacks virtually do not exist anymore since 1998 or so.
If you want real attacks, look for SQL injections and similar. But these won't show up in pcap files, yet the largely undocumented way the KDDCup'99 features were extracted from this...
Stop using this data set.
Seriously, it's useless data. Labeled, large, often used, but useless.

It seems that I am late to reply. But, as other people already answered, the KDD99 data-set is outdated.
I don't know about the usefulness of the NSL-KDD dataset. However, there is a couple of things:
When getting information from network traffic, the best you can do is to get statistical information (content-based information is usually encrypted). What you can do is to create your own data-set to describe the behaviors you want to consider as "normal". Then, train the neural network to detect deviations from that "normal" behavior.
Be careful knowing that even the definition of "normal" behavior changes from network to network and from time to time.
You can have a look to this work, I was involved in it, in which besides taking the statistical features of the original KDD, takes additional features from a real network environment.
The software is under request and it is free for academic purposes! Here two links to publications:
http://link.springer.com/chapter/10.1007/978-94-007-6818-5_30
http://www.iaeng.org/publication/WCECS2012/WCECS2012_pp30-35.pdf
Thanks!

Design for a Debate club assignment application

For my university's debate club, I was asked to create an application to assign debate sessions and I'm having some difficulties as to come up with a good design for it. I will do it in Java. Here's what's needed:
What you need to know about BP debates: There are four teams of 2 debaters each and a judge. The four groups are assigned a specific position: gov1, gov2, op1, op2. There is no significance to the order within a team.
The goal of the application is to get as input the debaters who are present (for example, if there are 20 people, we will hold 2 debates) and assign them to teams and roles with regards to the history of each debater so that:
Each debater should debate with (be on the same team) as many people as possible.
Each debater should uniformly debate in different positions.
The debate should be fair - debaters have different levels of experience and this should be as even as possible - i.e., there shouldn't be a team of two very experienced debaters and a team of junior debaters.
There should be an option for the user to restrict the assignment in various ways, such as:
Specifying that two people should debate together, in a specific position or not.
Specifying that a single debater should be in a specific position, regardless of the partner.
If anyone can try to give me some pointers for a design for this application, I'll be so thankful!
Also, I've never implemented a GUI before, so I'd appreciate some pointers on that as well, but it's not the major issue right now.
Also, there is the issue of keeping Debater information in file, which I also never implemented in Java, and would like some tips on that as well.

This seems like a textbook constraint problem. GUI notwithstanding, it'd be perfect for a technology like Prolog (ECLiPSe prolog has a couple of different Java integration libraries that ship with it).
But, since you want this in Java why not store the debaters' history in a sql database, and use the SQL language to structure the constraints. You can then wrap those SQL queries as Java methods.

There are two parts (three if you count entering and/or saving the data), the underlying algorithm and the UI.
For the UI, I'm weird. I use this technique (there is a link to my sourceforge project). A Java version would have to be done, which would not be too hard. It's weird because very few people have ever used it, but it saves an order of magnitude coding effort.
For the algorithm, the problem looks small enough that I would approach it with a simple tree search. I would have a scoring algorithm and just report the schedule with the best score.
That's a bird's-eye overview of how I would approach it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.