Methodologies or algorithms for filling in missing data

I am dealing with data sets with missing data and should be able to fill forward, backward and blanks. So, for example, if I have data from January 1, 2000 to December 31, 2010, and some days are missing, when a user requests a time interval that starts earlier, ends after, or covers missing data points, I need to "fill in" these missing values.

Is there a suitable term to refer to this concept of data filling? Imputation is one term, I don’t know if it is a “term” for it.

I assume that there are many algorithms and methodologies to fill in the missing data (use the last measured values, using the average / average / moving average, etc. between two known numbers, etc.

Does anyone know the correct term for this problem, any online resources on this topic or, ideally, refer to the open source versions of some algorithms (preferably C #, but any language will be useful)

+4
source share
3 answers

The term you are looking for is interpolation . (required wiki link)

You are asking for a C # solution with datasets, but you should also think about this at a database level like this .

A simple brute-force approach in C # could be to create an array of consecutive dates with your start and end values ​​as min / max values. Then use this array to combine the “interpolated” date values ​​into your dataset by inserting rows where the dataset does not have a corresponding date for your date array.

Here is the post fooobar.com/questions/1041272 / ... which comes close to what you need: interpolate missing dates using C #. There is no decision made, but reading the question and trying to answer can give you an idea of ​​what you need to do next. For instance. Use DateTime data in terms of Ticks (long value type), and then use an interpolation scheme for this data. Convert interpolated long values ​​to DateTime values.

+2
source

The algorithm you use will greatly depend on the data itself, the size of the gaps compared to the available data, and its predictability based on existing data. It may also include other information that you may know that is missing, as is usually the case in statistics, when your actual data may not reflect the same distribution as the universe in certain categories.

Linear and cubic interpolation are typical algorti that are easy to implement, try to find them on Google.

Here is a good primer with some code:

http://paulbourke.net/miscellaneous/interpolation/

The discussion context in this link is a graphic, but the concepts are universal.

+2
source

For the purpose of submitting statistical tests, a good search term is imputation - for example, http://en.wikipedia.org/wiki/Imputation_%28statistics%29

0
source

Source: https://habr.com/ru/post/1341317/


All Articles