How to get scientific results from non-experimental data (datamining?)

  • I want to get maximum performance from a process with many variables, many of which cannot be controlled.
  • I cannot run thousands of experiments, so it would be nice if I could run hundreds of experiments and
    • change many managed parameters
    • collect data on many performance indicators
    • "correct", as far as possible, for those parameters that I could not control.
    • Separate the "best" values ​​for the things that I can control, and start all over again.

It seems like this will be called data mining, in which you look at a lot of data, which, apparently, are not immediately connected, but show a correlation after some effort.

So ... Where do I start to look at algorithms, concepts, the theory of these kinds of things? Even useful terms for search purposes would be helpful.

Prerequisites: I like to engage in ultra-marathon cycling and keep journals of each trip. I would like to save more data, and after hundreds of athletes can pull out information about how I perform myself.

However, everything is changing: routes, environment (pace, pressure, noise, solar load, wind, draft, etc.), fuel, ratio, weight, water load, etc. etc. etc. I can control several things, but going the same route 20 times to test the new fuel mode will be depressing, and it will take years to complete all the experiments that I would like to do. I can, however, record all these things and much more (telemetry on a FTW bike).

+4
source share
3 answers

It looks like you want to do a regression analysis. You probably have a lot of data!


Regression analysis is an extremely common modeling method in statistics and science. (One could argue that statistics is the art and science of regression analysis.) There are many statistics packages to do the calculations you need. (I would recommend one, but I am out of date.)

Data output has a bad name because too often people think that correlation is equal to causality. I found that a good technique is to start with variables that, as you know, have an effect and first create a statistical model around them. This way, you know that wind, weight and lift affect how fast you can travel, and statistical software can take your data set and calculate what the correlation is between these factors. This will give you a statistical model or linear equation:

speed = x*weight + y*wind + z*climb + constant 

When you research new variables, you can see if the model is improved or not by comparing the fitness score corresponding to the R-squared. This way you can check if the temperature or time of day adds anything to the model.

You can apply the transformation to you data. For example, you may find that you perform better on colder days. But really cold days and really hot days can hurt your work. In this case, you can set temperatures for silos or segments : <From 0 Β° C; 0 Β° C to 40 Β° C; > 40 Β° C, or some of them. The key is to transform the data in a way that corresponds to a rational model of what is happening in the real world, and not just the data itself.


If someone thinks this is not a programming topic, note that you can use the same methods to analyze system performance.

+2
source

Given that many variables have too many dimensions, you can look at the analysis of the main components . Regression analysis requires some β€œart”, and the data speaks for themselves. Some software for this kind of analysis is shown at the bottom of the link.

+2
source

I used the Perl Statistics :: Regression module for some similar problems in the past. However, it should be warned that regression analysis is, of course, an art. As stated in the warning in the Perl module, this does not make sense to you if you have not learned the appropriate math.

+1
source

Source: https://habr.com/ru/post/1276964/


All Articles