For a huge number of huge csv files (100M lines +) from different sources I need a quick fragment or library to automatically guess the date format and convert it to split mode or unix timestamp. Once successfully guessed, the fragment should be able to check subsequent occurrences of the date field for validity, since it is likely that the date format changes throughout the file.
The set of date format tests should be variable, but compiling an optimal decision tree or some of the given date formats is fine.
I came to the conclusion that nothing of the kind exists, but still I need to do "market research", therefore, my question.
My first attempt was to imitate getdate () for the 23 different date formats I have seen so far, and replace the number parsers with optimized versions that take into account date characteristics (from "4" to "9" in dozens of parts of the day, from 3 'to 9' in the tens of parts of the month, etc.)
Has anyone encountered a similar problem or even created code of this kind?
source
share