AI for learning patterns in invalid data?

I work in the public health department, which daily receives and stores a lot of medical data. I wrote a program that uses regular expressions to determine if certain fields in the input are valid or invalid. Example: DOBs come in as YYYYmmDD, so they must match the regex ^ [0-9] {8} $

I want to analyze the "invalid" data to help identify problems in our system (we get too much data to go through each "bad" record in a row by row). Can anyone suggest AI methods / machine learning methods that can “control” bad data and find patterns in what's wrong? I think of coming up with a bunch of regular expressions for possible ways so that the data can be invalid (for example, not enough or too many characters), and then tracking these results might work. But instead of thinking about how the data might be invalid, I'm curious how to “learn” patterns from bad data using AI.

Are there any known methods that do this?

+4
source share
4 answers

I think of coming up with a bunch of regular expressions for possible ways so that the data can be invalid (for example, not enough or too many characters), and then tracking these results might work. But instead of thinking about how the data can be invalid, I am curious how to “learn” patterns from bad data using AI.

What a funny quote I was reminded of, usually attributed to Jamie Zawinski:

Some people, faced with a problem, think: "I know, I will use regular expressions." Now they have two problems.

Except, in this case, I think the manual regex route is actually your best bet!

The irony of irony.

Anyway.

The fact is that people tend to overreceive their decisions. Here, regular expressions are actually a fairly simple solution to your problem, while creating a student is something that will take you much longer than I think you understand.

There are fewer ways for this very limited representation of the data (date) to be correctly expressed than there are ways for it to be displayed incorrectly. Because there are endless ways to identify bad data. Do you want to train a student to discover all of them? This is a rabbit hole. Think of this AI student instead as a colleague or friend: how would you describe to them all the ways that dates could not be presented properly?

While your intention was to do less work for yourself in the long run - and it's a good quality - to figure out how to develop a student, not to mention traveling and checking it out, not to mention keeping a close eye on him, outweighs any benefits that the student can provide to you in such a narrow precedent.

+3
source

Bayesian filtering may be what you are looking for.

+2
source

It looks like you want to apply supervised learning to regular expressions. These guys seem to be something like that.

+1
source

Are you looking for outlier detection methods?

0
source

Source: https://habr.com/ru/post/1392165/


All Articles