How do you feel about batch processing of poorly formatted text files?

People complain a lot about XML, but compared to the EDI and some proprietary file formats I've worked with in my career, I think XML is bliss. The work I did when importing data files from Automotive Comparative Raters, each with its own creative and nightmare file format, still gives me nightmares.

Saying, I'm curious how other programmers approach the automatic analysis of poorly formatted text files. Do you have any language preferences? Are there any automation tools that you consider invaluable? How to make my code reusable?

+3
source share
4 answers

The solution I found out recently uses a standalone lexer. You can use structured regular expressions, and you avoid the limitations of a full-sized parser generator.

Here are some examples with ocamllex ( OCaml lexer generator ):

  • acamllex tutorial with some examples.
  • processing genbank formatted text files (other link that better illustrates the point, but prevents the javascript dialog box).

Obviously, lexer generators are also available in other languages ​​if using OCaml is a problem for you.

+2
source

Perl/Python, , ,

+1

, Perl Marpa, BNF. , , BNF, .

pattern_name ::= pattern_symbol1 pattern_symbol2 ... 

,

lexeme ~ lexeme_symbol1 lexeme_symbol2 ... 

BNF. BNF Marpa, ast, .

Perl, Marpa , SO:

Parse values ​​from a block of text based on specific keys

Problem Category = "Human Endeavors "
Problem Subcategory = "Space Exploration"
Problem Type = "Failure to Launch"
Software Version = "9.8.77.omni.3"
Problem Details = "Issue with signal barrier chamber."

extracted from:

Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.

Parsing brackets with sed using regex

key1
key2
key3
key4
key5
key6
key7

extracted from

dummy
(key1)
(key2)dummy(key3)
dummy(key4)dummy
dummy(key5)dummy))))dummy
dummy(key6)dummy))(key7)dummy))))

How to extract corporate bond information using machine learning

ABC 2.5 19
XYZ 6.5 15

extracted from

<[/] Trading 10mm ABC 2.5 19   05/06 mkt  can use 50mm>
<XYZ 6.5   15 10-2B    106-107                B3   AAA- 1.646MM 2x2>

Hope this helps.

+1
source

I know that for this I will get critical answers, but I like Java as a universal language. In case of parsing files, common regexes (I know, now I have 2 problems ...) work well for me.

0
source

Source: https://habr.com/ru/post/1777704/


All Articles