Understanding this CSV header

Question

Understanding this CSV header

I need to parse a CSV file that has this header:

Company;Registered office;Notifying party;Domicile or Registered office;Holdings of voting rights;;;;;;Publication ;;;;directly held;;additionally counted;;total;;in Germany;;in foreign countries ;;;;percentage;single rights;percentage;single rights;percentage;single rights;Official stock exchange

I was wondering if this is the standard heading format because I expected all the fields to be listed one after the other, for example (in the first line) "Holdings of voting rights - percentage, holding of voting rights - directly held - separate rights", and I see that the information spreads over three lines.

My file currently has 6 header lines (three are shown and the other three are in a different language) , how can I detect if they add a few more header lines a day ? The file continues with the next line (first data) and so on. The first line of real data is not always the same.

 BBS Kraftfahrzeugtechnik AG;Schiltach;Baumgartner, Heinrich;Deutschland;62,5;;37,5;;100,0;;Börsenzeitung;04.04.2002

I am also looking for Java libraries that can parse CSV files .

+4

java parsing csv

cdarwin Jan 6 '11 at 13:55

source share

7 answers

I do not agree with anyone who claims that only a comma is allowed. Wikipedia , for example, gives an example of a German CSV that uses semicolons to separate CSVs (since commas are used for decimal separation). I think that MS Excel also works quite flexibly on which delimiters to use. These are just the minds of programmers who are trying to get closer to the most simplified case.

For CSV analysis, I recommend Ostermiller Utils .

Q> how can I detect if they add a few more header lines a day?
A> you cannot. The only thing you can rely on is either a dynamic layout (where you know the column names in advance) or a static layout (where you assume that this column is always the nth).

+3

mindas Jan 6 '11 at 14:10

source share

Despite the fact that the CSV (Comma Seperated Value) files containing the word "comma" in their name, I saw very strange material in the corporate world.

I would suggest creating your own view of the data. It looks like you are reading several files formatted differently?

I would approach the problem in a modular way. Import importers for different formats, add them to the normalized representation of the data, than you, what you do what you want.

All this assumes that these files contain the same data type and that you do not have control over the files you receive.

Even if this is not the case, abstracting the data from its presentation and sticking it in a separate project would be useful.

I would also recommend using OpenCSV

+3

Casey Jan 6 '11 at 14:17

source share

This is not a csv file. You need to get the file specification from the one who generates it.

CSV files have comma-separated values with one entry per line. This is a free specification on how to avoid commas and escape characters. Excel uses double quotes around values, and then double double quotes.

+2

Lou franco Jan 6 '11 at 2:00 p.m.

source share

There is no standard header format. This can be seen as the convention that the first row is a list of values representing column headers separated by commas.

In your case, your table has three header lines (my guess is based on cell counting and comparison with the contents of your sample data).

This is still csv, but you don’t know in advance which row is the first row containing the actual data. There is no concept given by the format itself.

+1

Andreas_D Jan 6 '11 at 14:01

source share

As for the CSV headers, there is no standard format. In all cases, we assume that the first line is the heading. If the header spans multiple lines (which is the first time I see here), you will need to know the number of columns in the header before you start parsing this file. At least this is the beginning.

The following assumption in csv files is usually that one line is one line or record. Therefore, usually headers and data are separated by a newline. In your case, I'm not sure how you create the file and how it is planned to be used.

+1

Sachin Shanbhag Jan 6 '11 at 14:03

source share

Regarding CSV parsing libraries, I would highly recommend OpenCSV .

Also see: Can you recommend a Java library for reading (and possibly writing) CSV files?

+1

dogbane Jan 6 '11 at 2:04

source share

rajah9 · Accepted Answer · 2011-01-06T14:12:28+0000

Yes, you have a legit CSV file. I successfully read it in Excel and suspect that I would not have problems with OpenOffice. For Excel, I saved it as a .txt file, but then I had to tell Excel in the dialog that opens that it was separated by a semicolon.

This is the “standard” in the sense that it separates the columns by the separator (semicolons are in order, like the tabs and, of course, the commas), and the lines by new lines.

The reason you were given this format is because the second and third header lines do not fall directly under the first line. "Retention of voting rights" covers 6 columns. Beneath it, in the second row of the heading, “directly held” spans 2 columns, as well as an “additional account” and “general”. The third header line breaks the second header line into "percent" and "individual rights".

I don’t think you can easily find when the headers stop and the data starts. This is a semantic problem - one of the meanings. It's easier for a person though!

Understanding this CSV header

More articles: