Flat file normalization with dynamic number of columns

Question

Flat file normalization with dynamic number of columns

I have a flat file with, unfortunately, a dynamic column structure. There is a value that is in the hierarchy of values, and each level of the hierarchy gets its own column. For example, my flat file might resemble this:

StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Tier3ObjectId|Status 1234|7890|abcd|efgh|ijkl|mnop|Pending ...

The same channel the next day may look like this:

 StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Status 1234|7890|abcd|efgh|ijkl|Complete ...

The fact is that I do not care about all levels; I'm only interested in the identifier of the last (lower) level and all other row data that are not part of the level columns. I need to normalize the feed to something resembling this in order to insert into a relational database:

 StatisticID|FileId|ObjectId|Status 1234|7890|ijkl|Complete ...

What will be an effective, easily readable mechanism for determining the identifier of the last level object and organizing data, as described? Every attempt I made feels kludgy for me.

Some things I did:

I tried to study the column names for regex patterns, identify columns that are layered, sort them by name in descending order, and select the first record ... but I am losing the column number so that it doesn't look good.
I put the columns that I want in the IDictionary<string, int> object for reference, but again reliably collecting the sequence number of the dynamic columns is a problem, and it seems like this would be pretty inefficient.

+4

c # formatting text parsing flat-file

Jeremy holovacs Mar 13 '13 at 16:16

source share

3 answers

Personally, I would not reformat your file. I think the easiest approach would be to analyze each line in the front and back. For instance:

 itemArray = getMyItems(); statisticId = itemArray[0]; fileId = itemArray[1]; //and so on for the rest of your pre-tier columns //Then get the second to last column which will be the last tier lastTierId = itemArray[itemArray.length -1];

Since you know that the last level will always be second from the very end, you can just start at the end and work forward. This seems to be a lot easier than trying to reformat the data file.

If you really want to create a new file, you can use this approach to get the data you want to write.

0

Abe miessler Mar 13 '13 at 16:21

source share

I don't know the C # syntax, but something like that:

dividing line in parts with | as a separator
get parts [0], [1], [length - 2] and [length - 1]
pass details to database processing code

0

Kwebble Mar 13 '13 at 16:34

source share

Diederik Koerselman · Accepted Answer · 2013-03-14T07:30:53+0000

I ran into a modeling problem a few years ago. I used the dictionary to match columns, it was ugly, but it worked.

First create a dictionary:

 private Dictionary<int, int> GetColumnDictionary(string headerLine) { Dictionary<int, int> columnDictionary = new Dictionary<int, int>(); List<string> columnNames = headerLine.Split('|').ToList(); string maxTierObjectColumnName = GetMaxTierObjectColumnName(columnNames); for (int index = 0; index < columnNames.Count; index++) { if (columnNames[index] == "StatisticID") { columnDictionary.Add(0, index); } if (columnNames[index] == "FileId") { columnDictionary.Add(1, index); } if (columnNames[index] == maxTierObjectColumnName) { columnDictionary.Add(2, index); } if (columnNames[index] == "Status") { columnDictionary.Add(3, index); } } return columnDictionary; } private string GetMaxTierObjectColumnName(List<string> columnNames) { // Edit this function if Tier ObjectId is greater then 9 var maxTierObjectColumnName = columnNames.Where(c => c.Contains("Tier") && c.Contains("Object")).OrderBy(c => c).Last(); return maxTierObjectColumnName; }

And after that, it just runs through the file:

 private List<DataObject> ParseFile(string fileName) { StreamReader streamReader = new StreamReader(fileName); string headerLine = streamReader.ReadLine(); Dictionary<int, int> columnDictionary = this.GetColumnDictionary(headerLine); string line; List<DataObject> dataObjects = new List<DataObject>(); while ((line = streamReader.ReadLine()) != null) { var lineValues = line.Split('|'); string statId = lineValues[columnDictionary[0]]; dataObjects.Add( new DataObject() { StatisticId = lineValues[columnDictionary[0]], FileId = lineValues[columnDictionary[1]], ObjectId = lineValues[columnDictionary[2]], Status = lineValues[columnDictionary[3]] } ); } return dataObjects; }

Hope this helps (at least a little).

Flat file normalization with dynamic number of columns

More articles: