Flat file normalization with dynamic number of columns

I have a flat file with, unfortunately, a dynamic column structure. There is a value that is in the hierarchy of values, and each level of the hierarchy gets its own column. For example, my flat file might resemble this:

StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Tier3ObjectId|Status 1234|7890|abcd|efgh|ijkl|mnop|Pending ... 

The same channel the next day may look like this:

 StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Status 1234|7890|abcd|efgh|ijkl|Complete ... 

The fact is that I do not care about all levels; I'm only interested in the identifier of the last (lower) level and all other row data that are not part of the level columns. I need to normalize the feed to something resembling this in order to insert into a relational database:

 StatisticID|FileId|ObjectId|Status 1234|7890|ijkl|Complete ... 

What will be an effective, easily readable mechanism for determining the identifier of the last level object and organizing data, as described? Every attempt I made feels kludgy for me.

Some things I did:

  • I tried to study the column names for regex patterns, identify columns that are layered, sort them by name in descending order, and select the first record ... but I am losing the column number so that it doesn't look good.
  • I put the columns that I want in the IDictionary<string, int> object for reference, but again reliably collecting the sequence number of the dynamic columns is a problem, and it seems like this would be pretty inefficient.
+4
source share
3 answers

I ran into a modeling problem a few years ago. I used the dictionary to match columns, it was ugly, but it worked.

First create a dictionary:

 private Dictionary<int, int> GetColumnDictionary(string headerLine) { Dictionary<int, int> columnDictionary = new Dictionary<int, int>(); List<string> columnNames = headerLine.Split('|').ToList(); string maxTierObjectColumnName = GetMaxTierObjectColumnName(columnNames); for (int index = 0; index < columnNames.Count; index++) { if (columnNames[index] == "StatisticID") { columnDictionary.Add(0, index); } if (columnNames[index] == "FileId") { columnDictionary.Add(1, index); } if (columnNames[index] == maxTierObjectColumnName) { columnDictionary.Add(2, index); } if (columnNames[index] == "Status") { columnDictionary.Add(3, index); } } return columnDictionary; } private string GetMaxTierObjectColumnName(List<string> columnNames) { // Edit this function if Tier ObjectId is greater then 9 var maxTierObjectColumnName = columnNames.Where(c => c.Contains("Tier") && c.Contains("Object")).OrderBy(c => c).Last(); return maxTierObjectColumnName; } 

And after that, it just runs through the file:

 private List<DataObject> ParseFile(string fileName) { StreamReader streamReader = new StreamReader(fileName); string headerLine = streamReader.ReadLine(); Dictionary<int, int> columnDictionary = this.GetColumnDictionary(headerLine); string line; List<DataObject> dataObjects = new List<DataObject>(); while ((line = streamReader.ReadLine()) != null) { var lineValues = line.Split('|'); string statId = lineValues[columnDictionary[0]]; dataObjects.Add( new DataObject() { StatisticId = lineValues[columnDictionary[0]], FileId = lineValues[columnDictionary[1]], ObjectId = lineValues[columnDictionary[2]], Status = lineValues[columnDictionary[3]] } ); } return dataObjects; } 

Hope this helps (at least a little).

+1
source

Personally, I would not reformat your file. I think the easiest approach would be to analyze each line in the front and back. For instance:

 itemArray = getMyItems(); statisticId = itemArray[0]; fileId = itemArray[1]; //and so on for the rest of your pre-tier columns //Then get the second to last column which will be the last tier lastTierId = itemArray[itemArray.length -1]; 

Since you know that the last level will always be second from the very end, you can just start at the end and work forward. This seems to be a lot easier than trying to reformat the data file.

If you really want to create a new file, you can use this approach to get the data you want to write.

0
source

I don't know the C # syntax, but something like that:

  • dividing line in parts with | as a separator
  • get parts [0], [1], [length - 2] and [length - 1]
  • pass details to database processing code
0
source

Source: https://habr.com/ru/post/1469025/


All Articles