Existing answers work well with sample input:
- The helpful answer by Wiktor Stribi ew , which identifies fields with double quotes that do not contain
, using a regular expression, first loads the entire input file into memory, which allows replacing the input file with results in a single pipeline.
Although this is convenient - and faster than cross-processing - the caveat is that it may not be an option for large input files. - markg's useful answer , which breaks the lines into fields using
" chars.," is an alternative for large input files, since it uses a pipeline to process input lines in turn.
(As a result, the input file cannot be directly updated with the result.)
If we generalize the OP requirement for processing fields with embedded characters " . , We need a different approach:
Then the following fields should keep double quotes:
- (optionally) double-quoted fields with embedded characters,; eg,
"1234 Main St, New York, NY" - (optionally) double-quoted fields with embedded characters
" , which should be escaped as "" for RFC 4180 , i.e. doubles; for example,
"Nat ""King"" Cole"
Note:
- We are not dealing with fields that may contain embedded line breaks, since this will require a fundamentally different approach, because autonomous stepwise processing is no longer possible.
- Wiktor Stribiżew hat tip , which came up with a regular expression to ensure that the double quote field matches any number of built-in double quotes, escaped as "" : "([^"]*(?:""[^"]*)*)"
# Create sample CSV file with double-quoted fields that contain # just ',', just embedded double quotes ('""'), and both. @' bob,"1234 Main St, New York, NY","cool guy" nat,"Nat ""King"" Cole Lane","cool singer" nat2,"Nat ""King"" Cole Lane, NY","cool singer" '@ | Set-Content ./test.csv Get-Content ./test.csv | ForEach-Object { # Match all double-quoted fields on the line, and replace those that # contain neither commas nor embedded double quotes with just their content, # ie, with enclosing double quotes removed. ([regex] '"([^"]*(?:""[^"]*)*)"').Replace($_, { param($match) $fieldContent = $match.Groups[1] if ($fieldContent -match '[,"]') { $match } else { $fieldContent } }) }
This gives:
bob,"1234 Main St, New York, NY",cool guy nat,"Nat ""King"" Cole Lane",cool singer nat2,"Nat ""King"" Cole Lane, NY",cool singer
Input file update :
As in markg's answer, due to the phased processing, you cannot directly update the input file with the output in the same pipeline.
To update the iput file later, use a temporary output file, and then replace it with the input file ( ... represents the Get-Content pipeline from above, with only $csvFile instead of ./test.csv ):
$csvfile = 'c:\path\to\some.csv' $tmpFile = $env:TEMP\tmp.$PID.csv ... | Set-Content $tmpFile if ($?) { Move-Item -Force $tmpFile $csvFile }
Note that Set-Content uses system single-byte encoding with an extended ASCII character by default (even if the help topic falsely specifies ASCII ).
Using the -Encoding parameter, you can specify a different encoding, but note that UTF-16LE, by default for Out-File / > , causes the CSV file to not be recognized properly by Excel, for example.