How to split a text file using PowerShell?

I need to split a large (500 MB) text file (log4net exception file) into manageable fragments such as 100 5 MB files will be fine.

I would have thought it should be a walk in the park for PowerShell. How can i do this?

+42
powershell
Jun 16 '09 at 14:15
source share
14 answers

This is a fairly simple task for PowerShell, complicated by the fact that the standard Get-Content cmdlet handles very large files too poorly. I would suggest using the StreamReader .NET class to read a file line by line in a PowerShell script and use Add-Content to write each line to a file with an ever-increasing index in the file name. Something like that:

 $reader = new-object System.IO.StreamReader("C:\Exceptions.log") $count = 1 $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext) while(($line = $reader.ReadLine()) -ne $null) { Add-Content -path $fileName -value $line if((Get-ChildItem -path $fileName).Length -ge $upperBound) { ++$count $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext) } } $reader.Close() 
+33
Jun 16 '09 at 17:00
source share

A word of warning about some of the existing answers - they will work very slowly for very large files. For a 1.6-pound GB log file, I gave up in a couple of hours, realizing that it wonโ€™t finish until I get back to work the next day.

Two questions: the Add-Content call opens, searches and then closes the current destination file for each line in the source file. If you read a little source file every time and look for new lines, this will also slow down, but I assume the main reason is Add-Content.

The following option produces a slightly less pleasant conclusion: it will split the files in the middle of the lines, but it will break my 1.6-inch log in less than a minute:

 $from = "C:\temp\large_log.txt" $rootName = "C:\temp\large_log_chunk" $ext = "txt" $upperBound = 100MB $fromFile = [io.file]::OpenRead($from) $buff = new-object byte[] $upperBound $count = $idx = 0 try { do { "Reading $upperBound" $count = $fromFile.Read($buff, 0, $buff.Length) if ($count -gt 0) { $to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext) $toFile = [io.file]::OpenWrite($to) try { "Writing $count to $to" $tofile.Write($buff, 0, $count) } finally { $tofile.Close() } } $idx ++ } while ($count -gt 0) } finally { $fromFile.Close() } 
+30
Jun 13 2018-12-12T00:
source share

A simple single line delimiter based on the number of lines (in this case, 100):

 $i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt} 
+21
Apr 14 '14 at 13:22
source share

Same as all answers here, but using StreamReader / StreamWriter to split into new lines (line by line, instead of immediately reading the entire file in memory). This approach can split large files in the fastest way I know of.

Note. I do very little error checking, so I can not guarantee that it will work smoothly for your business. This was for mine (1.7 GB TXT file of 4 million lines divided by 100,000 lines per file in 95 seconds).

 #split test $sw = new-object System.Diagnostics.Stopwatch $sw.Start() $filename = "C:\Users\Vincent\Desktop\test.txt" $rootName = "C:\Users\Vincent\Desktop\result" $ext = ".txt" $linesperFile = 100000#100k $filecount = 1 $reader = $null try{ $reader = [io.file]::OpenText($filename) try{ "Creating file number $filecount" $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext)) $filecount++ $linecount = 0 while($reader.EndOfStream -ne $true) { "Reading $linesperFile" while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){ $writer.WriteLine($reader.ReadLine()); $linecount++ } if($reader.EndOfStream -ne $true) { "Closing file" $writer.Dispose(); "Creating file number $filecount" $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext)) $filecount++ $linecount = 0 } } } finally { $writer.Dispose(); } } finally { $reader.Dispose(); } $sw.Stop() Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds" 

Output split file size 1.7> GB:

 ... Creating file number 45 Reading 100000 Closing file Creating file number 46 Reading 100000 Closing file Creating file number 47 Reading 100000 Closing file Creating file number 48 Reading 100000 Split complete in 95.6308289 seconds 
+19
Feb 10 '15 at 13:13
source share

I often need to do the same. The trick gets the title repeated in each of the divided pieces. I wrote the following cmdlet (PowerShell v2 CTP 3) and it does the trick.

 ############################################################################## #.SYNOPSIS # Breaks a text file into multiple text files in a destination, where each # file contains a maximum number of lines. # #.DESCRIPTION # When working with files that have a header, it is often desirable to have # the header information repeated in all of the split files. Split-File # supports this functionality with the -rc (RepeatCount) parameter. # #.PARAMETER Path # Specifies the path to an item. Wildcards are permitted. # #.PARAMETER LiteralPath # Specifies the path to an item. Unlike Path, the value of LiteralPath is # used exactly as it is typed. No characters are interpreted as wildcards. # If the path includes escape characters, enclose it in single quotation marks. # Single quotation marks tell Windows PowerShell not to interpret any # characters as escape sequences. # #.PARAMETER Destination # (Or -d) The location in which to place the chunked output files. # #.PARAMETER Count # (Or -c) The maximum number of lines in each file. # #.PARAMETER RepeatCount # (Or -rc) Specifies the number of "header" lines from the input file that will # be repeated in each output file. Typically this is 0 or 1 but it can be any # number of lines. # #.EXAMPLE # Split-File bigfile.csv 3000 -rc 1 # #.LINK # Out-TempFile ############################################################################## function Split-File { [CmdletBinding(DefaultParameterSetName='Path')] param( [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] [String[]]$Path, [Alias("PSPath")] [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)] [String[]]$LiteralPath, [Alias('c')] [Parameter(Position=2,Mandatory=$true)] [Int32]$Count, [Alias('d')] [Parameter(Position=3)] [String]$Destination='.', [Alias('rc')] [Parameter()] [Int32]$RepeatCount ) process { # yeah! the cmdlet supports wildcards if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} } elseif ($Path) { $ResolveArgs = @{Path=$Path} } Resolve-Path @ResolveArgs | %{ $InputName = [IO.Path]::GetFileNameWithoutExtension($_) $InputExt = [IO.Path]::GetExtension($_) if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount } # get the input file in manageable chunks $Part = 1 Get-Content $_ -ReadCount:$Count | %{ # make an output filename with a suffix $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt)) # In the first iteration the header will be # copied to the output file as usual # on subsequent iterations we have to do it if ($RepeatCount -and $Part -gt 1) { Set-Content $OutputFile $Header } # write this chunk to the output file Write-Host "Writing $OutputFile" Add-Content $OutputFile $_ $Part += 1 } } } } 
+15
Jun 16 '09 at 20:47
source share

I found this question while trying to split multiple contacts in a single VCF vCard file to split files. Here's what I did based on Lee's code. I needed to see how to create a new StreamReader object and change the null value to $ null.

 $reader = new-object System.IO.StreamReader("C:\Contacts.vcf") $count = 1 $filename = "C:\Contacts\{0}.vcf" -f ($count) while(($line = $reader.ReadLine()) -ne $null) { Add-Content -path $fileName -value $line if($line -eq "END:VCARD") { ++$count $filename = "C:\Contacts\{0}.vcf" -f ($count) } } $reader.Close() 
+14
Apr 15 '10 at 14:26
source share

Many of these answers were too slow for my source files. My source files were SQL files between 10 MB and 800 MB, which need to be split into files with approximately the same number of lines.

I found some of the previous answers that use Add-Content to be pretty slow. Waiting many hours to split to the end was not uncommon.

I have not tried Typhlosaurus Answer , but it only sees split by file size and not by number of lines.

The following is consistent with my goals.

 $sw = new-object System.Diagnostics.Stopwatch $sw.Start() Write-Host "Reading source file..." $lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql") $totalLines = $lines.Length Write-Host "Total Lines :" $totalLines $skip = 0 $count = 100000; # Number of lines per file # File counter, with sort friendly name $fileNumber = 1 $fileNumberString = $filenumber.ToString("000") while ($skip -le $totalLines) { $upper = $skip + $count - 1 if ($upper -gt ($lines.Length - 1)) { $upper = $lines.Length - 1 } # Write the lines [System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)]) # Increment counters $skip += $count $fileNumber++ $fileNumberString = $filenumber.ToString("000") } $sw.Stop() Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds" 

For a 54 MB file, I get the output ...

 Reading source file... Total Lines : 910030 Split complete in 1.7056578 seconds 

I hope others looking for a simple line-based partitioning script that suits my requirements will find this helpful.

+5
Dec 08 '14 at 17:44
source share

There's also this quick (and somewhat dirty) single-line:

 $linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } } 

You can adjust the number of first rows per batch by changing the solid value of 3000.

+3
Feb 18 '13 at 2:53 on
source share

I made a small modification to split files based on the size of each part.

 ############################################################################## #.SYNOPSIS # Breaks a text file into multiple text files in a destination, where each # file contains a maximum number of lines. # #.DESCRIPTION # When working with files that have a header, it is often desirable to have # the header information repeated in all of the split files. Split-File # supports this functionality with the -rc (RepeatCount) parameter. # #.PARAMETER Path # Specifies the path to an item. Wildcards are permitted. # #.PARAMETER LiteralPath # Specifies the path to an item. Unlike Path, the value of LiteralPath is # used exactly as it is typed. No characters are interpreted as wildcards. # If the path includes escape characters, enclose it in single quotation marks. # Single quotation marks tell Windows PowerShell not to interpret any # characters as escape sequences. # #.PARAMETER Destination # (Or -d) The location in which to place the chunked output files. # #.PARAMETER Size # (Or -s) The maximum size of each file. Size must be expressed in MB. # #.PARAMETER RepeatCount # (Or -rc) Specifies the number of "header" lines from the input file that will # be repeated in each output file. Typically this is 0 or 1 but it can be any # number of lines. # #.EXAMPLE # Split-File bigfile.csv -s 20 -rc 1 # #.LINK # Out-TempFile ############################################################################## function Split-File { [CmdletBinding(DefaultParameterSetName='Path')] param( [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] [String[]]$Path, [Alias("PSPath")] [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)] [String[]]$LiteralPath, [Alias('s')] [Parameter(Position=2,Mandatory=$true)] [Int32]$Size, [Alias('d')] [Parameter(Position=3)] [String]$Destination='.', [Alias('rc')] [Parameter()] [Int32]$RepeatCount ) process { # yeah! the cmdlet supports wildcards if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} } elseif ($Path) { $ResolveArgs = @{Path=$Path} } Resolve-Path @ResolveArgs | %{ $InputName = [IO.Path]::GetFileNameWithoutExtension($_) $InputExt = [IO.Path]::GetExtension($_) if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount } Resolve-Path @ResolveArgs | %{ $InputName = [IO.Path]::GetFileNameWithoutExtension($_) $InputExt = [IO.Path]::GetExtension($_) if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount } # get the input file in manageable chunks $Part = 1 $buffer = "" Get-Content $_ -ReadCount:1 | %{ # make an output filename with a suffix $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt)) # In the first iteration the header will be # copied to the output file as usual # on subsequent iterations we have to do it if ($RepeatCount -and $Part -gt 1) { Set-Content $OutputFile $Header } # test buffer size and dump data only if buffer is greater than size if ($buffer.length -gt ($Size * 1MB)) { # write this chunk to the output file Write-Host "Writing $OutputFile" Add-Content $OutputFile $buffer $Part += 1 $buffer = "" } else { $buffer += $_ + "`r" } } } } } } 
+2
01 Oct '09 at 17:22
source share

Do it:

FILE 1

There's also this quick (and somewhat dirty) single-line:

  $linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | % { Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } } 

You can adjust the number of first rows per batch by changing the solid value of 3000.

 Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII 

FILE 2

 Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 5000 | Select -First 5000 | out-File C:\temp\file2.txt -Encoding ASCII 

FILE 3

 Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 10000 | Select -First 5000 | out-File C:\temp\file3.txt -Encoding ASCII 

etc...

+2
Aug 02
source share

My requirements were a little different. I often work with comma-delimited and ASCII tab-delimited files, where one line is one data record. And they are really big, so I need to break them into manageable parts (keeping the title bar).

So, I returned to my classic VBScript method and combined a small .vbs script that can be run on any Windows computer (it is automatically launched by the WScript.exe script engine in a window).

The advantage of this method is that it uses text streams, so the underlying data is not loaded into memory (or at least not all at the same time). As a result, this happens exceptionally fast and really does not need a lot of memory to run. The test file that I just shared using this script on my i7 was about 1 GB in file size, had about 12 million lines of text and was split into 25 part files (each of which had about 500 thousand lines each) - Processing took about 2 minutes, and he did not switch to 3 MB of memory used anywhere.

The caveat here is that it relies on a text file with โ€œlinesโ€ (which means that each record is split by CRLF) because the Text Stream object uses the ReadLine function to process one line at a time. But hey, if you are working with TSV or CSV files, this is perfect.

 Option Explicit Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt" Private Const REPEAT_HEADER_ROW = True Private Const LINES_PER_PART = 500000 Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart sStart = Now() sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1) iLineCounter = 0 iOutputFile = 1 Set oFileSystem = CreateObject("Scripting.FileSystemObject") Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False) Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True) If REPEAT_HEADER_ROW Then iLineCounter = 1 sHeaderLine = oInputFile.ReadLine() Call oOutputFile.WriteLine(sHeaderLine) End If Do While Not oInputFile.AtEndOfStream sLine = oInputFile.ReadLine() Call oOutputFile.WriteLine(sLine) iLineCounter = iLineCounter + 1 If iLineCounter Mod LINES_PER_PART = 0 Then iOutputFile = iOutputFile + 1 Call oOutputFile.Close() Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True) If REPEAT_HEADER_ROW Then Call oOutputFile.WriteLine(sHeaderLine) End If End If Loop Call oInputFile.Close() Call oOutputFile.Close() Set oFileSystem = Nothing Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now()) 
0
Oct 27 '15 at 18:20
source share

Sounds like a job to separate UNIX commands:

 split MyBigFile.csv 

Just split my 55GB CSV file into 21K pieces in less than 10 minutes.

However, it is not native to PowerShell, but comes with, for example, the git package for Windows https://git-scm.com/download/win

0
Sep 21 '16 at 18:11
source share

Since lines can be variables in logs, I thought it was best to use multiple lines for each file. The following code fragment processed the 4 millionth log file in less than 19 seconds (18.83 seconds), dividing it into 500,000 lines:

 $sourceFile = "c:\myfolder\mylargeTextyFile.csv" $partNumber = 1 $batchSize = 500000 $pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv" [System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001) # utf8 this one $fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None") $streamIn=New-Object System.IO.StreamReader($fs, $enc) $streamout = new-object System.IO.StreamWriter $pathAndFilename $line = $streamIn.readline() $counter = 0 while ($line -ne $null) { $streamout.writeline($line) $counter +=1 if ($counter -eq $batchsize) { $partNumber+=1 $counter =0 $streamOut.close() $pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv" $streamout = new-object System.IO.StreamWriter $pathAndFilename } $line = $streamIn.readline() } $streamin.close() $streamout.close() 

This can easily be turned into a function or script file with parameters to make it more universal. It uses StreamReader and StreamWriter to achieve its speed and low memory footprint.

0
Sep 23 '16 at 14:17
source share

Here is my solution for splitting the patch6.txt file (about 32,000 lines) into separate files with 1,000 lines each. It is not fast, but it does the job.

 $infile = "D:\Malcolm\Test\patch6.txt" $path = "D:\Malcolm\Test\" $lineCount = 1 $fileCount = 1 foreach ($computername in get-content $infile) { write $computername | out-file -Append $path_$fileCount".txt" $lineCount++ if ($lineCount -eq 1000) { $fileCount++ $lineCount = 1 } } 
0
Nov 09 '17 at 23:11
source share



All Articles