I have hundreds of large (6 GB) gziped log files that I read using the GZIPInputStream that I want to GZIPInputStream . Suppose each of them has the format:
Start of log entry 1 ...some log details ...some log details ...some log details Start of log entry 2 ...some log details ...some log details ...some log details Start of log entry 3 ...some log details ...some log details ...some log details
I pass the contents of the gziped file line by line through BufferedReader.lines() . The stream looks like this:
[ "Start of log entry 1", " ...some log details", " ...some log details", " ...some log details", "Start of log entry 2", " ...some log details", " ...some log details", " ...some log details", "Start of log entry 2", " ...some log details", " ...some log details", " ...some log details", ]
The beginning of each log entry can be determined by the predicate: line -> line.startsWith("Start of log entry") . I would like to convert this Stream<String> to Stream<Stream<String>> according to this predicate. Each "substream" should begin when the predicate is true, and collect lines while the predicate is false, until the next time the predicate is true, which means the end of this substream and the beginning of the next. The result will look like this:
[ [ "Start of log entry 1", " ...some log details", " ...some log details", " ...some log details", ], [ "Start of log entry 2", " ...some log details", " ...some log details", " ...some log details", ], [ "Start of log entry 3", " ...some log details", " ...some log details", " ...some log details", ], ]
From there, I can take each substream and match it using new LogEntry(Stream<String> logLines) to aggregate the associated log lines into LogEntry objects.
Here's a rough idea of ββhow it would look:
import java.io.*; import java.nio.charset.*; import java.util.*; import java.util.function.*; import java.util.stream.*; import static java.lang.System.out; class Untitled { static final String input = "Start of log entry 1\n" + " ...some log details\n" + " ...some log details\n" + " ...some log details\n" + "Start of log entry 2\n" + " ...some log details\n" + " ...some log details\n" + " ...some log details\n" + "Start of log entry 3\n" + " ...some log details\n" + " ...some log details\n" + " ...some log details"; static final Predicate<String> isLogEntryStart = line -> line.startsWith("Start of log entry"); public static void main(String[] args) throws Exception { try (ByteArrayInputStream gzipInputStream = new ByteArrayInputStream(input.getBytes(StandardCharsets.UTF_8));
Limitation: I have hundreds of these large files for parallel processing (but only one sequential stream for each file) that completely loads them into memory (for example, saving them as List<String> lines ), it is not possible.
Any help appreciated!