Reading and writing in chunks on linux using c

I have an ASCII file where each line contains a variable length record. for instance

Record-1:15 characters Record-2:200 characters Record-3:500 characters ... ... Record-n: X characters 

Since the file size is about 10 GB, I would like to read the record in chunks. After reading, I need to convert them, write them to another file in binary format.

So, for reading, my first reaction was to create a char array, such as

 FILE *stream; char buffer[104857600]; //100 MB char array fread(buffer, sizeof(buffer), 104857600, stream); 
  • Is it correct to assume that linux issues one system call and retrieves all 100 MB?
  • When records are separated by a new line, I search for a character by character for a new character in a line in the buffer and restore each record.

My question is, should I read in chunks or is there a better alternative for reading data in chunks and for restoring each record? Is there an alternative way to read x the number of variable-sized lines from an ASCII file in one call?

Further, during the recording, I do the same. I have a char write buffer that I pass to fwrite to write the entire set of records in one call.

 fwrite(buffer, sizeof(buffer), 104857600, stream); 

UPDATE: if I setbuf (stream, buffer), where buffer is my 100MB char buffer, will fgets return from the buffer or call the IO drive?

+6
source share
3 answers
  • Yes, fread will retrieve the whole thing at once. (Suppose this is a regular file.) But it will not read 105 MB if the file itself is not 105 MB, and if you do not check the return value, you have no way of knowing how much data was actually read, or if there was an error.

  • Use fgets (see man fgets ) instead of fread . This will look for line breaks for you.

     char linebuf[1000]; FILE *file = ...; while (fgets(linebuf, sizeof(linebuf), file) { // decode one line } 
  • There is a problem with your code.

     char buffer[104857600]; // too big 

    If you try to allocate a large buffer (105 MB, certainly large) on the stack, then it will fail and your program will crash. If you need a large buffer, you will have to allocate it on the heap using malloc or similar. I would, of course, keep using the stack for one function tens of kilobytes more, although you could get away with a few megabytes on most Linux systems.

Alternatively, you can simply mmap write the entire file to memory. This will not improve or degrade performance in most cases, but it is easier to work with.

 int r, fdes; struct stat st; void *ptr; size_t sz; fdes = open(filename, O_RDONLY); if (fdes < 0) abort(); r = fstat(fdes, &st); if (r) abort(); if (st.st_size > (size_t) -1) abort(); // too big to map sz = st.st_size; ptr = mmap(NULL, sz, PROT_READ, MAP_SHARED, fdes, 0); if (ptr == MAP_FAILED) abort(); close(fdes); // file no longer needed // now, ptr has the data, sz has the data length // you can use ordinary string functions 

The advantage of using mmap is that your program will not run out of memory. On a 64-bit system, you can put the entire file in your address space at the same time (even a 10 GB file), and the system will automatically read new fragments when your program accesses the memory. Old pieces will be automatically discarded and reread if your program needs them again.

This is a very good way to plow large files.

+6
source

If possible, you may find that the mmap file will be the easiest. mmap maps (part of) a file to memory, so this file can be accessed essentially as an array of bytes. In your case, you cannot immediately display the whole file, which will look something like this:

 #include <stdio.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <sys/mman.h> /* ... */ struct stat stat_buf; long pagesz = sysconf(_SC_PAGESIZE); int fd = fileno(stream); off_t line_start = 0; char *file_chunk = NULL; char *input_line; off_t cur_off = 0; off_t map_offset = 0; /* map 16M plus pagesize to ensure any record <= 16M will always fit in the mapped area */ size_t map_size = 16*1024*1024+pagesz; if (map_offset + map_size > stat_buf.st_size) { map_size = stat_buf.st_size - map_offset; } fstat(fd, &stat_buf); /* map the first chunk of the file */ file_chunk = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, map_offset); // until we reach the end of the file while (cur_off < stat_buf.st_size) { /* check if we're about to read outside the current chunk */ if (!(cur_off-map_offset < map_size)) { // destroy the previous mapping munmap(file_chunk, map_size); // round down to the page before line_start map_offset = (line_start/pagesz)*pagesz; // limit mapped region to size of file if (map_offset + map_size > stat_buf.st_size) { map_size = stat_buf.st_size - map_offset; } // map the next chunk file_chunk = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, map_offset); // adjust the line start for the new mapping input_line = &file_chunk[line_start-map_offset]; } if (file_chunk[cur_off-map_offset] == '\n') { // found a new line, process the current line process_line(input_line, cur_off-line_start); // set up for the next one line_start = cur_off+1; input_line = &file_chunk[line_start-map_offset]; } cur_off++; } 

Most of the complications are avoiding too much matching. You might be able to map the whole file with

 char *file_data = mmap(NULL, stat_buf.st_size, PROT_READ, MAP_SHARED, fd, 0); 
+2
source

My opinion is using fgets(buff) to automatically detect a new line.

and then use strlen(buff) to calculate the size of the buffer,

 if( (total+strlen(buff)) > 104857600 ) 

then write in a new snippet.

But the block size is unlikely to be 104857600 bytes.

CMIIW

0
source

Source: https://habr.com/ru/post/915349/


All Articles