Java text file size (before closing the file)

I am compiling full HTML from a service that provides access to a very large collection of blogs and news websites. I test HTML as it is (in real time) to see if it contains multiple keywords. If it contains one of the keywords, I write HTML to a text file to save it.

I want to do this for a week. Therefore, I collect a large amount of data. Testing the program for 3 minutes yielded a 100 MB text file. I have 4 TB of space and I can’t use more of this.

In addition, I do not want the text files to become too large, because I assume that they will become inaccessible.

I suggest opening a text file and writing HTML code, often checking its size. If it gets larger, say 200 MB, I close the text file and open another. I also need to save the execution log as much as I used in total so that I can make sure that I am not approaching 4 TB.

The question I have at this point is to check the size of the text file before the file has been closed (using FileWriter.close ()). Is there a function for this, or should I count the number of characters written to the file and use them to estimate the file size?

A separate question: are there ways to minimize the amount of space that my text files occupy? I am working in Java.

+4
source share
7 answers

Create a record that counts the number of characters written and uses them to wrap OutputStreamWriter .

[EDIT] Note: The correct way to save text in a file:

 new BufferedWriter( new OutputStreamWriter( new FileOutputStream( file ), encoding ) ) ); 

Coding is important; this is usually "UTF-8."

This chain gives you two places where you can enter your wrapper: you can wrap the writer to get the number of characters or an internal OutputStream to get bytes.

+5
source

To minimize space, you could zip your text files with Java. Why not add each file to zip after closing it. After zipping, you can check the size of the zip code to see your cumulative memory consumption.

+3
source

HTML is highly compressed with high compression. Think about using GZIPOutputStream to “minimize the amount of space” your text files occupy.

+3
source

I keep answering Aaron. You can use CountingOutputStream : just wrap your FileOutputStream with a CountingOutputStream and you can find out how many bytes you have already written.

+3
source

Have you ever counted how many bytes you write to a file?

+2
source
 import java.io.File; import java.io.FileWriter; import java.io.IOException; public class TestFileWriter { /** * @param args * @throws IOException */ public static void main(String[] args) throws IOException { FileWriter fileWriter= new FileWriter("test.txt"); for (int i=0; i<1000; i++) { fileWriter.write("a very long string, a very long string, a very long string, a very long string, a very long string\n"); if ((i%100)==0) { File file=new File("test.txt"); System.out.println("file size=" + file.length()); } } fileWriter.close(); File file=new File("test.txt"); System.out.println("file size=" + file.length()); } } 

This example shows that if you use a file writer, you can get its size in real time during recording and when opening a recording. If you want to save space, you can archive the stream.

+1
source

Apologizes for being slightly off topic:

Do I need to be in Java? Depending on how you get your channel data, this sounds like a job for a fairly simple shell script for me ( grep or fgrep to check keywords, gzip to compress ...)

0
source

Source: https://habr.com/ru/post/1382307/


All Articles