I am still a relatively new programmer, and the problem that I am facing in Java is memory errors. I don't want to increase memory with -Xmx, because I feel that the error is due to poor programming, and I want to improve my encoding, rather than relying on more memory.
The work I do includes processing a lot of text files, each about 1 GB in compression. The code I have is for scrolling through the directory where new compressed text files are deleted. It opens the second most recent text file (not the latest, since it is still being written), and uses the Jsoup library to parse specific fields in the text file (fields are separated by custom delimiters: "| nTa |" denotes a new column and "| nLa |" denotes a new line).
I believe that there should be no reason to use a large amount of memory. I open the file, look at it, parse the corresponding bits, write the analyzed version to another file, close the file and move on to the next file. I do not need to store the entire file in memory, and I certainly do not need to store files that have already been processed in memory.
I get errors when I start parsing the second file, which means that I am not dealing with garbage collection. Please take a look at the code and see if you can determine what I am doing, which means that I use more memory than it should be. I want to learn how to do it right so that I stop getting memory errors!
import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.util.ArrayList; import java.util.Collections; import java.util.Scanner; import java.util.TreeMap; import java.util.zip.GZIPInputStream; import java.util.zip.GZIPOutputStream; import org.jsoup.Jsoup; public class ParseHTML { public static int commentExtractField = 3; public static int contentExtractField = 4; public static int descriptionField = 5; public static void main(String[] args) throws Exception { File directoryCompleted = null; File filesCompleted[] = null; while(true) { // find second most recent file in completed directory directoryCompleted = new File(args[0]); filesCompleted = directoryCompleted.listFiles(); if (filesCompleted.length > 1) { TreeMap<Long, File> timeStamps = new TreeMap<Long, File>(Collections.reverseOrder()); for (File f : filesCompleted) { timeStamps.put(getTimestamp(f), f); } File fileToProcess = null; int counter = 0; for (Long l : timeStamps.keySet()) { fileToProcess = timeStamps.get(l); if (counter == 1) { break; } counter++; } // start processing file GZIPInputStream gzipInputStream = null; if (fileToProcess != null) { gzipInputStream = new GZIPInputStream(new FileInputStream(fileToProcess)); } else { System.err.println("No file to process!"); System.exit(1); } Scanner scanner = new Scanner(gzipInputStream); scanner.useDelimiter("\\|nLa\\|"); GZIPOutputStream output = new GZIPOutputStream(new FileOutputStream("parsed/" + fileToProcess.getName())); while (scanner.hasNext()) { Scanner scanner2 = new Scanner(scanner.next()); scanner2.useDelimiter("\\|nTa\\|"); ArrayList<String> row = new ArrayList<String>(); while(scanner2.hasNext()) { row.add(scanner2.next()); } for (int index = 0; index < row.size(); index++) { if (index == commentExtractField || index == contentExtractField || index == descriptionField) { output.write(jsoupParse(row.get(index)).getBytes("UTF-8")); } else { output.write(row.get(index).getBytes("UTF-8")); } String delimiter = ""; if (index == row.size() - 1) { delimiter = "|nLa|"; } else { delimiter = "|nTa|"; } output.write(delimiter.getBytes("UTF-8")); } } output.finish(); output.close(); scanner.close(); gzipInputStream.close(); } } } public static Long getTimestamp(File f) { String name = f.getName(); String removeExt = name.substring(0, name.length() - 3); String timestamp = removeExt.substring(7, removeExt.length()); return Long.parseLong(timestamp); } public static String jsoupParse(String s) { if (s.length() == 4) { return s; } else { return Jsoup.parse(s).text(); } } }
How can I make sure that when I finish objects, they are destroyed and do not use any resources? For example, every time I close GZIPInputStream, GZIPOutputStream and the scanner, how can I make sure they are completely destroyed?
For the record, the error I get is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572) at java.lang.StringBuilder.append(StringBuilder.java:203) at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1171) at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42) at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101) at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53) at org.jsoup.parser.Parser.parse(Parser.java:24) at org.jsoup.Jsoup.parse(Jsoup.java:44) at ParseHTML.jsoupParse(ParseHTML.java:125) at ParseHTML.main(ParseHTML.java:81)