Debug Java error from memory

I am still a relatively new programmer, and the problem that I am facing in Java is memory errors. I don't want to increase memory with -Xmx, because I feel that the error is due to poor programming, and I want to improve my encoding, rather than relying on more memory.

The work I do includes processing a lot of text files, each about 1 GB in compression. The code I have is for scrolling through the directory where new compressed text files are deleted. It opens the second most recent text file (not the latest, since it is still being written), and uses the Jsoup library to parse specific fields in the text file (fields are separated by custom delimiters: "| nTa |" denotes a new column and "| nLa |" denotes a new line).

I believe that there should be no reason to use a large amount of memory. I open the file, look at it, parse the corresponding bits, write the analyzed version to another file, close the file and move on to the next file. I do not need to store the entire file in memory, and I certainly do not need to store files that have already been processed in memory.

I get errors when I start parsing the second file, which means that I am not dealing with garbage collection. Please take a look at the code and see if you can determine what I am doing, which means that I use more memory than it should be. I want to learn how to do it right so that I stop getting memory errors!

import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.util.ArrayList; import java.util.Collections; import java.util.Scanner; import java.util.TreeMap; import java.util.zip.GZIPInputStream; import java.util.zip.GZIPOutputStream; import org.jsoup.Jsoup; public class ParseHTML { public static int commentExtractField = 3; public static int contentExtractField = 4; public static int descriptionField = 5; public static void main(String[] args) throws Exception { File directoryCompleted = null; File filesCompleted[] = null; while(true) { // find second most recent file in completed directory directoryCompleted = new File(args[0]); filesCompleted = directoryCompleted.listFiles(); if (filesCompleted.length > 1) { TreeMap<Long, File> timeStamps = new TreeMap<Long, File>(Collections.reverseOrder()); for (File f : filesCompleted) { timeStamps.put(getTimestamp(f), f); } File fileToProcess = null; int counter = 0; for (Long l : timeStamps.keySet()) { fileToProcess = timeStamps.get(l); if (counter == 1) { break; } counter++; } // start processing file GZIPInputStream gzipInputStream = null; if (fileToProcess != null) { gzipInputStream = new GZIPInputStream(new FileInputStream(fileToProcess)); } else { System.err.println("No file to process!"); System.exit(1); } Scanner scanner = new Scanner(gzipInputStream); scanner.useDelimiter("\\|nLa\\|"); GZIPOutputStream output = new GZIPOutputStream(new FileOutputStream("parsed/" + fileToProcess.getName())); while (scanner.hasNext()) { Scanner scanner2 = new Scanner(scanner.next()); scanner2.useDelimiter("\\|nTa\\|"); ArrayList<String> row = new ArrayList<String>(); while(scanner2.hasNext()) { row.add(scanner2.next()); } for (int index = 0; index < row.size(); index++) { if (index == commentExtractField || index == contentExtractField || index == descriptionField) { output.write(jsoupParse(row.get(index)).getBytes("UTF-8")); } else { output.write(row.get(index).getBytes("UTF-8")); } String delimiter = ""; if (index == row.size() - 1) { delimiter = "|nLa|"; } else { delimiter = "|nTa|"; } output.write(delimiter.getBytes("UTF-8")); } } output.finish(); output.close(); scanner.close(); gzipInputStream.close(); } } } public static Long getTimestamp(File f) { String name = f.getName(); String removeExt = name.substring(0, name.length() - 3); String timestamp = removeExt.substring(7, removeExt.length()); return Long.parseLong(timestamp); } public static String jsoupParse(String s) { if (s.length() == 4) { return s; } else { return Jsoup.parse(s).text(); } } } 

How can I make sure that when I finish objects, they are destroyed and do not use any resources? For example, every time I close GZIPInputStream, GZIPOutputStream and the scanner, how can I make sure they are completely destroyed?

For the record, the error I get is:

 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572) at java.lang.StringBuilder.append(StringBuilder.java:203) at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1171) at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42) at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101) at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53) at org.jsoup.parser.Parser.parse(Parser.java:24) at org.jsoup.Jsoup.parse(Jsoup.java:44) at ParseHTML.jsoupParse(ParseHTML.java:125) at ParseHTML.main(ParseHTML.java:81) 
+4
source share
5 answers

Update: this issue has been fixed in JSoup 1.6.2

It seems to me that this is probably an error in the JSoup parser that you are using ... currently the documentation for JSoup.parse () has the warning "BETA: if you get a raised exception or a bad parsing tree, write an error". This suggests that they are not sure that they are completely safe for use in production code.

I also found several error reports that mention exceptions from memory, one of which suggests that it is due to the fact that the analysis error objects are statically set by JSoup, and that dropping JSoup 1.6.1 to 1.5.2 may be a workaround.

+2
source

I did not analyze your code for a very long time (nothing stands out), but a good general purpose would be to familiarize yourself with the free VisualVM . This is a reasonable guide to its use, although there are many more articles.

In my opinion, there are the best commercial profilers - JProfiler for one - but it will at least show you which objects / classes are assigned to the majority of the memory, and possibly trace the stack of methods that caused this. More simply, it shows the distribution of the heap over time, and you can use it to judge that you cannot clarify something or it is an inevitable surge.

I suggest this, rather than looking at the features of your code, because it is a useful diagnostic skill.

+3
source

I am wondering if your parsing is not working because you are not good at HTML (e.g. closed tags, unpaired quotes, or something else)? You can infer / println to see how far you get into the document, if at all. The Java library may not understand the end of the document / file before running out of memory.

parsing public static Document parse (String html) Parse HTML into a document. Since no base URI is specified, the absolute definition of URLs depends on HTML, including the tag.

http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parse(java.lang.String )

+1
source

It's a little hard to say what's going on, but two things come to me.

1) In some strange circumstances (depending on the input file), the following loop may load the entire file into memory:

 while(scanner2.hasNext()) { row.add(scanner2.next()); } 

2) Looking at stackTrace, it seems that the jsoupParse problem is a problem. I believe this line is Jsoup.parse(s).text(); loads s into memory first and depending on the size of the line (which again depends on the particular input of the file) this may cause OutOfMemoryError

Perhaps a combination of the two points above is a problem. Again, it's hard to say just looking at the code.

Does this always happen with the same file? Have you checked the input content and user separators in it?

+1
source

Assuming the problem is not in JSoup code, we can do some memory optimization. In the example, the ArrayList<String> row can be deleted, since it contains all the parsed lines in memory, but only one line is needed for parsing.

Inner loop with row removed:

 //Caution! May contain obvious bugs! while (scanner.hasNext()) { String scanStr = scanner.next(); //manually count of rows to replace 'row.size()' int rowCount = 0; int offset = 0; while ((offset = scanStr.indexOf("|nTa|", offset)) >= 0) { rowCount++; offset++; } rowCount++; Scanner scanner2 = new Scanner(scanStr); scanner2.useDelimiter("\\|nTa\\|"); int index = 0; while (scanner2.hasNext()) { String curRow = scanner2.next(); if (index == commentExtractField || index == contentExtractField || index == descriptionField) { output.write(jsoupParse(curRow).getBytes("UTF-8")); } else { output.write(curRow.getBytes("UTF-8")); } String delimiter = ""; if (index == rowCount - 1) { delimiter = "|nLa|"; } else { delimiter = "|nTa|"; } output.write(delimiter.getBytes("UTF-8")); } } 
+1
source

Source: https://habr.com/ru/post/1395488/


All Articles