A review of the code and / or documentation is probably your best bet. But, if you want, you can check. It seems that the goal is a sufficient criterion and its minimization is less important. It’s hard to understand what a sufficient test is, based only on assumptions about what a threat is, but here is my suggestion: all code pages, including U + 0000, the correct handling of “combination of characters”.
The method you want to test has a Java string as a parameter. Java does not have “UTF-8 encoded strings”: Java's native text types use the Unicode character set UTF-16 encoding. This is common for presenting as text in text - it is used by Java, .NET, JavaScript, VB6, VBA, .... UTF-8 is usually used for streaming and storage, so it makes sense that you should ask about it in the context of "save and choice. " Databases typically offer one or more types of UTF-8, 3-byte UTF-8, or UTF-16 (NVARCHAR) and mappings.
Encoding is an implementation detail. If a component accepts a Java string, it must either throw an exception for data that it does not want to process or process properly.
"Characters" is a rather vague term. Unicode code numbers range from 0x0 to 0x10FFFFFF-21 bits. Some code points are not assigned (otherwise called ")", depending on the revision of the Unicode standard. Java data types can process any code, but information about them is limited by version. For Java 8, Character Information is based on the Unicode standard, version 6.2.0. . You can limit the test to “specific” code points or go through all possible code points.
A code point is either a base “character” or a “combining character”. In addition, each code point is in the same Unicode category. Two categories are intended for combining characters. To form a grapheme, the base character is followed by zero or more combining characters. A graphical graphical representation of graphical objects is possible (see Zalgo ), but for storing text, everything is necessary so as not to mutilate the sequence of code points (and byte order, if applicable).
So, here is not a minimal, somewhat comprehensive test:
final Stream<Integer> codepoints = IntStream .rangeClosed(Character.MIN_CODE_POINT, Character.MAX_CODE_POINT) .filter(cp -> Character.isDefined(cp)) // optional filtering .boxed(); final int[] combiningCategories = { Character.COMBINING_SPACING_MARK, Character.ENCLOSING_MARK }; final Map<Boolean, List<Integer>> partitionedCodepoints = codepoints .collect(Collectors.partitioningBy(cp -> Arrays.binarySearch(combiningCategories, Character.getType(cp)) < 0)); final Integer[] baseCodepoints = partitionedCodepoints.get(true) .toArray(new Integer[0]); final Integer[] combiningCodepoints = partitionedCodepoints.get(false) .toArray(new Integer[0]); final int baseLength = baseCodepoints.length; final int combiningLength = combiningCodepoints.length; final StringBuilder graphemes = new StringBuilder(); for (int i = 0; i < baseLength; i++) { graphemes.append(Character.toChars(baseCodepoints[i])); graphemes.append(Character.toChars(combiningCodepoints[i % combiningLength])); } final String test = graphemes.toString(); final byte[] testUTF8 = StandardCharsets.UTF_8.encode(test).array(); // Java 8 counts for when filtering by Character.isDefined assertEquals(736681, test.length()); // number of UTF-16 code units assertEquals(3241399, testUTF8.length); // number of UTF-8 code units