What is the minimum test to verify that a component can save / retrieve UTF8 encoded strings

Question

What is the minimum test to verify that a component can save / retrieve UTF8 encoded strings

I am testing component integration. The component allows storing and retrieving strings.

I want to make sure that the component handles UTF-8 characters correctly. What is the minimum test that is required to verify this?

I think doing something like this is a good start:

// This is the ☺ character String toSave = "\u263A"; int id = 123; // Saves to Database myComponent.save( id, toSave ); // Retrieve from Database String fromComponent = myComponent.retrieve( id ); // Verify they are same org.junit.Assert.assertEquals( toSave, fromComponent );

One mistake I made in the past, I set String toSave = "è" . My test passed because the row was saved and correctly restored to / from the database. Unfortunately, the application does not work correctly, because the application uses the ISO 8859-1 encoding. This meant that è worked, but other characters like ☺ did not.

The question is repeated: what is the minimum test (or tests) to verify that I can save UTF-8 encoded strings?

+5

java string encoding utf-8

sixtyfootersdude Apr 21 '17 at 20:21

source share

3 answers

A review of the code and / or documentation is probably your best bet. But, if you want, you can check. It seems that the goal is a sufficient criterion and its minimization is less important. It’s hard to understand what a sufficient test is, based only on assumptions about what a threat is, but here is my suggestion: all code pages, including U + 0000, the correct handling of “combination of characters”.

The method you want to test has a Java string as a parameter. Java does not have “UTF-8 encoded strings”: Java's native text types use the Unicode character set UTF-16 encoding. This is common for presenting as text in text - it is used by Java, .NET, JavaScript, VB6, VBA, .... UTF-8 is usually used for streaming and storage, so it makes sense that you should ask about it in the context of "save and choice. " Databases typically offer one or more types of UTF-8, 3-byte UTF-8, or UTF-16 (NVARCHAR) and mappings.

Encoding is an implementation detail. If a component accepts a Java string, it must either throw an exception for data that it does not want to process or process properly.

"Characters" is a rather vague term. Unicode code numbers range from 0x0 to 0x10FFFFFF-21 bits. Some code points are not assigned (otherwise called ")", depending on the revision of the Unicode standard. Java data types can process any code, but information about them is limited by version. For Java 8, Character Information is based on the Unicode standard, version 6.2.0. . You can limit the test to “specific” code points or go through all possible code points.

A code point is either a base “character” or a “combining character”. In addition, each code point is in the same Unicode category. Two categories are intended for combining characters. To form a grapheme, the base character is followed by zero or more combining characters. A graphical graphical representation of graphical objects is possible (see Zalgo ), but for storing text, everything is necessary so as not to mutilate the sequence of code points (and byte order, if applicable).

So, here is not a minimal, somewhat comprehensive test:

 final Stream<Integer> codepoints = IntStream .rangeClosed(Character.MIN_CODE_POINT, Character.MAX_CODE_POINT) .filter(cp -> Character.isDefined(cp)) // optional filtering .boxed(); final int[] combiningCategories = { Character.COMBINING_SPACING_MARK, Character.ENCLOSING_MARK }; final Map<Boolean, List<Integer>> partitionedCodepoints = codepoints .collect(Collectors.partitioningBy(cp -> Arrays.binarySearch(combiningCategories, Character.getType(cp)) < 0)); final Integer[] baseCodepoints = partitionedCodepoints.get(true) .toArray(new Integer[0]); final Integer[] combiningCodepoints = partitionedCodepoints.get(false) .toArray(new Integer[0]); final int baseLength = baseCodepoints.length; final int combiningLength = combiningCodepoints.length; final StringBuilder graphemes = new StringBuilder(); for (int i = 0; i < baseLength; i++) { graphemes.append(Character.toChars(baseCodepoints[i])); graphemes.append(Character.toChars(combiningCodepoints[i % combiningLength])); } final String test = graphemes.toString(); final byte[] testUTF8 = StandardCharsets.UTF_8.encode(test).array(); // Java 8 counts for when filtering by Character.isDefined assertEquals(736681, test.length()); // number of UTF-16 code units assertEquals(3241399, testUTF8.length); // number of UTF-8 code units

+3

Tom blodget Apr 22 '17 at 14:54

source share

String instances use a predefined and immutable encoding (16-bit words).
Thus, returning only a String from your service is probably not enough for this check.
You should try to return the byte representation of the stored string (for example, an array of bytes) and compare the contents of this array with the "\u263A" String , which you encode in bytes encoded in UTF-8.

 String toSave = "\u263A"; int id = 123; // Saves to Database myComponent.save(id, toSave ); // Retrieve from Database byte[] actualBytes = myComponent.retrieve(id ); // assertion byte[] expectedBytes = toSave.getBytes(Charset.forName("UTF-8")); Assert.assertTrue(Arrays.equals(expectedBytes, actualBytes));

0

davidxxx Apr 21 '17 at 20:28

source share

Mike nakis · Accepted Answer · 2017-05-01T05:12:12+0000

If your component is capable of storing and retrieving strings, all you have to do is make sure that nothing is lost when converting to Unicode strings from both Unicode Java strings and UTF-8 strings that are stored in this component.

This involves checking with at least one character from each UTF-8 code point length. So, I would suggest checking with:

One character from US-ASCII, (1-byte long code point), then
One character from the Greek language, (2-byte long code point) and
One character from Chinese (3-byte long code).
In theory, you'll also want to check with emoji (4-byte long code point), although they cannot be represented in java-Unicode strings, so it discusses the point.

A useful additional test would be to try a string combining at least one character from each of the above cases to make sure that characters of different lengths of the code point can coexist within the same string.

(If your component does something more than storing and retrieving strings, such as finding strings, the situation may be a little more complicated, but it seems to me that you specifically avoided asking about it.)

I really believe that black box testing is the only kind of testing that makes sense, so I would not recommend polluting the interface of your component with methods that would reveal knowledge about its internal components. However, there are two possibilities you can do to increase the testability of a component without destroying its interface:

Introduce additional functions into the interface, which can help in testing without revealing anything about the internal implementation and without requiring the test code to know the internal implementation of the component.
Introduce functionality useful for testing in the constructor of your component. The code that constructs the component knows exactly which component it builds, so it is deeply familiar with the nature of the component, so it can convey something specific for implementation.

An example of what you could do with any of the above methods would be to artificially limit the number of bytes that the internal representation is allowed to occupy, so you can make sure that the specific string that you plan to store will work. This way you can limit the internal size to no more than 9 bytes, and then make sure that the java unicode string containing 3 Chinese characters is correctly stored and retrieved.

What is the minimum test to verify that a component can save / retrieve UTF8 encoded strings

More articles: