Code Format
A code is a sequence of bytes. All of the bytes must fall in the range 0x21-0x7E or 0xA0-0xFF. Every bank and intermediary must properly handle any code that is in this format. Here is a grammar definition, in [http://en.wikipedia.org/wiki/Augmented_Backus-Naur_form ABNF]:
ValidChar = %x21-7E / %xA1-AC / %xAE-FF ValidCode ::= 1*256validchar
Code Display Recommendation
One goal of the code format is to allow people to use codes that have meaning in their own language. The reasons for this are:
- Users may like codes in their own language
- Users are comfortable entering their own language into their computer and mobile phone
- They enter their own language faster than English characters
- They will make fewer mistakes entering their own language
Local languages may have many characters, allowing shorter codes while maintaining sufficient entropy. See PasswordEntropy.
Therefore the code format should support various languages. Rather than support various character encodings, the protocol will use ISO-8859-15 with UNICODE entities. Codes should be displayed using the [http://en.wikipedia.org/wiki/ISO_8859-15 ISO/IEC 8859-15 character encoding]. Substrings matching the following [http://en.wikipedia.org/wiki/Augmented_Backus-Naur_form ABNF] grammar are handled differently:
HexDigit = %x30-39 / %x41-46 / %x61-66 ; [0-9a-fA-F] HexEntity = "&" , "#" , "x" , 1*HexDigit , ";"
Substrings matching HexEntity are converted to a [http://www.unicode.org/ UNICODE] code point, as in [http://www.w3.org/TR/REC-xml/#sec-references XML]. Entities encoding characters that don't need entity encoding should be displayed as-is.
For example, in the string "abcA", the entity "A" encodes the ISO-8859-15 character 'A'. If the software displays the code as "abcA", the user might write this down and later enter the code as "abcA", an error. The software should display the entity as-is, "abcA". The software should keep from displaying ambiguous representations of the code.
Software may choose to show an entity as-is, if doing so will eliminate ambiguity. Ambiguity may arise when a code contains a character that appears multiple times in the UNICODE. Some similar characters appear in the Chinese and Japanese portions of the unicode. When displaying a code containing a Chinese character on a Japanese computer, the entity should be displayed as-is.
To be safe, the software may display the code twice: once as-is and once with entities converted to their corresponding characters.
Code Entry
Software must allow the user to enter all codes that match ValidCode. The software may allow the user to enter characters not matching ValidChar. Such software must convert those characters to unicode code points and then to a string matching HexEntity. Website software accepting input via web browser may receive unicode characters encoded in decimal notation:
Digit = %x30-39 ; 0-9 DecimalEntity = "&" , "#" , 1*Digit , ";"
The software must convert substrings matching DecimalEntity into substrings matching HexEntity. Software displaying codes containing DecimalEntity must display them as-is.
Recommendation for Code Creation
When a client or bank generates a code, it should use characters from only one language. This minimizes the chances for the user to enter a character in the wrong language.
The code generator algorithm should use the output of a random number generator to select components of the code. The security of user accounts may be compromised if a malicious person is able to predict and guess codes. See PasswordEntropy.
Some codes are created for pre-printed cards, like credit card numbers. Such codes are intended for use by people. They should be:
- Easy to read, lacking easily confused characters such as 'O' and '0'
- Easy to tell over the telephone, excluding letters that are hard to tell apart, such as 'f' and 's', 'm' and 'n', etc.
- As short as possible, yet hard enough to guess
English Language Codes
You may generate English language codes using any letter, number, or symbol that appears in the English language. But let's apply some usability analysis and pare down that set:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
- Users may be most comfortable with letters and digits:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
- It's unweildy to say 'capital A', 'small B', 'capital C', and so on. So we can eliminate the lower-case letters, leaving:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
- Some are difficult to tell apart on paper: (O,0), (5,S), (T,7), (1,7), (G,6), (I,1). Let's cut them out.
234789ABCDEFGHJKLMNPQRUVWXYZ
- A few are hard to tell apart over the phone: (B,D), (F,S), (C,G), (M,N), (C,Z).
234789AEHJKLPQRUVWXY
This leaves us with twenty characters. A 64-bit random number may be encoded into a 15-digit base-20 number. This is quite manageable.
For comparison, a VISA credit card has a 16 digit account number, 4 digit expiration date, 3 digit CVV code, and the owner's name. Credit card information is often given over the telephone, with the caller spelling her name. Our proposed 15-digit codes contain letters that require spelling when relayed over the telephone. But I expect that relaying a debit code over the phone will require about the same amount of time as a credit card number. I also expect errors to be about as common. Experiments are needed to determine the best code composition and length.
Codes for Software
Some codes are generated and handled by software. Such codes are rarely displayed to a user, so they can be longer and more complicated. Such applications could use an alphabet of all 188 valid characters, for very compact representation. A 128-bit random number fits into a 17-digit base-188 number.
TODO: decide whether or not to remove '&' from the ValidChar range: ValidChar = %x21-25 / %x27-7E / %xA1-AC / %xAE-FF
Discussion
KenLeonhard: Another thought is to only use letters and numbers that are easily distinguishable with simple voice recognition software. That might reduce errors as people could, for example, simply read off the code into their cell phone instead of having to type it in. Although it would result in longer codes.
- ["MisterN"]: If you don't use codes of size 15 only, but also codes with sizes 1-15 or say only sizes 3,6,9,12,15 you could encode more information with less bytes. Who doesn't have trouble with long codes?
MichaelLeonhard: This is a good idea. The question is, "How much more information can we encode?"
2015 = 3.2768 x 1019 = 264.829
201 + 202 + ... + 2015 = SUM i=1..15 20i = 3.4493 x 1019 = 264.903
- 64.903 - 64.829 = 0.074
- So by including the codes that are less than 15 digits long, we gain only 0.074 bits. I guess the gain is so small because the base is large. For base-2 numbers, one gains exactly one bit by including shorter numbers.
- ["MisterN"] says: Of course, not all codes should be valid at any point of time. But why not take it further and create honeypot codes, which, when someone tries to use them, are logged and analysed and the user can be blocked for some time.
MichaelLeonhard responds: A very small percentage of codes will be valid at any bank, perhaps 1 / 240. I expect that the bank would log every transfer attempt, no matter what code is provided. So every invalid code would be a 'honeypot code'.
