Character-level recurrent neural networks are attractive for modelling text specifically because of their low input and output dimensionality. You have only so many chars to represent - lowercase letters, uppercase letters, digits and various auxillary characters, so you end up with 50-100 dimensions (each char is represented in one-hot encoding).
Still, it’s a drag to model upper and lower case separately. It adds to dimensionality, and perhaps more importantly, a network gets no clue that ‘a’ and ‘A’ actually represent pretty much the same thing.
The simplest solution is to discard uppercase and just use lowercase. We propose a more elegant way to deal with the two problems mentioned above: inserting special markers before each uppercase letter.
Hello World -> ^hello ^world
The resulting text is still quite readable. Of course you need to make sure there are no carets in your input to start with, but this is a minor matter: one could use any character as a marker, or invent one. Remember that a char is just a sparse vector: we can make it longer by one element, and that abstract element can be our marker.
Gents witnessing the emergence of R33, the very first char-RNN, back in the day
Here’s how to convert mixed-case text:
s = 'Hello World'
re.sub( '([A-Z])', '^\\1', s ).lower()
What we do is insert a caret befor each uppercase letter and then turn the whole string to lowercase (\1
is a backreference to a subgroup marked by parens in the first pattern; we need to quote the backslash, hence \\1
). An alternative is to perform both operations inside sub()
using a function to modify the match and return a replacement:
re.sub( '([A-Z])', lambda match: "^" + match.group( 1 ).lower(), s )
Should we need to convert stuff back, we’d use a similar construct:
s = '^hello ^world'
re.sub( '\^(.)', lambda match: match.group( 1 ).upper(), s )
A caret means “start of the line” in a regular expression, so we need to quote it with a backslash.
Does it work? It does. The network is especially quick to learn . ^
combo, representing the end of a sentence and an uppercase letter at the beginning of the next one.
The trick described above is meant for text. People have used char-RNNs for modelling other stuff. It is conceivable to use a similar gimmick for source code, or music, for example to insert bar markers - that might help a network to learn the rhytm.