Thursday, September 26, 2013

Base93 - Integer Shortening in C#

base93

This post will describe a technique of shortening an integer by approximately 50%.  For example, the approximate coordinate extent of continental United States in web Mercator can be expressed in the following JSON representation.
{"xmin":-15438951,"ymin":1852835,"xmax":-5420194,"ymax":7485635},

…or simplified to:
-15438951,1852835,-5420194,7485635

…or shortened to:
-ji5l,2sk},cGyR,70o#

The principle being employed here is that numbers are normally expressed using the decimal or base ten numerical system.  This system requires ten characters (or “digits”), 0 to 9, to formulate a number.  However, because ASCII has 95 printable characters we can encode numbers using base 93 which can result in printed strings that are 50% shorter than their decimal cousins.  For example:

-15438951 to –ji5l, or
1852835 to 2sk}

Why not use all 128 ASCII characters?  Well. 33 are non-printable and the characters “,” and “-“ are reserved for delimiter and negation respectively.

Lastly, to reduce the overall length of extents, the xmax is expressed as a width and ymax as a height.  At small scales the benefit of this technique is negligible (if any) but significant at large scales.

Please see below for source code and test method.

using System;
using System.Diagnostics;
using System.Globalization;
using System.Linq;

namespace ESRI.PrototypeLab.Base93 {
    public static class Base93Extension {
        private const string CHRS =
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" +
" !?\"'`^#$%@&*+=/.:;|\\_<>[]{}()~"; public static string ToShortenedString(this int num) { string result = string.Empty; if (num < 0) { result += "-"; num = Math.Abs(num); } int len = CHRS.Length; int index = 0; while (num >= Math.Pow(len, index)) { index++; } index--; while (index >= 0) { int pow = (int)Math.Pow(len, index); int div = num / pow; result += CHRS[div]; num -= pow * div; index--; } return result; } public static int ToUncompressedInteger(this string s) { int result = 0; int len = CHRS.Length; int index = 0; var chars = s.ToCharArray(); foreach (char c in chars.Reverse().Where(x => x != '-')) { int pow = (int)Math.Pow(len, index); int ind = CHRS.IndexOf(c); result += pow * ind; index++; } if (chars.First() == '-') { result *= -1; } return result; } internal static void Test() { int max = int.MaxValue; string com = max.ToShortenedString(); int unc = com.ToUncompressedInteger(); double x = (double)com.Length /
(double)max.ToString(CultureInfo.InvariantCulture).Length; Debug.WriteLine("Max Integer: {0}", max); Debug.WriteLine("Compressed: {0}", com); Debug.WriteLine("Uncompressed: {0}", unc); Debug.WriteLine("Compression: {0:P}", 1 - x); } } }

2 comments:

  1. Great piece of code! But why, when using 95 characters, is it called Base93?

    ReplyDelete
  2. It is called base93 because each "digit" can be comprised of any one of 93 characters. As described in the post, two of the visible 95 characters are reserved for negation ("-") and a decimal (".") indicator.

    If the goal was to represent ONLY positive integers then it would be possible to use all 95 ASCII character, and hence, you could encode in base 95.

    Thanks for the comment!

    ReplyDelete