Sunday, January 27, 2008

Helium Knight

Helium wasn't first discovered on Earth. The first place we found it was on the Sun. It was first "discovered" by Joseph Lockyer, who didn't so much discover it as propose that some as-yet-unknown mystery element Helium was responsible for some unexplained absorption lines in the visible spectrum of our sun. Twenty five years later it was discovered on Earth, as well.

He was knighted for his discovery in 1897. From what I have read he was a good scientist with many contributions to knowledge from which I have no doubt personally benefited. His knighthood should remind us to be humble with regard to that which we think we know. He was honoured for (now read this slowly) finding the second most abundant element in the whole universe. In all of recorded human history, we hadn't noticed the second most abundant thing in the whole freaking universe. Wow that's pretty embarrassing to any technological hubris we might have had. It's like living at the beach and not noticing there's sand until years later when somebody says "hey, look what I found under my feet". I'm pretty sure that's not the last time something like that is going to happen, either.

Thanks, Sir Lockyer, for striking a massive outward dent into the limits of the opaque shell of human knowledge. It turns out that ignorance is the most abundant thing in the universe, and you banished some of it.

Burton MacKenZie www.burtonmackenzie.com

Friday, January 25, 2008

C/C++/LISP to DNA Compiler

Today J. Craig Venter Institute scientists announced they have built the first bacterial genome from the raw chemical components of DNA.

If this process can be done entirely by machine instruction, we could build a programming interface to assembly of DNA. In other words, we could make a C/C++ to DNA Compiler. (also from the other language, LISP) We could literally design a new functioning organism from the ground up.

I'm not looking forward to the bugs in the code. (eg. "remember that bacteria we created to eat oilspills? it turns out it also attacks the oily animal fat in mammals and now we're all going to die")

Somebody's going to write a literal software virus that targets biologicals, and it will knock out the population. It will be the work of a script kiddie.

Maybe somebody will write a really useful organism, like something that targets amyloid beta plaques in the brains of Alzheimer's patients. Then, the helpful djinni will mutate to attack our immune cells and destroy us all.

If we ever actually create artificial intelligence, we're not going to do it first in silicon. We need the efficient raw processing power of a biological system to get the computational complexity we need. My money's on the first artificial intelligence arising, if anywhere, in artificially designed flesh and blood.

It is probably a muchos bad idea to design an unrestrained intelligence that lives in our ecosystem and could outcompete us. It would be like unleashing a new kind of cockroach into our ecosystem, except the cockroaches would be smarter than us.

All in all, as much as I'd love to have a C/C++ to DNA compiler, Pandora's Box seems mighty big. It's probably inevitable, but even still, I have hope.

Burton MacKenZie www.burtonmackenzie.com

Update: Here's some related informational links:

  • From the J.Craig Venter Institute overview on Synthetic Genomics, "Synthetic genomics combines methods for the chemical synthesis of DNA with computational techniques to design it."
  • DNA 'Fabricator' constructs walking DNA, "Just as computer languages let programmers create any number of applications, the researchers behind the approach predict that biochemical programming "languages" inspired by their work could let bioengineers create any number of desired molecular products and processes". (At time of this update, the newscientist original was down, but you should try it anyway)
  • From Nature 451, 318-322 (17 January 2008), Programming biomolecular self-assembly pathways.
  • Start Hacking Life: Desktop DNA Synthesizers, "your new DNA compiler ... fits on your desk and can take genetic code that looks like this, gatcctccat, and turn it into an actual DNA sequence. "

Wednesday, January 23, 2008

Did English Wine Merchants develop and use a binary number system in the 13th century?

Knuth [1] states the first known appearance of pure binary notation was about 1605 with Thomas Harriot [2], but he suggests another intriguing idea. He lists the common units of liquid measure used by English Wine Merchants as late as the 13th century. Specifically,

2 gills = 1 chopin
2 chopins = 1 pint
2 pints = 1 quart
2 quarts = 1 pottle
2 pottles = 1 gallon
2 gallons = 1 peck
2 pecks = 1 demibushel
2 demibushels = 1 bushel (or firkin)
2 firkins = 1 kilderkin
2 kilderkins = 1 barrel
2 barrels = 1 hogshead
2 hogsheads = 1 pipe
2 pipes = 1 tun

This is effectively a binary counting system using a 14 bit word. Each measure, twice as large as the last, has its own name. If you represent each named unit as a single bit and combine them all, you get a binary number. For instance, if I had 1 hogshead, a firkin, a gallon, and a pint of wine, I could write the measured amount of wine as 0b00100100100100 gills, where "gills" is the LSB (least significant bit) and "tuns" is the MSB (most significant bit). ("0b" at the front is a common programmer way to say "the following are digits in binary")

Compare this with our system where we might write 1 litre and 13 millilitres of volume in base 10 as 1013 millilitres. The wine merchants had names for each power of 2 of liquid measure, whereas in the metric system we have prefixes for each power of 10, allowing us to extend their use to all units of measure.

If this is true, I am impressed. The English wine merchants would have been cogently representing measures in the 2nd most optimal integer base as well as competent in "everyday" binary calculations. I may have found a new respect for the imperial system of measure which I always thought was somewhat arbitrary.

Is this only legendary? I don't know where Knuth [1] got his conversion factors. All the modern references I find now generally don't agree with his conversions. Standards change, but I could find no record of it. According to google, here's some the current conversion factors

2 firkin = 1 kilderkin
1.4340494 kilderkin = 1 barrel
2.03225806 barrels = 1 hogshead

As you can see, they're not quite factors of 2. Were they never factors of 2 (and Knuth was misinformed or fudging) or did the standards drift? I don't know for sure, but when examining the conversion of two units that are supposed to be equal (firkin and bushels)

1 firkin = 1.16106426 US bushels

it seems that there is a disparity between UK and US designations, which suggests drift. It is believable that Knuth was reporting historically and the online references are modern standards that drifted from the original measures. Unfortunately Knuth doesn't state any sources for his unit equivalences. If anybody has better information, please leave a comment, preferably with a link. Until there is a solid verifiable reference, this idea unfortunately can only be legendary. I hope it's not; the idea of 13th century wine merchants conducting business in binary is just too cool.

Burton MacKenZie www.burtonmackenzie.com



[1] Donald E. Knuth, "The Art of Computer Programming, Volume 2: "Seminumerical Algorithms", Second Edition, Addison-Wesley, 1981, p. 183

[2] and the first published discussion was in 1670 by the bishop Juan Caramuel Lobkowitz.

Wednesday, January 16, 2008

James Woods Frankenstein

While flipping through an old copy of How to Care for Your Monster, I spotted this picture.

"Wow", I thought, "James Woods modeled as the Frankenstein Monster back in the 70s! He sure is a multi-faceted actor."



I really enjoyed How to Care for Your Monster when I was a kid, and it's just as enjoyable now. I also know two little ones that are absolutely in love with my decades old copy of it, insisting on readings from it nightly. You can still get them (as per my links to amazon), but only used copies (It's out of print). The author is Norman Bridwell, also known for his much more famous books about Clifford the Big Red Dog.

If James Woods ever comes through town again, maybe I could get him to sign it. How many people have pictures of the Frankenstein Monster signed by James Woods? Not many, I bet.

Burton MacKenZie www.burtonmackenzie.com

Sunday, January 13, 2008

Adsense advertisers bid on keywords for preferred content-aware ad placements. When people click on my google ads on the right, I get paid a few pennies. (Sometimes a lot of pennies) It's in everybody's (google, the advertisers, the webhost, and the readers) best interests for these ads to be as relevant and intriguing as possible.

Last year I put up an allegedly funny image called "Spiderman Masturbates!". Check out the screen grab of the advertising for the relevant keyword-related product:

If you're shooting off knucklebabies like Spiderman, maybe you need facial tissue by the "case pack".

Burton MacKenZie www.burtonmackenzie.com

Sunday, January 06, 2008

Our species has 10 fingers, and we (mostly) count in base 10. It's easy to learn. Base 10 seems natural, but it hasn't always been the favoured base. The Mayans used base 20 (vigesimal), the Babylonians used base 60 (sexagesimal), and our computers use base 2 (binary). What makes our base 10 so special, and more importantly, is 10 the optimal base we could use? Is there a better one?

First, we have to define what is meant by optimal and better in the context of a numeric base (or radix). The only definition of optimal numeric base that I know is the base at which the area A = (#digits in a given number) * (#unique symbols in base) is minimized for all numbers. That is, the optimal base is where, given all numbers that can be represented, the least "space" to represent them is the least amount of area given by the product of the number of digits in a number and the number of symbols required in that base system. [1]

Using least "area" as our metric for most optimal, the area is given by A = D * B. B = the base, which is also the number of symbols used in the given base; D = the number of digits which can be modeled with the function D(B,N) = logB(N) (or more exactly, it is the index of the leading digit needed to represent the number). Since over all numbers we also consider numbers approaching infinity, this equation for area doesn't tell us much in its present form, so instead we can work with a derivative of it (pun intended). As any intro calculus student will tell you, the maximum or minimum point of a concave surface (in this case, A = D * B) is where d2A/(dD*dB) = 0. If this can give us a solution independent of N (the value of any given number), then there is an optimal base for representing all numbers.

d2A/(dD*dB) = d/dD ( d/dB ( D(B,N) * B ) )
= d/dD ( D(B,N) + B * d/dB ( D(B,N) )
= d/dD [ D(B,N) + (-1)*D(B,N)/ln(B) ]
= 1 + (-1)/ln(B)

This is the slope of a concave up [2] curve, where the optimal base occurs at the minimum point and the slope equals zero (0).

1 + (-1)/ln(B) = 0

or

B = e

It turns out e is the optimal numeric base! I don't find this particularly surprising; e is everywhere. (Not to mention lack of surprise from having previously read the final answer) The problem with this is that we humans tend to use integer bases. Having an irrational base isn't unheard of, but we humans don't have much intuitive use for it. [3]

Optimal Integer Base
Just as Aiken and Tropp discussed [1], three (3) is thought to be the most optimal integer numeric base because it is closer to e than two (2).

Is three really the best integer base? We can do better than just saying "it's closer to e than 2". (but to be fair, that's not an unreasonable thing to say) The function is non-linear, which makes the immediate obviousness/truth of the statement questionable. [4] The original equation for the symbolic-spatial area used by representing any number is a function of the base and the value of the number, so it's hard to use it to study a trend for all numbers. The first derivative (with respect to #digits and base), however, is only a function of the base, and is a lot easier to work with.

A'(B) = 1 - 1/ln(B)

Plotting this function over the domain (2:3) shows not only the optimal base to use (where B = e, at the zero crossing), but just as importantly, it shows the rate of change in the representational area as a function of (change of) Base.

To decide whether 2 or 3 is closer to the optimal base in a more quantitative way, we can integrate A'(B) from 2 to e, and again from e to 3, and compare which of the two areas between the curve and the B-axis is smaller in magnitude. It should be obvious from the curve of A'(B), above, that the area between the curve and the B-axis is considerably smaller between e and 3. [5] This means that if we use a base of 3, there is less change of the total area required to represent numbers. So, three really is the optimal integer numeric base, assuming the original metric for the definition of optimal is reasonable.

This has implications on more than just representing numbers. Three as an optimal base also gives insight into optimal menus. It provides insight into tree traversal/creation. If we switch to a tertiary number system implemented in silicon, we may be able to implement an equal size of integer processing (e.g. the equivalent of 32 bits, or a max of 4 294 967 296, can be represented within 21 tits of ternary, with a max of 10 460 353 203) in a slightly less amount of silicon. Alternately, you might implement a larger numeric representation in a roughly equivalent silicon area. (I haven't done a formal analysis, myself, but I have seen some papers published on it) All the opinions I recall reading on this, though, have suggested that we're currently too deeply ingrained in binary physical implementations such that switching would probably be a bit of a setback given our 50ish years of optimizing for binary implementations. I suspect that if a ternary implementation really is better, it will only get explored if we hit a wall with Moore's Law for our current setup.

The Second most optimal integer base
We still don't know what the second most optimal integer numeric base is, but it's probably 2 or 4. Again, refer to the same expression of A'(B), but now over the domain (2:4), and do the same integration on either side of the optimal point of B = e.

Visually, it looks rather close, but numerically it's no contest. The area from 2->e is 0.132 and from e->4 is 0.209, so 2 is less of an area change from e and is a more optimal integer base than 4. Four (4) still gets the bronze medal for integer bases, though, and still deserves its props. Not only is base 4 (quaternary) the third most optimal integer numeric base, but it is strongly related to the second most optimal integer base, binary, like octal and hexadecimal. Base 4 is the native numeric base of DNA (it has four characters, A G C T, to represent Adenine, Guanine, Cytosine, and Thymine). Base 4 is still more optimal than the one currently in use, base 10.

Where does base 10 (decimal) rank
Where does decimal (base 10) rank in this scheme? Once more, here is the same curve of A'(B) plotted this time over the domain (2:10).

After finding the order of the first few, the rest fall in line. If you were to integrate in the same way, the ranking for optimal positive integer numeric bases are:

  1. base 3 (ternary)
  2. base 2 (binary)
  3. base 4 (quaternary)
  4. base 5 (quinary)
  5. base 6 (senary)
  6. base 7 (septenary)
  7. base 8 (octal)
  8. base 9 (nonary)
  9. base 10 (decimal)
I don't know if base 1 (unary) should get a mention here; it smells like a special case. However, based on the same evaluative criterion as above, it still doesn't rate in the top 10 - it rates between base 14 (tetradecimal) and base 15 (pentadecimal). Most don't realize that when you are counting on your fingers, you are actually representing the count value in base 1, not base 10. A bonus of this is that the base 1 symbol is space rotatable. Place-value systems (like decimal) are not space rotatable; you cannot have the digits in a nonspecific order. With unary you can hold up any consistent number of fingers in any (physically possible) order and still mean the same value. (i.e. it has non-unique representations)

For those interested in irrational bases [3], based on these calculations base π is more efficient that base 2, as well, but only by about half as much as base 3.

Burton MacKenZie http://www.burtonmackenzie.com/

[1] This is not my definition. I'm having a hard time finding where I originally found it. I thought it was in Hardy's "A Mathematician's apology", but I've been combing through that book and come up dry. At the time I read the definition (not the date it was published), the author stated "and this has been known for at least a hundred years". I'm still looking for a real reference. Please let me know if you find one. In the "Computer Oral History Collection, 1969-1973, 1977" from the Smithsonian National Museum of American History on page 140, Howard Aiken says: "...this is a much more efficient system than the binary number system...", to which Henry Tropp replies "Well, theoretically, e is the optimal phase and three is a little bit closer approximation to e than two is".

[2] On an unrelated tangent, "Concave Up" by Jesse Reklaw is a great collection of Dream comix.

[3] From Mathworld, "Bergman (1957/58) considered an irrational base, and Knuth (1998) considered transcendental bases. This leads to some rather unfamiliar results, such as equating π to 1 in base π, π = 10π.

[4] For instance, y(x) = A*x is a linear function. If x=4b, y(x)=y(4b)=A*4b = 4*(A*b) = 4*y(b). Four times its argument results in the function being four times as big.
y(x) = x*x*x is a nonlinear function. If x=4b, y(x)=y(4b)=4^3*b^3 = 64*b^3 = 64*y(b). Four times its argument results in the function being sixty-four times as big! It does not increase linearly.

[5] I couldn't be bothered to integrate A'(B) over B, as it turned out to be longer than a five minute problem. Luckily engineers are more pragmatic and we don't require an exact value. Inspection of the relative areas under the curve is sufficient. A quick numerical integration (rectangular, 1000 slices per side) gives the results of the area from 2->e is 0.132 and from e->3 is 0.0132. So, there is an order of magnitude less change using base 3 than there is using base 2.