Stacoscimus Blagoblog

Communicating Numerical Precision

At Nomi, we've built an inference engine that accepts real-time data from several different sources to provide a single best estimate of the traffic in a store. This estimate is produced as a floating point value in the world of real numbers, not as an integer value in the world of natural numbers, like counts really ought to be. Our inference engine operates on a continuous number line using probability distributions, so it's perfectly normal to have an expected value of a visit count to be something like 43.2476784927658.

But, of course, in no world is it possible to have, say, 43.2476784927658 visitors enter a store in a single day.

It's been an ever-present challenge to effectively communicate our level of confidence in the estimates. The goal, of course, is to be completely transparent and earn trust. However, different audiences are likely to have completely different ideas of what appears trustworthy and what appears suspicious.

Really, there are two goals to work toward:

  1. Providing maximal benefit for customers.
  2. Instilling maximal confidence in the data.

These two goals are complicated to meet simultaneously because the consumers of the numbers have different styles of sophistication. I'll describe four perspectives.

Statistical Perspective

Statisticians treat measured or calculated numbers as point estimates, and often describe precision in terms of statistical variance (or equivalently, a standard deviation). Both the point estimate and its calculated variance can be carried around with arbitrarily high precision, but inference is only made when combining the two in a formal a statistical test. A good statistical way to present estimated visit counts might be as a 95% confidence interval, wherein we expect 95% of intervals so constructed in experimental replications to contain the the true value. That is, perhaps we would want to say the visits fall within (249, 334) with 95% confidence.

Statisticians distrust people who present obviously estimated numbers without standard deviations or confidence bounds.

Scientific Perspective

Scientists developed the notion of significant figures to deal with measurement precision. They know that when weighing a chemical, small air currents cause a scale's last few digits to fluctuate. A scientist will record the mass by writing down all the stable "trusted" numbers, as well as the first unstable "untrustworthy" number. This is a heuristic, base-ten way of thinking about precision in measurements.

A chemist would distrust people who present estimated visits as 24,543 since it doesn't communicate measurement precision. Instead, a chemist might trust a number like 24,500 -- with three significant figures.

Savvy Intuitive Perspective

An analytics-savvy person without statistical or scientific training is still aware of the idea of measurement error, and probably knows about sampling error for polls. They might trust somebody who provides numbers with confidence intervals disguised as "+/- 5%", and may or may not care whether the point estimates are rounded, or just what the confidence level of the interval was.

They may or may not get suspicious if numbers are presented without these sampling errors.

Intuitive Perspective

An intuitive person with little experience interpreting measurements might be unfamiliar with concepts like precision, estimation, or measurement error. Instead, they might expect all numbers to be exact if they are counts and rounded if they're percentages. In this case, trust comes from producing a plausible value on first principles --- a natural number for a count, and nothing with decimal places at all.

A naive intuitive consumer of data might get suspicious if they see visit counts like 24,500 because having an exact multiple of 100 is unlikely. They might also distrust a set of percentages that do not add to 100% merely because of rounding error.

Javascript Thought of the Day

Now, Javascript surely doesn't have the best reputation around for being a well-designed language, but I often feel like being an apologist for it, much in the same way that I truly enjoy the ethos and aesthetics of Bash as a programming language in its own right. They might have warts, but they're not arbitrary. Although Gary Bernhardt's now-classic WAT highlights some ridiculous consequences of Javascript's type conversion system, they are nonetheless explainable in a consistent way. In the JSC interpreter:

> [] + []

> [] + {}
[object Object]
> {} + []
0
> {} + {}
NaN

So while this is certainly WAT, it also can be explained relatively easily. The + operator can either add numbers, concatenate strings, or be used to specify a positive number. So, [] + [] is really empty string, and [] + {} is really empty string plus the string representation of {}, which is '[object Object]'. In fact, the Node interpreter makes this clear:

> [] + []
''
> [] + {}
'[object Object]'

Then, the final two are interpreted by JSC as empty code blocks followed by something that must be coerced into a numeric value via its string representation. Roughly:

> + Number([].toString())
0
> + Number({}.toString())
NaN

So these at least have some straightforward logical explanations, based on understanding Javascript's parsing and type coercion. But how on earth could we allow this to happen?

> Boolean([0])
true
> [0] == true
false

That the path to a boolean using Boolean is separate from the path to a boolean using == might seem downright nuts, except when one realizes that the == operator is used for far more than truth comparisons. The two use entirely different algorithms for determining the final evaluation.

Posting code in WordPress

Looks like I'll be using this blagoblog as a platform for messing around a bit with code. Particularly, I suspect I'll try to start sharing a few things here or there that I learn as I try to hack into NES chips, and demonstrate what's going on with the synthesizers inside.

This is what a Github Gist looks like when embedded, simply by pasting the script tag directly in:

Another option would be to use HTML code and pre tags as formatted using the Crayon plugin. Here's the result of "pre" tags, after removing all of the overzealous eye candy that Crayon comes in with.

Or pre tags:

import derp

print derp.blat()

Yup, it looks like using code and pre will be the best option! Three cheers for editing in raw text mode! Now all that remains is to figure out how to get the Wordpress editor to ignore hard line breaks when I put them in paragraph tags. Ooof!

UPDATE: Adding remove_filter( 'the_content', 'wpautop' ); to the functions.php file for a theme will take care of the auto-insertion of br and p tags. Huzzah!

Nomi!

Nomi

I'm happy to say that my customer engagement startup, Nomi, has secured Series A funding! Here's looking forward to ever-improving products and fascinating analytics.

Crowd-sourcing or Expert-sourcing?

Crowd-sourcing is sometimes pitched, deliberately or not, as a sort of cure-all artificial intelligence, when in fact it is essentially averaging. The problem is, of course, that averaging eliminates individual differences and homogenizes instead of specifying. Naive "crowdsourcing" would lead every single shoe to be size 9.

As always, useful/interesting inference comes from intelligently specifying sources of variance.

Machine Learning or Statistical Learning?

After spending plenty of time jumping back and forth between "machine learning" folks and statisticians, I finally am learning how to translate between these two communities, who do essentially identical things.

Machine Learning is a specialty of engineers, and an outgrowth of artificial intelligence research. People with machine learning expertise talk about prediction and learning using "features." This learning can be supervised or unsupervised. Machine learning folks tend to find brute-force solutions to problems with effective algorithmic optimizations. Even when Monte Carlo techniques are the enlightened machine learner's best friend, sampling methodologies don't often enter into the equation.

Statisticians are applied mathematicians, and many are methodological philosophers. Statisticians see the world in terms of central tendency and dispersion instead of hits and misses, and we use variables instead of features. We infer and predict, using regression, estimation (supervised) or clustering. While using the same basic techniques and technologies as Machine Learners, we live at the population level instead of the unit level. That's why sampling methods are so important to us, why we allocate variance, and why we lean on mathematical theorems.

The frequentist/Bayesian distinction is trivial compared to the statistician/engineer divide, at least in terms of tradition. But in the end, the tools and techniques overlap hugely. Learning both perspectives is always the best.

iRb Corpus Released

After a solid year of conversion, editing, and tagging, the iRb corpus is available in **jazz format at the Cognitive and Systematic Musicology Laboratory website. This corpus represents some 1186 individual songs, each encoded into **jazz, a new Humdrum-like specification.

Using tools from Craig Sapp's Humdrum extras as well as the bundled jazzparser.sh script, you can convert individual song files to the following sort of flat representation:

**jazz    **kern  **exten **solfa **mint  **quals **dur
*thru     *thru   *thru   *thru   *thru   *thru   *thru
*M4/4     *M4/4   *M4/4   *M4/4   *M4/4   *M4     *M4/4
*D-:      *D-:    *D-:    *D-:    *D-:    *D-:    *D-:
2E-:min7  E-      min7    re      [E-]    min7    2.0000
2B-7b13   B-      7b13    la      P5      dom     2.0000
=         =       =       =       =       =       =
2E-:min7  E-      min7    re      P4      min7    2.0000
2A-7      A-      7       so      P4      dom     2.0000
=         =       =       =       =       =       =
2D-:maj7  D-      maj7    do      P4      maj     2.0000
2G-7      G-      7       fa      P4      dom     2.0000
=         =       =       =       =       =       =
*-        *-      *-      *-      *-      *-      *-

Because date information is available, large diachronic studies are possible. For a taste, here is a plot of changes in jazz chord usage over time:

A complete writeup will soon be available in Music Perception.

Panksepp's Affective Neuroscience

Jaak Panksepp's most excellent book was a stunningly thorough explication of a particular view of emotions---that they originate in evolutionarily ancient neural pathways which are profitably studied using animal models.  The conclusion is immediate: animals such as dogs, cats, rats, and apes most likely do experience emotional states, and moreover homologous systems mediate human emotions.

The systems Panksepp details in fine neurochemical, neuroanatomical, and behavioral detail are SEEKING, RAGE, FEAR, LUST, CARE, PANIC, and PLAY, capitalized to distinguish the theoretical construct from the common use of these terms.  Uniting experimental results from animal research with common psychopathologies, Panksepp self-consciously makes the necessary conceptual steps to paint a fascinating picture of the neurobiological basis of human affect.

Panksepp's Figure 3.5, "The Major Emotional Systems."

Best read in a long time; very highly recommended.  More here.

MS Statistics Complete

As of today, I've earned my Master of Science in Statistics.  It's been a fantastic journey these last two years---I couldn't have forseen how much statistics would speak to me.  My advisor, David Huron, had merely suggested I take some statistics courses.  Wanting to learn the theory underlying the practice, I enrolled in the first graduate statistical theory course without even knowing what a cdf was.  Special thanks to the excellent instructors I've had the pleasure of working with and learning from: Steve MacEachern, Noel Cressie, Mario Peruggia, Michael Fligner, Douglas Critchlow, Chris Hans.

Statistics isn't only a broadly-applicable toolkit for solving problems, it's also a way of looking at the world, and choosing which joints should be sharply carved and which must remain fuzzy.  I'm eager to begin applying this technology to important and interesting problems!

ABD in Music Theory

I've now passed my candidacy exam for my doctoral studies in Music Theory, making me officially All-But-Dissertation (ABD)!  This means I'll be able to spend my fellowship year working on a dissertation and shore up several papers for publication.  Special thanks to David Huron, Gregory Proctor, David Clampitt, and Robert Ward for sitting on my committee and wading through 150 pages of Yuri-style purple prose!

Newer Entries Older Entries