boosting, information geometry, and the bayesian framework
Boosting
Boosting is a classification algorithm based on majority vote by committee \( G \) of size=m
, (members \( g_{m} \)), over training data size=n
.

sequential additive model
For each sequential member , incorrectly classified points
n
are given increased weights. This increases their influence with the next member \( g_{m+1} \). Vice versa, weights for correct pointsn+
are decreased, reducing their influence. This adaptive sampling produces in effect a different distribution over the data at each stepM
; examples that are easy to learn are given less emphasis through decreased weight, and vice versa

weak learners
Under the PAClearning (probably approximately correct) framework, a weak learner is defined as having random accuracy around 1/2.

A weak learner has the guarantee that it will weak learn regardless of what the distribution over the data takes.

Further, a problem can be weaklearned
iff
it can be stronglearned (defined as close to perfect accuracy). 
Boosting transforms the weak learners
M
into a strong learner, where the final hypothesis is a weighted majority of all the learners.


exponential loss function

The graph shows negative binomial likelihood(orange), exponential loss(cyan), over the classification margin, where f(x)=0 is the decision boundary.

monontone continous approximation

?? prevents overfitting due to high variance of incorrect weight
TODO

 stumps, trees, feature learning
def tree(ndata):
'''
greedy, topdown recursive partition
1.find R_j (gini index to grow tree id Rj, here use loss error)
splitting criteria is weighted exp_loss
growing tree with weighted_least_square_error
gradient descent
2.mean of y_j
that minimize loss_fnc
ndata,kfeatures,Mrounds
3. boost tree is forward stagewise additive over trees, minimize loss
scale tree with weight $$ \beta $$
4. size of trees_j optimal between 4 and 8
5. subsampling fraction theta (wo replacement), improve performance, decrease run time
'''
for _ in range(M):
 generalized margin
There exists a proof for the training error bound h < 1/2  , which depends on a normalization factor, sum of weights=1.
 Sequential Probability Ratio Test
Information geometry
Specifically, I am interested in the KullbackLeibler divergence and how it relates to Bayes updating. However, I ran across a set of interesting Information Geometry notes that derive the Fisher Information Metric. It provides a nice way to think about maximumentropy as gibbs ensembles, how that relates to information geometry(ellipsoids of a relative unitsphere), and a little bit of differentiation of logs over exponentials to derive the FIM as the covariance matrix equivalent of a secondorder differentiated partition function…

First, maximum entropy is described by an ellipsoid, which describes the fluctuations from an average observed value. For a set of observables, the ellipsoid fluctuations of each state are interdependent, and the stretch of each ellipsoids’ principal axis is described by the eigenvalues of the covariance matrix.
But how can the maximum entropy be found?
Max entropy describes a state at thermodynamic equilibrium, for the chosen values of that system (see boxvolume, gas concentration example). Mixed states are described by a density operator , with trace(sum of diagonals) equal to .
entropy of the mixed state is defined by

and the Gibbs state(which is the max entropy) equals , where is the
lagrange
, conjugate variable of the observable x_i 
The partition function: \( Z=tr(\exp^{\lambda^{i}X_j}) \)

taking the derivative of log Z with respect to gives the mean values of the observables.
\[ \langle X_i \rangle =  \frac{\partial}{\partial \lambda^{i}}lnZ \]
 differentiating the log of the density operator : \[ ln\rho = \lambda^{i}  ln Z \]
 differentiating both sides with respect to : \[ \frac{\partial}{\partial \lambda^{i}} ln \rho =  X_i + \langle X_i \rangle \]
\[ X_i  \langle X_i \rangle = \frac{\partial\ ln \rho}{\partial\lambda^{i}} \]

For observables X_i a (nxn) covariance matrix can be described: \[\langle (X_i  \langle X_i \rangle)(X_j  \langle X_j \rangle)\rangle \]

which equals the second derivative of the partition function: \[ g_{ij} = \frac{\partial^{2}}{\partial \lambda_i \partial \lambda_j }lnZ \]


This equals the Fisher information metric, is symmetric positive definite, and can be thought of as innerproduct of the tangent space. When a manifold is and parameterizes a Gibbs state, maximun entropy, where the observable are random variables which fluctuate from their mean value. To a first approximation this is an ellipsoid. The information geometry over the ellipsoid, gives a unit ball sphere; a standard to measure the lenght of any vector sticking out from the point.
 The lecture series also describes replicator equations for evolutionary population dynamics, entropy as learning from the environment, and mixed population of replicators as game theoretic mixed strategies… (write up your notes of this)
Bayesian Framework

Pvalues determine the significance of of random event(x) frequency from observation; pvalues themselves are randomly uniformly distributed over [0,1] (thereby identical trials will generate different pvals). If the pvalue falls within the tail 5%, the nullhypothesis is rejected. Sample size is needed for proper interpretation of pvalues. If underlying data distribution follows normal distribution, then teststatistics such as ztest, ttest, or chisquare via central limit theorem(not normal distribution).

Bayesian approach is not limited to a single null hypothesis,
reductio ad absurdum
; rather, it updates the underlying distribution via a likelihood over all possible hypothesis. The update provides a linear skew of the prior distribution based on new data. This posterior distribution, rather than being accepted or rejected at a 5% cutoff can be incorporated into a loss function, from which model parameters can be assessed. If provided with wildly astonishing results, such as smoking prevents cancer, this new data can be incorporated into prior beliefs. 
Statistical Modeling: I have been working on an alert fatigue problem.
 Id like to assess how well alerts thresholded as
hard
, based on boosting weight dynamics, relate to a model of alert generation.  This model is based on expert priors, #alerts, interarrival times
 Finally, Id like to look take a closer look at the boosting exponential loss function; specifically fit its derivative to a logistic function and compare it to the bayesian model in a separation plot.
 Id like to assess how well alerts thresholded as
Monte Carlo Method
Monte Carlo Methods rely on the use of random number generators to simulate the problem space and converge toward a solution. Interestingly, other event data can be used to achieve approximations.
 The Buffoon’s Needle:
 Given a wood plank with width(t)=1, and needle of length(l)=1, approximate
 probability a needle crosses a plank is area of blue semicircle:total area
 frequency count= the number of times a dropped needle overlaps wood planks.
 Generalized(any t,l): x=t/l
 case1: short needle l<t either falls in gap or crosses plank
 case2: long needle l>t either falls in gap, crosses plank, or at least 1 plank
 result, for short linear; for long asymptote to near certainty(convergence)
 Buffoon noodle (bent):
 bend has no effect on result
 bent needle is shorter => cross less often
 when cross, increased chance cross more than once
 circle with diameter same as width t
 circle length t
 always intersect twice
 Monte Carlo (ie no needle):
 use random generator to select points in square inscribed with circle
 from count ratio between the two areas
MonteCarlo methods allow approximation of the probability ratio between factors(in ndimensional space), which can be formulated to bound or update posterior probabilities.
 Data Generation, Goodnessoffit, MCMC
Posteriors returned over unknown variable should not be mixed. This is due to the fact that variable parameters are dependent in some way (ie large std in one leads to smaller in the other). MCMC.sample
convergence relies on starting value; MCMC(model).fit()
alternatively MAP (max aposteriori) can be used with good result.
#unsupervised model
import pymc as pm
# what random variable describes this data?
@pm.Stochastic
@pm.deterministic
@pm.Categorical
# what parameters of this distributions? ie different sets?
# can this be modeled with a distribution?
# .. can this parameter be modeled over different sets ?
# if not set this parameter to fit observed data
#

priors

Bayesian Truth Serum Survey method to provide incentive to tell the truth by asking:
 respondent
r
, amchoice
question, with responsek
: \( x_{k}^{r} \in {0,1} \)  prediction of response of others
y
: \( (y_{1}^{r}, .., y{m}^{r}) \)
 respondent
The ideas is based on the premise that if a respondent truly holds a belief then they are more likely to believe that others agree, that the overall share of one’s belief is actually suprisingly common
but underestimated.
Each member(m_{i}) committee of votes of size=m. Each weak learner
adaptively samples over . sequentially additive process, which reduces the error of an exponential loss function defined as \( \exp^{\beta_{m}}
dimensionality:
margin:
Boosting is a sequentially additive process, which minimizes an exponential loss function. This loss function is equivalent to the binomial equation, in terms of population minimizer equal to Pr(1err/err). Thus estimates class probabilities. However, the loss function is more variant and thus prevents overfitting.