Boosting

Boosting is a classification algorithm based on majority vote by committee \( G \) of size=m, (members \( g_{m} \)), over training data size=n.

  • sequential additive model

    For each sequential member , incorrectly classified points n- are given increased weights. This increases their influence with the next member \( g_{m+1} \). Vice versa, weights for correct points n+ are decreased, reducing their influence. This adaptive sampling produces in effect a different distribution over the data at each step M; examples that are easy to learn are given less emphasis through decreased weight, and vice versa

  • weak learners

    Under the PAC-learning (probably approximately correct) framework, a weak learner is defined as having random accuracy around 1/2.

    • A weak learner has the guarantee that it will weak learn regardless of what the distribution over the data takes.

    • Further, a problem can be weak-learned iff it can be strong-learned (defined as close to perfect accuracy).

    • Boosting transforms the weak learners M into a strong learner, where the final hypothesis is a weighted majority of all the learners.


""
  • exponential loss function

    • The graph shows negative binomial likelihood(orange), exponential loss(cyan), over the classification margin, where f(x)=0 is the decision boundary.

    • monontone continous approximation

    • ?? prevents overfitting due to high variance of incorrect weight TODO



  • stumps, trees, feature learning
def tree(ndata):
	'''
	greedy, topdown recursive partition
	1.find R_j (gini index to grow tree id Rj, here use loss error)
	splitting criteria is weighted exp_loss
	growing tree with weighted_least_square_error
	gradient descent
	2.mean of y_j
	that minimize loss_fnc
	ndata,kfeatures,Mrounds
	3. boost tree is forward stagewise additive over trees, minimize loss
	  scale tree with weight $$ \beta $$
	4. size of trees_j optimal between 4 and 8
	5. subsampling fraction theta (wo replacement), improve performance, decrease run time
	'''
	for _ in range(M):


  • generalized margin

There exists a proof for the training error bound h < 1/2 - , which depends on a normalization factor, sum of weights=1.

  • Sequential Probability Ratio Test

Information geometry

Specifically, I am interested in the Kullback-Leibler divergence and how it relates to Bayes updating. However, I ran across a set of interesting Information Geometry notes that derive the Fisher Information Metric. It provides a nice way to think about maximum-entropy as gibbs ensembles, how that relates to information geometry(ellipsoids of a relative unit-sphere), and a little bit of differentiation of logs over exponentials to derive the FIM as the covariance matrix equivalent of a second-order differentiated partition function…

  • First, maximum entropy is described by an ellipsoid, which describes the fluctuations from an average observed value. For a set of observables, the ellipsoid fluctuations of each state are inter-dependent, and the stretch of each ellipsoids’ principal axis is described by the eigenvalues of the covariance matrix.

  • But how can the maximum entropy be found? Max entropy describes a state at thermodynamic equilibrium, for the chosen values of that system (see box-volume, gas concentration example). Mixed states are described by a density operator , with trace(sum of diagonals) equal to .
    • entropy of the mixed state is defined by

    • and the Gibbs state(which is the max entropy) equals , where is the lagrange, conjugate variable of the observable x_i

    • The partition function: \( Z=tr(\exp^{-\lambda^{i}X_j}) \)

    • taking the derivative of log Z with respect to gives the mean values of the observables.

    \[ \langle X_i \rangle = - \frac{\partial}{\partial \lambda^{i}}lnZ \]

    • differentiating the log of the density operator : \[ ln\rho = -\lambda^{i} - ln Z \]
    • differentiating both sides with respect to : \[ \frac{\partial}{\partial \lambda^{i}} ln \rho = - X_i + \langle X_i \rangle \]

    \[ X_i - \langle X_i \rangle = \frac{\partial\ ln \rho}{\partial\lambda^{i}} \]

    • For observables X_i a (nxn) covariance matrix can be described: \[\langle (X_i - \langle X_i \rangle)(X_j - \langle X_j \rangle)\rangle \]

    • which equals the second derivative of the partition function: \[ g_{ij} = \frac{\partial^{2}}{\partial \lambda_i \partial \lambda_j }lnZ \]

  • This equals the Fisher information metric, is symmetric positive definite, and can be thought of as innerproduct of the tangent space. When a manifold is and parameterizes a Gibbs state, maximun entropy, where the observable are random variables which fluctuate from their mean value. To a first approximation this is an ellipsoid. The information geometry over the ellipsoid, gives a unit ball sphere; a standard to measure the lenght of any vector sticking out from the point.

  • The lecture series also describes replicator equations for evolutionary population dynamics, entropy as learning from the environment, and mixed population of replicators as game theoretic mixed strategies… (write up your notes of this)

Bayesian Framework

  • P-values determine the significance of of random event(x) frequency from observation; p-values themselves are randomly uniformly distributed over [0,1] (thereby identical trials will generate different p-vals). If the p-value falls within the tail 5%, the null-hypothesis is rejected. Sample size is needed for proper interpretation of p-values. If underlying data distribution follows normal distribution, then test-statistics such as z-test, t-test, or chi-square via central limit theorem(not normal distribution).

  • Bayesian approach is not limited to a single null hypothesis, reductio ad absurdum; rather, it updates the underlying distribution via a likelihood over all possible hypothesis. The update provides a linear skew of the prior distribution based on new data. This posterior distribution, rather than being accepted or rejected at a 5% cutoff can be incorporated into a loss function, from which model parameters can be assessed. If provided with wildly astonishing results, such as smoking prevents cancer, this new data can be incorporated into prior beliefs.

  • Statistical Modeling: I have been working on an alert fatigue problem.

    • Id like to assess how well alerts thresholded as hard, based on boosting weight dynamics, relate to a model of alert generation.
    • This model is based on expert priors, #alerts, interarrival times
    • Finally, Id like to look take a closer look at the boosting exponential loss function; specifically fit its derivative to a logistic function and compare it to the bayesian model in a separation plot.

Monte Carlo Method

Monte Carlo Methods rely on the use of random number generators to simulate the problem space and converge toward a solution. Interestingly, other event data can be used to achieve approximations.

$$Area=\int^{\pi}_{0}\frac{1}{2} sin\theta \ d\theta$$
"$$Area=\int^{\pi}_{0}\frac{1}{2} sin\theta \ d\theta$$"
  • The Buffoon’s Needle:
    • Given a wood plank with width(t)=1, and needle of length(l)=1, approximate
    • probability a needle crosses a plank is area of blue semicircle:total area
    • frequency count= the number of times a dropped needle overlaps wood planks.
  • Generalized(any t,l): x=t/l
    • case1: short needle l<t either falls in gap or crosses plank
    • case2: long needle l>t either falls in gap, crosses plank, or at least 1 plank
    • result, for short linear; for long asymptote to near certainty(convergence)
  • Buffoon noodle (bent):
    • bend has no effect on result
    • bent needle is shorter => cross less often
    • when cross, increased chance cross more than once
  • circle with diameter same as width t
    • circle length t
    • always intersect twice
  • Monte Carlo (ie no needle):
    • use random generator to select points in square inscribed with circle
    • from count ratio between the two areas

MonteCarlo methods allow approximation of the probability ratio between factors(in n-dimensional space), which can be formulated to bound or update posterior probabilities.



  • Data Generation, Goodness-of-fit, MCMC

pyMC

Posteriors returned over unknown variable should not be mixed. This is due to the fact that variable parameters are dependent in some way (ie large std in one leads to smaller in the other). MCMC.sample convergence relies on starting value; MCMC(model).fit() alternatively MAP (max aposteriori) can be used with good result.

#unsupervised model
import pymc as pm

# what random variable describes this data?
@pm.Stochastic
@pm.deterministic
@pm.Categorical


# what parameters of this distributions? ie different sets? 
# can this be modeled with a distribution?


# .. can this parameter be modeled over different sets ?
# if not set this parameter to fit observed data


#
  • priors

  • Bayesian Truth Serum Survey method to provide incentive to tell the truth by asking:

    1. respondent r, a m-choice question, with response k: \( x_{k}^{r} \in {0,1} \)
    2. prediction of response of others y: \( (y_{1}^{r}, .., y{m}^{r}) \)

The ideas is based on the premise that if a respondent truly holds a belief then they are more likely to believe that others agree, that the overall share of one’s belief is actually suprisingly common but underestimated.


Each member(m_{i}) committee of votes of size=m. Each weak learner adaptively samples over . sequentially additive process, which reduces the error of an exponential loss function defined as \( \exp^{-\beta_{m}}
dimensionality: margin: Boosting is a sequentially additive process, which minimizes an exponential loss function. This loss function is equivalent to the binomial equation, in terms of population minimizer equal to Pr(1-err/err). Thus estimates class probabilities. However, the loss function is more variant and thus prevents overfitting.