Wednesday, August 17, 2016

The Master Algorithm

Really impressed with the clarity of the thoughts in the book - The Master Algorithm -How the quest for the ultimate learning machine will remake our world by Pedro Domingos

Notes from the book: (partially)
Symbolists --> learning as the inverse of deduction and take ideas from philosophy, psychology and logic  --> master algorithm --> inverse deduction

Connectionists --> reverse engineer the brain and are inspired by neuroscience and physics  --> master algorithm --> backpropagation

Evolutionaries --> simulate evolution on the computer and draw on genetics and evolutionary biology  --> master algorithm --> genetic programming

Bayesians --> believe learning is a form of probabilistic inference and have their roots in statistics  --> master algorithm --> Bayesian inference

Analogizers --> learn by extrapolating from similarity judgements and are influenced by psychology and mathematical optimization  --> master algorithm --> support vector machine

* Every algorithm, no matter how complex, can be reduced to just three operations: AND, OR, and NOT. Simple algorithms can be represented by diagrams, using different symbols for the AND, OR and NOT operations.
* an algorithm is not just any set of instructions: they have to be precise and unambigious enough to be executed by a computer.
* algorithms are an exacting standard. Equations are really just a special kind of algorithm.
* designing an algorithm is not easy. pitfalls. find every error and fix it
* complexity such as space, time, and human complexity (intricate for human brains to understand),

In learners world, compiters can write their own programs (machine learning), even so, humans have an ability to control it.


* machine learning is like: farming, learning algorithms are the seeds, datat is the soil, and the learned programs are the grown plants. The machine-learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil and keeping an eye of the health of the crop but otherwise staying out of the way.
* some learners learn knowledge and some learn skills.

* machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems,

machine learning is a subfield of AI. Goal of AI is to teach computers to do what humans currently do better and learning is arguably the most important of  those things.

statistically correct is required, you cannot have deterministically correct solutions all the time.

** The industrial revolution automated manual work and the information revoluton did the same for mental work, but machine learning automates automation itself.

learning alorithms are the matchmakers: they find producers and consumers for each other, curring through the information overload.

*** the best way for a company to ensure that learners likes its products is to run them itself. whoever has the best algorithms and the most data wins. a new type of network effect takes hold: whoever has the most customers accumulates the most data, learns the best models, wins the most new customers, and so on in a virtuous circle.

** "data is the new oil" you have to refine it properly."

* machine learning is the scientific method on steroids, it follows the same process of generating, testing and discarding or refining hypotheses.

The biggest challenge is assembling all this information into a coherent whole.

once a learned program is deployed, the bad guys change their behavious to defeat it. This constrasts with the natural world, which always works the same way. the solution is to marry machine learning with game theory, don't just learn to defeat what your opponent does now; learn to parry what he might do against the learner. factoring in the costs and benefits of different actions, as game theory does, can also help strike the right balance between privacy and security.

Chapter 2: The Master Algorithm / Pg23 (46/353)
The machine learning algorithm must deploy same algorithm for doing all of the different things (not write two different programs). For example, the same ML should help play chess as well as do the credit-card applications processing.

Bayes equations -- diagnose the condition based on the database of patient records - their symptoms, test results. The same algorithm is used to learn spam filters.
Nearest-neighbor algo -- hand-writing recogniton, controlling robot hands, recommending books and movies
Decision tree learners - credit card application acceptance, splice junctions in DNA, choosing the next move in a game of chess.

enough data could be infinite. Learning from finite data requires making assumptions, and different learners make different assumptions, which makes them good for some things but not others.

All knowledge - past, present, and future - can be derived from data by a single, universal learning algorithm ==> Master Algorithm, this will be the last thing we will every have to invent because once we let it loose, it will go on to invent everything else that can be invented.

Evolution is an iterative search, solving a problem by trying many candidate solutions, selecting and modifying the best ones, and repeating these steps as many times as necessary.

Monday, September 7, 2009

Sensors for capturing speech signal

As my research focused on use of Physiological Microphone - PMIC for stress and speaker verification, I thought it would be a good idea to identify different sensors that have been used for speech signal analysis/recording.

Any sensor which has an ability to measure displacement would be thought as a good option to record speech as then it can record jaw movement, cheeck movement, tongue movement so on and so forth, recording movement of an articulator, or any vibration that is due to speech production (bone vibrations) at different locations other than the standard mic-near-the-mouth standard approach.

So, the study does focus on understanding details of displacement sensors - and the principle behind measuring displacements. - The sensors to measure displacement are strain gauges, piezoelectric crystals, and accelerometers.

Microphones - the acoustic ones usually record the pressure variations - caused due to sound travel in air - acoustic vibrations. Simplest form is the diaphragm based which measure the difference in pressure on either side - pressure gradient microphones. There can also be a pressure microphone with diaphragm open to atmospheric pressure and any pressure variation on either side can be recorded as a displacement, hence can be treated as omni-directional.

Examples of a few regular microphones.

1. Crystal Microphone - piezocrystal on one side of diaphragm, frequency response 80Hz to 6000Hz, needs amplifier as the signal is weak. p+ve points - dynamic range, sensitivity, and lack of distortion is good. good stability if used outside of humid conditions.
2. Dynamic Microphone - moving coil wound on a permanent magnet. Sensitivity is less as compared to crystal one because of the coil and magnet weight and also may not perform faithfully in presence of other electromagnetic sources.

3. Condenser Microphone - diaphragm is one of the capacitor plate and back plate (another parallel) combined together act as a capacitor. Requires a power supply (capacitor polarization voltage). Flat response for speech / music range, expensive and high impedance hence requires a pre-amplifier for impedance matching and signal transfer.

4. Electret Microphone - condensor microphone with permanent charge on the plates. Similar flat response over a frequency range, can be made smaller in size, but require battery power.

5. Probe Tube Microphone - useful to measure sound pressure levels within a closed or nearly closed cavity. Consists of usual transducer, but the diaphragm is coupled to a length of tubing of very small diameter. The tubing is inserted into the cavity to be measured. The length of the tubing defines the resonance characteristics hence the probe tube microphone is calibrated to appropriately yield a flat frequency response.

Monday, April 6, 2009

Fractal Dimension - PMIC

Analyzing PMIC acquired signal using nonlinear dynamical framework

Tuesday, March 17, 2009

Fractal Dimensions - a review and study on sound captured through skin vibrations

Chapter 5. Invariants of the motion [book - analysis of observed chaotic data - Abarbanel H.D.I.]


Chapter 2/pg 21 - Fractal Dimension: A Quantitative Measure of Self_Similarity and Scaling
[book - Fractal Physiology - James Bassingthwaighte, Larry Liebovitch, Bruce West]

Self-Similarity Dimension-
describes how many new pieces geometrically similar to the whole object are observed as the resolution is made finer. If we cahnge the scale by a factor F, and we find that there are N pieices similar to the original, then the self-similarity dimension D(self-similarity) is given by
N = F ^ D(self-similarity)
This can be a fractional dimension - fractals (fragments) which usually takes in non-integer values and which therefore lies between Euclidean Dimensions.

Hausdorff-Besicovitch Dimension and Capacity Dimension-
Necessary to find fractional dimensions of irregular shaped objects. these two dimensions are technically similar.
This is explained in terms of number of balls required to cover all the points within a space. In 1-D (ball is a line), 2-D (circle), 3-D (sphere), and so on. Consider the radius of the ball to be r/ let N(r) be minimum number of balls of size r needed to cover the object. The capacity dimension Dcap tells us how the number of balls needed to cover the object changes as the size of the balls is decreased.
Dcap = lim (r -> 0) log N(r) / log (1/r)

If the capacity dimension is determined by using balls that are contiguous non overlapping boxes of a rectangular coordinate grid, then the procedure is called as "BOX COUNTING".

Thursday, March 12, 2009

nonlinear time series analysis (Kantz, Schreiber)

Comments on "Nonlinear Time Series Analysis," Holger Kantz and Thomas Schreiber, 2nd edn, Cambridge

Chapter 3 Phase Space Methods
Consider a dynamical system whose trajectory can be studied in finite dimensional vector space R^m, thus the state is specified by some vector x belonging to R^m. The said dynamical system can be described as xn+1 = F(xn), for discrete variable, and d/dt x(t) = f(x(t)) for a continuous case. Based on the initial condition for x0, the sequence of points for xn will portray a trajectory of the dynamica lsystem, which either will run to infinite or stay within some bounds.

the set of initial conditions leading to the same asymptotic behaviour of the trajectory is called the basin of attraction for this particular motion.
Since the dynamical equations (or the equations of motion) are defined in phase space, it is also most natural to yuse a phase space description for approximations.

3.2 delay reconstruction pg 35
sn = s(x(n5t)) - sequence of scale measurements of some scalar quantity taken at mutiples of fixed sampling time
a delay reconstruction in m dimensions is then formed by the vectors sn,
sn = (sn-(m-1)tao, sn-(m-2)tao, ...sn)
The attractor formed by above reconstruction is equivalent to the attractor in the unknown space in which the original system is living if the dimension m if the delay coordinate space is sufficiently large. precisely, this is guaranteed if m is larger than twice the box counting dimension Df of the attractor.

pg 36
for many practical purposes, the most important embedding parameter is the product mtao of the delay time, and the embedding dimension, rather than the embedding dimension m or the delay time tao alone.

Wednesday, March 11, 2009

Analysis of observed chaotic data (Abarbanel, H. D.I.)

Comments on "Analysis of observed chaotic data," Abarbanel H.D.I. Institute of Nonlinear Science Springer Study Edition 1995


pg 4 1.1.2 Correlations among data points
The dynamics takes palce in a space of vectors y(t) of larger dimension, and we view it projected down on the axis of observed variables. we can identify a space formally equivalent to the original space of variables using coordinates made out of the observed variables and its time delayed copies. - Phase space reconstruction.


The time delay in the embedding is divided in terms on nonlinear correlation function called average mutual information has its first minimum.


pg 5 1.1.3 number of coordinates

the number of coordinates to use is determined by asking when the projection of the geometrical structure has been completely unfolded. implies that points lying close to one another in the space of the y(t) vectors do so because of the dynamics and not because of the projection. this can be decided on the basis of False Nearest Neighbors Algorithm. [false neighbors are connected with unpredictability as phase space points will be near to each other for non dynamical reasons, namely just because of projection. it may happen that the percentage of false nearest neighbors drops to ZERO is the necessary global dimension to be used to examine the observed data]. pg 7

reconstructed series replace the scalar data measurements with data vectors in Euclidean distance d-dimensional space in which the invariant aspects of the sequence of points s(n) are captured with no loss of information about the properties of the original system. the new space is related to the original space of the s(n) by smooth, differentiable transformations.

what is the time lag Ts to use and what dimension d to use are the central issues of this reconstruction ----
the theorem notes that if the motion lies on a set of dimension Da, which could be fractional, then choosing the integer dimension d of the unfolding space so d > 2Da is sufficient to undo all overlaps and make the orbit unambiguous.

pg25
Time delays
it must be some multiple of the sampling time tau_s, since we only have data at those times, an interpolation scheme to get more data is just as uncertain as estimating the time derivatives of s(t).
if the time delay is too short, the coordinates s(n) and s(n+T) which we wish to use in our reconstructed data vector y(n) will not be independent enough.
finally, since chaotic systems are intrinsically unstable, if T is too large, any connection between the measurements is numerically tantamount to being random with respect to reach other.

Average mutual information -
the actual prescription suggesed is to take the T where the first minimum of the average mutual information I(T) occurs as that valye to use in time delay reconstruction of phase space. Average mutual information is invariant under smooth changes of coordinate system. Thus, the quantity I(T) evaluated in time delay coordinates and in the original, but unknown, coordinates takes on the same values. Also, I(T) will be robust against noise than many other quantities.

choosing the dimension of reconstructed phase space
De is a global dimension and may well be different from the local dimension of the underlying dynamics.
using global false nearest neighbours --
since the attractors for real physical systems is quite compact in phase space, each phase space point will have numerous neighbours as the number of data becomes large enough to populate state space well.

Teh basic geometric idea in the embedding theorem is that we have achieved an acceptable unfolding of the attractor from its values as seen projected on the observation axis when the orbits conposing the attractor are no longer crossing one another in the reconstructed phase space.

filtering out spatial variations on small scales, perhaps even associated with high frequencies as well, ca nlead to a practical system of lower dynamical dimensions.



Friday, March 6, 2009

The Dripping Faucet as a Model Chaotic System

Comments on "The dripping faucet as a model chaotic system," Shaw, Robert, Publisher - Science Frontier Express Series (Aerial Press, Inc.) 1984.

This is a technical story of how a dripping faucet can make you lose your sleep :-)

The major question to be addressed is: How do we construct a model from a stream of experimental data which we have not seen before? How do we use the model to make predictions? what are the limits of our predictive ability?

Chaotic dynamics - short-term predictability but long-term unpredictability. The system state at one instant of time is causally disconnected with its state far enough into the future.

The author wants to find answers to:
1. the amount of information a system is capable of storing, or transmitting from one instant of time to the next,
2. the rate of loss of this information.

If we consider any system to be a black box which is producing a stream of numbers - the above questions along with the "predictability" can be posed.

One method is to construct "return map" or Poincaire' cross-section by using a embedding dimension on time-delayed phase space reconstruction.

Most important attribute of the "characteristic of a system" is that quantities - "information stored in a system" - must be properties of the system, and not type of measurement. This implies an invariance under coordinate transformations - a property which appropriately defined measures of information possess.

The dripping faucet has different patterns of behavior - periodic or chaotic - depending on initial conditions, indicating multiple "basics of attraction", changes in behavior are sudden and hysteretic.

A behavior which examplifies a mixture of periodic and chaotic aspects can be called as noisy periodicity (May) or semiperdicity (Lorenz).

then there is a possibility of "mixed dimensionality" - in some regions of the state space - the state of the system can be optimally specified by the value of a single coordinate - but in othe rregions, more coordinates are required.

Interesting conclusions (page 18-19) made by the author are:
1. a model of only a few dimensions can sometimes adequately describe the chaotic behavior of a continuum system, the high dimensionality of the system is not required.
2. there exist fundamental geometrical structures which reappear in many different nonlinear systems.

Pg 30 -
Poincare realized that even in celestial mechanics determinism did not imply unlimited predictability. For purely deterministic dynamical systems, predictability may be extremely limited.
The degree of unpredictability --> entropy of the given system.

pg 32 ->
Kolmogorov, Sinai and others were able to show that, if one considers the set of all possible pratitions, and selects the one which gives the largest numerical value for the entropy, the number one obtains is a topological invariant of the system, independent of the coordinate system used to describe the original continuous variables v.

34:
the entropy describes the rate at which information flows from the microscopic variables up to the macropscopic. But in presence of noise, partition definition of entropy does not help,as the partition becomes finer than length scale by the noise.
In mechanics the transmission of information requires the possibility of change.

The channel capacity is defined as maximum rate one can find by varying the statistics of the input message over all possible input ensembles and using the optimum.
For dynamical systems, a particular input ensemble is selected by the properties of the system itself. Thus, information (Stored in the system) is quantifies as the average increase in out ability to predict the future of a system when we learn its past. Thus it is the difference in the expected randomness of a system with and without knowledge of its past. [pg 40]

[48-49]
our information about a system cannot increase in the absence of observation.
sharper distributions representing greater stored information will spread faster than broad distributions.
The maximum entropy or loss of predictability will thus occur when the stored information is at the maximum. under this definition both the stored information and entropy of a purely stochastic map will be zero.

pg 55 the entropy in the limit of "pure determinism"
In deterministic case (for zero noise), the information storage capacity of the map diverges. But, it becomes finite with the addition of noise, and information stored by higher iterates of the map has the concave behavior.

one task of an experimenter studying a new system is to characterize the determinism of the system, which can be described as the information stored through time in the dynamical variables. if the system has a positive entropy, some or all of this causal connection will be lost in the passage of time.

pg69
in a real chaotic system with noise present, initial data has predictive value for only a limited time, events which are too far apart are causally disconnected. correlations extend over only a finite length of the string of symbols, old experience ceases to have predictive value.

[pg 93]
from a classical perspective, it is clear that an experimenter's claim that he is "in the noise", and that there is no point in attempting to increase resolution, is an assumption. if the observation rate is less than the entropy of the system at some resolution level, the system will appear stochastic, even if it were "completely deterministic".

pg 103
The dimension of an object should describe how its volume scales with increasing linear size.

pg 109:
the usefulness of a "dimension" number as a scaling exponent depends on the number of orders of magnitude between the smallest and largest length scales. Robert Shaw comments that "the more low-dimensional structure there is in the reconstructed attractor, the less well-defined is a single "dimension" number. Thus, there is little reason at this point to prefer one dimension algorithm over another as "more fundamental".

In short:
Comments on the book -
Book has been vividly written with less of mathematical probing (or the probe is kept to minimal and necessary, whenever required).

Technically - attacks fundamental problem of what information and flow of information is related to "determinism" factor of a chaotic system.