MindFilled* Machine Intelligence    |    Cyber Security    |    Startups



We are drowning in information
but starved for knowledge.
- John Naisbitt


There are wavelengths that people cannot see,
There are sounds that people cannot hear,
And maybe computers have thoughts that people cannot think.
- Richard Hamming


Fathers send their sons to college either because they went to college or because they didn't.
- L. L. Henderson

=======================

the basic stuff.

what are you looking for?

what are you trying to make?

======================
If you put a complex data into a simple model, it will underfit.
if you put a simple data into a complex model, it will overfit.

usually, if you make test set correctly, you will avoid overfitting.
but, also the data is simple enough, try to use simpler model. it is faster, and easier to fix.


=====================
if it is about shapes,


if it is about words,


if it is about music,




These are so called "pattern".
and, we study patterns.
i dont want to call it "studying" or "researching"
we just "like" patterns. something similar to how bees "like" yellow flowers.
we all have little bit of OCD in us.


I put funny examples here,
but I kid you not.
This is actually a very serious business.
Everything you ever have known and will everr know, are all based on science backed by real data.

===========================
https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_of_Mathematics_in_the_Natural_Sciences

There is some deep truth in the UEOMITNS.
it works better than it should because correlation, meaning everything is somehow more linked up in the underlying layer.


=========================
Data Science is easy as 1, 2, 3.

1(once) is Chance.
2(twice) is Coincidence.
3(thrice) is Pattern.

(In the lack of data, sometimes, I use coincidence as pattern as well. in a large enough data/sparse, sometime you need to take more risk, and overall it is positive return.)

-------------------
imagine as if you are trying to locate something.
in one dimension, you get a line.
in two dimension, you get an area, and a dot.
in three dimesnion, you get a space, and a specific location. more specific. from 3-D and on, you just call it a pattern, as if occurs in that area.

in four dimension, you get a hyperspace, and a hyper-dot? or like a movement in a 3D space, over time.. if you are chasing down a plane in a dog-fight, that is 4D. by looking at the path, if curved, you could calculate a faster path. but, sometimes, it will require that curved path, as the optimal path.

so, this is a long way of saying, three times is a pattern. and, also, three different media type for advertisment is the best method.
========================
These days, I think about techinque that are mixture of parametric(linear/SVR) as core/generalized case, then using non-parametirc(kNN/RF) as case by case to pick the outliar, using it as diff against parametric model prediction. blending them in a exhaustive way. ratio blend, trigger blend, log ratio blend.

or, to use non-parametric hotwiring in each layer of DL. That makes DL more blackbox-ish, but it is already blackbox to begin with. something like S-Box. NN is basically an S-Box. but, locking down the function into a linear function is unnecessary, when the lookup table is time is fast enough. This should require much shallow DNN.
==========================
I think ontology(categorization) is important.
I spent way too much time thinking about how to divide the data, in a similar way that butcher think about how to cut the meats.
category of good and bad.
category of accepted and rejected.
category of to buy and not to buy.
category of planet and not a planet. (Pluto)
to be or not to be.

basically, this is all our brain do.

=====================
the strange thing is that "machine learning" in depth, becomes more philosophical than you want it to be.
=================
I dont care what algorithm you use.
I dont care which machine you use.
I dont care which degree you have.

i have one question.
if you were to design your own algorithm, how would you do it? and, what does it achieve?

what does the data look like? and which method would describe it well?
after doing this exercise with many many different type of data, then lets generalize to find an universal, or semi-universal algorithm.

====================

* Stuffs I learned from doing Machine Learning.
theory vs pratice.
insights, generated by machine intelligence. DIKW.
computers are really good at adding things a billion times in a second.
but, what about quantity-quality tradeoff??
what about processing-memory tradeoff??
what about algorithm-data tradeoff??
what if you use the maximum computing power, at the same time, you try to do some meaningful calculation on each cycle?
i am super-excited and also scared shitless.
i am really fascinated by how machines are learning stuff. and human are just like machines...

==========================
*What pisses me off.

when machine learning scientists are focusing on one single aspect of data science.
they are doing "local optimization"
dont they see it?
if they like to do local optimization, how can i trust what they make??
this is a meta-problem.

a good data scientist, should know where the most signal is getting dropped, and attack at the problem.
not the problem they are most familiar with...
not the problem fatten their wallet the most...


=========================

everything is true, by itself.
if you have unlimited resources, doing what is right, is so simple.
it is when you are trying to plug that into the whole system, and still make it coherent, that you will be able to see the flaw. differential diagonostics.

every opening statement is convincing.
it is when the counter-examination kicks in, the truth is revealed.


=======================
There are things human do well.
there are things machines do well.

we should find a way to use the best of the both worlds.

============================

for me, it is really about information flow.
how information is flowing through data points. data object. every object do not have meaning. the meaning actually flow through the data point.

=========================


intelligent machine = not limited by humanly limitation = large data, fast processing power, accurate calculation.

machine intelligence = easily scalable.

The human friendly versions are wikipedia, github. (the modern version of the lib of alexandria) and, to some extent, youtube + facebook. some central repository of info/data. or, netflix + pandora + credit card data.

think about wikipedia. we could argue that wikipedia is a super intelligence. it is not strictly AI, but a very smart "thing", that knows about many different things. it is a brute force way of creating intelligent system.

we could argue that, the internet as a whole is a super intelligence. since it is noisy, we use google to filter the noise, or rank them, using some algorithm and lots of machines.  working together of machine intelligence and human intelligence. something close to singularity.

taking on that cue, all you need to do is build a system to process existing knowledge into a format that machines are comfortable with.

we call this "transformation", nothing fancy. tf-idf is one way. frequency counting is one way. belief propagation is one way. just some stats and math.

"simple models and a lot of data trump more elaborate models based on less data."


we are not just aiming at the level of "human intelligence". we are aiming higher. the super-system that could replace congress. airport control tower. HR process. budgets. GDP. the machines that could "answer to the ultimate question of life, the univese, and everything". (which is also the reason the logo for this site, is an "asterisk".)


================
compared to security, data science is more mathy. to some degrees, cryptoanalysis or stats, or lots of number crunching, and parameter tweaking.



the particle theory:
The simplest (or the most obvious, as it could get difficult later on) is is classification, or regression, or some sorts of form fitting. I see this as component analysis.
you could try to pry out some signal in correlated variables. but, mostly, it could be viewed as indepedent.
even if it is not independent, the values itself, is already correlated, so you dont really worry about them. it is already correlation adjusted variables.
that is why you can treat each variable as indepedent. it is not independent, but it is not anti-indepedent. you dont need to reverse the signal.

the string theory:
Another kind is the relationship based. or the link layer. the first derivatives. the diffentials. Speech, NLP, stock price: all of these could be viewed as n-gram.(sliding window). and tfidf or its variant can be used for all of them. in the similar manner.


hybrid method is alternating between these 2 to get the maximum signal out from the data.


==================
really, this is not about how powerful the machines are.
your laptop is fast enough (for the most of the tasks).
the machine speed has made it easier, thus can use brute force.
but, that speed threshold was achieved about 5 years ago. when we stop caring about how fast is the CPU.
=================
Graph theory and ngrams.

in reality, graph theory is about the sequence. the relationship between items.
and, ngrams is also about the sequence.

so, there is a very close proximity between using graphical model and using ngrams.

==================
Machine Intelligence.

no matter how you cut it, the center of AI or big data is the search engine.
maybe, the words "search engine" isn't quiet right.
it is more like "info engine", or "data filter".
in any cases, it is all about the same thing.
finding a needle in a haystick.
or finding a pattern.
or making something with data, in more refined format, or more valuable format.

making a robot out of lego blocks. or from the junk yard, where pieces dont quiet fit together, but you can make a workable model with it. it is never the lack of the tools. it is the lack of imagination and lack of thoughts.

=========================

signal vs noise

In DM/ML, all the process is centered around 2 words: SIGNAL and NOISE. The whole data mining process is basically, REMOVING NOISE, WHILE KEEPING THE SIGNAL UP.  Data mining is not a fix-all solution, but as long as the data contains MORE SIGNAL THAN NOISE, data-mining can pull out some patterns/"meanings" from the data. the difficult part of the real world situation is what is noise and what is signal. a good pictoral analogy would be: a pinhole camera makes everything in the image crisp. DSLR can focus on an object, and blur out the rest, which makes a "better" picture. the central object is signal, and the blurred out background is noise, in as cognitive sence. so, this ability to focus on the desired object is a good data-mining. a bad datamining would be filtering out color red, or focusing on a wrong object.


=======================
How to play with Data. how to operate on Data.
how to tweak on the data.


every data has 2 main personality.
location and distribution.

location is just average of data point. that tells where there data is located.

then, distribution. some data has smooth distribution. these are easy to work with.
some of them have sharp distribution, where signal decays out quickly, when away from the known coordinate.
you can view this also as, the space is sparse. it is not that the signal decays quickly. but the space is very very large.


------------------------

Intelligence is a merge of contents/nodes and ranking/edges.


without ranking system, the internet would collapse on its weight, becasue 90% of everything is crap.
more exactly, 90% of everything is irrelevant, or are of an average. something you already know. something that is not remarkable.


So, if we dont rank things, we are stuck in a rut. (which goes back to, money as the final measuring stick)


--------------------
brain - 100 billion neurons, and 10k synapses for each neuron.

===========================
the making of social brain. making of an expert system. Once the machines can make good decisions, which they can do it in a large scale, with precision. what is left for human to do? maybe some intuition? is intuition quantifiable?

=========================
Data Science

Find the Essence of it. Knowing what is important, and what is not.
95% of everthing is crap.
For that 95%, lets use compress.
(lossless is math. lossy is stats.)


Finding the Essense. simplicity is desired!
simplcity in 2 folds.
simplicity to remove all the flares.
simplicity to use simple algorithm to remove all the flares. if your algorithm to remove the flares are also complex, then over all complexity is increased, not decreased. that is not simplicification overall. but actually, an increase in entropy. that is why, this is called art. because every situation requires a different level of safekeeping of data.

-------------------
the 1st stage of data science is Understanding. Pattern Recognition.
*Knowledge Engineering* :  Pattern Recognition.  all under the general theme of "Finding the Truth"
-Anomally detecion= based on the past data, what is a normal, and what is abnormal?
-Predictive Analytics = based on the past, looking at the future? This is no different from regular pattern rcognition, except there is temporal element to it.

------------------
the 2nd stage of data science is DECISION. to take an action based on the analysis.
def: decide. = to kill of the other choice.

======================
General Strategy

these are the usual thought process: I stare at the data, before reading the challenge objective. I don't like to distrub the general understanding of the data. After I understand what is the data looks like, then I look at the mission objectives. Then, I go back to the data, and stare at it for another while. poking around. make a basic working model, submit the result. Then, I can gauge where I stand. Is it good? or just baseline? or in a wrong area? Then, look at the error metrics, and optimize on those. Blend methods if necessary. Stratify/clean up the data set if necessary. If stuck, then go read the forum. Try to make some friends there. By this time, I ususally run out of time. So, I make few "hail mary" algorithms, then hits the deadline. If the competition is about supervised learning, I spend some time in dividing the data into training and validation set. usually just 2-3 folds, not the ususal 10-folds. If the memory usage goes over 2GB, I filter out more in the first pass, so everything is kept relatively small. I just upgraded my local box to 8GB, so I might want to try to rely more on machines/bruth force. I still like to hand-code, rather than use library. These days, I am focusing on team works.

-------------------------


2) Model Building.
  - Archetype: centeroid finding. : linear. (SVR, k-means clustering). Baseline/Generalization.  if data points clusters have meanings. SVM/hyperplane to separate them. For clean data, easily separated data, this is better. this works well on a simple convex problem. if it has many rough coutour, meaning the answer(meaning) is in the combinations of indicators, thus it is not a complex convex, but has many local maxima and minima. sure, kernel trick works, but that is like using n-to-n matching, creating some manifold structure, that is linear. if you have more data, using n-to-n might be simpler, unless you need the archetype, the centroid, as some role model.

  - Case by Case: (kNN) N-to-N. sniping.:  nonlinear. atomic. NLP (star and start is a different thing.). if it is difficult to find a centeroid, when the "meaning" of the data is not within the dataset, that you use data more like hashkey, then I will just to do case by case.  I pick the closest case and easy case. This is not even considered model building. case by case. patch works. a good thing about picking off easy ones is that, the data could be messy, and self-contradictory. RF and kNN behave in a smilary way. hyperplane will suffer from confusion. n-to-n point picking does not care. this is more like narrow vision. for dirty data, or contradicting data, sniping is good. good for un-patternful data. bad for smoothcurve data.  short tree. lots of trees.
  In NLP,
it is ususally unlimited number of features, and approach it differently. lots of sparse matrix.


  (RF is a good mix of case by case, with some linear modeling. but does not have to, if the feature is atomic)

  Ususally do it RF/kNN first (pick or drop.), then use linear/Bayesian to get the best approx, tagged with lower certainty. More weight/certainty for RF. This method is better than straight-up averaging between them.

  Why Pick and Drop method works well? due to the power law. there are 5% of data, that is high signal. and 95% of data, that is a moosh. use linear on moosh, and non-linear on 5% high signal.


  When you have different grades of detectors from n-to-n(narrow selection) to SVM(broad coherent data) model, then you can cover almost everything.
  keep it a very simple algo (so simple, any math reasonable person would agree). Usually, the choice of algorithm, matters less in the practice, since there are inherent noise in them, espcially in people data. that is why simple bayesian would get there 90% of the way. for scientific data, it is more accurate, but also in that case, you dont need much statistics either. just plug into a formula.


In NN, it is stacked "linear model". where model making is outsourced to the machines. (i feel uneasy when people say they use ensemble of neural networks. neural network already has an emsembleness in them. it is just dividing the networks, like CNN. or something like corpus callosum.). the machines are dumb in how they construct the model. but us, human are not that smart either.  and, the data itself, does not require super complex modeling. if it is human data, it should be simple enough, carrying the collective thought. if it is scientific data, there are always some simple beauty in the scientific data. it is interesting field, but only a part of the solution.
this is a bit of graphical model.


In graph model,
belief propagataion. looks like neural network. where activation spreads.
it is not about truth, but belief.
it is such a well made name. any belief will propagate. it is just piping, and any water, crude oil, seawater, compressed natural gas, they all will flow.
it is a very interesting model for human society, or the internet documents.
you can still add hidden nodes. or use it as "helper nodes".
these days, i am making up some "RBM" type of hidden nodes to help out in the graphical model. something similar to "imaginary number" for complex number coordinates.


==========================================
What is bayesian? What is Bayes' theorem?
This is real important, and THE central piece for almost everything Data Mining.

I try to use the most simplest language possible.

A simple example would explain it well.
http://www.quora.com/What-does-it-mean-when-a-girl-smiles-at-you-every-time-she-sees-you

in the world, 50% are men, and 50% are women.
if 20% of men have long hair, and 80% of men have short hair.
if 70% of women have long hair, and 30% of women have short hair.

then, if you saw a person with a short hair, what is the chance that person is a woman?


on a cursory look, it is just simple math.
but, it is deducing the genotype with the phenotype.

so, how is this useful in everyday life?

think of "Restricted Boltzman Machine"
genotype is the hidden state.
phenotype is the visible state.

and, layer the, into deep learning structure. or, any structure that you like.

-----------------------
what concept is distilled in bayesian?

what is the PURPOSE of the bayesian inference?
why are we so crazy about bayesian??

The main reason that bayesian is important is that.
it is trying to figure out the genotype, from looking at the phenotype.
it is trying to figure out what you are thinking about, from the things you say.

it is trying to figure out the "inner hidden state" from observing that is happening on the outside.

in a way, this as alot to do with restricted boltzman machine (RBM).
=======================================
correlation  ==?  causation

there is a history to AI and data mining.

1)
the old-school AIer was about building rules by hand.
it was made by "systems engineer" building rules, such as physics engines used in video games simulation.
Most of the computer(system) engineering, most simply put, is a function that takes in some data, process them, and outputs a value. It's like a civil engineer laying underground pipes.


these are the "causation" fans.


2)
after the AI winter, a new school of thoughts came to life.
data driven methods.

data-mining is more of
transforming a given object into multiple parts, taking pictures from multiple angles. taking samples from multiple places.
a function that "collapses/compresses" multi-dimensional vectors onto a single dimensional value, usually normalized into 0 to 1, for each category/label. It's slightly different in the thinking process. It's more like a cook trying to make a new sauce.

DM is similar to SQL language rather than OOP language. that it is centered around the data rather than about the process.


these data-driven modeling is such a powerful concept.
with minimal code, it lets us build a fairly large and complex system.
With a large enough data, margin error decrease with the size of the data.
(we are big fans of correlation. and it lets us be lazy/efficient/smart.)

this probabilistic model doesnt fit for everything.
in courtroom, correlation is not good enough. (until we have pre-cog)
in advertisement, correlation is more than good enough.


(even in the courtroom, a very strong/extreme correlation would work. but, there is always a chance of coincidence. but, everything could be coincidenced..


---------------------------
when we take a look at pavlov's dog, we start to see that correlation gets blurred with causation in the brain.
when the dog hears the bell, the bell has no direct causation to salivation.
but, it is the correlation, that gets engrained into our brain, and we acts in reflex to it.

The interest part is that dog's brain is biologically no different from human brain, meaning human brain is susceptable to correlation == causation. Our brain thinks this way as well, as shown in Neural Network/Auto Encoder.

=================================
a new tricks/technology that I am recently interested.

- progressive word boundary (dynamic words boundary.. still searching for more succint words..)
- the difference from Log|Sigmoid function(perception)  and power curve(most social reality). these days, rectifiers are better. curve fitting. instead of finding linear or logit function, i will just order the result in order to greatest to smallest, and just fit them to the existing curve shown in the training set. if we dont need to dig deep, and centrality actually prevent over-fitting. and quick.

============================
How to read the news these days.

data-driven => machine learning.

======================
The grand movement: Hardware -> Software -> Data
(anti-grand movement of Bitcoin mining: CPU -> GPU -> ASIC. when we really really need fast performance. )

What I mean by "data" is not just numbers. But, what it means.
You start not care about which hardware it runs on.
you start not care about whether your car is gasoline, or diesel, or electric.
you start not care about whether you run Windows or MacOS.

Companies want you to care about these things, since these are "branding".
But, these are more of "color" of your tools. does the color of your screwdriver matter? little bit. but not much.

We start to care less and less of what things are being called, but care more at the fundamental level.


=========================
When you think of scientist experienting on stuff,
you dont imagine him to know how everything will end up in the result.
you have hypothesis. and you test it.

this is data science. we come up with new algorithm. and test it whether it works.

================================
DataMining is about more or less, of making the brain.

data science (data science is a more recent term.)
it is about "decision making engine". expert system. AI. data mining, machine learning, statistics. signal and noise. numbers. consumer behavior. patterns. av engine. information theory. kaggle. bayesian. brain. skynet. all the data in the world. search engine. 3 circle of data science=1.math/data/modeling/algorithm 2.code/technology/implementation/computers 3.subject knowledge/diverse knowledge. (this is the other side of privacy) while security is about system building, data science is closer to a pure science. about experimentation. data science is in a way, devaluing brain. devaluing human intelligence. that the world's smartest person would not be as smart as the smartest machine. can we really live with that?? visually, data science is OCR.

==========================
Data Science is a new way of using computer.
traditionally, we used computer to make pipelines. (look at security)

Data Science is more about solving a math problem. it is mostly guessing game.
and, to get is as correct as possible, you want to collect as much data as possible, as more data will bring up the accuracy. upto certain point. the accuracy gain would wane off. in that case, you dont need to collect more data. you still want to collect more data, as you want to say, i have 2000 exabyte of data. that is like saying, i have 50 billion dollar. and cant spend it all.

that is what is happening these days. since no many people dont know what they are talking about, they substitude with the size of disk storage.


if you can, well, you can collect more data. but, i would rather spend that resource on working on something more interesting...

today's machines are very fast. your laptop is very fast. your phone is very fast.

and, this way of solving a problme is a different approach.
it is like solving a theorem by exhaution with computing power.
it is like proving my brute force.

instead of tweaking by hand, it is like tweaking by natural selection.
you propose a hypothesis, and let the computer tell you whether it is a good thing, or bad thing.


this is the close to the human + computer symbiosis, so far.

===================================
date-driven means statistical. basically, there is no single feature.
but, use multiple factors into account.
that, even if one specific feature is lacking, there would be a spread of information.

after all, this is the basis of democracy.


===========================
"Data Science" is the new word for "intelligence". I still like to use the old term, because there are some elements in there, still considered arts, even though computer is kicking out the "arts" part out of it.
========================
where is a good place to learn about data mining and machine learning.

- wikipedia - start with any data mining topic, and follow the links.
- quora - a good overview from people who are in the industry.
- kaggle - just dive in, and figure it out. nothing is better than working with actual data. it is like learning how to ride bicycle. forum people are ususally helpful.
- mit open lecture - i found the traditional university material to be more basic and more useful than new online courses popping up. harvard data science course was also good. very well covered.
- slideshares or youtube

======================

i tend to think this dualism, as in deCartes.
physical things and meta-physical things.

And, I also imagine them in RBM. restricted boltzman machine.

there are visible states, the outer values.
and there are inner hidden state.

where, visible things are noun,
and the invisible thing are adjectives. a tag.
or, if it is exclusive, then folders.


if it is simple, the hidden state can be single layer.
most of language stuff are simple enough, that it can be done in single layer.

if it is more complex,  the hidden state does not have to be a single layer. but a multilayer. or even network hidden behind it. it does not have to have a strict format. and, it can be whatever the data said.
that is mostly about deep learning.




=====================================
machine learning is just data mining with a feedback loop.
that, the feedback loop phase is similar to systems engineering,
yet a large portion of it, is just data mining.

=======================================
dont lose the signal: from folder model to tag model

data-mining is pegged on the single idea of removing the noise, without losing the signal.
meaning, all the signal is initially distilled inside the data. data-miner is not creating something new new. that is a job for fiction writers.

An actual datamining process is no different from how people would find a pattern.
We see similar things and different things.
we put the similar things in one place. and separate out the different things into different places.

psycho/sociologically, people like to "stereotype" things.
it is a mental shortcut. and it helps us deal with things. in the times, when we had 7channels on tv.
bucketizing into 7 +- 2 buckets. (or we feel confused and go crazy)
bucketizing is okay, when the data is cleanly divisible, like mondays and tuesdays.
when it is a complex matter, such as human characteristics, it doesnt fit into few buckets.
If you model something that is not mutually exclusive, into mutually exclusive model, a great amount of signal is lost.
once you are throwing different things into the same bucket, that bucket starts to lose its meaning. it becomes a bucket with "random stuffs"
Even if bucketizing works really well, with people data, people start to game the system, by focusing on the bucket.
Any critical system with a feedback loop, focusing on few variable will get gamed and abused and get broken down.


the biggest advantage of recent years is that we have powerful computer, that do not have human-y restrictions.
machines dont have 7+-2 limitation. their memory is "limitless".
so, the best strategy with using computers, is to avoid pre-mature bucketizing as much as possible.
Instead of putting them into a folder, just put a tag on them. tag system works wonderfully with (seemingly) self-contradicting data.
What works better than a regular tag model, is a probabilitic tag model. (in fancy terms, this graphical model with probability is called baysian network, markov networks, belief network.)
that, it is like having infinite number of buckets in a multi-dimensional space.

*When it comes to modeling construction, this gets a bit philosophical, dealing with ontology. Why do we call that what it is what it is.. kinds of stuff. is it black or white, or a gradient of it. can a person be a good person and evil person? can a person be smart and stupid. whether double citizenship is allowed. religiously, whether there is one god or multi god, and so on...

=======================================
limitation of Neural Network/ Perceptron

we think bayesian and neural network has lots in commonality. so, we are here to defend some NN ideas.

it is known/debated to be that neural network cant learn "xor".
i suspect this statement would be generally true.
minsky says it is only capable of learning linearly separable patterns.
this was a big deal, since NN how brain works, and it shouldn't have an obvious limitation.
here is a my take on it.


"xor" is one of the favorite things used by computer programmers, since it is a fast operation.
the reason why it is fast can shine a light on why NNs are having a hard time learning it.

in the low low level, it's XOR is 1-bit ADD, with no carry.
meaning, adding A and B, and taking only the one's bit.
so, it is taking the least significant bit, and throwing out the most significant bit.
it is like a 1-bit integer overflow.

neural network is natural, meaning it came into being because it is evolutionary favorable. we pay special attention to things that are natural, because it implies it stood in the test of the time as a good method.

the nature is honest, that it always cares about the most significant bit(signal) over the least significant bit(noise). it is not random, or deceiving.

in another words, nature uses a floating point system, not a fixed bit integer system.
so that, neural network is just not designed to collect noise over signal. (this is not a restriction. rather a great feature ;) this is similar to saying gravity is not a restriction, but a great feature of universe.)

on the other hand, XOR (or mod) is a great for noise generator. in cryptosystem, especially in PRNG, xor (and mod, used along with prime numbers) gets used. (crypto and AI is both a field, where math meets cs. while AI trys to be as close to human as possible. crypto is trying to be as far from human as possible.)


=======================================
*computational sociology: Data Mining meets Social Science

What is an inside joke? inside information?

While looking at pagerank, which is a "belief propagation" in bayesian network, we saw lots of analogies in the neural structure. neurons are the webpages, and synapses are the hyperlinks. But, pushing the analogy further, we viewed a society as a big brain. each individual is a neuron, and each interaction is a synapse. Modeling the society in that way, we could use lots of research done in web search engine, and apply those technique to social network analysis to make a "collective brain" of a group of people. (This is a big deal.) it is not just a knowledgebase to measure popularity and other cultural implications, but also a cooperative working framework. more on this on /blog.

-----------------------------------
Many people say "Why do we need this? when human editting is good enough."
I agree that machines cant beat top-tier news outlet and wikipedia for the depth and the writing style.
Why we need "social data minig", is not because human cant do better. (Machine Intelligence is always competing against Mechanical Turks.)
it is because "when the information is too valuable, if left to few 'chosen'/'pre-selected' people, it is prone to be corrupted."
This is the same argument as, why dont selected smart pick the president of US? I am pretty sure that smart people can pick a better president for the nation. instead wasting time and money on campaigns and trying hard to look pleasant, when the needed skill is making smart decision, not about looking good.
But, we still choose popular election, becasue it is less corruptable. less manipulatable.
by collecting multiple weak signals, the system is more reliable.


=======================================
Data Structure of Social Data

Social data is deeply based on the human language.
In linguistics, we can break down a sentence into this generalized form:
"Subject Verb Object"

With social data, we used a graphical model
most simply put,
(User) - action - [Thing].

in a graphical model, node is an object. and edge is a relationship.

Graphical model is really simple and intuitive model, and well suited to represent these complex data.
it is frequently used to represent complex structures, such as the internet, and brain's neural structure.

*it is no accident of why NoSQL's tuple system works well with web data and social data.

=======================================
Social Data: "it's complicated"

Natural data, like SETI data or gene data, are complex, and has noise in them.
But, social data is complex in a different way.
social data is about society. and, society is about how people live together.

in a way, people data are extremly simple. Social data is ususally a short snippet. 15seconds or less.
It does not have a complex argument structure of a phD dissertation, or Windows internals structure.
It is just a large amount of shallow data. an EXTREMELY large quantity of it. I ususally call it a "heap of blahs".

People are complex. sometimes irrational, self-contradictory.
it's not black or white. but a gradient of color, and mixture of color.
and relationships between these complex/irrational people, are mindboggligly complex, especially with miscommunication, exaggeration, and lies. a society in effect, is a some sort of complex gaming environment of truths and misleading truths.



-----------------------------------------
Social Data Noise

Mainly, there are 2 classes of noise in social data.
"unintentional noise": communicating with a language is actually a constant interpretation. Listener is always trying to interpret what the speaker is saying. This is analogous to a natural noise occuring in the wire. however, high flexivity of language tolerates/enables a high level of noise.
"intentional noise": basically lies and exaggerations. people do not always tell the truth, whether it is due to social contract, or self-advertising. People pretend to like something they dont like really.  People pretend to be nonchalant when they are passionate about it.

These noises are actually good for the society that it avoids conflicts, and essential building blocks of the culture. However, for data-mining purpose, a noise is a noise.

The good news is that, lies are unnatural, and truths are natural. that, it takes lots of effort/energy to fabricate a lie. and, it is all economically driven. and, the market economy tells us that, a large data always tells the truth.

In the grand scheme of things, there are more truth than lie.



-----------------------------------------
"action" speaks the truth
the difference between what you say and what you do.

even though DM/data scrubing can detect lies/noise, and filter out them, it is best to start with the "right" kinds of data.


here, we take some radically different concept from similar product, yet the concept is widely used practice in search engines.
Instead of asking you to rate on a 5-star rating system (which become 3.5 stars. small difference matter alot, yet noise takes over. you could do lower bound on Bernoulli trials, but not my cup of tea.)

We focused on the "action" data, "implicit" data, meaning check-ins and photos.

Your actions speaks much clear than your words. mostly because your rewards/costs for lying with words is usually much higher than rewards/costs for lying with action. action cost too much time and efforts.

There are 2 main advantage of using "implicit data"
1) there are plentiful since it is easier to post. (lots of small info packets are more reliable than few of large info packets, when it comes to common knowledge.)
2) it is truthful since you don't really think about the consequences. there are less social contract attached to it. thus, it is less noisy. the truth is in the unconsciousness.


*on a cursory exploratory survey of check-in data, it drew a perfect power curve.





========================================
What is belief propagation/message passing/Bayesian networks/Markov random fields?
Why are there so many names?

how does information spread in a network?
how does a flu spread in a network?
how does secret spread in a network?

they all spread in a similar fashion.
or, a virus is an information. in a communication theory, this is true.
anything that can propagate in a network, follows this belief propagation scheme.
and, that is why it is being used in "search engine ranking"

not only belief propagation is useful in computer science, it has many philosophical implication. how you learn the world is from your surroundings. as well as advertising. they are all in the realm of information spreading.
=======================================
Web 1.0: Popularity : backrub, vouching, belief propagation.

google is amazing at finding the answer.
according to google's pagerank, "importance = popularity".
we are not talking about what is good or what is bad.
we are in the realm of super-philosophy, and there is no such thing as good or bad.
what is popular is what is worth paying attention.

These are all stemmed from "popular election". Voting might seem like a rudimentary idea, but it is remarkably elegant algorithm of collecting weak signals into a strong signal, and it is the basis of our civilization.

And, we used a good chuck of algorithm from this "popular voting".

a great overview of PR: http://computationalculture.net/article/what_is_in_pagerank

=======================================
Web 2.0: Relevancy : collaborative filtering

Web 2.0 has been about the real life.
it is not about 1+1=2, but, about your favorite color.
it is not about "objectivity", but about "subjectivity". in another words, opinions.
to put in more bluntly, the dramas among your friends are much more important than the war in the middle east, or the ground breaking scientific experiment done with LHC.
to put in more philosophiclly, there is no one single truth. objectivity assumes that there is a such thing, that one belief speaks the truth. there might be, but we dont know which. in that case, it is always, in some degrees, "subjective".
you are the center of your universe.


(in a way, everyone lives inside their brain. "homunculus in your brain". and, use of language always make a communication, as translation of meaning. we are not 100% clear when we use a language.
if this is pushed to extreme, this is also "mental hospital")

to find what fits your standard.
to find what is SIMILAR to you.
we have modified the pagerank, upside down. a "selfish pagerank"


(we were actually approaching this problem with spam filtering. to remove the fraudelent data. and to remove irrelevant data. since unlike scientific data, there could be a wide variety of "truth" in social data. what we end up is that, collabortive filtering works for both cases, and almost all cases.)


it gets more interesting...
that Peer Pressure is in effect, no different from "culture".


we are also social animal, that we form society. and the society defines a culture they like.
you become the average of 5 people you most interact with.
you are the one of the 5 close people affecting that person.
This is called "peer pressure", or "group dynamics" or a variation of it.
in mathy terms, it is Collaborative Filtering, UOUO correlation.


===================================
Mix them all up

an interesting thing about social network data is that,
while working with these methods, we realize is that,
as we bring in the reference points closer to subjectivity, rather than objectivity,
the line between IMPORTANCE and RELEVANCY starts to blur.


Our ultimate algorithm is a mix of popularity and collaborative filtering, under one continuous computational model.

Since pagerank is well known, we will skip the description on that. Also, to apply to social data, we have dumbed it down significantly.

(*strictly speaking, pagerank is also a collaborative filtering. any method that utilize the link layer, instead of the innate object, is collaborative filtering. but, we ususally reserved the term "collabortive filtering" for personalized collaborative filtering, not for global collaborative filtering. and, pagerank fits better with belief_propagation than collaborative_filtering.)


For collaborative filtering, in a super sparse data,
we designed "n-hop generalized belief propagation"
it is simplified for scalability, and fitted for fast calculation.

ususally, "belief propagation" cant propagate through in-animate objects. but, if we let it "flow" via shared objects, the formula gets super simplified. thus "generalized".  the details can be tweak with weights.

[diagram of social network]
1-hop: I like that   O------[]
2-hop: My friends like that  O------O-----[]
3-hop: Similar taste people like that O---[]-----O----[]
4-hop: people similar to my friends like that O---O---[]----O----[]
       Similar taste people's Friends like that  O--[]---O---O---[]
n-hop: you can just go on forever, until the signal decays out.


*the best way to visualize is that, you are pumping in water to Node_you, and amount of water at node_x is the link_strength.
*To simplify: for intermediate notes, the more eccentric/unpopular that is, the link is stronger.










rememeber, 90% of the code is in the data.
and, what you are doing is, merely transforming the data.
think of it as, you are debugging a massive codebase, where as the code is not in code format, but in piles of data.

and, the actual code is tiny, relatvie to the data size.
you have 2GB of data, yet the code to process it is 100kb of text.

it is more like which photoeditting tools are being used to edit photo or video.

==========================================


Let the machine do the dirty work.

- Kernighan and Ritchie


====================================
simple model vs complex model.

when you look at the data model in the wrong way, you make epicycles. http://en.wikipedia.org/wiki/Deferent_and_epicycle
things gets complicated. and it seems like all noise. so, that as a good DMer, you disregard that feature. (this is the right thing to do. well, semi-right things to do.)
But, when you move the model to helio-centric, everything gets so straight forward.

that is the magic of a good model representation.
not everything you look is what you think it is.

================================
the difference in using kNN or SVM, is a difference of using screw driver, or  flat driver. it is about the same.

==================================
Data Science has lots to do with Odds and luck.

We need to define what it means to be lucky, becasue the definition has been tainted by the marketing campaigns.

you can get lucky in a short term. meaning you have beat the odds. your sport team has made an upset aginst the odds. it means, it has deviated from the norm, in a significant way. unlucky means the deviation was againt your favor.
but, nobody can maintain their luck over time. if they can either maintain the odd, deviate from the norm, for a long period of time, it is either ksill or cheating.
and, if you can cheat over time without getting catuhgt, that is also a skill.


=====================================

Heroes in this fields are:
Claude Shannon
Richard Hamming
Nobert Weiner
Leonhard Euler
Richard Feynman
Stephen Wolfram
Ray Kurzweil




================================
When you are coding, sometimes you copy and paste a code in, and dont really understand it.
but, just enough to know it works, by unit testing them.

many of machine learning is like that.
you don't really understand, yet you sometimes trust the distilled logic inside data.

and, sometimes, you dont understand the algorithm. this is a bad case.

knowing either data or algo, will save you.
if you dont know either well, it is more like shot in the dark..... it could work... but not against competitors, who use things correctly.

===================
In the scheme of Graphical model, why add "hidden node"?

i usually prefer to do it without hidden node, since it is just simpler that way.
adding hidden node mungles signal together.
aside from reducing the number of links, since in the days of cheap computing, the computing complexity really doesn't matter significantly.
if you can get a better prediction, that matters much more. the computing time get solved by time. just wait 1 more year.

adding multiple layers are building more structure to the model.
========================
very carefully, I want to proclaim that human slavery is over.
i am saying this with a bit of doubt, it is a complex issue.

we dont need to repetitive task anymore.
which sounds good, on the surface.
but, it means that the "property class"/people with money, dont need to hire that many people.

with the education system, we have made many "worker class". that are good at doing what is told to do. we no longer need this "worker class".

it is good for the society, as whole.
but, these worker class, who used to get paid, no longer can find a task, that someone is willing to pay for. robot can do it at a fraction of cost.

in that case, it becomes like watching a "special olympic". (with all due respect to spcial people)

can taxi drivers really compete in the market against self-driving car?
can pharmacies really compete in the market against pharmacy dispensing robot?
can poker player really compete in the market against the poker bot?
can chess player really compete in the market against the chess bot?

Govt would start to create jobs, and protect these people. but, that is socialism. and that didnt work well in USSR.
it could work, as in norther european nation. but, i think natural resource played a big part in that.

I worry about these things.
but, at the same time, human is very adaptive animal.
=========================
Whenever I look at text, that talks about using ultrafast 333mhz processor, somehow i feel both lucky and smurky.
i feel lucky that i have 10x than the super computer back then.
and also, that i dont have worry about making the algorithm so lean.
and focus more on the algorithm itself, than trying to fit into a hardware requriemtn.

and, also try to leverage the machinage more, since I can get access to super super computing power.
at the same time, trying to focus on the algorithm, since in 3years, i would have 4x performance automatically.

===============
"the simplest answer is usually the correct answer"


==========================
When i do data-mining, i see lots of similrity with cryptoanalysis.
if the system is complex, you need more of the data.
if the system is simple enough, then you dont need that much data. more data will validate.
but, once you found the answer, the rest of the data is just validation.
===========================
 AdaBoost
 Autocorrelation
 Belief propagation
 Boltzmann distribution
 Boltzmann machine
 Centrality
 Cepstrum
 Claude Shannon
 Computational sociology
 Conjugate gradient method
 Connectionism
 Dempster-Shafer theory
 Dmitri Mendeleev
 Elastic net regularization
 Entropy (information theory)
 Expert system
 Factor analysis
 Fenwick tree
 Gamma distribution
 Generalized linear model
 Genetic algorithm
 German tank problem
 Gradient boosting
 Halting problem
 Homomorphic filtering
 Hopfield network
 Information theory
 John von Neumann
 Kalman filter
 Kelly criterion
 Kernel trick
 Kullback-Leibler divergence
 Latent Dirichlet allocation
 Latent variable
 Linear discriminant analysis
 List of probability distributions
 Locality-sensitive hashing
 Low-density parity-check code
 Maslow's hierarchy of needs
 Mathematical beauty
 Michael I. Jordan
 Minimum description length
 Network topology
 Norbert Wiener
 Null hypothesis
 Outline of artificial intelligence
 Poisson distribution
 Principal component analysis
 Probit model
 R/K selection theory
 Random forest
 Rasch model
 Receiver operating characteristic
 Reed-Solomon error correction
 Restricted Boltzmann machine
 Self-organizing map
 Sigmoid function
 Singular value decomposition
 Social choice theory
 Social media
 Spline (mathematics)
 Standard deviation
 Three-phase
 Turbo code
 VC dimension
 Viterbi algorithm
 Window function

===================
(i would like to suggest, power-curves, tag structure. after some training, the msg are buckted the most occuring combo.)



=========================
Is this the faster method to reach immortality? or is the biology a faster method?

If we can slow down the time. and maybe even reverse it, what would happen?

=========================
while doing data mining, you learn about lots of things.
one thing is that "dont tell me what to do. show me what to do."
in the old age, we built individual rules. = tell the machine what to do.
in the new age, we make the machine observe what is happening. which is showing them what to do.
this new method is much more accurate. and easy to scale.


=====================================
what it really means by when people are saying "we are going to make data-driven decision" is that,
we will no longer listen to bullshits, and politics, and rhetorics.
We dont care who you are, or what you have done.
if it doesnt work for this case, it does not work. period.
Our CYA will be data, not reputation.

this could lead to short-term profit maximization. so, you need to be careful, what metrics to use.


========================


If you do data science, you see many people coming over from Physics. and Physics people are really good at data science.
something about they see the data.

so, this is how i imagine what is going on inside an atom.

KE = (1/2) m v^2.

And, the famouse equation is E = m c^2.

that mean, if you can imagine a ball, flying into the earth at the speed of the light. and on the impact, the ball is stopped by the ground. that momentum energy will be dissipated to the surrounding. if you multiple that by 2, that is the power of the atomic bomb.



then, lets get into some wonky science. (when physics gets theoretical, it is more of math and symbols rather than reality.)
the question is what makes a mass a mass? why is mass to energy conversion formula, looks similar to kinetic energy formular??
if mass is a vibration in string theory. and, if the vibration is stopped, the mass no longer exist. then, the energy dissipated can be visualized as vibration energy. (the vibration is not in 3D world, but a hyper-dimensional world. and to get a better grip of hyper-dimensions, try to understand how "imaginary number" really works. it is an orthogonal dimension to the visible/easy dimension, constructed to help to understand how the world really works, and what is visible to us. ). if you imagine vibration is moving at the speed of the light, this kinda makes more and more sense. except the concept of the mass breaks down. but, if you can imagine mass and vibration is a dual form. then, it still makes some sense to me.

I used to think string theory as vibration. like a guitar string vibration. Now, I think it is more of helix. if you look at from the side, it looks like sine wave. but in reality, it could be helix. we are already in the elevated number of dimension. that way, the momentum is conserved easier.


(once you understand the imgainary number, as kicking up a dimension to help solve a problem, now try to imagine the imaginary(orthogonal) to that imaginary number. it is, like x, y, z.  real, imaginary, imaginary to imaginary) then, apply this back to the existing world's 3-D space, where each dimension can have these hidden dimensions.) you have 9-dimensional space. (once comfortable with that, try to do one more imaginary dimension. the 4th imaginary dimension. this has been always tough for me. some say it is curled up. but, not really. it is just orthogonal. Can you imagine infinitely many imaginary dimensions? this is little bit easier than, thinking in terms of real world, as you are already in the imaginary world. and I dont think 'time' is a good example for the 4th dimension. time is a special dimension. ) (once done that, try to jump from y axis' imaginary dimension to z-axis's imaginary's imaginary. in the hyperspace, these are just 2 axis.) (once done that, try to imagine, that x axis's 6th imaginary dimension could be overlapping with z axis's 24th imaginary dimension. they dont need to overlap exactly, but very close enough, since/if the total hyperspace limit is like 20 dimension(just made up this number), but over some limit, these axis cant no longer orthogobal, but closer..)
=================================
Tags = [Data. Machine. People. Complex. Network. Human. Model. Algorithm. Meaning. Simple. Correlation.]