Monday, May 4, 2015

Advices on how to successfully implement a generalized linear model in C#

Recently I was given a task in which i need to implement a GLM to be used directly in C# without the usage of R. While GLM in R is very trivial, to implement one in C# natively is not as trivial as it requires quite a bit of knowledge in linear algebra and inferential statistics. After trial and errors and going though a number of online courses teaching linear algebra and inferential statistics, I was finally able to successfully implement the required algorithms. This page documents some of the resources i went through to be able to successfully implement the algorithm, so that someone else will save their time if they were ever given such a task.

0. Quick way
If you are already familiar with linear algebra and inferential statistics, you can directly go and take a look at http://bwlewis.github.io/GLM/ and start implement your own GLM in any programming language. but if you find the terminology and symbols difficult to understand there like I used to be, then read on.

1. Learn SVD, eigen decomposition, as well as Cholesky and QR factorization
The first step is to learn the various matrix decomposition techniques, while this seems to be trivial if you are already using a matrix library such as MathNet.Numerics, you may find yourself extremely frustracted when the matrix computation performance in the library does match up to your expectation to solve linear equations in the IRLS (algorithm used in GLM) and you need to implement your own (Which is unfortunately the case for me).

Assuming you forgot most of your linear algebra knowledge learned in high school. I recommend to start with the coursera course http://czreading.blogspot.sg/2015/05/online-course-coding-matrix-linear.html. The course will teach you how to implement a sparse matrix as well as matrix factorization such as QR factorization.

While the course does not teach how to perform SVD, or Cholesky decomposition, you should find the following materials teach SVD and Cholesky very-easy to understand after taking the course:

http://czreading.blogspot.sg/2015/05/applied-numerical-computing.html
http://czreading.blogspot.sg/2015/05/numerical-methods-for-solving-large.html

With these knowledges, you should be able to implement your own linear algebra and should not have problem with the matrix computation in GLM

2. Learn inferential statistics
Another important preliminary knowledge is to understand since GLM is all about finding regression model coefficient that minimize variance-covariance of the error terms in the regression model.

GLM uses the concepts of link function and distribution family, therefore, one must be familiarized with concepts such as logit and logistic function, binomila, bernouli, normal, categorical distribution as well as linear regression and logistics regression. In addition, one must be uncomfortable with concepts such as explained variation R^2, residuals and RSS, standard error, variance-covariance matrix. If you do not have these knowledge, it is better that you go through a data analytics and inferential statistics course such as:

http://czreading.blogspot.sg/2015/04/online-course-data-analytics-and.html

To compute the derivative of link function, you may need some basic knowledge of calculus which is actually very basic.

3. Start learning about IRLS
Once you have sufficient understanding of the above topics as well as comfortable with local search algorithms, you should not have problem understand the IRLS implementation described at http://bwlewis.github.io/GLM/

With the above knowledges in linear algebra and inferential statistics, it should take you less that an afternoon to come up with the implementation of GLM as well as the IRLS and its QR and SVD Newton variants.

One trick that i learned is that the R codes given at the http://bwlewis.github.io/GLM/ is very useful for reverse engineering. For example, initially i had difficulty implements the variance(g) used as the denominator in W's diagonal entry, I just go into R and type:

binomial()$variance

And its implementation comes out, and i repeats for the rest of the distributions.

No comments:

Post a Comment