Updates about parallelization of Greedy Coordinate Descent Method

I have added parallelization of “for loop” for the Greedy Coordinate Descent method for kernel SVM (L2 empirical risk minimization problem), random partitioned gradient array into omp_get_num_threads() numbers of subarrays and chose the best variables for each thread to update gradient, and applied atomic mechanism to the updating of gradient to avoid conflict write and loss of information. However, the parallelized DCD (dual coordinate descent) doesn’t converge as fast as I expect when it is multi thread (4 threads only speed up 0.0001 second compared to single thread and doesn’t converge to the same optimum on each running of code)

I diagnosed that there would be sth wrong with the way I compute my kernel matrix when it is in the multi core setting. Hopefully I will fix this by the end of this week.

Cheers.

 

Gaussian Kernel

I found an intuitive post on Gaussian Kernel on  Jesse Johnson‘s blog. Enjoy!

Gaussian kernels

In order to give a proper introduction to Gaussian kernels, this week’s post is going to start out a little bit more abstract than usual. This level of abstraction isn’t strictly necessary to understand how Gaussian kernels work, but the abstract perspective can be extremely useful as a source of intuition when trying to understand probability distributions in general. So here’s the deal: I’ll try to build up the abstraction slowly, but if you ever get hopelessly lost, or just can’t take it any more, you can skip down to the heading that says “The practical part” in bold – That’s where I’ll switch to a more concrete description of the Gaussian kernel algorithm. Also, if you’re still having trouble, don’t worry too much – Most of the later posts on this blog won’t require that you understand Gaussian kernels, so you can just wait for next week’s post (or skip to it if you’re reading this later on).

Recall that a kernel is a way of placing a data space into a higher dimensional vector space so that the intersections of the data space with hyperplanes in the higher dimensional space determine more complicated, curved decision boundaries in the data space. The main example that we looked at was the kernel that sends a two-dimensional data space to a five-dimensional space by sending each point with coordinates (x,y) to the five-dimensional point with coordinates (x,y,x^2, y^2, x^y). If we wanted to give ourselves even more flexibility, we could pick an even higher dimensional kernel, for example by sending the point (x,y) to the point (x,y,x^2, y^2, x^y, x^3, x^2y, xy^2, y^3) in a nine-dimensional space.

This week, we’re going to go beyond higher dimensional vector spaces to infinite-dimensional vector spaces. You can see how the nine-dimensional kernel above is an extension of the five-dimensional kernel – we’ve essentially just tacked on four more dimensions at the end. If we keep tacking on more dimensions in this way, we’ll get higher and higher dimensional kernels. If we were to keep doing this “forever”, we would end up with infinitely many dimensions. Note that we can only do this in the abstract. Computers can only deal with finite things, so they can’t store and process computations in infinite dimensional vector spaces. But we’ll pretend for a minute that we can, just to see what happens. Then we’ll translate things back into the finite world.

In this hypothetical infinite-dimensional vector space, we can add vectors the same way that we do with regular vectors, by just adding corresponding coordinates. However, in this case, we have to add infinitely coordinates. Similarly, we can multiply by scalars, by multiplying each of the (infinitely many) coordinates by a given number. We’ll define the infinite polynomial kernel by sending each point (x,y) to the infinite vector (x,y,x^2, y^2, x^y, x^3, x^2y, xy^2, y^3, x^4,\ldots). In particular, every monomial in the variables x and y, such as x^7y^42 or y^{10,000} will appear in one of the entries of this kernel, possibly very far down the sequence.

In order to get back to the computational world, we can recover our original five-dimensional kernel by just forgetting all but the first five of the entries. In fact, the original five-dimensional space is contained in this infinite dimensional space. (The original five-dimensional kernel is what we get by projecting the infinite polynomial kernel into this five-dimensional space.)

Now take a deep breath, because we’re going to take this one step further. Consider, for a moment, what a vector is. If you ever took a mathematical linear algebra class, you may remember that vectors are officially defined in terms of their addition and multiplication properties. But I’m going to temporarily ignore that (with apologies to any mathematicians who are reading this.) In the computing world, we usually think of a vector as being a list of numbers. If you’ve read this far, you may be willing to let that list be infinite. But I want you to think of a vector as being a collection of numbers in which each number is assigned to a particular thing. For example, each number in our usual type of vector is assigned to one of the coordinates/features. In one of our infinite vectors, each number is assigned to a spot in our infinitely long list.

But how about this: What would happen if we defined a vector by assigning a number to each point in our (finite dimensional) data space? Such a vector doesn’t pick out a single point in the data space; rather, once you pick this vector, if you point to any point in the data space, the vector tells you a number. Well, actually, we already have a name for that: Something that assigns a number to each point in the data space is a function. In fact, we’ve been looking at functions a lot on this blog, in the form of density functions that define probability distributions. But the point is, we can think of these density functions as vectors in an infinite-dimensional vector space.

How can a function be a vector? Well, we can add two functions by just adding their values at each point. This was the first scheme we discussed for combining distributions in last week’s post on mixture models. The density functions for two vectors (Gaussian blobs) and the result of adding them are shown in the Figure below. We can multiply a function by a number in a similar way,  which would result in making the overall density lighter or darker. In fact, these are both operations that you’ve probably had lots of practice with in algebra class and calculus. So we’re not doing anything new yet, we’re just thinking about things in a different way.

addvectors

The next step is to define a kernel from our original data space into this infintie dimensional space, and here we have a lot of choices. One of the most common choices is the Gaussian blob function which we’ve seen a few times in past posts. For this kernel, we’ll choose a standard size for the Gaussian blobs, i.e. a fixed value for the deviation \sigma. Then we’ll send each data point to the Gaussian function centered at that point. Remember we’re thinking of each of these functions as a vector, so this kernel does what all kernels do: It places each point in our original data space into a higher (in fact, infinite) dimensional vector space.

Now, here’s the problem: In order to bring things back to the computational world, we need to pick out a finite dimensional vector space sitting in this infinite dimensional vector space and “project” the infinite dimensional space into the finite dimensional subspace. We’ll choose a finite-dimensional space by choosing a (finite) number of points in the data space, then taking the vector space spanned by the Gaussian blobs centered at those point. This is the equivalent of the vectors pace defined by the first five coordinates of the infinite polynomial kernel, as above.  The choice of these points is important, but we’ll return to that later. For now, the question is how do we project?

For finite dimensional vectors, the most common way to define a projection is by using the dot product: This is the number that we get by multiplying corresponding coordinates of two vectors, then adding them all together. So, for example the dot product of the three-dimensional vectors (1,2,3) and (2,.5,4) is 1 \cdot 2 + 2 \cdot .5 + 3 \cdot 4 = 15.

We could do something similar with functions, by multiplying the values that they take on corresponding points in the data set. (In other words, we multiply the two functions together.) But we can’t then add all these numbers together because there are infinitely many of them. Instead, we will take an integral! (Note that I’m glossing over a ton of details here, and I again apologize to any mathematicians who are reading this.) The nice thing here is that if we multiply two Gaussian functions and integrate, the number is equal to a Gaussian function of the distance between the center points. (Though the new Gaussian function will have a different deviation value.)

In other words, the Gaussian kernel transforms the dot product in the infinite dimensional space into the Gaussian function of the distance between points in the data space: If two points in the data space are nearby then the angle between the vectors that represent them in the kernel space will be small. If the points are far apart then the corresponding vectors will be close to “perpendicular”.

The practical part

So, lets review what we have so far: To define an N-dimensional Gaussian kernel, we first choose N points in the data space. We can then calculate the kernel coordinates of any point in the data space by calculating its distance to each of these chosen data points and taking the Gaussian function of the distances.

To better understand how this kernel works, lets figure out what the intersection of a hyperplane with the data space looks like. (This is what is done with kernels most of the time, anyway.) Recall that a plane is defined by an equation of the form a_1 x_1 + a_2 x_2 + \cdots + a_N x_N = b where (x_1,\ldots,x_N) are the coordinates of the point (in the higher dimensional kernel space) and a_1,\ldots,a_N are parameters that define the hyperplane. If we’re using a Gaussian kernel then, thanks to our version of the dot product, the values (x_1,\ldots,x_N) measure the distances to our N chosen points. The decision boundary is thus the set of points for which the Gaussian function of the distances to these N points satisfy this equation.

That’s still pretty hard to unpack, so lets look at an example where each of the values a_1,\ldots,a_K is either 1 or -1. Then near each data point with label a_i = 1, the value x_i will be very close to 1, while the other values x_j will be small, so the sum a_1 x_1 + a_2 x_2 + \cdots + a_N x_Nwill be positive. Similarly, near a point with a_i = -1, the sum will be negative. Thus if b = 0 then the decision boundary will separate the positive points from the negative points. In fact, it will carve out a region reminiscent of the Gaussian balls that define the kernel. One example is indicated on the left in the Figure below, where the colors indicate whether the coefficients are positive or negative. As you can see, the result looks something like a smooth version of the nearest neighbors algorithm.

gaussianclassif

If we adjust the parameters a_1,\ldots,a_K, this has the effect of changing the sizes of the Gaussian balls around the points, and thus moves the decision boundary towards or away from them, as on the right of the Figure. If a coefficient switches from positive to negative, the decision boundary will move from one side of a point to the other. If we have a labeled data set (which may or may not coincide with the N points that define the Gaussian kernel) then training a linear classification algorithm (such as SVM or logistic regression) in the kernel space corresponds to moving this decision boundary around, within the constraints defined above, to maximize how many of the data points are on the correct side.

So, this gives us more flexibility for choosing the decision boundary (or, at least, a different kind of flexibility) but the final result will be very dependent on the N vectors that we choose. If we choose too many (such as if we let the N points that define the kernel be the same as the data points) then we will risk overfitting, similar to how the nearest neighbor algorithm tends to lead to overfitting. What we really want is a small number of points that are evenly distributed throughout the set, ideally such that each of the N points is close to mostly points in the same class.

Finding such a collection of points is a very different problem from what we’ve been focusing on in the posts so far on this blog, and falls under the category of unsupervised learning/descriptive analytics. (In the context of kernels, it can also be thought of as feature selection/engineering.) In the next few posts, we’ll switch gears and start to explore ideas along these lines.

Living in the Office

I moved to live in my cubicle at GDC CS department last night. I slept in the conference room as it is the only open room without lights. I find it very efficient in the sense that it sets a regular life schedule for me: I am able to go to bed (sleeping bag) on time as the main lights at Dell complex are turned off every night at 12:00pm, and get up very early in the morning and quickly get back to my cubicle to be immersed in the working and learning environment.

 

 

[Summer Reading]: Matrix Computation and Numerical Stability/ Accuracy

During my short stay at Stanford University last week, I visited a young assistant professor Jack Poulson who has innovation in algorithmic design and expertise in fast algorithm. He has recently added support for distributed dense and sparse-direct Linear, Quadratic, and Second-Order Cone Programs into Elemental and begun integration into more user-friendly tools (e.g., CVXPY).

During our conversations, he kindly recommended great books on linear algebra and numerical analysis. They are as follows:

1) Trefethen and Bau, “Numerical Linear Algebra” (a friendly advanced undergraduate level book on numerical linear algebra)

2) Higham, “Accuracy and Stability of Numerical Algorithms” (a graduate-level book on numerical linear algebra; I recommend reading this *after* Trefethen and Bau)

3) Golub and van Loan, “Matrix Computations” (*the* reference for numerical linear algebra; it’s a bit too terse for an introduction)

4) Horn and Johnson, “Matrix Analysis” (a great reference for non-numerical linear algebra; it would be okay to read this in tandem with Trefethen and Bau)

I bought “Matrix Computation by Gene Golub” today and “Accuracy and Stability of Numerical Algorithms by Nicholas Higham“. I plan to start reading the book on numerical algorithm first as soon as I get the book. (hopefully I can earn some scholarship next semester to cover the expense!)

Happy reading summer!

 

 

How to Wake up Every Morning on Top of The World

I have been trying to pack many things in my pocket during this summer and in order to efficiently and tangibly get all of them done, I trumpet here that I will be a morning person starting from tomorrow morning. Here I share a helpful blog with you all.

“You get peace of mind not by thinking about it or imagining it, but by quietening and relaxing the restless mind.” ~Remez Sasson

What’s the first thought that goes through your head when you wake up in the morning? Is it deliberate or is it the default “Oh shi#$, it’s six!”?

If that’s how you start your day, then it’s likely your day will be filled with anxiety and stress. It’s not exactly the most productive mechanism for getting things done.

Questions are quite powerful if used in the right way.

How to Use Morning Power Questions

When you wake up in the morning, you are always asking yourself questions, whether you realize it or not. As you brush your teeth, drink your coffee, or eat your breakfast, thoughts are running through your head. You might be thinking, “Why am I so I tired? Why didn’t I sleep earlier? What am I going to eat?”

These things generally don’t serve any useful purpose, and in some cases, as you can see, are even hurting you. The idea behind using questions is to take conscious control of the direction of your day. So, let me give you a few examples of things that you could ask yourself first thing in the morning:

  • What do I have to look forward to today?
  • What’s absolutely perfect about my life?
  • How can I make today absolutely awesome?
  • What’s the best thing that could happen today?

By asking yourself these kinds of questions, you start to shift the focus of your mind toward all of the things you want to have happen.

One interesting thing to note is that your questions don’t need to have any basis in reality because your brain will answer anything you ask it quite literally. So if you’re going to be delusional, you might as well make your delusions extremely empowering.

The key to using this effectively, however, is to do it for thirty days in a row. What happens when you do this is that your brain will create a link, known as a neuro-association, between the empowering states you create with your questions and being awake in the morning.

One Question to Ask Yourself Every Morning

For about two weeks now, I’ve been asking myself one question from the moment I wake up: “What am I grateful for?” You’ve heard before that you should start every single day with an attitude of gratitude. This is probably the simplest way to actually do that.

 

If you ask yourself that question enough days in a row, you will wake up feeling on top of the world every single day. As you start to view your life and the world around you as full of things to be grateful for, you’re going to bring more and more of that into your life.

We all have lots to be grateful for but we often get caught up in all the things that are wrong with our lives. Hopefully this will enable you shift your focus.

Ways to Change Your Morning Routine

I want you to give some consideration to changing up how you start your day. In addition to power questions I encourage you to start your day in a more peaceful, quiet way then you have in the past. I think you’ll find that the impact this will have on you both physically and mentally will quite powerful.

Don’t turn on the computer or TV.

As a blogger, for the last year or so the first thing I would do every single morning is turn on the computer. Even if you are not a blogger you may have a tendency to turn on the computer right when you wake up. Starting your brain off with so much information overload right when you wake up can’t possibly be healthy.

I encourage you to just enjoy your coffee or breakfast for about twenty minutes. Turning on the TV is one of the most insidious things you can do. The news can have such a negative impact on you that you might not even realize it. The news is generally about everything that’s wrong in the world and this is the first thing you become exposed to in the morning.

One thing that we know from years of self-help is that our minds tend to be extremely receptive in the morning. That’s why I encourage you not to turn on the TV if you’ve been doing it.

Listen to music/something uplifting.

I love listening to music and when possible I even use an alarm that actually plays music. I try to find uplifting songs or ones that have peaceful melodies. One of the best times to listen to a self-help tape or program is right when you wake up. Think about how the effect this will have on you if you do this for about thirty days.

If you listen to inspirational/uplifting material right when you wake up, then you will eventually condition that message into your mind and connect it with waking up in the morning.

Meditate.

I think one of the most challenging things about meditating is to free yourself from thought. As somebody with a mind that moves at what feels like a million miles a minute, this isn’t something I’m great at myself. Early in the morning your mind is in a fairly quiet state and even five to ten minutes of deep-centered relaxation/meditation can make a huge difference in your day.

How do you start your morning routine? Is there anything else you’d add to this list?

SIAM Conference on Computational and Mathematical Issues on the Geosciences

I arrived at SFO on last Saturday June 27th and got on Caltrain to Palo Alto for the SIAM GS Conference on Computational and Mathematical Issues on the Geosciences.

Today is the second day of the conference before which, on Sunday June 28th I attended three short courses: Interactive Data Visualization, GPU CUDA C++, and Intro to Machine Learning. I find GPU CUDA acceleration extremely useful and applicable to my current undergraduate research on parallelizing greedy coordinate descent method for QP. Instead of CPU parallelism with OpenMP, I may think about using basic SAXPY algorithm to parallelize my “for loops”. I learned routines and API in the interactive lab “Qwik Lab” including error checking GPU code, Data Management with Unified Memory, and Querying GPU devices for capabilities…

[TO BE CONTINUED]