Tuesday, June 20, 2017

Java Comparable and Comparator

In this tutorial, I will go over two similar but very different interfaces in Java: comparable and comparator. For official documentations on these, refer to here and here.

Comparable and Comparator are both Java interfaces to compare two or more objects. The difference is that comparable is the default comparator of the object, whereas comparator is a custom comparison implementation.

For example, assume you are entering personal info. For each personal info object, you have one's name, year, month, and date of birth. One way to sort a list of personal info objects is by their names, but another way is to sort by the date of birth. In Java, one will be able to sort by name or date of birth by providing corresponding comparator to the class. On the other hand, if one is to simply sort the objects without any comparator provided, then the list will be sorted by the class' comparable interface.

Let's take a look at an example.

To run it, simply type in
$ javac Personal.java && java Personal

Sorting by default
Bob 1999, 1, 2
Charles 1955, 7, 4
Mike 2011, 4, 1
Tom 2011, 4, 4
Tony 2000, 5, 5
Tony 1987, 12, 11

Sorting by date of birth
Charles 1955, 7, 4
Tony 1987, 12, 11
Bob 1999, 1, 2
Tony 2000, 5, 5
Mike 2011, 4, 1
Tom 2011, 4, 4

Sorting by name
Bob 1999, 1, 2
Charles 1955, 7, 4
Mike 2011, 4, 1
Tom 2011, 4, 4
Tony 1987, 12, 11
Tony 2000, 5, 5

In the first sorting, no comparator is provided, so the list is sorted by Personal's comparable interface, namely compareTo method. For the second and third sorting, DobComparator and NameComparator is provided, so the list is sorted by the date of birth and name, respectively. The comparator classes provide compare method which specifies the order in which the list is to be sorted.

Sunday, June 18, 2017

Analysis of Quicksort with Duplicate Keys

In this post, I will present two slightly different versions of quicksort, and examine why one is better than the other.

Let's dig in the code first.

As you can see, the difference between two versions is very subtle. One partitions into two where the left contains values less or equal to pivot key and the right contains values strictly larger. The second version partitions into two where the left contains values less or equal to pivot key and the right contains values greater or equal to the pivot key.

What difference will this make? For an array with distinct keys, the difference is not as significant. However, when we have lots of distinct keys, the story changes. The first one performs extremely poorly when there are many identical keys, whereas the second one performs faster as the number of duplicate keys increases.

$ g++ qsort_analysis.cpp -g -ltictoc -O0 && ./a.out 10000000 10000000
Time elapsed: 2.067320s
Time elapsed: 1.842697s

$ g++ qsort_analysis.cpp -g -ltictoc -O0 && ./a.out 10000000 10000
Time elapsed: 10.985325s
Time elapsed: 1.441884s

How does such subtle difference make such large difference? We can analyze this by looking at an array with all the same keys and observe how many calls each quicksort is called in total.

$ g++ qsort_analysis.cpp -g -ltictoc -O3 -DDEBUG && ./a.out 16 1
#1 quicksort1 with 0,15
#2 quicksort1 with 0,14
#3 quicksort1 with 0,13
#4 quicksort1 with 0,12
#5 quicksort1 with 0,11
#6 quicksort1 with 0,10
#7 quicksort1 with 0,9
#8 quicksort1 with 0,8
#9 quicksort1 with 0,7
#10 quicksort1 with 0,6
#11 quicksort1 with 0,5
#12 quicksort1 with 0,4
#13 quicksort1 with 0,3
#14 quicksort1 with 0,2
#15 quicksort1 with 0,1
#16 quicksort1 with 0,0
#17 quicksort1 with 2,1
#18 quicksort1 with 3,2
#19 quicksort1 with 4,3
#20 quicksort1 with 5,4
#21 quicksort1 with 6,5
#22 quicksort1 with 7,6
#23 quicksort1 with 8,7
#24 quicksort1 with 9,8
#25 quicksort1 with 10,9
#26 quicksort1 with 11,10
#27 quicksort1 with 12,11
#28 quicksort1 with 13,12
#29 quicksort1 with 14,13
#30 quicksort1 with 15,14
#31 quicksort1 with 16,15
Time elapsed: 0.000057s
#1 quicksort2 with 0,15
#2 quicksort2 with 0,7
#3 quicksort2 with 0,3
#4 quicksort2 with 0,1
#5 quicksort2 with 0,0
#6 quicksort2 with 2,1
#7 quicksort2 with 3,3
#8 quicksort2 with 5,7
#9 quicksort2 with 5,5
#10 quicksort2 with 7,7
#11 quicksort2 with 9,15
#12 quicksort2 with 9,11
#13 quicksort2 with 9,9
#14 quicksort2 with 11,11
#15 quicksort2 with 13,15
#16 quicksort2 with 13,13
#17 quicksort2 with 15,15
Time elapsed: 0.000020s

As you can see above, the first version of quicksort is invoked about 2*N times where the second version is invoked about N times for given array of N elements all of equal keys. The reason is that for the first version, each partitioning will be extremely skewed over to the right, whereas for the second version the partition is always at the center, thus leading to much more efficiency.


Saturday, June 17, 2017

Variations in Quicksort

In this post, I would like to go over three different versions of quicksort and compare their performance.

Let's dig into code first.

There are three different version of quicksort. The first is regular binary quicksort, the second is ternary quicksort, and the last is 3-way quicksort. Note that tictoc.h is my custom library; for more details, see this post.

The binary quicksort is what everyone is most familiar with. The left partition contains those less or equal to pivot value, whereas the right partition contains those who are greater or equal to the pivot value.

The ternary quicksort is similar in that it divides into three partitions and two pivot values. The left-most partition contains those less or equal to the first pivot value, the middle partition contains those in between two pivot values, and the right-most partition contains those greater than the second pivot value. Note that I am assuming the first pivot value is less or equal to the second pivot value.

Lastly, the 3-way quicksort partitions the array into 3, where the first is strictly less, the middle is equal to, and the last is strictly greater.

Let's examine their performance. First, when the array values are mostly distinct:
$ g++ -g quicksort.cpp -ltictoc -O3 && ./a.out 1000000000
Time elapsed: 0.983398s
Time elapsed: 0.883001s
Time elapsed: 1.151062s

It seems like ternary quicksort is the fastest among these.

When the array values are less distinct, however, we observe that 3-way quicksort performance starts to take off, while ternary quicksort slows downs greatly.

$ g++ -g quicksort.cpp -ltictoc -O3 && ./a.out 100000
Time elapsed: 0.808114s
Time elapsed: 1.043392s
Time elapsed: 0.829712s

$ g++ -g quicksort.cpp -ltictoc -O3 && ./a.out 1000
Time elapsed: 0.612664s
Time elapsed: 27.944554s
Time elapsed: 0.427633s

We will examine what is going on as we introduce more duplicate key values in the next post.

Assertion: Great Practice for Software Development

In this post, I will discuss recommended usage of assertion in programming. For both C/C++/Java, assert is a great way to alert a programmer that there is something against his expectation.

Let's do a quick example. Say you are implementing quick sort. In particular, you are trying to validate your partition method. You can insert an assertion to make sure that your partition method is functioning as expected, as below:

Well, you must agree that assertion is a very great tool for debugging, as it tells you exactly where it could have gone wrong. Unfortunately, assertion will slow down the performance and is not needed for production code, so do we need to comment them out when releasing the code for production?

In fact, both C/C++ and Java has a nice way to disable and enable the entire assertion statements. For C/C++, you simply need to define a macro NDEBUG, which will disable all assertions. Otherwise, assertions are enabled by default, unless of course you define NDEBUG macro in your source files explicitly. Therefore, you need disable assertions at compile time for production binary:

$ g++ *.cpp -DNDEBUG

Here, -D is the option for implicit #define macro, so the option -DNDEBUG will define NDEBUG macro for all files.

For Java, there is -da and -ea options which will disable and enable assertions, respectively at the runtime. In fact, assertions are by default disabled in Java, so you need to enable with -ea option when debugging, as below:

$ java -ea SomeClass

For more info, please refer to this and this official documentations.

Happy coding!

Monday, June 12, 2017

Setting up Environment for Coursera's Algorithm Class

Yes, I am taking the very basic algorithm class from Coursera. In fact, this class is very good. It is actually a course from Princeton, and I really like the course, except that they use Java, which I am not as familiar with compared to C/C++.

On top of that, the author provides custom jar library which I must use for compiling and running Java projects. Although the instructions are written very clearly on the official website, I still want to go over this bit, because I spent quite some time figuring this out.

In order to compile Java with a custom library, one must specify the library's path with -classpath flag as below:
$ javac -classpath /path/to/custom/library.jar SomeClass.java -g

Also, it is handy to include -g flag so that one can debug it later. If you are compiling a program that uses algs4.jar library file provided by the author, you must download the file and specify it with -classpath flag.

In order to run the program, you need to do the same thing, but with the path to source files as well:
$ java -classpath /path/to/custom/library.jar:/directory/to/source/files SomeClass 

Finally, to debug, you need to do the same thing:
$ jdb -classpath /path/to/custom/library.jar:/directory/to/source/files SomeClass

Let's do an example. Let's assume that you have downloaded algs4.jar file to ~/Downloads folder. Say you have created Queue.java file in ~/Documents folder.

You will need to run the following commands to compile, run, and debug Queue.java, respectively:
$ javac -classpath ~/Downloads/algs4.jar ~/Documents/Queue.java -g
$ java -classpath ~/Downloads/algs4.jar:~/Documents Queue
$ jdb -classpath ~/Downloads/algs4.jar:~/Documents Queue

Happy coding algorithms!

Sunday, June 11, 2017

GDB on Mac OS X (Safe Method)

The officially supported debugging package for Mac OS X is lldb, which is a fine debugger. However, many people including me still prefer to use gdb, the GNU debugger. In fact, I use cgdb, which provides nice color interface to gdb, and that is why I must install gdb on my Mac. Unfortunately, it is not so easy to install and enable gdb to debug on Mac. In the previous post, I presented a method to do this, but I figure now that it was quite unsafe way to do so. For more details, please take a look at the official documentation.

Here is a better way of enabling gdb on your Mac. First, if you haven't installed gdb, do so.
$ brew install gdb

If you are using Sierra or above, run the following as well:
$ echo "set startup-with-shell off" >> ~/.gdbinit

Next, this is the part where we give gdb debug other processes. Open up Keychain Access application and on the menu select Keychain Access -> Certificate Assistant -> Create a certificate.

Enter gdb-cert for the name, select Self Signed Root for Identity Type, select Code Signing for Certificate Type, and check the box Let me override defaults.

Click Continue several times until you see Specify a Location For the Certificate window. Select System for the Keychain and proceed.

Once the certificate is created, double click on it, and in the Trust section locate Code Signing item. Select Always Trust and exit. You will be prompted with your admin password to make the change.

Finally, you can close Keychain Access app and type in the following in terminal:
$ sudo killall taskgated
$ codesign -fs gdb-cert $(which gdb)

That's it! You will be able to use gdb on your Mac!

Friday, May 5, 2017

Handwriting Recognition with Deep Learning

In this post, we will go over an example to apply deep learning, in particular convolutional neural network, to recognize handwritten numbers from 0 to 9 using TensorFlow backend Keras. If you don't have deep learning environment setup yet, checkout this post.

Thankfully, there is publicly available handwritten digits and its labels on the web that we can use. For more info, check out MNIST website. Even better, Keras has MNIST data module, so we will use  it.

Use your favorite editor and create mnist.py file with the following:

Upon running the file with
$ python mnist.py

you will see convolutional network being trained with data and tested with unseen data with accuracy of about 99%!

Thursday, May 4, 2017

Setup Deep Learning Development Environment in Ubuntu

In this tutorial, I will go through step by step instructions to setup deep learning development environment for Ubuntu. We will install the following python packages on fresh Ubuntu 16.04:
opencv
tensorflow
keras
matplotlib
numpy
scipy
sklearn
tk

Let's dig it! First, we need to install pip, which helps us install these python packages with ease.
$ sudo apt-get install python-pip -y

You may want to upgrade pip to the latest version:
$ pip install --upgrade pip

Next, let's install python packages:
$ sudo -H pip install tensorflow keras numpy scipy matplotlib sklearn

For OpenCV and TK, we need to install it from apt-get:
$ sudo apt-get install libopencv-dev python-opencv python-tk -y

That's it! Now you are ready to develop your neural network with tensorflow backend keras! If you want to test out if your environment is successfully setup, check out this post.

Tuesday, May 2, 2017

Displaying Exact Values for Digital Color Meter on Mac OS X

Mac OS X's built-in app Digital Color Meter is an extremely useful tool for me, and I love it. However, the only problem is that it is not exact.

I manually created an image with RGB values of arbitrary numbers, say (38, 205, 86) using OpenCV. I then opened up the image with Preview, and measured the color with Digital Color Meter. Unfortunately, its output was close but not exact.

I also noticed that there is an drop down menu, such as
Display in native values
Display in P3
...

After playing around with all of these options, I am very happy to report that the option Display in sRGB produces exact match with the pixel!


Saturday, April 22, 2017

Hosting Multiple Websites on a Single IP Address

It is not too difficult to host multiple websites on a single server and a single IP address. Here is how to do it with Apache server, based on article1, article2, and article3. I am assuming that you already have a server setup with apache2. If not, please refer to this tutorial for instructions.

The first step is to create virtual hosts where each virtual host serves each different website. For example, assume you want to serve domain names: hi.com and hello.com. You will need to create two virtual hosts, so that one will serve hi.com while the other will serve hello.com.

On Ubuntu or Debian, apache default server configuration files are in /etc/apache2 directory, while web server directory is /var/www.

First, copy the default configuration file and create two virtual host config files:
$ sudo cp /etc/apache2/sites-available/000-default.conf /etc/apache2/sites-available/hi.conf
$ sudo cp /etc/apache2/sites-available/000-default.conf /etc/apache2/sites-available/hello.conf

Next, edit /etc/apache2/sites-available/hi.conf similar to below:
 <VirtualHost *:80>
    ServerName hi.com
    ServerAlias www.hi.com
    DocumentRoot /var/www/hi
    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>


Note that you must create /var/www/hi directory that contains files to serve for hi.com.

Similarly, edit /etc/apache2/sites-available/hello.conf in the same manner.
 <VirtualHost *:80>
    ServerName hello.com
    ServerAlias www.hello.com
    DocumentRoot /var/www/hello
    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>


Again, you will need to create /var/www/hello directory that will serve visitors to hello.com.

Next, enable the new virtual host configuration files:
$ sudo a2ensite hi.conf 
$ sudo a2ensite hello.conf 

Next, reload apache2 so that the change takes effect
$ sudo service apache2 reload

If you want to test these out, refer to this excellent article for more details.


These steps up to here will complete the setup for the server side. Now, it is time to setup your domain name configurations.

To direct any visitor who enters hi.com or hello.com to your virtual hosts, you will need to add A Record. Take a look at this article for more details.

Essentially, create A Record for hi.com and hello.com to direct to your server's IP address, and apache server will then take care of directing visitors of hi.com to your hi.com virtual host, and visitors of hello.com to the virtual host of hello.com that you have set up above.

*** Note: make sure not to enable forwarding of your domain name to your server. That was my first attempt, and it did not work. You will need to set up A Record instead in order for your apache server to point to appropriate virtual host.