BIDMach library for GPU computing with Intel parallel studio XE: simply amazing [install on MacOS and EC2]
Using Intel Parallel Studio,
there is a very great BIDMach library with very competitive results(see the benchmarks): on some tasks, one can achieve on a single GPU instance the speed of a cluster of a few hundred instances, at a cost 10 to 1000 times lower.
We’ll see in this article how to install the library on Mac OS, or run an EC2 instance with the library, as well as a few first operations. For a deeper tutorial : my next article.
Install on Mac OS
Having the library installed locally presents the great advantage to develop and test on small datasets directly on the local computer, before renting an GPU-enabled instance in the cloud.
For a CPU-only install
git clone https://github.com/BIDData/BIDMach.git
cd BIDMach
./getdevlibs.sh
./sbt clean package
For a GPU install, it requires an iMac with a NVIDIA GPU and its CUDA library installed.
Since I’m using CUDA 7.5 instead of 7.0, I had to recompile JCuda, BIDMat, and BIDMach.
First download Intel parallel studio (if you need to uninstall it, run command for i in rpm -qa |
grep intel ; do sudo rpm -e $i ; done a few times). |
mkdir ~/technologies/JCuda
cd ~/technologies/JCuda
git clone https://github.com/jcuda/jcuda-common.git
git clone https://github.com/jcuda/jcuda-main.git
git clone https://github.com/jcuda/jcuda.git
git clone https://github.com/jcuda/jcublas.git
git clone https://github.com/jcuda/jcufft.git
git clone https://github.com/jcuda/jcusparse.git
git clone https://github.com/jcuda/jcurand.git
git clone https://github.com/jcuda/jcusolver.git
cmake jcuda-main
make
cd jcuda-main
mvn install
git clone https://github.com/BIDData/BIDMach.git
#compiling for GPU
export PATH=$PATH:/usr/local/cuda/bin/
cd ~/technologies/BIDMach
cd jni/src
./configure
make
make install
cd ../..
#compiling for CPU
cd src/main/C/newparse
./configure
make
make install
cd ../../../..
./getdevlibs.sh
rm lib/IScala.jar
cp ../JCuda/jcuda-main/target/* lib/
rm lib/jcu*0.7.0a.jar
cp ../BIDMat/lib/libbidmatcuda-apple-x86_64.dylib lib/
sbt compile
sbt package
In bidmach file, change the CUDA version to the current one JCUDA_VERSION="0.7.5"
, augment the memory
JCUDA_VERSION="0.7.5" # Fix if needed
MEMSIZE="-Xmx12G"
and start ./bidmach
command which gives :
Loading /Users/christopher5106/technologies/BIDMach/lib/bidmach_init.scala...
import BIDMat.{CMat, CSMat, DMat, Dict, FMat, FND, GMat, GDMat, GIMat, GLMat, GSMat, GSDMat, GND, HMat, IDict, Image, IMat, LMat, Mat, SMat, SBMat, SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{Click, FM, GLM, KMeans, KMeansw, LDA, LDAgibbs, Model, NMF, SFA, RandomForest, SVD}
import BIDMach.networks.DNN
import BIDMach.datasources.{DataSource, MatSource, FileSource, SFileSource}
import BIDMach.datasinks.{DataSink, MatSink}
import BIDMach.mixins.{CosineSim, Perplexity, Top, L1Regularizer, L2Regularizer}
import BIDMach.updaters.{ADAGrad, Batch, BatchNorm, IncMult, IncNorm, Telescoping}
import BIDMach.causal.IPTW
1 CUDA device found, CUDA version 7.5
Welcome to Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51).
Type in expressions to have them evaluated.
Type :help for more information.
Everything works well, my GPU is found correctly.
./scripts/getdata.sh
./bidmach
val a = loadSMat("data/rcv1/docs.smat.lz4")
returns
a: BIDMat.SMat =
( 33, 0) 1
( 47, 0) 1
( 94, 0) 1
( 104, 0) 1
( 112, 0) 3
( 118, 0) 1
( 141, 0) 2
( 165, 0) 2
... ... ...
Let’s continue on the Quickstart tutorial :
val c = loadFMat("data/rcv1/cats.fmat.lz4")
val (mm, mopts) = GLM.learner(a, c, 1)
mm.train
To clear the cache :
resetGPU; Mat.clearCaches
EC2 launch
To launch an EC2 G2 (GPU-enabled) instance with BIDMach, there exists AMI with BIDMach pre-installed :
-
in the US west zone (Oregon) : EC2 AMI created by BIDMach team
-
in EU west zone (Irland), I created an AMI with BIDMach and an AMI to launch a cluster of GPU with BIDMach
First, add an EC2 permission policy to your user :
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "StmtXXX",
"Effect": "Allow",
"Action": [
"ec2:DescribeAvailabilityZones",
"ec2:RunInstances",
"ec2:TerminateInstances",
"ec2:CreateSecurityGroup",
"ec2:CreateKeyPair",
"ec2:DescribeInstances"
],
"Resource": [
"*"
]
}
]
}
Create a EC2 security group and a keypair and start the instance from an AMI, all in the zone where the AMI lives.
In the US west zone (Oregon) :
#Create a security group
aws ec2 create-security-group --group-name bidmach --description bidmach \
--region us-west-2
#Create key pair
aws ec2 create-key-pair --key-name us-west2-keypair --region us-west-2
# Save the keypair to us-west2-keypair.pem and change its mode
sudo chmod 600 us-west2-keypair.pem
#Launch instance
aws ec2 run-instances --image-id ami-71280941 --key-name us-west2-keypair \
--security-groups bidmach --instance-type g2.2xlarge \
--placement AvailabilityZone=us-west-2b --region us-west-2
#Get your instance public DNS with
aws ec2 describe-instances --region us-west-2
#Connect to the instance
ssh -i us-west2-keypair.pem ec2-user@ec2-XXX_DNS.us-west-2.compute.amazonaws.com
In the EU west zone (Irland) :
aws ec2 run-instances --image-id ami-e2f74491 --key-name sparkclusterkey \
--instance-type g2.2xlarge --region eu-west-1 --security-groups bidmach
Let’s download the data :
/opt/BIDMach/scripts/getdata.sh
/opt/BIDMach/bidmach
Start BIDMach with bidmach
command and you get :
Loading /opt/BIDMach/lib/bidmach_init.scala...
import BIDMat.{CMat, CSMat, DMat, Dict, FMat, FND, GMat, GDMat, GIMat, GLMat, GSMat, GSDMat, HMat, IDict, Image, IMat, LMat, Mat, SMat, SBMat, SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{DNN, FM, GLM, KMeans, KMeansw, LDA, LDAgibbs, Model, NMF, SFA, RandomForest}
import BIDMach.datasources.{DataSource, MatDS, FilesDS, SFilesDS}
import BIDMach.mixins.{CosineSim, Perplexity, Top, L1Regularizer, L2Regularizer}
import BIDMach.updaters.{ADAGrad, Batch, BatchNorm, IncMult, IncNorm, Telescoping}
import BIDMach.causal.IPTW
1 CUDA device found, CUDA version 6.5
Data should be available in /opt/BIDMach/data/. Let’s load the data, partition it between train and test, train the model, predict on the test set and compute the accuracy :
val a = loadSMat("/opt/BIDMach/data/rcv1/docs.smat.lz4")
val c = loadFMat("/opt/BIDMach/data/rcv1/cats.fmat.lz4")
val inds = randperm(a.ncols)
val atest = a(?, inds(0->100000))
val atrain = a(?, inds(100000->a.ncols))
val ctest = c(?, inds(0->100000))
val ctrain = c(?, inds(100000->a.ncols))
val cx = zeros(ctest.nrows, ctest.ncols)
val (mm, mopts, nn, nopts) = GLM.learner(atrain, ctrain, atest, cx, 1)
mm.train
nn.predict
val p = ctest *@ cx + (1 - ctest) *@ (1 - cx)
mean(p, 2)
During training, you get
- the percentage of consumed train data,
- the negative log likelyhood,
- the gigaflops,
- the times,
- the consumed data gigabytes,
- the megabytes per seconds, and
- the occupied GPU memory
as here :
corpus perplexity=14737,915077
Predicting
3,00%, ll=-0,00783, gf=9,558, secs=0,0, GB=0,00, MB/s=436,75, GPUmem=0,70
6,00%, ll=-0,00806, gf=9,610, secs=0,0, GB=0,01, MB/s=439,78, GPUmem=0,70
10,00%, ll=-0,00804, gf=10,101, secs=0,0, GB=0,01, MB/s=462,40, GPUmem=0,70
13,00%, ll=-0,00802, gf=10,380, secs=0,0, GB=0,01, MB/s=475,39, GPUmem=0,70
16,00%, ll=-0,00813, gf=10,550, secs=0,0, GB=0,02, MB/s=483,29, GPUmem=0,70
20,00%, ll=-0,00804, gf=10,605, secs=0,0, GB=0,02, MB/s=485,10, GPUmem=0,70
23,00%, ll=-0,00793, gf=10,444, secs=0,0, GB=0,02, MB/s=477,68, GPUmem=0,70
26,00%, ll=-0,00820, gf=10,548, secs=0,1, GB=0,02, MB/s=482,65, GPUmem=0,70
30,00%, ll=-0,00797, gf=10,625, secs=0,1, GB=0,03, MB/s=486,27, GPUmem=0,70
33,00%, ll=-0,00798, gf=10,685, secs=0,1, GB=0,03, MB/s=489,04, GPUmem=0,70
36,00%, ll=-0,00795, gf=10,750, secs=0,1, GB=0,03, MB/s=492,29, GPUmem=0,70
40,00%, ll=-0,00769, gf=10,813, secs=0,1, GB=0,04, MB/s=495,43, GPUmem=0,70
43,00%, ll=-0,00811, gf=10,718, secs=0,1, GB=0,04, MB/s=491,17, GPUmem=0,70
46,00%, ll=-0,00824, gf=10,746, secs=0,1, GB=0,04, MB/s=492,30, GPUmem=0,70
50,00%, ll=-0,00798, gf=10,786, secs=0,1, GB=0,05, MB/s=494,21, GPUmem=0,70
53,00%, ll=-0,00784, gf=10,802, secs=0,1, GB=0,05, MB/s=494,82, GPUmem=0,70
56,00%, ll=-0,00809, gf=10,832, secs=0,1, GB=0,05, MB/s=496,25, GPUmem=0,70
60,00%, ll=-0,00817, gf=9,144, secs=0,1, GB=0,06, MB/s=418,94, GPUmem=0,70
63,00%, ll=-0,00765, gf=9,239, secs=0,1, GB=0,06, MB/s=423,33, GPUmem=0,70
66,00%, ll=-0,00818, gf=9,323, secs=0,1, GB=0,06, MB/s=427,19, GPUmem=0,70
70,00%, ll=-0,00779, gf=9,346, secs=0,2, GB=0,07, MB/s=428,33, GPUmem=0,70
73,00%, ll=-0,00782, gf=9,418, secs=0,2, GB=0,07, MB/s=431,64, GPUmem=0,70
76,00%, ll=-0,00761, gf=9,494, secs=0,2, GB=0,07, MB/s=435,24, GPUmem=0,70
80,00%, ll=-0,00806, gf=9,555, secs=0,2, GB=0,07, MB/s=438,00, GPUmem=0,70
83,00%, ll=-0,00791, gf=9,559, secs=0,2, GB=0,08, MB/s=438,16, GPUmem=0,70
86,00%, ll=-0,00812, gf=9,616, secs=0,2, GB=0,08, MB/s=440,77, GPUmem=0,70
90,00%, ll=-0,00817, gf=9,666, secs=0,2, GB=0,08, MB/s=443,01, GPUmem=0,70
93,00%, ll=-0,00797, gf=9,711, secs=0,2, GB=0,09, MB/s=445,04, GPUmem=0,70
96,00%, ll=-0,00817, gf=9,757, secs=0,2, GB=0,09, MB/s=447,12, GPUmem=0,70
100,00%, ll=-0,00799, gf=9,705, secs=0,2, GB=0,09, MB/s=444,77, GPUmem=0,70
Time=0,2090 secs, gflops=9,71
The accuracies are :
0,99035
0,92883
0,98513
0,98612
0,95681
0,96348
...
To get the training options :
mopts.what
Command GPUmem
gives you percentage of used memory, free memory and memory capacity :
(Float, Long, Long) = (0.69568384,2987802624,4294770688)
Stop the instance :
aws ec2 terminate-instances --region us-west-2 --instance-ids i-XXX
To get an updated AMI with the new version of BIDMach and Cuda 7.5, have a look at my article about new AMI.
Well done!