//Cloud notes from my desk -Maheshk

"Fortunate are those who take the first steps.” ― Paulo Coelho

[k8s] Kubernetes dashboard access warnings

Accessing your Kubernetes dashboard through proxy you might experience this warning.

Sample text: configmaps is forbidden: User “system:serviceaccount:kube-system:kubernetes-dashboard” cannot list configmaps in the namespace “default” k8rbac

Resolution: From the message it is apparent that, access to the dashboard is restricted. Solution is to add the required rolebinding as below.

Two ways to do it. You can create the binding with simple one liner from CLI or YAML way.

$ kubectl create clusterrolebinding kubernetes-dashboard --clusterrole=cluster-admin --serviceaccount=kube-system:kubernetes-dashboard

or  YAML way to create the role binding. Create the below Yaml file with some name say “dashboard-rolebinding.yaml” and submit for creation in the same Kubectl.

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
name: kubernetes-dashboard
k8s-app: kubernetes-dashboard
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
- kind: ServiceAccount
name: kubernetes-dashboard
namespace: kube-system

$ kubectl create -f dashboard-rolebinding.yaml

PS- I had this experience when I access my AKS cluster, so not sure about other providers or distribution at this time of writing.


2019-01-26 Posted by | Kubernetes, Uncategorized | | Leave a comment

[LFCS] Managing Software RAID

mdadm is a super cool command in Linux used to manage MD devices aka Linux Software RAID. Before we jump in, let’s see what is RAID. – –Redunant array of independent disks
–if disk gets corrupted, then data loss
–using RAID, if one disk fails, other will take over

$ man page says this,

RAID devices are virtual devices created from two or more real block devices. This allows multiple devices (typically disk drives or partitions thereof) to be combined into a single device to hold (for example) a single filesystem. Some RAID levels include redundancy and so can survive some degree of device failure.

Understanding RAID soln,
RAID O- Striping { one big device based on multiple disk, no redundancy or easy recovery}
RAID 1- Mirroring { 2 disks, identical }
RAID 5- striping with distributed parity { if data is written with parity info, if one disks fails, then restore the data }
RAID 6- striping with dual distributed parity { redundant parity is written,advancement of RAID 5 }
RAID 10- mirrored and striped { minimum of 10 disks, 2 for striping, 2 for mirrored}

Sample question: How to create a RAID 5 Device using 3 disk device of 1 GB each. Also allocate additional device as spare device.
— Put a file system on it and mount it on /raid
— Fail one of the devices, monitor what is happening
— Replace the failed device with spare device


$ cat /proc/partitions
$ fdisk -l { list the partition tables for the device, if no device specified then list all the partitions from the system }
$ fdisk /dev/sdc
-n create a new partitions { size as +1G }
-m {help}
-t :L {enter “fd” for Linux raid auto}
-w { write the
entries to persist }
$ partprobe { inform the OS partition table changes }
$ vim /etc/fstab { before we proceed, let’s verify the disks are not used for any mounting. In my case, I had used as swap device mounting so got an error saying device is busy error. Rmv the entry, reboot }
$ mdadm –create /dev/md1 -l 5 -x 1 –raid-disk=3 /dev/sdc1 /dev/sdc2 /dev/sdc3 /dev/sdc4 –verbose –auto=yes
$ mdadm –detail /dev/md1 { list details after creation, should see 3 device + 1 spare device }

$ mdadm fail dev/md1 /dev/sdc1 { to simulate the failure }
$ mdadm –remove /dev/md1 /dev/sdc1 { remove the faulty one }
$ mdadm –add /dev/md1 /dev/sdc1 { add the device back to the pool as spare device if healthy }

other disk related commands,
$ blkid $ blkid /dev/sdc
$ df -h, df -h -T, df -hT /home
$ du -h /home, du -sh /home/mydir
$ mount /dev/sdc5 /mnt, cd /mnt , touch file1 { after mounting make entry in /etc/fstab to persist}
$ mount -a { to mount all fs mentioned in fstab}
$ mkfs.ext4 /dev/sda4 { format a partition of type ext4, after creating a partition }

Command output:

root@mikky100:~# mdadm –fail /dev/md1 /dev/sdc1 { Simulate the failure }
mdadm: set /dev/sdc1 faulty in /dev/md1

root@mikky100:~# mdadm –detail /dev/md1 { view the detail after the failure, we should see the spare disk getting rebuild }
Version : 1.2
Creation Time : Mon Jun 11 06:10:34 2018
Raid Level : raid5
Array Size : 1951744 (1906.32 MiB 1998.59 MB)
Used Dev Size : 975872 (953.16 MiB 999.29 MB)
Raid Devices : 3
Total Devices : 4
Persistence : Superblock is persistent

Update Time : Mon Jun 11 17:06:09 2018
State : clean, degraded, recovering
Active Devices : 2
Working Devices : 3
Failed Devices : 1
Spare Devices : 1

Layout : left-symmetric
Chunk Size : 512K

Rebuild Status : 3% complete

Name : mikky100:1 (local to host mikky100)
UUID : 772f743c:b1209727:6910411d:690d6294
Events : 20

Number Major Minor RaidDevice State
3 8 36 0 spare rebuilding /dev/sdc4
1 8 34 1 active sync /dev/sdc2
4 8 35 2 active sync /dev/sdc3

0 8 33 – faulty /dev/sdc1

root@mikky100:~# mdadm –detail /dev/md1
Version : 1.2
Creation Time : Mon Jun 11 06:10:34 2018
Raid Level : raid5
Array Size : 1951744 (1906.32 MiB 1998.59 MB)
Used Dev Size : 975872 (953.16 MiB 999.29 MB)
Raid Devices : 3
Total Devices : 4
Persistence : Superblock is persistent

Update Time : Mon Jun 11 17:08:13 2018
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 1
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 512K

Name : mikky100:1 (local to host mikky100)
UUID : 772f743c:b1209727:6910411d:690d6294
Events : 37

Number Major Minor RaidDevice State
3 8 36 0 active sync /dev/sdc4
1 8 34 1 active sync /dev/sdc2
4 8 35 2 active sync /dev/sdc3

0 8 33 – faulty /dev/sdc1

root@mikky100:~# mdadm –add /dev/md1 /dev/sdc1 { add the disk back as spare }

root@mikky100:~# mdadm –detail /dev/md1
Version : 1.2
Creation Time : Mon Jun 11 06:10:34 2018
Raid Level : raid5
Array Size : 1951744 (1906.32 MiB 1998.59 MB)
Used Dev Size : 975872 (953.16 MiB 999.29 MB)
Raid Devices : 3
Total Devices : 4
Persistence : Superblock is persistent

Update Time : Mon Jun 11 17:12:21 2018
State : clean
Active Devices : 3
Working Devices : 4
Failed Devices : 0
Spare Devices : 1

Layout : left-symmetric
Chunk Size : 512K

Name : mikky100:1 (local to host mikky100)
UUID : 772f743c:b1209727:6910411d:690d6294
Events : 39

Number Major Minor RaidDevice State
3 8 36 0 active sync /dev/sdc4
1 8 34 1 active sync /dev/sdc2
4 8 35 2 active sync /dev/sdc3

5 8 33 – spare /dev/sdc1

2018-06-11 Posted by | LFCS, Linux, OSS | | Leave a comment

AI Revolution and resources

Before 1990, its was all CLI. Every software had commands & parameters. That was the standard of the Software development. Post that we had GUI based interfaces having buttons and mouse clicks. Around 1995, we had Internet Web UI as became the standard. After the success of smart phones in 2008, we saw more Responsive UI developed using jQuery,Bootstrap and now we live in the era of Conversation UI. Bots like Cortana, Siri, Alexa are around the block to help our modern needs.

Below is the snippet which I borrowed from this ebook covering “how the enterprise information technology has transformed over the last few decades”. Thanks to the author who put the classification beautifully with examples.

Client-Server Revolution → Systems of records It was the client-server revolution that first enabled broad use of information technology to manage business. Organizations first built systems of records: Customer Relationship Management (CRM) systems; Human Capital Management (HCM) systems for HR; and Enterprise Resource Planning (ERP) systems for financials and key assets.

Internet Revolution → System of engagement The rise of the internet, mobile, and chat allowed us to create systems of engagement that interfaced between the systems of records and interacted directly with customers and suppliers.

AI Revolution → System of intelligence What is emerging now are systems of intelligence that integrate data across all systems of record, connect directly to systems of engagement, and build systems that understand and reason with the data. These systems can drive workflows and management processes, optimize operations, and drive intelligent interactions with customers, employees, suppliers, and stakeholders.

Below is our Microsoft AI platform story covered in one slide deck.


If you are wondering how your organization can start the AI journey today, then below is the some of the key resources for learning,

  1. Azure AI -> A page where our Microsoft AI story is well articulated (In the page, scroll down for ‘AI Services’).
  2. Intelligent KIOSK –> must try windows app to demonstrate Pre-built AI (Cognitive APIs) & store url
  3. Seeing AI –> Great demo app on iOS using our cognitive APIs. Good to install and check the capability it brings in
  4. Conference Buddy –> ingredients needed to develop an intelligent chatbot. sample code for try one.
  5. Microsoft AI School –> Another great learning resource our Services and ML offerings
  6. The JFK Files –> Cognitive Search – An AI-first approach to content understanding – code

update: 6/26

update 12/28


Happy Learning !!

2018-06-02 Posted by | AI, Azure, ML | | Leave a comment

[LFCS] terminal multiplexer commands $ tmux

tmux, htop and bash-completion are the 3 useful commands for the exam. You can install them using $ apt install tmux htop bash-completion.Let’s see the commands for tmux,

Session management:-
$ apt install tmux -y
$ tmux new -s <session-name>
$ tmux attach -t <session-name> {continue to run in the bg if we detach}
$ tmux list-sessions

Session command:-
<Cntl-b> % -> Split the window vertically
<Cntl-b> ” -> Split the window horizontally
<Cntl-b> x -> kill the current pane
<Cntl-b> Up, Down, Right, Left cursor -> switch the cursor from one pane to the other
<Cntl-b> x -> Close the current pane
<Cntl-b> [ -> Scroll within a pane (use q to exit)

2018-05-28 Posted by | LFCS, Linux | | Leave a comment

[LFCS] Commands to manage and configure containers in Linux

– LXC(Linux container) is an OS level virtualization for running multiple isolated Lx systems (containers) using single kernel
– LXC combines the kernel’s cgroups (Control Groups) to provide isolated space for our application
– Lx kernel provides the cgroups functionality + namespace isolation –> cgroups is the brain behind the virtualization
– cgroups provides { resource limiting, prioritization, accounting and control }
– various projects use cgroups as their basis, including Docker, CoreOS, RH, Hadoop, libvirt, LXC, Open Grid/Grid Engine, Kubernetes, systemd, mesos and mesoshpere
– initial version of Docker had LXC as execution environment, but later replaced with libcontianer written in go lang
– both dockers and LMCTFY taken over the containers space and used by many companies


>>>LXC – Linux container commands

$ sudo -i {switch to root account}
$ apt update
$ free -m { check your memory availability, -m for MB, -G for GB }
$ apt install lxc  { linux container, docker is based on this/type of }
$ systemctl status lxc.service
$ systemctl enable lxc
$ lxc ->tab -> tab  { to see all the lxc- commands }
$ cd /usr/share/lxc/templates
$ ls { should see the list of templates }
$ lxc-create -n mylxcontainer -t ubuntu { should create ubuntu container based on the specified template}
$ lxc-ls { list the local container, ubuntu should appear with name mylxcontainer }
$ lxc-info -n mylxcontainer { should see the status as STOPPED }
$ lxc-start -n mylxcontainer
$ lxc-info -n mylxcontainer { should see the state as RUNNING }
$ lxc-console -n mylxcontainer { console login to the container, username -ubuntu, pass-ubuntu }
$ ubuntu@mylxcontainer:~$ { upon login your console prompt changes takes you to ubuntu }
$ uname -a or hostname { to confirm you are within the container }
$ Type <cntrl+a q> to exit the console
$ lxc-stop -n mylxcontainer
$ lxc-destroy -n mylxcontainer

>>>Docker container commands
$ apt update
$ free -m
$ apt install docker.io
$ systemctl enable docker
$ systemctl start docker
$ systemctl status docker
$ docker info
$ docker version
$ docker run hello-world { to pull the hello-world for testing }
$ docker ps
$ docker ps -la or $ docker ps -a { list all the containers }
$ docker search apache or microsoft { to search container by name }
$ docker images { to list all the images in localhost }
$ docker pull ubuntu
$ docker run -it –rm -p 8080:80 nginx  { for nginx, -it for interative }
$ docker ps -la { list all the containers, look for container_id, type first 3 letters which is enough }
$ docker start container_id or ubuntu { say efe }
$ docker stop efe
$ docker run -it ubuntu bash
$ root@efe34sdsdsds:/# { takes to container bash }
<type cntrl p + cntrl q> to switch back to terminal
$ docker save debian -o mydebian.tar
$ docker load -i mydebian.tar
$ docker export web-container -o xyz.tar
$ docker import xyz.tar
$ docker logs containername or id
$ docker logs -f containername or id { live logs or streaming logs }
$ docker stats
$ docker top container_id
$ docker build -t my-image dockerfiles/ or $ docker build -t aspnet5 .  { there is a dot at the end to pick the local yaml file for the build }

>>>for working with Azure Container

$ az acr login –name myregistry
$ docker login myregistry.azurecr.io -u xxxxxxxx -p myPassword
$ docker pull nginx
$ docker run -it –rm -p 8080:80 nginx { Browse to http://localhost:8080  }
{To stop and remove the container, press Control+C.}
$ docker tag nginx myregistry.azurecr.io/samples/nginx
$ docker push myregistry.azurecr.io/samples/nginx
$ docker pull myregistry.azurecr.io/samples/nginx
$ docker run -it –rm -p 8080:80 myregistry.azurecr.io/samples/nginx
$ docker rmi myregistry.azurecr.io/samples/nginx
$ docker inspect -f “{{ .NetworkSettings.Networks.nat.IPAddress }}” nginx
$ az acr repository delete –name myregistry –repository samples/nginx –tag latest –manifest
$ docker run -d redis (By default, Docker will run a command in the fg. To run in the bg, the option -d needs to be specified.)
$ docker run -d redis:latest
$ docker start $(docker ps -a -q)
$ docker rm -f $(docker ps -a -q)

>>>docker minified version
$ docker pull docker.io/httpd
$ docker images
$ docker run httpd
$ docker ps [-a | -l]
$ docker info
$ docker run httpd
$ curl  <ctrl+c>
$ docker stop httpd
$ docker rmi -f docker.io/httpd
$ systemctl stop docker

Happy learning !

2018-05-27 Posted by | LFCS, Linux, Microservices, Open Source | | Leave a comment

[Azure HPC] Intro to HPC and steps to setup CycleCloud in Azure


Aug 31, Our CycleCloud team hits general availability in Azure. It's a tool for creating, managing, operating, and optimizing HPC clusters of any scale in Azure.Azure CycleCloud is available in the Microsoft Download Center, Azure Marketplace, and Azure Container Registry,
Azure CycleCloud announcement
Azure CycleCloud product page
Azure CycleCloud download
Azure Marketplace offering for Azure CycleCloud
Azure Container Registry container

The following key scenarios are met by CycleCloud:
• Ability to run Linux & Windows HPC Clusters with traditional schedulers, including Slurm, PBS Pro, HPC Pack, Spectrum LSF and Symphony, Grid Engine, or HTCondor.
• Easily managing HPC clusters with multiple VM families and sizes to get capacity for critical runs
• Customizable workload templates that serve as best-practice starting points for Azure deployments
• Active directory integration for access to and management of compute environments


As part of Microsoft Internal MOOC course “Big Compute: Uncovering and Landing Hyperscale Solutions in Azure”, I was introduced to CycleCloud and learned how to setup CycleCloud in my Azure subscription. I would like to blog about some of my HPC learning + steps followed to setup one.

What is HPC? High Performance computing(HPC) is a parallel processing technique for solving complex computational problems. HPC applications can scale to thousands of compute cores. We can run these workloads in our premise by setting up clusters, extend the burst volume to cloud or run as a 100% cloud native solution.


Where is Big Compute used, usecase ? Usually compute intensive operations are best suited for this workload.


How HPC can be achieved in Microsoft Azure?

1) Azure Batch –>managed service, “cluster” as a service, running jobs, developers can write application that submit jobs using SDK, cloud native, HPC as a service, Pay as you go billing

2) CycleCloud –>acquired by MS, “cluster” management software aka orchestration software, supports hybrid clusters, multi cloud, managing and running clusters, one time license, you have complete control of the cluster and nodes

3) CrayComputer –>partnership with CrayComputer, famous weather forecasting service

4) HPC pack in Azure Infra–> Marketplace offerings  {HPC Applications, HPC VM images, HPC storages}

Azure Batch doesn’t need intro as it is there for quite sometime, setting up a Batch is very easy. Tools like Batch Labs helps us to monitor/control the Batch job effortlessly. Batch SDK helps us to integrate with existing legacy application easily to submit the job or manage the entire batch operation using their custom developed application. The end uses need not to login to Azure portal for submitting the jobs.

What is CycleCloud? CycleCloud provides a simple, secure, and scalable way to manage compute and storage resources for HPC and Big Compute/Data workloads in Cloud. CycleCloud enables users to create environments in Azure. It supports distributed jobs and also parallel workloads to tightly-coupled applications such as MPI jobs on Infiniband/RDMA. By managing resource provisioning, configuration, and monitoring, CycleCloud allows users and IT staff to focus on business needs instead infrastructure.



How to set it up in Azure? Steps are already documented here, I am trying to put the same steps in screenshot for easy reference.

1) Download the json files to your local drive. Say, c:temp

2) Generate the Service Principle

3) Generate SSH pub and private key

4) Clone the repo file to your local drive, say c:temp

git clone https://github.com/azurebigcompute/Labs.git 

5) Edit the vms-params.json file to specify the generated rsaPublicKey parameter from Step3. The cycleDownloadUri and cycleLicenseSas parameters have been pre-configured, but if you procure license then you need to update these two params as well. For now, I am leaving as it..


6) Now login to Azure CLI, create resource group, storage account, create VNET deployment and at last create VMs

C:temp>az login

C:temp>az group create --name "cycle-rg" --location "southeastasia"

C:temp> az storage account create --name "mikkyccStorage" --group "cycle-rg" --location "southeastasia" --sku "Standard_LRS"

C:temp>az group deployment create --name "vnet_deployment" --resource-group "cycle-rg" --template-uri https://raw.githubusercontent.com/azurebigcompute/Labs/master/CycleCloud/deploy-vnet.json --parameters vnet-params.json

C:temp>az group deployment create --name "vms_deployment" --resource-group "cycle-rg" --template-uri https://raw.githubusercontent.com/azurebigcompute/Labs/master/CycleCloud/deploy-vms.json --parameters vms-params.json


7) Post the deployment, you will find the above set of resources created in our resource group say “cycle-rg”. Select the Cycleserver VM and copy the IP address to see if you can browse CycleCloud setup page.


8) Pls note, the installation uses a self-signed SSL certificate, which may show up with a warning in your browser. So, it is safe to ignore the warning and add it as exception to get the page like the after setting up the cluster (refer configure “CycleCloud Server” section from this page). If you get the below page after all the setup, then we are ready to create new cluster and submit the jobs.


9) Refer the section as it is “Creating a Grid Engine Cluster” 5.1 as it is from here

10) After the cluster is created, we need to start the cluster and see it is running like the below.


11) Now our Grid Engine cluster is ready for the job submission, For security reasons, the CycleCloud VM (CycleServer) is behind a jump box/bastion host. To access CycleServer, we must first log onto the jump box, and then ssh onto the CS instance. To do this, we’ll add a second host to jump through to the ssh commands.

From Azure portal, retrieve the admin box DNS and construct the SSH command as in screenshot. The idea is to “ssh –J” to our CycleServer through CycleAdmin box. One cannot directly ssh into CycleServer which is for security.

$ ssh -J cycleadmin@{JUMPBOX PUBLIC HOSTNAME} cycleadmin@cycleserver -i {SSH PRIVATE KEY}


12) Once we get into CycleAdmin@CycleServer, first change into root user and call CycleCloud Initialize command. You need to enter the username and password for that machine.


13) Connecting to the Grid Engine Master as

[root@cycleserver ~]$ cyclecloud connect master –c <clustername>


14) Now ready to submit our first job, qstat is to query the status of grid engine jobs and queues & qsub is to submit the batch jobs.


15) On successful submission, we should see the job started executing in our nodes.


Master takes the batch job and getting executed from 3 nodes spin under execute node template



By the way, if we login the Azure portal and navigate to the RG, then we would see there is VMSS created as part of execute worker nodes.



we could also set the autoscaling feature from CycleCloud cluster settings, so the Azure VM’s comes and goes away once the job is completed. We have submitted 100 jobs per our command so it will request 100 cores. Based on the cluster core limit, it will decide whether to scale till that or not. Let say, if we have set 100 cores as cluster scale limit, then we would see many other VM’s also getting created to complete the task in parallel.

[cyclecloud@ip-0A000404 ~]$ qsub -t 1:100 -V -b y -cwd hostname

Once the job is completed, we can terminate the cluster and also delete the RG if you don’t want to retain which is our last step. I know it’s a bit of learning + confusing to start for the first time, but once you hands-on then it is easy to setup whenever you require and dispose off after completing our jobs.

Happy learning !

2018-03-31 Posted by | Azure, HPC | , , , | Leave a comment

%d bloggers like this: