Page History

...

In the following instructions the node cld-np-19, an INFN node, is the one to be reinstalled

Table of Contents

Take note of the IP addresses and the interfaces used by the VM

First of all take note of the IP addresses and the interfaces name used by the VM for the management and data networks (192.168.60.x and 192.168.61.x) ) and the relevant interfaces

...

Code Block

[root@cld-np-19 ~]# nmcli 





eno1: connected to System eno1
        "Intel X710"
        ethernet (i40e), 34:E6:D7:F5:53:0D, hw, port e4434b3a8890, mtu 1500
        ip4 default
        inet4 192.168.60.129/24
        route4 192.168.60.0/24 metric 100
        route4 default via 192.168.60.254 metric 100
        inet6 fe80::36e6:d7ff:fef5:530d/64
        route6 fe80::/64 metric 1024

eno3: connected to eno3
        "Intel X710"
        ethernet (i40e), 34:E6:D7:F5:53:0F, hw, port e4434b3a8892, mtu 9000
        inet4 192.168.61.129/24
        route4 192.168.61.0/24 metric 101
        inet6 fe80::65:ec7c:7325:3bec/64
        route6 fe80::/64 metric 1024


...

...

Then disable the compute node, so that no new VMs will be instantiated on this compute node:

Code Block
openstack compute service set --disable cld-np-19.cloud.pd.infn.it nova-compute

To check if the compute node was actually disabled:

Code Block
openstack compute service list

Then finds all the virtual machines instantiated on this compute node:

Code Block
openstack server list --all --host cld-np-19.cloud.pd.infn.it

Then migrate each VM instantiated on this compute node to other hypervisors.

To avoid the migration of the same virtual machine multiple times, please migrate the VM on a compute node that was already migrated to AlmaLinux9

The migration of a virtual machine usually requires a downtime (~ 15') of that VM, unless the VM was created using a volume. So before migrating the VM you must agree this operation with the relevant owner.

E.g. to find the owner of a VM with UUID be832766-7996-4ccf-a9c1-5eec2e5f8fd3:

Code Block

[root@cld-ctrl-01 ~]# openstack server show be832766-7996-4ccf-a9c1-5eec2e5f8fd3 | grep user_id
| user_id                             | 7681252c430040e7bb96c7c4f7c88464                            |
[root@cld-ctrl-01 ~]# openstack user show 7681252c430040e7bb96c7c4f7c88464
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| email               | Ysabella.Ong@lnl.infn.it         |
| enabled             | True                             |
| id                  | 7681252c430040e7bb96c7c4f7c88464 |
| name                | ysaong@infn.it                   |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+

To migrate a virtual machine to another compute node, there are two possible procedures:

Procedure to be used if the VM was created using a volume (the migration procedure doesn't require a downtime)
Procedure to be used if the VM was created using ephemeral storage (the migration procedure requires a downtime)

Once there are no more VMs on that compute node, reinstall the node with AlmaLinux9 using foreman

The hostgroup must be hosts_all
The kickstart to be used must be TBC
TBC

When the node restarts after the update, make sure that SELinux is disabled:

Code Block
[root@cld-np-19 ~]# getenforce Disabled

Then do an update of the packages (probabily only puppet will be updated):

Code Block
[root@cld-np-19 ~]# yum clean all [root@cld-np-19 ~]# yum update -y

Once the node has been reinstalled with AlmaLinux9, configure the data network

TBC

The address to be used must be the original one

MTU must be 9000

If this is a DELL host, reinstall DELL OpenManage, as explained here

Check on Nagios that the relevant checks get green

Then configure the node as Openstack compute node using puppet

Stop puppet:

Code Block

language	bash

systemctl stop puppet

In foreman move the host under the ComputeNode-Prod_Yoga hostgroup

Run puppet manually:

Code Block

language	bash

puppet agent -t

If the configuration fails reporting

Code Block

language	bash

error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key  
error: "net.bridge.bridge-nf-call-iptables" is an unknown key  
error: "net.bridge.bridge-nf-call-arptables" is an unknown key

then please issue:

Code Block

language	bash

modprobe br_netfilter

and then rerun puppet

If the procedure terminated without error, enable puppet and reboot the host

Code Block

language	bash

systemctl enable puppet
shutdown -r now

Enable the nagios check for lv

Log on cld-nagios.cloud.pd.infn.it

cd /etc/nagios/objects/

Edit cloudcomputenodes.cfg (or the other file where this compute node is defined) and add a passive check:

Code Block
define service{ use LVS ; Name of service template to use host_name cld-np-19 service_description LVS freshness_threshold 28800 }

Make sure there are no problems in the configuration:

Code Block
[root@cld-nagios objects]# nagios -v /etc/nagios/nagios.cfg

and restart nagios to have applied the new check:

Code Block
[root@cld-nagios objects]# systemctl restart nagios

...

Enable the checks in Nagios for this host

...

Wait till all checks are ok (in particular the VM network and volume one: remember to enable the node just for the time to force the run from Nagios; othwerwise the check will fail.)

...

Stop puppet on the 2 controller nodes:

Code Block

language	bash

systemctl stop puppet

...

Add the new compute node in cld-config:/var/puppet/puppet_yoga/controller_yoga/templates/aai_settings.py.erb

Run puppet on the first controller node (this will trigger a restart of httpd):

Code Block

language	bash

puppet agent -t

Run puppet on the second controller node (this will trigger a restart of httpd):

Code Block

language	bash

puppet agent -t

Start puppet on the two controller nodes:

Code Block

language	bash

systemctl start puppet

Enable the host:

Code Block

language	bash

openstack compute service set --enable cld-np-19.cloud.pd.infn.it nova-compute

...

Take note of disk or disks configuration

Check if the host has 1 disk with more partitions or 2 disks (usually the first disk sda is used for Operating System the second sdb for the /var/lib/nova/instances). Check also if there is a BIOS boot partition used for EFI. This will be useful for choose one partition talbe in foreman we interface.

Use command fdisk and df: here some examples:

Code Block

1) In this first example the cld-nl-24 host has two disks: sda with OS ad sdb with /var/lib/nova/instances in sda there isnt a BIOS Boot (so no EFI);

[root@cld-nl-24 ~]# fdisk -l
Disk /dev/sda: 893.8 GiB, 959656755200 bytes, 1874329600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disklabel type: dos
Disk identifier: 0x1d6bbd3d

Device     Boot     Start        End    Sectors   Size Id Type
/dev/sda1  *         2048    2099199    2097152     1G 83 Linux
/dev/sda2         2099200  264243199  262144000   125G 82 Linux swap / Solaris
/dev/sda3       264243200 1874329599 1610086400 767.8G 83 Linux


Disk /dev/sdb: 2.2 TiB, 2399276105728 bytes, 4686086144 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 524288 bytes

[root@cld-nl-24 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        252G     0  252G   0% /dev
tmpfs           252G  4.0K  252G   1% /dev/shm
tmpfs           252G  4.0G  248G   2% /run
tmpfs           252G     0  252G   0% /sys/fs/cgroup
/dev/sda3       755G  5.2G  712G   1% /
/dev/sda1       976M  269M  641M  30% /boot
/dev/sdb        2.2T  281G  2.0T  13% /var/lib/nova/instances
tmpfs            51G     0   51G   0% /run/user/0



2) In this second example cld-dfa-gpu-01 has 2 disks: sda use for OS with BIOS boot partition (EFI) and the nvme0n1p1 has the /var/lib/nova/instances mounted on it;

[root@cld-dfa-gpu-01 ~]# fdisk -l
Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xc6f9ace9

Device         Boot Start        End    Sectors  Size Id Type
/dev/nvme0n1p1       2048 3907029167 3907027120  1.8T 83 Linux


Disk /dev/sda: 3.7 TiB, 4000225165312 bytes, 7812939776 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 bytes
Disklabel type: gpt
Disk identifier: 0A02235E-3DA4-4F30-942E-EE3AC02107C4

Device         Start        End    Sectors  Size Type
/dev/sda1       2048       4095       2048    1M BIOS boot
/dev/sda2       4096    2101247    2097152    1G Linux filesystem
/dev/sda3    2101248  264245247  262144000  125G Linux swap
/dev/sda4  264245248 7812937727 7548692480  3.5T Linux filesystem

[root@cld-dfa-gpu-01 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         79G     0   79G   0% /dev
tmpfs            79G     0   79G   0% /dev/shm
tmpfs            79G  4.1G   75G   6% /run
tmpfs            79G     0   79G   0% /sys/fs/cgroup
/dev/sda4       3.5T  5.1G  3.3T   1% /
/dev/nvme0n1p1  1.8T  422G  1.3T  25% /var/lib/nova/instances
/dev/sda2       976M  231M  679M  26% /boot
tmpfs            16G     0   16G   0% /run/user/0



3) In this third example has just one disk with more partitions, without BIOS boot (so no EFI);

[root@cld-np-15 ~]# fdisk -l
Disk /dev/sda: 1.1 TiB, 1199638052864 bytes, 2343043072 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xc572f99b

Device     Boot     Start        End    Sectors   Size Id Type
/dev/sda1  *         2048    2099199    2097152     1G 83 Linux
/dev/sda2         2099200  264243199  262144000   125G 82 Linux swap / Solaris
/dev/sda3       264243200  327157759   62914560    30G 83 Linux
/dev/sda4       327157760 2343043071 2015885312 961.3G  5 Extended
/dev/sda5       327159808 2343043071 2015883264 961.3G 83 Linux

[root@cld-np-15 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         63G     0   63G   0% /dev
tmpfs            63G  4.0K   63G   1% /dev/shm
tmpfs            63G  4.0G   59G   7% /run
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/sda3        30G  4.9G   24G  18% /
/dev/sda1       976M  253M  657M  28% /boot
/dev/sda5       946G  175G  723G  20% /var/lib/nova/instances
tmpfs            13G     0   13G   0% /run/user/0

Disable the compute node

Then from one controller node (cld-ctrl-01 or cld-ctrl-02) disable the compute node, so that no new VMs will be instantiated on this compute node:

Code Block
[root@cld-ctrl-01 ~]# source admin-openrc.sh [root@cld-ctrl-01 ~]# openstack compute service set --disable cld-np-19.cloud.pd.infn.it nova-compute

To check if the compute node was actually disabled:

Code Block
[root@cld-ctrl-01 ~]# openstack compute service list

Finds the VM instantiated on this compute node

Then finds all the virtual machines instantiated on this compute node:

Code Block
openstack server list --all --host cld-np-19.cloud.pd.infn.it

Migrate the VMs instantiated on this compute node

Then migrate each VM instantiated on this compute node to other hypervisors.

To avoid the migration of the same virtual machine multiple times, please migrate the VM on a compute node that was already migrated to AlmaLinux9

The migration of a virtual machine usually requires a downtime (~ 15') of that VM, unless the VM was created using a volume. So before migrating the VM you must agree this operation with the relevant owner.

E.g. to find the owner of a VM with UUID be832766-7996-4ccf-a9c1-5eec2e5f8fd3:

Code Block

[root@cld-ctrl-01 ~]# openstack server show be832766-7996-4ccf-a9c1-5eec2e5f8fd3 | grep user_id
| user_id                             | 7681252c430040e7bb96c7c4f7c88464                            |
[root@cld-ctrl-01 ~]# openstack user show 7681252c430040e7bb96c7c4f7c88464
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| email               | Ysabella.Ong@lnl.infn.it         |
| enabled             | True                             |
| id                  | 7681252c430040e7bb96c7c4f7c88464 |
| name                | ysaong@infn.it                   |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+

To migrate a virtual machine to another compute node, there are two possible procedures:

Procedure to be used if the VM was created using a volume (the migration procedure doesn't require a downtime)
Procedure to be used if the VM was created using ephemeral storage (the migration procedure requires a downtime)

Check in the Compute orphan VM

Check if there are any VM not more connected to nova using libvirt client:

Code Block
virsh list --all

Reinstall the node with AlmaLinux9

Once there are no more VMs on that compute node, reinstall the node with AlmaLinux9 using foreman (https://cld-config.cloud.pd.infn.it/users/login)

Go to Hosts → All hosts → cld-np-19.cloud.pd.infn.it

Edit the host in this way:

The hostgroup must be hosts_all
The Operating system has to be set to AlmaLinux 9.2 (if the Operating system form doesn't appear, check in the "manage host" option is selected. If not, select it)
The media (dispositivo) has to be set to AlmaLinux
The Partion table to be used must be one with LVM thin but with the disks structure you have seen above. Cloud be:
- Kickstart default - swap 128GB - LVM thin (one disk with more partitions)
- Kickstart default - swap 128GB - 2 DISKS - LVM thin (two disks sda with OS and sdb with /var/lib/nova/instances
- Kickstart default - swap 128GB - 2 DISKS NVME - LVM thin (two disks sda with OS and nvme0n1p1 with /var/lib/nova/instances)
- Kickstart default - swap 128GB - 2 DISKS EFI NVME - LVM thin (two disks sda with OS and a biosboot partition for EFI and nvme0n1p1 with /var/lib/nova/instances)
- Kickstart default - swap 128GB - EFI- LVM thin (one disk with more partitions and a biosboot partition for EFI)
- ....

If there are some missing partition table, one more can be created in foreman web interface, make a clone from the one more similar and change the disk name or size or adding EFI biosboot partition, ....

Check if the root password is already set otherwise add it.
Check if under "Interfaces" the interfaces eno1 and eno2 have the ip (management and data)

Save the changes and then build the node. Open a remote console (via https://blade-cld-rmc.lan/cgi-bin/webcgi/login) to reboot the compute

When the node restarts after the update, make sure that SELinux is disabled:

Code Block
[root@cld-np-19 ~]# getenforce Disabled

Then do an update of the packages (probabily only puppet will be updated):

Code Block
[root@cld-np-19 ~]# yum clean all [root@cld-np-19 ~]# yum update -y

Configure the data network

Once the node has been reinstalled with AlmaLinux9, configure the data network

If in the compute node there is a dedicated interface (like cld-np-19 eno3) for data lan

Use the correct ip and interface name in the commands below changing the name and ip with yours (change "eno3" and "192.168.61.129")

Code Block

[root@cld-np-19 ~]#  nmcli con add type ethernet ifname eno3     (change here the interface name, eno3, with your one)
[root@cld-np-19 ~]#  nmcli con mod eno3 ipv4.method manual ipv4.addr "192.168.61.129/24" (change here the interface name, eno3 and use the corretct ip in data network)
[root@cld-np-19 ~]#  nmcli con mod eno3 connection.autoconnect true
[root@cld-np-19 ~]#  nmcli con mod eno3 802-3-ethernet.mtu 9000
[root@cld-np-19 ~]#  nmcli con up eno3
[root@cld-np-19 ~]#  ip link set eno3 mtu 9000

If the interfaces used the addresses with the tagged network (VLAN 302 andl VLAN 301 in same interface) use these commands.

Use the correct ip and interface name the commands below (change "enp2s0f0.302" and "192.168.61.129")

Code Block

[root@cld-nl-24 ~]#  nmcli con add type vlan ifname enp2s0f0.302 dev enp2s0f0 id 302
[root@cld-nl-24 ~]#  nmcli con mod vlan-enp2s0f0.302 ipv4.method manual ipv4.addr "192.168.61.129/24"
[root@cld-np-24 ~]#  nmcli con up vlan-enp2s0f0.302
[root@cld-np-24 ~]#  nmcli con mod vlan-enp2s0f0.302 802-3-ethernet.mtu 9000
[root@cld-np-24 ~]#  ip link set enp2s0f0.302 mtu 9000

The address to be used must be the original one

MTU must be 9000

Reinstall OpenManage

If this is a DELL host, reinstall DELL OpenManage, as explained here

Check on Nagios that the relevant checks get green

Configure the node as Openstack compute node using puppet

Then configure the node as Openstack compute node using puppet

Stop puppet:

Code Block

language	bash

systemctl stop puppet

In foreman move the host under the ComputeNode-Prod_Yoga.el9 hostgroup

Run puppet manually:

Code Block

language	bash

puppet agent -t

If the configuration fails reporting

Code Block

language	bash

error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key  
error: "net.bridge.bridge-nf-call-iptables" is an unknown key  
error: "net.bridge.bridge-nf-call-arptables" is an unknown key

then please issue:

Code Block

language	bash

modprobe br_netfilter

and then rerun puppet

If the procedure terminated without error, enable puppet and reboot the host

Code Block

language	bash

systemctl enable puppet
shutdown -r now

Enable the nagios check for lv

Enable the nagios check for lv

Log on cld-nagios.cloud.pd.infn.it

Code Block
cd /etc/nagios/objects/

Edit cloudcomputenodes.cfg (or the other file where this compute node is defined) and add a passive check:

Code Block
define service{ use LVS ; Name of service template to use host_name cld-np-19 service_description LVS freshness_threshold 28800 }

Make sure there are no problems in the configuration:

Code Block
[root@cld-nagios objects]# nagios -v /etc/nagios/nagios.cfg

and restart nagios to have applied the new check:

Code Block
[root@cld-nagios objects]# systemctl restart nagios

Some Checks

Verify that the checks on Nagios gets green (only the "VM network" should stay in error)
Verify that a ssh from root@cld-log to root@cld-np-19 and to root@cld-np-19.cloud.pd.infn.it works without requiring password
Verify that a ssh from nagios@cld-nagios to nagios@cld-np-19 and to nagios@cld-np-19.cloud.pd.infn.it works without requiring password

Re-enable the compute node

Code Block

language	bash

openstack compute service set --enable cld-np-19.cloud.pd.infn.it nova-compute

Within 24 hours, the "VM network" nagios check should get green

Run the aggregate_manage.sh script to add the node in the relevant hostgroups, e.g.:

Code Block

language	bash

/usr/local/bin/aggregate_manage.sh add cld-np-19.cloud.pd.infn.it INFN

Space shortcuts

Page tree

Versions Compared

Old Version 7

New Version Current

Key

Take note of the IP addresses and the interfaces used by the VM

Take note of disk or disks configuration

Disable the compute node

Finds the VM instantiated on this compute node

Migrate the VMs instantiated on this compute node

Check in the Compute orphan VM

Reinstall the node with AlmaLinux9

Configure the data network

Reinstall OpenManage

Configure the node as Openstack compute node using puppet

Enable the nagios check for lv

Some Checks

Re-enable the compute node

Related articles