Commits · ffde165d215ff2c628354d8b79048e83bc0b9400 · Sovereign Cloud Stack / openstack-health-monitor

Oct 26, 2023

Add user systemd unit and tmux start scripts. (#142) · ffde165d

Kurt Garloff authored 1 year ago


* Add user systemd unit and tmux start scripts.
  Even with some docu.
* Improve docu.
* Improve shutdown logic.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

ffde165d

Oct 07, 2023

Feat/add az wave (#140) · 13bc802b

Kurt Garloff authored 1 year ago


* Delete keypair by name if creation failed.

An laready existing keypair by that name is the most common reason for
the failure, so cleaning up just in case is a good idea.
Of course, there should not have been a left-over key ...

* Also look for left-over keypairs in regular cleanups.

* Better output for KEYPAIR creation.

* Use all AZs for wavestack (default).

* 1.92: Use az hints for VM networks.

This should allow neutron to optimize things a bit.
The VMs in a network are all in one AZ, so let neutron create that
network in the right AZ. Can be disabled by passing NAZS=" ".

Also call the one and only NET_JH and not NET_JH0.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

13bc802b

Jul 04, 2023

Report VM that LB could not create a member for. (#136) · 79bfa6d3
Kurt Garloff authored 1 year ago
```
Signed-off-by: Kurt Garloff <kurt@garloff.de>
```
View commits for tag v6.0.0 v6.0.0

79bfa6d3

Fix/use timeout (#135) · 338df07b

Kurt Garloff authored 1 year ago


* Use timeout binary rather than self-coded mech.

* Also use diskless on gxscs.

* Wrapper mytimeout for timeout.

This is needed as myopenstack is a shell function and can not
be called by timeout.

* Also fix timeout detection ($? >= 124).

timeout returns 124 for a normal timeout.
Previously, we expected > 128 (128+SIGNUM).

Signed-off-by: Kurt Garloff <kurt@garloff.de>

338df07b

Fix/upd flavors (#134) · 5853a198

Kurt Garloff authored 1 year ago


* Use diskless flavors in drivers for PCO, GXSCS, WAVE.


Signed-off-by: Kurt Garloff <kurt@garloff.de>

5853a198

Jul 03, 2023

Feat/diskless flavor support (#133) · a443b3a8

Kurt Garloff authored 1 year ago


* Add logic to patch openstackclient to support --block-device.
* Determine whether we NEED_BLKDEV and use it.
* 1.90. Fix syntax errors.
* Translate --block-device args from nova to openstack.
* Use nova boot workaround if opentackclient is very old.
* Default to Ubuntu 22.04 img and diskless flavors.
* Avoid sed script to patch more than a line.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

a443b3a8

Jun 29, 2023

Feat/avoid keygen (#132) · 7c7dfb65

Kurt Garloff authored 1 year ago


* Generate ssh key with ssh-keygen (ed25519).
   Generation of key pairs in openstack is deprecated; so we do it
   locally and upload the pubkey.
* Use new filename for keypair everywhere.
* Make keypair type configurable, default rsa.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

7c7dfb65

alignment with new plusserver brand guidelines (#131) · 23699f07
Ralf Heiringhoff authored 1 year ago
```
Signed-off-by: Ralf Heiringhoff <ralf.heiringhoff@plusserver.com>
```
23699f07

Mar 14, 2023

Allow setting a CA certificate. (#130) · 01bccae8

Kurt Garloff authored 2 years ago


* Allow setting a CA certificate or passing other params.

export OS_EXTRA_PARAMS="--os-cacert FILE.crt" can be set if needed.
(This allows running against a CiaB system.)

We only use it when talking to the endpoints directly.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

01bccae8

Mar 07, 2023

Fix container image release workflow (#129) · 28f1086d

Christian Berendt authored 2 years ago


* replace redundant Containerfile with a symlink
* use OPENSTACK_VERSION instead of VERSION for the
  used OpenStack version
* use Zed instead of Xena

Signed-off-by: Christian Berendt <berendt@osism.tech>

28f1086d

Bump version to 1.88. · 3ba96eb3
Kurt Garloff authored 2 years ago
```
Signed-off-by: Kurt Garloff <kurt@garloff.de>
```
View commits for tag v5.0.pre1 v5.0.pre1

3ba96eb3
Use new v2 flavor names as default flavors. (#126) · efb98ea9
Kurt Garloff authored 2 years ago
```
Signed-off-by: Kurt Garloff <kurt@garloff.de>
```
efb98ea9

Feat/allow select extnet2 (#125) · b0d855b9

Kurt Garloff authored 2 years ago


* Allow selecting external net (there may be several).

This is done by exporting EXTSEARCH to have a way to select
a working external network.

* Document existence of EXTSEARCH.

* Improve comment as per @berendt's suggestion.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

b0d855b9

Cleanup LBs when Octavia times out. (#124) · fb2e4f8a

Kurt Garloff authored 2 years ago


In gx-scs currently, the Loadbalancer service is extremely
slow. When createing the LB instance (with two amphorae),
we time out in the creation API call (which does NOT wait
for them to become active). So we end up testing without
the LB test. However, the LBs eventually come up and we fail
to expect this and fail to clean up. So more and more LBs
are hanging around until the 200 iterations are over.
We also fail to cleanup networks by consequence.

Change this: In cleanup, if LBs are enabled at all, but
there were errors, looks for LBs and try to delete them.
So we stop collecting them and also avoid failed network
and router cleanup.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

fb2e4f8a

Feb 17, 2023

Feat/lb provider flavor (#118) · baaa6edb

Kurt Garloff authored 2 years ago


* Support other LB providers (ovn) with -LP.

* Allow passing loadbalancer provider via -LP <PROVIDER>
* This results in a TCP loadbalancer with algo SOURCE_IP_PORT
  (which is the only algo that the ovn provider supports)
* When configuring members, we actually had ommitted the --subnet-id
  parameter previously, as it was unneeded and lead to a failure
  with token authentication.
  This needed changing, as the member subnets need to be set
  explicitly when using the OVN provider.

Still: It does not (yet?) work.
Maybe having the VIP (for the listener) and the member
ports in different networks is not supported?

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* LB member subnet is JHSUBNET[0].

Explicitly setting the --subnet-id for amphorae to the backend
member actually breaks the LB's connection to the backend members.
So we set it to the subnet from the VIP; as we access the LB via
the floating IP, the request does indeed originate from the LB's
VIP address in the VIP subnet (which we set to the JHSUBNET[0]).

When using ovn provider, this makes the LB work IF the requests
come from a host in the JHSUBNET, but not from the floating IP.
So currently, this breaks.

Sidenote: health-monitor is not supported for ovn provider pools.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* SG in VM needs to allow port 80 from 0/0 for OVN.

The requests come from the real client IP (which can be the internet,
i.e. 0/0) and not the LB's virtual IP. Thus adjust the security
group to allow for it.

The OVN provider does not support the health monitor, unfortunately.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Adjust max cycle time if we don't kill LB members.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Improve warning when LB backend kills are skipped.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Set the member's subnet to the backend server subnets.

This causes the health-monitor to detect members with error operating
status.
It does correctly take them out of the rotation when accessed via the
VIP, unfortunately not via the Floating IP though.
https://bugs.launchpad.net/neutron/+bug/1956035



The api_monitor.sh does thus report a few errors on each iteration.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

---------

Signed-off-by: Kurt Garloff <kurt@garloff.de>

baaa6edb

Jan 10, 2023

If there are multiple external nets, use the first. (#117) · b88340f3

Kurt Garloff authored 2 years ago


Previously, we tried to use several, confusing numerous followup
commands.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

Signed-off-by: Kurt Garloff <kurt@garloff.de>

b88340f3

Dec 09, 2022

Rearrange dashboards in errors and bench sections. (#115) · e3abf038
Kurt Garloff authored 2 years ago
```
Signed-off-by: Kurt Garloff <kurt@garloff.de>
```
e3abf038

Feat/python stats (#114) · 3288b8fe

Kurt Garloff authored 2 years ago


* Use python to calc percentiles.

This is a performance issue -- doing math in bash can easily take
several seconds when we have API calls from a day. For longer runtimes
things get worse ...
So pass the stats to a little python script.

* Bump version to 1.86.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

3288b8fe

Re-add executable permissions. (#113) · dcfde4f3

Kurt Garloff authored 2 years ago


No idea why it got lost in commit b971b193 (from PR #107).

Signed-off-by: Kurt Garloff <kurt@garloff.de>

dcfde4f3

When outputting codes for cols, we need echo -e. (#112) · eaf62499
Kurt Garloff authored 2 years ago
```
Cosmetic. This was missing from ed808bba.

Signed-off-by: Kurt Garloff <kurt@garloff.de>
```
eaf62499

Fix/obj upload name (#110) · 959543dc

Kurt Garloff authored 2 years ago


* Upload files to swift containers without path in name.

* Fix APIMonitor log file upload trigger.

We had the wrong filenames in mind, as we use a separate $DATADIR now,
so the check failed.
Also be more verbose when uploading.

* Fix double DATADIR.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

959543dc

Dec 01, 2022

Fix/lb port cleanup (#108) · 2f573ac8

Kurt Garloff authored 2 years ago


* Fix port cleanup for leftover load-balancers.

The grep expression previously did not match, causing port deletion
for octavia-lb ports in our subnet to be skipped.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Also cleanup leftover octavia ports in cleanup.

There, we also need to shift the security group deletion until after we
have cleaned up load balancer ports, as LB ports could have security
groups assigned.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Hit all octavia ports not just -vrrp.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Enable LBs in CLEANUP call.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Do octavia port cleanup unconditionally.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

Signed-off-by: Kurt Garloff <kurt@garloff.de>

2f573ac8

Medium old OSC reports Networks in a diff format. (#109) · 9a952c55

Kurt Garloff authored 2 years ago


* Medium old OSC reports Networks in a diff format.

So OSC-5.5.0 reports (openstack server list -f json):
[
  {
    "ID": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "Name": "HealthMon-Host",
    "Networks": {
      "kd500924-SCS-healthmonitor-network": [
        "192.168.0.17",
        "78.138.66.252"
      ]
    }
  }
]
whereas older version OSC-5.3.1 reports
[
  {
    "ID": "XXXXXXXXXXXXXXXXXXXXXXXXX",
    "Name": "testkg2",
    "Networks": "testkg2=192.168.33.95, 31.172.117.55",
  }
]

Adjust IP parser to handle both ...

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* A bit more debugging info (commented out).

Signed-off-by: Kurt Garloff <kurt@garloff.de>

Signed-off-by: Kurt Garloff <kurt@garloff.de>

9a952c55

Fix/assign extnet earlier (#107) · b971b193

Kurt Garloff authored 2 years ago


* Handle neutron not reporting device_id.

We currently match ports by looking for the VM UUID in the port's
device_id. This does no longer seem to work reliably on PCO; the field
is not returned by the port list command.
So look for the IP address reported from server list and match it
with the ports to determine port ID. Sort of ugly, but there is no
straightforward way to get port <-> VM relationships without looking
at port details or matching both subnet and IP address.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Simplify old code path.

We already know that we only go here if we're using the legacy tooling.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Assign external net to router before creating VMs.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Revert "Assign external net to router before creating VMs."

This reverts commit b0a1991f.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Assign external net to router before creating VMs.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Clean up router when external-net-list fails.

Also, ignore when setting external gateway fails.
(The reason is that this is not required on OTC and the error is
thus harmless.)

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Bump version number to 1.85.

The change (assigning the external net early) is significant enough
that we want to see from the version number whether or not we have it.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

Signed-off-by: Kurt Garloff <kurt@garloff.de>

b971b193

Nov 30, 2022

Feat/new collect port (#106) · fb63fb23

Kurt Garloff authored 2 years ago


* Handle neutron not reporting device_id.

We currently match ports by looking for the VM UUID in the port's
device_id. This does no longer seem to work reliably on PCO; the field
is not returned by the port list command.
So look for the IP address reported from server list and match it
with the ports to determine port ID. Sort of ugly, but there is no
straightforward way to get port <-> VM relationships without looking
at port details or matching both subnet and IP address.

* Simplify old code path.

We already know that we only go here if we're using the legacy tooling.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

fb63fb23

Feat/add lbconn grafana (#102) · ed808bba

Kurt Garloff authored 2 years ago


* Report timing and errors of LB connections to grafana.

* 3 digits LBconn time measurement.

* Add LBconn to dashboard.

* Change range for bench graph to include LBConn vals.

* Multiply LB dur by ten, so graphs in grafana align better.

* Range for bench chart 0.5--32.

We may have iperf measurements below 1 (Gbps), so extend scale a bit
downwards.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

ed808bba

Only wait for JHPORT b/f assigning FIP if needed. (#101) · 6b38ab4d

Kurt Garloff authored 2 years ago


* Only wait for JHPORT b/f assigning FIP if needed.

This was required on OTC, and we can still force the waiting by setting
the FIPWAITPORTDEVOWNER environment.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Fix syntax for new FIPWAITPORTDEVOWNER. Set for OTC.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

6b38ab4d

Nov 28, 2022

Improve dashboard setup documentation. (#104) · f03233cf

Kurt Garloff authored 2 years ago


* Improve dashboard setup documentation.
* Some additional hints for setting up the monitoring VM.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

f03233cf

Nov 23, 2022

Use small openSUSE image, using SCS-1L:1:5 flavor. (#105) · 3c2bdce6

Kurt Garloff authored 2 years ago

My own 4GiB openSUSE image nicely fits the smallest standardized
SCS flavor: SCS-1L:1:5. This reduces the resource consumption
for the openStack health monitor in the PCO environment.

Link to the image:
https://kfg.images.obs-website.eu-de.otc.t-systems.com/



Signed-off-by: Kurt Garloff <kurt@garloff.de>

Signed-off-by: Kurt Garloff <kurt@garloff.de>

3c2bdce6

Nov 20, 2022
- Error reporting had an echo command missing in error case. (#103) · 6ae102a5
  Kurt Garloff authored 2 years ago
  
  Signed-off-by: Kurt Garloff <kurt@garloff.de>
  6ae102a5
Nov 16, 2022

Assign Floating IPs earlier after JHVM creation. (#100) · b7e9491a

Kurt Garloff authored 2 years ago


This is an attempt to work around a situation where the FIP assignment
that happens during the JumpHost VM booting and downloading packages
(triggered by cloud-init userdata) does disrupt communication for a
while.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

Signed-off-by: Kurt Garloff <kurt@garloff.de>

b7e9491a

Oct 24, 2022

Fix/redelete lb (#97) · 8acba405

Kurt Garloff authored 2 years ago


* Retry removing LB after waitdel.

When JH creation fails, we clean up things in reverse order.
The LB is then typcally still in PENDING_CREATE, which means
we can't delete it yet. Waiting it to vanish then won't work
either.
If that happens, we should try to delete it again and wait
again.

This addresses the issue observed in #96.

Signed-off-by: Kurt Garloff <scs@garloff.de>

* Wait shorter b/f retrying LB delete.

Also for the implementation to work, we need to unset LBDSTATS
before deleting again (otherwise the array grows larger than the
list of resources, causing the wait loop to not break once all
resources are gone).
When retrying, we wait 5s rather than 2s now, increasing the chances
that re-deleting actually works.

This fixes #96.

Signed-off-by: Kurt Garloff <scs@garloff.de>

* Bump version to 1.83.

Note to previous commit: Restoring LBAASS from DELLBAASS was
key to make second calls to deleteLBs and waitdelLBs actually
do something.

This belongs to #96.

Signed-off-by: Kurt Garloff <scs@garloff.de>

Signed-off-by: Kurt Garloff <scs@garloff.de>
Co-authored-by: chschilling <c.schilling@gmx.net>

8acba405

Fix wait time reporting in waitResources. (#99) · 985fca42

Kurt Garloff authored 2 years ago


Only the resources waited for in waitlistResources would be properly
reported to telegraf/influx/grafana, there was a typo in the ones waited
for in waitResources. This cause waitJHPORT not to be visible in
grafana. Fixed.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

Signed-off-by: Kurt Garloff <kurt@garloff.de>

985fca42

Oct 21, 2022

Allow NOT supplying nameservers to subnets. (#98) · 4cedc174

Kurt Garloff authored 2 years ago


This results in defaults from your cloud being used, which often is
a good choice. Use this for the wavestack cloud.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

Signed-off-by: Kurt Garloff <kurt@garloff.de>

4cedc174

Sep 27, 2022

Feat/new dashboard screenshots (#95) · 9a517b00

Kurt Garloff authored 2 years ago


* Use current screenshot in dashboard/README.md

Screenshot courtesy of gx-scs environment from PlusServer.

* Document how to interpret the dashboards.

... using an example from gx-scs.

* Mention no filters.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

9a517b00

Sep 19, 2022

Feat/update usage (#94) · 6181d4b1

Kurt Garloff authored 2 years ago


* Add pointer to dashboard dir, update usage output.

* Fixup typos.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

6181d4b1

Sep 15, 2022

Avoid removing lb-vrrp ports from wrong subnets. (#92) · dde03df1

Kurt Garloff authored 2 years ago


Our logic to filter only for our own subnets was broken.
It now works.

Signed-off-by: Kurt Garloff <kurt@garloff.de>

Signed-off-by: Kurt Garloff <kurt@garloff.de>

dde03df1

Add iperf3 gauge in top line. (#93) · 5cf80999

Kurt Garloff authored 2 years ago


We might otherwise overlook things that are significantly wrong ...

Signed-off-by: Kurt Garloff <kurt@garloff.de>

Signed-off-by: Kurt Garloff <kurt@garloff.de>

5cf80999

Sep 09, 2022
- Finetune panel sizes and ranges. · 1eec7bb9
  Kurt Garloff authored 2 years ago
  
  Signed-off-by: Kurt Garloff <kurt@garloff.de>
  1eec7bb9
Sep 05, 2022
- Only clean FIPs once. · 40729b36
  Kurt Garloff authored 2 years ago
  
  Signed-off-by: Kurt Garloff <kurt@garloff.de>
  40729b36

Consent