Web Interface Control Panel Failing When Upgrading OpenShift 3.7 to 3.9

I’ve been running through upgrades to OpenShift, starting from 3.5 up to 3.9. OpenShift only allows upgrading a single point release, so we’ve had to go from 3.5 to 3.6, then from 3.6 to 3.7, etc. Each of these has been plagued with various bugs or problems during the upgrade process.

Going from OpenShift 3.7 to 3.9 (technically, the upgrade installs 3.8 first, then 3.9 – oh joy!) resulted in a problem during the Ansible script against the master nodes doing:

ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_control_plane.yml

 

Change in Web Interface Console in OpenShift 3.9

To back up for a quick second…. OpenShift 3.9 migrates the web console out of OpenShift itself and into Pods instead. These pod(s) exist in the newly-created openshift-web-console namespace.

Our web interface disappeared and we started hitting the following error during upgrade on the task named:
“TASK [openshift_web_console : Verify that the web console is running]”

FAILED - RETRYING: Verify that the web console is running (58 retries left).Result was: {
"attempts": 3,
"changed": false,
"cmd": [
"curl",
"-k",
"https://webconsole.openshift-web-console.svc/healthz"
],
"delta": "0:00:01.016992",
"end": "2018-04-26 14:15:53.303304",
"invocation": {
"module_args": {
"_raw_params": "curl -k https://webconsole.openshift-web-console.svc/healthz",
"_uses_shell": false,
"chdir": null,
"creates": null,
"executable": null,
"removes": null,
"stdin": null,
"warn": false
}
},
"msg": "non-zero return code",
"rc": 7,
"retries": 61,
"start": "2018-04-26 14:15:52.286312",
"stderr": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused",
"stderr_lines": [
"  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current",
"                                 Dload  Upload   Total   Spent    Left  Speed",
"",
"  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused"
],
"stdout": "",
"stdout_lines": []
}

Basically, this is being caused by a problem with TLS on the relevant pod. For example…

webconsole.openshift-web-console.svc means we should be looking at the “webconsole” Service found in the “openshift-web-console” namespace/project.

[root@osm104 ~]# oc get projects
NAME                                DISPLAY NAME         STATUS
......
openshift-web-console                                    Active
......

There’s the new namespace just created by the Ansible installer script… Let’s check that out the service details.

[root@osm104 ~]# oc project openshift-web-console 
Now using project "openshift-web-console" on server "https://somewhere:8443".

[root@osm104 ~]# oc describe svc webconsole 
Name:              webconsole
Namespace:         openshift-web-console
Labels:            app=openshift-web-console
Annotations:       kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"prometheus.io/scheme":"https","prometheus.io/scrape":"true","service.alpha.openshift.io...
                   prometheus.io/scheme=https
                   prometheus.io/scrape=true
                   service.alpha.openshift.io/serving-cert-secret-name=webconsole-serving-cert
                   service.alpha.openshift.io/serving-cert-signed-by=openshift-service-serving-signer@1481905612
Selector:          webconsole=true
Type:              ClusterIP
IP:                172.30.153.175
Port:              https  443/TCP
TargetPort:        8443/TCP
Endpoints:         
Session Affinity:  None
Events:            

Ok, not too much there I suppose. What about the pods themselves?

[root@osm104 ~]# oc get pods
NAME                          READY     STATUS             RESTARTS   AGE
webconsole-56c6745c85-9vfzr   0/1       CrashLoopBackOff   10         8m
webconsole-56c6745c85-kjtqj   0/1       CrashLoopBackOff   10         8m
webconsole-56c6745c85-xxqf9   0/1       CrashLoopBackOff   9          9m

Well that doesn’t look good. Lets take a peek into their logs.

[root@osm104 ~]# oc logs webconsole-56c6745c85-9vfzr 
Error: unable to load server certificate: open /var/serving-cert/tls.crt: permission denied
Usage:
  origin-web-console [flags]
Flags:
      --alsologtostderr                                log to standard error as well as files
      --audit-log-format string                        Format of saved audits. "legacy" indicates 1-line text format for each event. "json" indicates structured json format. Requires the 'AdvancedAuditing' feature gate. Known formats are legacy,json. (default "json")
      --audit-log-maxage int                           The maximum number of days to retain old audit log files based on the timestamp encoded in their filename.
      --audit-log-maxbackup int                        The maximum number of old audit log files to retain.
      --audit-log-maxsize int                          The maximum size in megabytes of the audit log file before it gets rotated.
      --audit-log-path string                          If set, all requests coming to the apiserver will be logged to this file.  '-' means standard out.
      --audit-policy-file string                       Path to the file that defines the audit policy configuration. Requires the 'AdvancedAuditing' feature gate. With AdvancedAuditing, a profile is required to enable auditing.
      --audit-webhook-batch-buffer-size int            The size of the buffer to store events before batching and sending to the webhook. Only used in batch mode. (default 10000)
      --audit-webhook-batch-initial-backoff duration   The amount of time to wait before retrying the first failed requests. Only used in batch mode. (default 10s)
      --audit-webhook-batch-max-size int               The maximum size of a batch sent to the webhook. Only used in batch mode. (default 400)
      --audit-webhook-batch-max-wait duration          The amount of time to wait before force sending the batch that hadn't reached the max size. Only used in batch mode. (default 30s)
      --audit-webhook-batch-throttle-burst int         Maximum number of requests sent at the same moment if ThrottleQPS was not utilized before. Only used in batch mode. (default 15)
      --audit-webhook-batch-throttle-qps float32       Maximum average number of requests per second. Only used in batch mode. (default 10)
      --audit-webhook-config-file string               Path to a kubeconfig formatted file that defines the audit webhook configuration. Requires the 'AdvancedAuditing' feature gate.
      --audit-webhook-mode string                      Strategy for sending audit events. Blocking indicates sending events should block server responses. Batch causes the webhook to buffer and send events asynchronously. Known modes are batch,blocking. (default "batch")
      --config string                                  filename containing the WebConsoleConfig
      --contention-profiling                           Enable lock contention profiling, if profiling is enabled
      --enable-swagger-ui                              Enables swagger ui on the apiserver at /swagger-ui
      --log-flush-frequency duration                   Maximum number of seconds between log flushes (default 5s)
      --log_backtrace_at traceLocation                 when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                                 If non-empty, write log files in this directory
      --logtostderr                                    log to standard error instead of files (default true)
      --profiling                                      Enable profiling via web interface host:port/debug/pprof/ (default true)
      --stderrthreshold severity                       logs at or above this threshold go to stderr (default 2)
  -v, --v Level                                        log level for V logs
      --vmodule moduleSpec                             comma-separated list of pattern=N settings for file-filtered logging
F0426 18:24:10.372121       1 console.go:35] unable to load server certificate: open /var/serving-cert/tls.crt: permission denied

Permission denied? That doesn’t seem right. The path /var/serving-certs/tls.crt probably comes from a Secret.

[root@osm104 ~]# oc get secrets 
NAME                         TYPE                                  DATA      AGE
.........
webconsole-serving-cert      kubernetes.io/tls                     2         11m
............

It does! Ok, so what’s likely happening here is that the secret is being assigned to the Pod(s) but with the wrong permission set in the Deployment or DeploymentConfig.

[root@osm104 ~]# oc get deployment -o yaml
skip-a-bunch-of-stuff
        volumes:
        - name: serving-cert
          secret:
            defaultMode: 400
            secretName: webconsole-serving-cert

Yup, the defaultMode is set to 400. Why that is I don’t know, but 440 would make more sense. We need the root group to be able to read the files. So change that to this…

        volumes:
        - name: serving-cert
          secret:
            defaultMode: 440
            secretName: webconsole-serving-cert

The pods should restart and the console should come backup.

Caveat… of course

Every time you run the mast node installer script, it’s going to re-write the webconsole Deployment configuration.  You’ll need to perform the edit while Ansible gets stuck on TASK [openshift_web_console : Verify that the web console is running].  It retries 60 times, which is enough time if you’re quick.  Once it sees the web interface is ok, your install will continue.

Update

If you want to let your Ansible installer run and don’t want to sit there waiting for it it get to the task so you can apply the change, try running this from another terminal window while the Ansible script runs:

while(true); do kubectl patch deployment webconsole --patch '{"spec": {"template": {"spec": {"volumes": [{"name": "serving-cert","secret": {"defaultMode":440}}]}}}}'; sleep 30; done
Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *