GitLab cloud-native Helm chart migration
Scheduled Maintenance Report for renkulab
Postmortem

Post Mortem Report: GitLab Migration from Omnibus Docker Image to Helm Chart in Kubernetes

Date: 16.08.2023

Prepared by: Wes Johnson

Introduction

This report details the GitLab migration process, which involved transitioning from the GitLab Omnibus Docker image to the GitLab cloud-native Helm chart on the renkulab.io Kubernetes cluster. The migration also encompassed a move to a new PostgreSQL instance to be used by both GitLab and Renku without being deployed by either service directly.

Background

The initial setup used the GitLab Omnibus Docker image, deployed as part of the renkulab.io Renku instance. As the platform grew, the need for a more scalable and manageable GitLab deployment became apparent, leading to the decision to migrate to the GitLab Helm chart.

Objectives

  1. Migrate GitLab from its Omnibus Docker image to the GitLab cloud-native Helm chart.
  2. Transition to a new shared PostgreSQL database instance.

Methodology

The chosen method for this migration was to back up the current GitLab instance onto a Kubernetes volume and then restore it. Items in S3 object storage - such as LFS, registry, uploads, and artifacts - were not backed up, as the new instance could be reconnected to the existing S3 buckets.

Execution

Backup

  1. A backup of the current GitLab omnibus docker image was taken using the GitLab command gitlab-backup create SKIP=lfs,uploads,artifacts,registry.
  2. The backup files, including configurations, database, repositories, and other related data, were then stored in a dedicated Kubernetes volume.
  3. The entire database instance was backed up and stored onto a dedicated Kubernetes volume.

Setup New Infrastructure

  1. A new PostgreSQL instance was set up and configured to be shared with other applications. The backup of the entire database instance was restored onto the new instance, and the gitlabhq_production database was dropped and recreated to ensure the new GitLab Helm chart deployment had a fresh database.
  2. The GitLab Helm chart was deployed on the Kubernetes cluster.

Restoration

  1. The backup file from the Kubernetes volume was restored to the new GitLab Helm chart deployment using the GitLab command backup-utility --restore -f file:///backup-restore/1690792282_2023_07_31_14.10.5_gitlab_backup.tar.
  2. renkulab.io was redeployed, pointing to the newly migrated GitLab instance.

Challenges Faced

  • Size of data: Given the size of the renkulab.io Gitlab instance, the time taken to back up and restore the GitLab instance was longer than expected.
  • Maintenance window communication: The maintenance window had only been publicised on Discourse, and no note was made on the Renkulab Statuspage. This meant that visitors to the site before the migration had no warning regarding the extended maintenance window.
  • Maintenance page: For extended periods during the maintenance window, if someone navigated to renkulab.io, they were presented with a 404 and did not indicate that a maintenance window was active.
  • Interrupted restore: Whilst trying to publish the maintenance page, a restore attempt of GitLab was interrupted, resulting in the environment being cleaned and the restore reattempted. This cost valuable time and could have caused more severe issues if the restore had been interrupted in more critical phases of the restore job.
  • LFS objects modification: There were modifications to LFS objects during the maintenance window by CI jobs. This led to the creation of dangling files, making some files unavailable in certain projects.

Outcome

Post-migration, there was a noticeable improvement in the responsiveness of the GitLab UI and API. The GitLab instance's performance on the renkulab.io Kubernetes cluster was more efficient, providing users a smoother experience.

Lessons Learned

  1. Anticipate Larger Data Sets: When dealing with a migration of a significant scale, it's essential to factor in additional time for backing up and restoring data.
  2. Communication is Key: Ensuring all platform users are made aware of planned maintenance is crucial. Using multiple channels, such as the Renkulab Statuspage and Discourse, helps reach a wider audience and ensure no one is caught off-guard.
  3. Maintain Clear User Notifications During Downtime: When a service is down for maintenance, it's essential to provide clear messaging for users. The presence of a 404 error without context can be confusing and alarming for users. A dedicated maintenance page with information about the ongoing work and expected duration can greatly improve user experience during downtimes.
  4. Ensure Uninterrupted Migration Phases: When executing critical operations, such as data restoration, ensure that no other tasks (like updating maintenance pages) are being performed simultaneously to prevent interruptions that could compromise the integrity of the migration.

Conclusion

The transition of GitLab from its Omnibus Docker image to the Helm chart on the renkulab.io Kubernetes cluster was completed successfully. While challenges arose due to data size and communication gaps, these challenges presented lessons for future maintenance windows and other infrastructure projects. The updated setup offers enhanced scalability and manageability and seamlessly makes better use of Kubernetes. The improved performance, as evidenced by the more responsive GitLab UI and API, further underlines the success of this migration.

Posted Aug 16, 2023 - 15:08 CEST

Completed
The scheduled maintenance has been completed.
Posted Aug 06, 2023 - 01:33 CEST
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Aug 03, 2023 - 10:31 CEST
Scheduled
We will migrate the Gitlab instance to a different Helm chart that enables better performance and maintainability.
During that time user-sessions will be blocked from starting and any running user-session will be stopped.
Posted Aug 03, 2023 - 10:25 CEST
This scheduled maintenance affected: Renkulab web UI, Knowledge Graph, GitLab, Renkulab sessions, and Loud.