Skip to content

2023-07-11: AI weights, Learning eBPF workshop, tracepusher for CI/CD Observability, Cost efficiency with Kepler and krr, KubeShark, Flux 2.0.0

Thanks for reading the web version, you can subscribe to the Ops In Dev newsletter to receive it in your mail inbox.

πŸ‘‹ Hey, lovely to see you again

Hope you are having a great time wherever you are reading this newsletter and finding time to relax and take a vacation. In Germany, summer has reached 34 degrees Celsius (93.2 Fahrenheit), making my brain melt. Things have been quite busy in June and July - Cloudland 2023 was a fantastic community event where we learned AI, eBPF, and Observability together in birds-of-a-feather sessions workshops and talks. My new talk "Observability for Efficient DevSecOps Pipelines" uses a slow CI/CD pipeline that can be analyzed and optimized step-by-step. tracepusher is one of many great tools mentioned for pipeline tracing, next to discussions about better cost efficiency, optimizing resource usage, leveraging cloud-native technology such as eBPF, and how AI can help. A culmination of my learnings shared in this newsletter with you :)

🌱 The Inner Dev learning ...

This month's learning list:

🐝 The Inner Dev learning eBPF

For Cloudland 2023, I created a new workshop "Learning eBPF for better Observability" with new exercises. Thanks to Liz Rice's great book "Learning eBPF", the workshop added more challenges to the exercises. The workshop project provides a Lima VM (macOS), Vagrant environment, or standalone Ansible role to provision a Ubuntu 23 VM with the required build tools and git clones. The slides guide you through the workshop environment, and is structured into the following learning path way:

  1. eBPF first steps (Learning strategy, bcc: trace evecve, bcc discussion, bpftrace, bpftrace: open calls, bpftrace use cases)
  2. eBPF development (languages, architecture, CO-RE, eBPF Verifier, Kernel code: count packets with XDP, maps, load kernel programs)
  3. Development libraries (XDP with Go, XDP with Rust)
  4. eBPF for Observability (eBPF exporter for Prometheus, Prometheus metrics, OpenTelemetry, Profiling, Kubernetes with Coroot)
  5. Additional content: eBPF for SRE&DevOps, eBPF for Security, eBPF for Everyone; Automation, CI/CD, DevSecOps

All learning resources are documented on o11y.love.

Quick wins:

πŸ€– The Inner Dev learning AI/ML

Stack Overflow said "Self-healing code is the future of software development" and I could not agree more. Not only in times when code and architecture complexity increases - the entire DevOps lifecycle evolved fast. Dependabot/Renovate help with dependency automation, and drift detection in GitOps workflows help prevent deployment problems. A little help from AI with self-healing in every stage of the DevOps lifecycle is the next step.

OpenAI announced the Code Interpreter plugin alpha, an experimental ChatGPT-4 model with a working Python interpreter in a sandboxed execution environment, and spherical disk space. Will be interesting to see how code execution helps with calculations, generating graph plots from data arrays, or even creating ephemeral remote development workspaces in some way with AI ... This Twitter thread dives into more examples for Code Interpreter usage.

My colleague Gabriel Engel asked me to review a new blog post about GPU-enabled GitLab SaaS Runners to empower ModelOps and HPC workloads. At first, I did not understand how GitLab Runners would be able to access graphical processing units (GPUs) for training machine learning models. After some research, I found this great article about "How Docker Runs Machine Learning on NVIDIA GPUs, AWS Inferentia, and Other Hardware AI Accelerators". It helped to understand that NVIDIA GPU hardware on a Linux host can be accessed with installing the CUDA drivers (NVIDIA container toolkit) into a Docker image, and then use the image in GitLab CI/CD pipeline jobs. The blog post provides hands-on examples for training the first models in CI/CD, which is fascinating at its best - looking behind the curtain of how technology is built (similar to how I want to understand eBPF for innovation and risks).

I'm also investing time into learning with AI-assisted MR summaries when someone sends a review request, and learning a new programming language using code suggestions, available with GitLab Duo. My other AI research topics include cost efficiency and observability with AI, and pipeline efficiency to fix failing jobs fast with a little help from AI.

Quick wins:

πŸ‘οΈ Observability

tracepusher is a new Open Source project that provides a Python script to send traces to OpenTelemetry. The script can be run standalone on the command-line or integrated into GitLab CI/CD pipelines. The CI/CD configuration needs to be extended with a preparation job to generate a unique trace ID for the pipeline run, and consecutive before_script and after_script default definitions for all jobs to generate span IDs, and emit the trace span to OpenTelemetry with the tracepusher script, with a post pipeline job that finalizes the trace.

You can create a quick demo setup with an OpenTelemetry collector that accepts HTTP messages using this project which docker-compose. You can then follow my public learning curve implementing tracepusher into the slow pipeline project in this MR. I also created upstream pull requests and issues, including emphasis on the CI-tagged image, jobs that specify a custom image, before_script/after_script override problem workarounds and Jobs with needs do not download dotenv artifacts by default, causing new trace IDs generated.

Tracepusher from GitLab CI/CD to OpenTelemetry, slow pipeline analysis

How Observability Changed My (Developer) Life is a great read about the expansion of metrics, logs and how traces and spans help tackling the complexity of microservices architectures. It also highlights observability data costs, and the requirement of sampling with the OpenTelemtry collector or Honeycomb Refinery. The article "Monitoring is a Pain" takes a different view, and sheds light into missing helpful standards in logging, the complexity with metrics architectures, and that tracing should be enabled and work OOTB. Great reminder for increased project adoption and contributions.

Julius Volz continued his amazing learning series "Prometheus Monitoring with Julius":

Quick wins:

πŸ›‘οΈ DevSecOps

After Red Hat announced that the RHEL source code will only be accessible through the Red Hat Customer portal, this led to responses in the community from RHEL binary compatible clones such as RockyLinux and AlmaLinux.

Some additional context: Access to the exact same source RPM versions will be required to provide binary compatible rebuilds. With Red Hat providing a source code copy and build files (.spec files, etc.) in the public CentOS Git repository, automated rebuilds were much easier than years before where one had to download the SRPM files from public FTP mirrors. CentOS historically has been the the open-source distribution compatible with RHEL. Some years ago, Red Hat acquired CentOS, and changed its release process into a rolling release. This allows Red Hat to build RHEL from CentOS Stream, instead of a frozen Fedora upstream version. Community concerns inspired RockyLinux and AlmaLinux to step into the CentOS place. With removing access to RHEL source code, rebuilding becomes harder and it may or may not be legal (lawyers will have to analyze this situation).

The source code access change also breaks software supply chain, especially for security fixes that are developed for or by Red Hat, and may reach the public community later (or too late). This week, Oracle added concerns about compatibility issues with Oracle Linux after 9.2. SuSE went one step further, and announced a fork of the publicly available RHEL sources and planning to invest $10+ million while promising its investments in SuSE Linux Enterprise (SLE). Hopefully folks find a way to collaborate, and drive the open source spirit again.

Quick wins:

🌀️ Cloud Native

Cost efficiency, and sustainable resource usage, is an important topic. The maintainers of Kepler, a CNCF sandbox project, thought about sustainable environments and power consumption observability for Kubernetes clusters. They started a Prometheus metrics exporter that uses eBPF to probe performance counters and other system stats. The data is processed with ML models to estimate workload energy consumption based on these stats. These projects will help with sustainable computing efforts to reduce CO2 emissions and can also help with pipeline efficiency strategies:

  • Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy-related system stats and exports as Prometheus metrics.
  • PEAKS (Power Efficiency Aware Kubernetes Scheduler) uses metrics exported by Kepler to help Kubernetes schedule to improve energy efficiency by placing Pods on optimal nodes.
  • CLEVER (Container Level Energy-efficient VPA Recommender) uses metrics exported by Kepler to recommend Vertical Pod Autoscaler the resource profiles to improve energy efficiency by running workloads.

Suggest subscribing to updates and trying the projects in your production environments.

Quick wins:

  • Robusta released a new tool to find over/under-provisioned resources in a Kubernetes cluster: krr. It describes itself as Prometheus-based Kubernetes Resource Recommendations.
  • Daniele Polencic shared 20 Kubernetes learning threads in 20 weeks - fantastic content to learn new topics, or verify existing knowledge. Daniele also provides more Kubernetes learning content on learnk8s.io.
  • Oldschool published a great read about GitLab Runners on Kubernetes, diving into observability and scalability requirements, how they managed to avoid the deprecated docker-machine autoscaling mechanism, how upgrades and rollback do not duplicate worker pools, and debugging strategies for incidents.
  • Jan SchΓΌrmann shared a KubeShark demo spontaneously at Cloudland 2023, and how they use it in production for analyzing traffic in Kubernetes.

πŸ“š Tools and tips for your daily use

πŸ”– Book'mark

🎯 Release speed-run

Flux v2.0.0

Flux v2.0.0 comes with the promotion of the GitOps related APIs to v1 and adds horizontal scaling & sharding capabilities to Flux controllers. The Git bootstrap capabilities provided by the Flux CLI and by Flux Terraform Provider are now considered stable and production ready. Starting with this version, the build, release and provenance portions of the Flux project supply chain provisionally meet SLSA Build Level 3. Flux GA is fully integrated with Kubernetes Workload Identity for AWS, Azure and Google Cloud to facilitate passwordless authentication to OCI sources (container images, OCI artifacts, Helm charts). The Flux alerting capabilities have been extended with PagerDuty and Google Pub/Sub support. The improved Alert v1beta2 API provides better control over events filtering and allows users to enrich the alerts with custom metadata.

Parca v0.20.0 improves support for short-lived processes, adds support for unwinding mixed stacks, and many more performance improvements. Jaeger Tracing v1.47.0 supports Prometheus normalization for OpenTelemetry metrics. Grafana 10 helps visualize real-time dashboards with a new trend panel, while time region support is provided with the canvas panel and time series panel. Dashboards as code got improvements with a revamped version of the Grafonnet library, and a Terraform provider for composing dashboards.

Trivy v0.43.0 adds scanning for development dependencies, skipping services in cloud scanning, and supports Kubernetes bill of materials, Kubernetes private registries, Yarn workspaces. GitLab 16.1 brings a new navigation experience, visualizes Kubernetes resources, service account authentication, job artifact management, and more async CI/CD workflows with needs in rules. Hurl 4.0.0 improves error reporting in CI/CD, adds new filters for decode and xpath, and adds a request waterfall timeline into the HTML reports.

Docker Desktop 4.21 adds support for new Wasm runtimes, including Fermyon Spin. Go 1.21 RC added Wasm and Wasi support (example project).

πŸŽ₯ Events and CFPs

πŸ‘‹ CFPs due soon

Looking for more CfPs?

🎀 Shoutouts

Julia Evans for creating a fun exercise to implement DNS in a weekend.

🌐

Thanks for reading! If you are viewing the website archive, make sure to subscribe to stay in the loop!

See you next month - let me know what you think on LinkedIn, Twitter, Mastodon.

Cheers,

Michael

PS: If you want to share items for the next newsletter, please check out the contributing guide - tag me in the comments, send me a DM or submit this form. Thanks!