Reader

#s3 posts

from Software and Tech

Edit: The below didn't work. Jump to the edit to see the current attempt.

I'm experimenting with where to put these types of blog posts. I have been putting them on my home server, at gotosocial.michaeldileo.org, but I'm thinking of moving over here instead of a micro-blogging platform.

Longhorn, the system that is used to manage storage for Keyboard Vagabond, performs regular backups and disaster recovery management. I noticed that on the last few billing cycles, the costs for S3 cloud storage with BackBlaze was about $25 higher than expected, and given that the last two bills were like this, it's not a fluke.

The costs are from s3_list_objects, over 5M calls last month. It turns out this is a common thing that has been mentioned in github, reddit, Stack Overflow, etc. The solution seems to be just to turn it off. It doesn't seem to be required for backups and disaster recovery to work and Longhorn seems to be doing something very incorrectly to be making all of these calls.

...
data:
    default-resource.yaml: |-
        ...
        "backupstore-poll-interval": "0"

My expectation is that future billing cycles should be well under $10/month for storage. The current daily average storage size is 563GB, or $3.38 per month.

#kubernetes #longhorn #s3 #programming #selfhosting #cloudnative #keyboardvagabond

Edit – the above didn't work (new solution below)

Ok, so the network policy did block the external traffic, but it also blocked some internet traffic that caused the pods to not be in a ready state. I've been playing around with variations of different ports, but I haven't found a full solution yet. I'll update if I get it resolved. I got it. I had to switch to a CiliumNetworkPolicy

I also tried changing the polling interval from 0 to 86400, though I think the issue is ultimately how they do the calls, so bear this in mind if you use Longhorn. Right now I'm toying around with the idea of setting a cap, since my backups happen after midnight, so maybe gamble on the cap getting reset and then a backup happening, then at some point the cap gets hit and further calls fail until the morning? This might be a bad idea, but I think that I could at least limit my daily expenditure.

One thing to note from what I read in various docs is that in Longhorn v1.10.0, they removed the polling configuration variable since you can set it in the UI. I still haven't solved the issue, ultimately.

I see that yesterday longhorn made 145,000 Class C requests (s3-list-objects). I found on a github issue that someone solved the issue be setting a network policy to block egress outside of those hours. I had Claude draw up some policies, configurations, and test scripts to monitor/observe the different system states. The catch, though, is that I use FluxCD to maintain state and configuration, so this policy cannot be managed by flux.

The gist is that a blocking network policy is created manually, then there are two cron jobs: one to delete the policy 5 minutes before backup, and another to recreate it 3 hours later. I'm hoping this will be a solution.

Edit: I think that I finally got it. I had to switch from a NetworkPolicy to CiliumNetworkPolicy, since that's what I'm using (duh?). using toEntities: kube-apiserver fixed a lot of issues. Here's what I have below. It's the blocking network configuration and the cron jobs to remove and re-create it. I still have a billing cap in place for now. I found that all volumes backed up after the daily reset. I'm going to keep it for a few days and then consider whether to remove it or not. I now at least feel better about being a good citizen and not hammering APIs unnecessarily.

---
# NetworkPolicy: Blocks S3 access by default
# This is applied initially, then managed by CronJobs below
# Using CiliumNetworkPolicy for better API server support via toEntities
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: longhorn-block-s3-access
  namespace: longhorn-system
  labels:
    app: longhorn
    purpose: s3-access-control
spec:
  description: "Block external S3 access while allowing internal cluster communication"
  endpointSelector:
    matchLabels:
      app: longhorn-manager
  egress:
    # Allow DNS to kube-system namespace
    - toEndpoints:
      - matchLabels:
          k8s-app: kube-dns
      toPorts:
      - ports:
        - port: "53"
          protocol: UDP
        - port: "53"
          protocol: TCP
    # Explicitly allow Kubernetes API server (critical for Longhorn)
    # Cilium handles this specially - kube-apiserver entity is required
    - toEntities:
      - kube-apiserver
    # Allow all internal cluster traffic (10.0.0.0/8)
    # This includes:
    # - Pod CIDR: 10.244.0.0/16
    # - Service CIDR: 10.96.0.0/12 (API server already covered above)
    # - VLAN Network: 10.132.0.0/24
    # - All other internal 10.x.x.x addresses
    - toCIDR:
      - 10.0.0.0/8
    # Allow pod-to-pod communication within cluster
    # The 10.0.0.0/8 CIDR block above covers all pod-to-pod communication
    # This explicit rule ensures instance-manager pods are reachable
    - toEntities:
      - cluster
    # Block all other egress (including external S3 like Backblaze B2)
---
# RBAC for CronJobs that manage the NetworkPolicy
apiVersion: v1
kind: ServiceAccount
metadata:
  name: longhorn-netpol-manager
  namespace: longhorn-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: longhorn-netpol-manager
  namespace: longhorn-system
rules:
- apiGroups: ["cilium.io"]
  resources: ["ciliumnetworkpolicies"]
  verbs: ["get", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: longhorn-netpol-manager
  namespace: longhorn-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: longhorn-netpol-manager
subjects:
- kind: ServiceAccount
  name: longhorn-netpol-manager
  namespace: longhorn-system
---
# CronJob: Remove NetworkPolicy before backups (12:55 AM daily)
# This allows S3 access during the backup window
apiVersion: batch/v1
kind: CronJob
metadata:
  name: longhorn-enable-s3-access
  namespace: longhorn-system
  labels:
    app: longhorn
    purpose: s3-access-control
spec:
  # Run at 12:55 AM daily (5 minutes before earliest backup at 1:00 AM Sunday weekly)
  schedule: "55 0 * * *"
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 2
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: longhorn-netpol-manager
        spec:
          serviceAccountName: longhorn-netpol-manager
          restartPolicy: OnFailure
          containers:
          - name: delete-netpol
            image: bitnami/kubectl:latest
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - |
              echo "Removing CiliumNetworkPolicy to allow S3 access for backups..."
              kubectl delete ciliumnetworkpolicy longhorn-block-s3-access -n longhorn-system --ignore-not-found=true
              echo "S3 access enabled. Backups can proceed."
---
# CronJob: Re-apply NetworkPolicy after backups (4:00 AM daily)
# This blocks S3 access after the backup window closes
apiVersion: batch/v1
kind: CronJob
metadata:
  name: longhorn-disable-s3-access
  namespace: longhorn-system
  labels:
    app: longhorn
    purpose: s3-access-control
spec:
  # Run at 4:00 AM daily (gives 3 hours 5 minutes for backups to complete)
  schedule: "0 4 * * *"
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 2
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: longhorn-netpol-manager
        spec:
          serviceAccountName: longhorn-netpol-manager
          restartPolicy: OnFailure
          containers:
          - name: create-netpol
            image: bitnami/kubectl:latest
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - |
              echo "Re-applying CiliumNetworkPolicy to block S3 access..."
              kubectl apply -f - <<EOF
              apiVersion: cilium.io/v2
              kind: CiliumNetworkPolicy
              metadata:
                name: longhorn-block-s3-access
                namespace: longhorn-system
                labels:
                  app: longhorn
                  purpose: s3-access-control
              spec:
                description: "Block external S3 access while allowing internal cluster communication"
                endpointSelector:
                  matchLabels:
                    app: longhorn-manager
                egress:
                # Allow DNS to kube-system namespace
                - toEndpoints:
                  - matchLabels:
                      k8s-app: kube-dns
                  toPorts:
                  - ports:
                    - port: "53"
                      protocol: UDP
                    - port: "53"
                      protocol: TCP
                # Explicitly allow Kubernetes API server (critical for Longhorn)
                - toEntities:
                  - kube-apiserver
                # Allow all internal cluster traffic (10.0.0.0/8)
                - toCIDR:
                  - 10.0.0.0/8
                # Allow pod-to-pod communication within cluster
                # The 10.0.0.0/8 CIDR block above covers all pod-to-pod communication
                - toEntities:
                  - cluster
                # Block all other egress (including external S3)
              EOF
              echo "S3 access blocked. Polling stopped until next backup window."
 
Read more...