Programming

Software development in general

Thoughts of a functionally oriented CSharp dev creating an FSharp web app. Feedback appreciated.

I am currently writing a web site/application in FSharp. I could absolutely do it more quickly in a website builder, most likely, but I've been wanting to write FSharp for a while. I've done bits and pieces in coding challenges and I've been writing C# in a functional style ever since getting familiar with FSharp and functional programming. As such, I'm using this as an opportunity to learn. A lot of things look great on their promo pages, but how is it really to use in an IDE or editor, with authorization, logging, metrics, timeouts, etc?

Right now the portion that I've been spending too much time in is the data access layer of a media file upload. The basic idea so far, which I don't know is absolutely correct but how I did it years ago, was a request is made for an item or a batch that will generate metadata and ids, which are then used on the file upload. The file upload will update the Media db entry, save to disk, publish an event, etc.

But on the data access layer, I've been using both raw Dapper and DbFun. I tried a few other libraries and currently have my eyes on SqlHydra. The two FSharp projects have certain features that I like, namely:

Compile time type checking
They use TypeProviders, like FSharp's version of SourceGenerators, to read a schema and validate your runtime query. Thus, in a test, you can run any part of a module and the whole file will be read and fail to build. FSharp's tooling isn't very great, unfortunately. In Rider, you can set up your IDE with a connection string to the db and it will check your sql for you. It's not compile time, but it is something. This doesn't work with FSharp.

Separation of the Query and Execution
I'm not really a fan of the Repository Pattern. In a lot of ways, it makes sense, but it isn't really OO like it may appear at first, and it gets bloated very quickly with a new function for almost every variation. Then, what if you need multiple methods run within a transaction, but you're already creating a context in the method, then you need to change everything.

The functional way of doing this is to separate the data for the DbQuery and the running of the query. So, DbFun does this via partially applied function (each parameter can be applied later), like so (with explicit typing).

Note: the QueryBuilder is something that you can conifigure and register in DI on startup with your type mappings and such.

let findByUserId (userId:UserId) (query:QueryBuilder) : IConnection -> Async<User option> = 
    query.Sql<UserId, User option>("SELECT * FROM users WHERE id = @id",
	          "id",
		   Results.Optional) userId // assuming the value of userId can be mapped. The result mapping name may not be quite right, but that's the idea

So this function is a few things. The .Sql is typed and the types could be inferred, but it takes your query text, parameters, result mapping, and returns a function that takes the connection to execute it, or the runner.

The downside for my use case is that Async is the version of FSharp tasks that predates async/await in C#, but the current HttpHandlers on the api only take Tasks and there is some overhead in converting between them. Not the end of the world, but if it's unnecessary, I'd prefer to try something else first.

I'd like a way to get the generated DbQuery object and put that through a runner that returns a Task, but that is not the way this works.

So, for now I have a repository with some DbFun methods and some Dapper methods, but that signature is already hella bloated. The repository also handles instrumentation and other things with DI, but some things have gotten repetitive.

type IMediaRepository =
    abstract member FindAsync: id:IdentityId * fileName:MediaSystemFileName -> Async<MediaResponse option>
    abstract member FindTask : id:IdentityId * fileName:MediaSystemFileName * ct:CancellationToken -> Task<MediaResponse option>
    abstract member FindTask : slug:Slug * mediaId:MediaId * ct:CancellationToken -> Task<MediaResponse option>
    abstract member FindTask : id:IdentityId * mediaId:MediaId * ct:CancellationToken -> Task<MediaResponse option>
    abstract member FindTask: slug:Slug * filename:MediaSystemFileName * ct:CancellationToken -> Task<MediaResponse option>
    abstract member ListTask: slug:Slug * ct:CancellationToken -> Task<MediaResponse System.Collections.Generic.List>
    abstract member ListTask: id:IdentityId * ct:CancellationToken -> Task<MediaResponse System.Collections.Generic.List>
    abstract member ListTask: id:IdentityId * mediaIds:MediaId seq * ct:CancellationToken -> Task<MediaResponse System.Collections.Generic.List>
    abstract member UpsertTask: MediaUpsertRequest * ct:CancellationToken -> Task<MediaResponse>
    abstract member UpsertManyTask: mediaUpserts:MediaUpsertRequest seq * ct:CancellationToken -> Task<MediaResponse System.Collections.Generic.List>
    abstract member FindByOriginalFileNameTask : id:IdentityId * originalFileName:MediaOriginalFileName * ct:CancellationToken -> Task<MediaResponse option>
    abstract member SetUploadStatusTask: mediaId:MediaId * status:MediaUploadStatus * ct:CancellationToken -> Task<bool>

Every function has this structure:

member this.FindByOriginalFileNameTask(id:IdentityId, originalFileName:string, ct) =
            // withDbActivity is a helper function that I wrote
            let instrument = withDbActivity logger (nameof(findByIdentityIdAndOriginalFileNameQuery)) (Some findByIdentityIdAndOriginalFileNameSql)
            instrument (fun () -> task {
                let conn = connectionFactory()
                let guidValue = match id with | IdentityId guid -> guid
                let! res = conn.QuerySingleOrDefaultAsync<MediaResponse>(
                    CommandDefinition(findByIdentityIdAndOriginalFileNameSql, {| id = guidValue.ToString(); originalFileName = originalFileName |}, cancellationToken = ct))
                return Option.ofObj res
            })

My preference would be a bit more separation and to have an instrumented runner to handle the execution. But also, right now I'm doing the Dapper way of writing the Sql. The pros are that it's really fast to run since there's nothing the generate. The cons are maintainability. So, if a generated query could be cached and applied with new parameters, that'd be great, but it isn't always so easy when working with frameworks.

Another thing with generation is that I prefer to use upsert methods where I can, which in Postgresql gets really long:

INSERT INTO media.media(
    "id", "user_id", "slug", "system_file_name", "original_file_name", "full_uri", "is_deleted", "media_type", "sort_order", "note", "upload_status", "version")
VALUES (@Id, @UserId, @Slug, @SystemFileName, @OriginalFileName, @FullUri, @IsDeleted, @MediaType, @SortOrder, @Note, @DefaultStatus, @DefaultVersion)
ON CONFLICT (id) DO UPDATE SET
    "user_id" = @UserId,
    "slug" = @Slug,
    "original_file_name" = @OriginalFileName,
    "full_uri" = @FullUri,
    "is_deleted" = @IsDeleted,
    "media_type" = @MediaType,
    "sort_order" = @SortOrder,
    "note" = @Note,
    "version" = ... --weird version logic that I don't like  
RETURNING *; -- return the whole record's current state

The ON CONFLICT (UNIQUE KEY) DO ... isn't always supported in generators.

But anyway, SqlHydra, which I haven't tried implementing yet, looks like it generates the query and performs the execution, so I'll see how that works with metrics:

let getExpensiveProducts (db: QueryContextFactory) minPrice =
    selectTask db {
        for p in SalesLT.Product do
        where (p.ListPrice > minPrice)
        select p
    }

How would the separation look in OO?

I really don't want to re-invent the wheel and a lot of the libraries generate some type of DbQuery record that I'd love to be able to pipe into Dapper or something, but it might look something like the below. The goals of this approach would be for type generics on the query to carry through to the runner so that the Dapper implementation of things like 1 record or multiple records get handled for you.

One way that might look could be something like

// Note that I make changes to the structure later on in the post
record DbQuery<T> {
    public abstract string QueryText { get; }
	public abstract object? Parameter { get; }
	
	public virtual CommandDefinition ToCommandDefinition(option args like timeouts and cancellation token) =>
		new CommandDefinition(QueryText, Parameter, other stuff);

	public abstract Task<T> Run(IDbConnection connection);
}

record OptionQuery<T> : DbQuery<Option<T>> {
	public override async Task<T> Run(IDbConnection connection, CommandDefinition command) {
		// the command definition can carry Transaction information, so having it as an argument allows a query to be paired with others
		var record = await connection.QuerySingleOrDefaultAsync(command);
		return Optional(record);
	}
}

record MultipleQuery<T> : DbQuery<List<T>> {
	public override async Task<T> Run(IDbConnection connection, CommandDefinition command) {
		var result = await connection.QueryAsync<List<T>>(command);
		return result.ToList();
	}
}

record GetByUserId(UserId userId) : OptionQuery<User> {
	public override string QueryText = "SELECT * FROM users WHERE id = @id";
	public override object? Parameter => { id = userId.Value };
}

record ListThingByUserId(UserId userId) : MultipleQuery<T> {
	public override string QueryText = "SELECT * FROM thing WHERE id = @id";
	public override object? Parameter => { id = userId.Value };
}

class DapperRunner(Func<IDbConnection> connFactory) {
	public Task<T> Run<TQuery, T>(TQuery query, CommandDefinition? command) where TQuery : DbQuery<T> =>
	    query.Run(connFactory(), command ?? query.ToCommandDefinition());

	// run in transaction

	// execute multiple
}

class MetricsRunner(DapperRunner runner, SomeMetricsStuff metrics) {
	public Task<T> Run<TQuery, T>(TQuery query, CommandDefinition? command) where TQuery : DbQuery<T>
	{
		var queryName = typeof(TQuery.Name); // benefits of strongly typed queries
		var queryText = query.SqlText;
		// log metrics with this info
		return runner.Run(query, command);
	}
}

Then, instead of a Repository, you could have things be more functional, like a “module” for the queries.

static class UserThingQueries {
	public record GetByUserId(UserId userId) : OptionQuery<User> {
	    public override string QueryText = "SELECT * FROM users WHERE id = @id";
	    public override object? Parameter => { id = userId.Value };
    }

    public record ListThingByUserId(UserId userId) : MultipleQuery<T> {
	    public override string QueryText = "SELECT * FROM thing WHERE id = @id";
	    public override object? Parameter => { id = userId.Value };
    }
}

Now, as far as the consumer goes, it would need to have a runner injected, or you could again put it behind a repository, but that would defeat the purpose a bit, though testing might be easier since you could make an interface for just those particular queries that need to be run. There are lots of options with programming.

But I don't know, going back to my FSharp repo, one thing I would like to have done is to have my instrumentation not need to be copy-pasted per method. It's not a big deal, but if I'm not using strongly typed query objects, then metrics and logging becomes more difficult. It's nice to have a name for the query so that you can see easily in the code where it's getting used. The libraries that I have don't really have that, which is why my withDbActivity helper takes a QueryName and QueryText option as parameters.

I still don't know where I want to go with things like SqlHydra, DbFun (I may have to drop it if I want to use tasks natively), or the currently implementation of writing Dapper in a repository. I don't like how big the repo is getting already, and trying to implement this same thing in FSharp feels clunky and I keep getting stuck on various syntax elements. Or I could copy-paste what I wrote into my local llm running on a Radeon 6800XT just because.

What I'm not wild about in the repository is the tight coupling of a query itself and the running of it. Separating those makes it easier to combine queries in a transaction, such as something like below.

let updateOneThing = UpdateOneThingQuery(blah)
let updateAnotherThing = UpdateAnotherThingCommand(blah)
runner.RunInTransaction(fun connection transaction -> task {
    let cmd1 = updateOneThing.ToCommandDefition(transaction = transaction)
	let cmd2 = updateOneThing.ToCommandDefinition(transaction = transaction)

	let! result1 = runner.Run(updateOneThing, cmd1)
	let! result2 = runner.Run(updateOneThing, cmd2)
	return result1, result2
})

But writing this out is a bit clunky, so perhaps a different approach to reaching across things everywhere would be to put the transaction and other parameters. Things are crossing weirdly, so maybe if it was restructured to the below. This feels a lot more natural.

let updateOneThing = UpdateOneThingQuery(blah)
let updateAnotherThing = UpdateAnotherThingCommand(blah)

runner.RunInTransaction(fun connection transaction -> task {
    let cmd1 = updateOneThing { with Transaction = transaction }
	let cmd2 = updateOneThing { with Transaction = transaction }

	let! result1 = runner.Run(cmd1)
	let! result2 = runner.Run(cmd2)
	return result1, result2
})

so the CSharp version of the base would look something like below. This feels a bit more natural, and is what I've seen in the source code for DbFun and other libs.

record DbQuery<T> {
    public abstract string QueryText { get; }
	public abstract object? Parameter { get; }

	public DbTransaction? Transaction { get; } = default;
	public CancellationToken CancellationToken { get; }
	
	public virtual CommandDefinition ToCommandDefinition() =>
		new CommandDefinition(QueryText, Parameter, transaction: Transaction, cancellationToken: CancellationToken);

	public abstract Task<T> Run(IDbConnection connection);
}

record OptionQuery<T> : DbQuery<Option<T>> {
	public override async Task<T> Run(IDbConnection connection) {
		var record = await connection.QuerySingleOrDefaultAsync(ToCommandDefinition());
		return Optional(record);
	}
}

// consumer
var getByUserIdQuery = new GetByUserId(new UserId(12345));
var listThingByUserIdQuery = new ListThingByUserId(new UserId(12345));

var (getResultOption, listResult) = await runner.RunInTransactionAsync(async (conn, txn) => {
	var getQuery = getByUserIdQuery with { Transaction = tnx };
	var listQuery = listThingByUserIdQuery with { Transaction = txn };

	Option<User> getResult = await getQuery.Run(conn); // the run function only exists to handle dapper's Query, QueryMultiple, QuerySingle, etc and could just be done here
	List<Thing> listThings = await listQuery.Run(conn);

	return (getResutl, listThings)
})

Most FSharp libs use immutable records and such can also set the SqlText and Parameter using with, but you'd have to make your own QueryName if you want to try and log something like that. The custom classes provide really just 2 things: A name for the query for logging and an out-of-the-way mapping from Dapper's result to a List or Option or whatever that type is.

How and whether to incorporate into FSharp

I could do something similar and port this structure over to FSharp syntax, but then I wonder how much it's worth it to wrap a SqlHydra query generation into a query type, such as

// normal way:
let getProducts (db: QueryContextFactory)  =
    selectTask db {
        for p in SalesLT.Product do
        select p
    }

//wrapped way
type ListProductsQuery(storeId: StoreId, db: QueryContextFactory) =
  // inherit a base type
  member _.Run(already doesn't work with dapper connection - bad abstraction?) = 
	  selectTask db {
	        for p in SalesLT.Product do
			where p.StoreId = storeId
	        select p
	    }
  
let getProductsByStore (storeId: StoreId) (db: QueryContextFactory) = ListProductsQuery(storeId, db)

This is how SqlHydra does transactions:

let completeOrder (db: QueryContextFactory) orderId = task {
    use! shared = db.CreateContextAsync()
    shared.BeginTransaction()        

    // Update status for order
    do! updateTask shared {
            for o in dbo.Orders do
            set o.Status "Complete"
            where (o.Id = orderId)
        } : Task

    // Write to audit log
    do! insertTask shared {
            into dbo.AuditLog
            entity { Message = $"Completed order {orderId}"; Timestamp = DateTime.UtcNow }
        } : Task

    shared.CommitTransaction()
}

As for logging named query metrics, I see that SqlHydra's QueryContextFactory can take a custom logging function, but I wouldn't be able to use those typed query names. I could probably get over it haha. I'm looking at 3 different ways of doing things and trying to merge them in my head.

At least with my helper function, I have the ability to put it anywhere, it doesn't have to fit into a specific style, so I could do

withDbActivity logger "Set Order Status Complete" (sql:None) (fun () -> 
    do! updateTask shared {
            for o in dbo.Orders do
            set o.Status "Complete"
            where (o.Id = orderId)
        } : Task
) 

But this means that the particular block for update status and having that be attached to a query name won't be shared across other uses. It'll work in that I could copy-paste to find the part where it's slow, but I wouldn't necessarily see other usages of the same query.

I'm not really too sure what the best approach is. I do really like Dapper and I don't mind writing SQL queries. However, in larger projects with a lot of area and changes, keeping things up to date really only happens with integration tests, and assuming people on other projects find them or that nothing gets missed, which is why microservices are usually split by teams, communicate via message bus, and have their own DBs. It's all primarily, yes for scaling, but to make sure teams don't step on each other.

But that's not the scope for this project. I could probably use Dapper, maybe Dapper.FSharp or SqlHydra for some things where the sql generation makes sense, and use raw Dapper where it makes sense. If I have two flows for doing metrics, then so be it.

I may wind up having a few different ways for each method, but try to make it somewhat invisible at the consumer level? Or I could just pick one (Dapper) and just have one thing. KISS and all that.

But that's part of the learning. I'm currently fumbling around for what feels “right” or “natural” in this current environment. Even in a CSharp repo, I'd want to steer away from Repositories because the typically become God classes, primarily due to pairing the DbQuery and its execution.

I am very open to feedback. WriteFreely blog posts are the easiest to find on the fediverse, but the mastodon handle for this blog is @programming@blog.keyboardvagabond.com and the KeyboardVagabond mastodon link is https://mastodon.keyboardvagabond.com/@programming@blog.keyboardvagabond.com/116525648156410356

#fsharp #csharp #softwaredevelopment #dotnet #programming

I recently caused myself a bit of a minor issue by installing some updates on the Keyboard Vagabond cluster. It wasn't a big deal, just some version number updates from a project called renovate that automatically creates pull requests when package versions that you use get updated. Doing this did trigger a restart on the redis cluster, which means that different services may need to be restarted because their redis connection strings get stale. I had restarted the piefed-worker pod, but the update didn't seem to stick and I didn't realize it.

I noticed the next morning that I wasn't seeing any new posts, so I figured the worker was stuck and, sure enough, I checked the redis queue and saw it stuck at ~53k items.

image

Piefed will stop publishing items to the queue when the redis queue reaches 200MB in size and return 429 rate limit http responses.

Solution: restart and then processing started, but I was wondering about pod scaling.

The thing about scaling the worker is that piefed scales internally from 1-5 workers, so vertical scaling is preferred over horizontal, especially since redis doesn't ensure processing order like Kafka does, so by adding a new pod, I could create a situation where one pod pulls a post create, the next pulls an upvote, but the upvote gets processed before creating the post. So normally, you wouldn't want to scale horizontally, but there is a use case for doing it: something gets stuck.

In the past, the queue had blown up due to one or more lemmy servers going down and message processing stalling. I solved that at the time with multiple parallel worker pods so that at least some of the workers would likely not get stuck. Doing something similar could help in this current case, where the first worker wasn't processing queues. Now, the ultimate item on the to-do list is that I should make that pod return redis connectivity as part of the health check so that it'll get restarted if redis fails. (I'll be doing that after this blog post)

My up until today current version of horizontal scaling was on cpu and memory usage, but I never hit those limits, so it never triggered. I was working with Claude on it when it introduced me to KEDA, Kubernetes Event Driven Autoscaling. https://keda.sh/. This looks like what I need.

Installation was pretty simple, https://keda.sh/docs/2.18/deploy/, you can use a helm chart or run kubectl apply --server-side -f https://github.com/kedacore/keda/releases/download/v2.18.3/keda-2.18.3.yaml and it takes care of it. I had Claude create a kustomization file:

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: keda-system

resources:
  - https://github.com/kedacore/keda/releases/download/v2.18.3/keda-2.18.3.yaml

patches:
  # Custom patches to change the namespace to keda-system to be consistent with my other namespace patterns
  - path: patches/clusterrolebinding-keda-operator-namespace.yaml
  - path: patches/clusterrolebinding-keda-system-auth-delegator-namespace.yaml
  - path: patches/rolebinding-keda-auth-reader-namespace.yaml
  - path: patches/apiservice-external-metrics-namespace.yaml
  - path: patches/validatingwebhook-namespace.yaml

And the patches aren't necessary, but they look like the below just because I want that namespace.

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.external.metrics.k8s.io
spec:
  service:
    namespace: keda-system

After that, there's a scaledobject in Kubernetes that you can configure:

---
# KEDA ScaledObject for PieFed Worker - Queue-Based Autoscaling

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: piefed-worker-scaledobject
  namespace: piefed-application
  labels:
    app.kubernetes.io/name: piefed
    app.kubernetes.io/component: worker
spec:
  scaleTargetRef:
    name: piefed-worker
  minReplicaCount: 1
  maxReplicaCount: 2 
  cooldownPeriod: 600  # 10 minutes before scaling down (conservative)
  pollingInterval: 30  # Check queue every 30 seconds
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 600  # Wait 10 min before scaling down
          policies:
          - type: Percent
            value: 50
            periodSeconds: 60
          selectPolicy: Max
        scaleUp:
          stabilizationWindowSeconds: 120  # Wait 2 min before scaling up
          policies:
          - type: Pods
            value: 1
            periodSeconds: 60
          selectPolicy: Max
  triggers:
  - type: redis
    metadata:
      address: redis-ha-haproxy.redis-system.svc.cluster.local:6379
      listName: celery  # Main Celery queue
      listLength: '40000'  # Scale up when queue exceeds 40k tasks per pod. Piefed stops pushing to redis at 200MB, 53k messages the last time it got blocked.
      databaseIndex: "0"  # Redis database number (0 for PieFed Celery broker)
    authenticationRef:
      name: keda-redis-trigger-auth-piefed

This will scale when 40k messages are in the queue, which should only happen when something isn't getting processed, and will scale up to a second pod only. So, in the event that a pod gets stuck, at least things should gradually be kept moving.

When I got to this point, I decided to implement my restart idea, but Claude gave a different suggestion to use the Celery worker's retries, so it added

- name: CELERY_BROKER_CONNECTION_MAX_RETRIES
  value: "10"  # Exit worker after 10 failed reconnects → pod restart
- name: CELERY_BROKER_TRANSPORT_OPTIONS
  value: '{"socket_timeout": 10, "socket_connect_timeout": 5, "health_check_interval": 30}'

A new startup probe, sure why not

startupProbe:
          exec:
            command:
            - python
            - -c
            - "import os,redis,urllib.parse; u=urllib.parse.urlparse(os.environ['CELERY_BROKER_URL']); r=redis.Redis(host=u.hostname, port=u.port, password=u.password, db=int(u.path[1:]) if u.path else 0); r.ping()"
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30

and it changed a few thresholds for checking liveliness, which I thought looked fine.

The current state of things is that once the number of records started going down, other servers started federating, which is the spike you see in the graph. There are now 3 web pods and 2 worker pods, vs the typical 2 web pods and 1 worker pod.

The good news is that after scaling out, the total max processed gradually rose from ~1.5k per minute to just under 3k per minute. Once the records fall below 40k and other servers are back to normal federation, things will go back to more normal levels, as a single worker is fine unless things stop and get backed up.

Good job on piefed for returning 429s to keep things from getting too crazy!

Here are the requests coming in. You can see big spikes once we stopped returning 429's. I do have some nginx rate limiting set up as well to keep things sane. image

Edit: I just ran into a fun thing while doing all of this. I ran out of WAL (Write Ahead Log) space on the storage volume. I gave it 10GB with expansion, so the primary db node started failing at 20.6GB in size. I just doubled the size of the WAL PVC and that resolved. lol.

Edit 2: Fun waves as it hovers around the 40k threshold

#selfhosting #kubernetes #fediverse #yaml #keda #autoscaling #piefed #lemmy #programming #softwaredevelopment #k8s

It started with a perfectly good and running kubernetes cluster hosting fediverse applications at keyboardvagabond with all the infrastructure and observability that comes with it. I've worked in kubernetes environments for a while, but lacked being able to see how everything comes together and what it means; I also wanted to host some fediverse software for the digital nomad community.

I followed a guide on bare metal kubernetes setup with hetzner (though you should definitely NOT change cluster.local like it says) with some changes, adjustments, and modifications over time to suite my scenario. While I was getting up and running with my 3 cluster VPS servers, I became nervous about resource usage. The applications that I host are currently more ram needy than cpu and the nodes with all of the applications were using ~12GB out of the 16GB available. I decided to make 2 of the 3 nodes worker nodes and have one control plane node. The control plane is the one that determines what the other nodes are doing and hosting. Put a pin in this, it'll come back later.

I also was able to migrate from DNS entries on exposed ports to Cloudflare tunnels and Tailscale for VPN access. This means that no one can try to input commands on the Talos or Kubernetes ports, as they're no longer exposed. You'd need to figure out the encryption key to be able to do it, but now it's even safer. Put a pin in that.

This has been very much a learning process for me in a lot of ways, and I hope that I haven't forgotten too much – it's funny how memory is. I've been taking a lot of notes and having claude/cursor draw up summaries that I leave lying around. It's funny how much sense your documentation makes until you come back 3 months later.

One of the issues that was in the back of my mind was that I had configured the Talos configuration launch kubernetes with the port number specified and I was using the external IP. This was a mistake, because it meant that the nodes were primarily communicating with each other externally rather than over the VLAN, or the internal network. Internal traffic still happened, as I believe that service to service communication would go via kubernetes to a local IP. However, I eventually got a broken dashboard working that showed me the network traffic by device, but it was all on eth0, the external ethernet, not the VLAN. I then checked the dashboards on the provider and it showed 1.8TB of internet usage. That's within my budget, thankfully, but way too much for a single-user cluster, as I had not yet announced the services as open to the public.

I wanted to get this working before going live, so I figured that I would start with n3, one of the workers. I have an encrypted copy of the Talos machine config, but couldn't decrypt it, so I copied n2, changed the IP to the internal 10.132.0.30, and applied...... I forgot to change the host name from n2 to n3.

No biggie, I'll change it and apply....timeouts. Tailscale is no longer connected to the cluster. I spent an hour trying to get access, working with Claude for ideas and work-arounds. No dice. I believe what happened was that in the confusion of 2 nodes with the same name, Tailscale was likely running on n3 and was no longer accessible and the weird state of things caused it to not be spun up on the other nodes. If it wasn't a weird state it was because at my scaling with redundant services and two nodes don't have the RAM available to handle everything from a failed node. But either way, I had to get back in to the cluster.

I went into the VPS dashboard and rebooted the server into recovery mode, wiped the drive, re-installed, and tried to re-join the cluster. This should have been fine as I ensure that there are 2 copies of all storage volume across the nodes in addition to nightly s3 backups. In hind-sight, I might have been better rebooting talos into maintenance mode. But it didn't rejoin the cluster. It turns out that I was missing a particular network configuration that would allow a foreign node to join. That doesn't happen automatically, there's allow-listing for the IP address and some other network policies that need to exist to allow it and I was missing one for one of the talos ports.

I need to get to the control plane node, n1. I rebooted into Talos maintenance mode and apply the new configuration, but it's logging that it can't join a cluster and that I need to bootstrap it to join. I guess that makes sense, it was the only control plane. I get it up and running and progressively add n3 and n2 and they re-join. I reinstall the basic infrastructure to get running and then let FluxCD restart all of the services. The majority boot up, but I notice that a couple of services are blank. No existing data.

I check the longhorn UI, which is what I use to manage storage, and I don't see a lot of volumes, but I see about 50 orphans.... Crap. All volumes were orphaned. When I put n1 into maintenance mode and then bootstrapped, I thought that longhorn would see the volumes and put them back with the services that they belonged to. However, when I redid n1, etcd, the part that manages cluster resources, was cleared and all that storage information lost who and what it belonged to. Learning is painful sometimes.

I tried to take a look at the volumes, but Talos is pretty minimal, so Claude made a pod with alpine and XFS (my file-system) tools that would attach a specific orphan volume, mount it, and try to look at the contents to see what it belonged to. Some things were fairly easy to identify, such as the WriteFreely blog, which is one of the first services that I loaded and uses its own SQLite database. I got that up and running. I also use harbor registry to be a mirror proxy and allow me to privately push my own builds – it was all 0s, or at least the first 100MB were. That's not a huge deal. The database volumes were intact, but I couldn't really get those running, so I'd have to re-create it.

I gradually got these services running and re-configured. Once Harbor is up, images should start getting pulled and cached. But redis failed to pull. That's weird.

But first let me get the database running with CloudNative Postgres. I got it up, but the database was empty, so back to looking at orphans. The tricky thing here is that a few applications have their own postgres databases, such as Harbor Registry. So instead of looking at the file structure I also had to find out what tables were there, but even when I found them, I didn't know which orphan belonged to the primary rather than a replica. In the end, I decided to restore the latest nightly backup and then had Claude arrange a “swap” where it replaces the current “volume claim” with a pinned “volume” name. Essentially, the database pod has a PVC (persistence volume claim) and I want to have the claim that is used be pointed to the recovered volume. So I had claude execute those steps, which unfortunately can leave you with a PVC in your source code that has a volume reference, which you can get rid of, but may or may not be immediately worth it. I restarted and postgres shows all of the databases that I expect.

Next is to fix redis. It turns out that not only Harbor was using Bitnami helm charts (pre-made configurations for kubernetes), but so was the redis cluster. I run with a main and 2 replicas on the 3 nodes. It was failing because Bitnami no longer wants to provide free charts, so they moved everything to bitnamilegacy. No biggie, I'll just change the image and repository that's used and it'll load. Redis loaded, but then there was another component called “redis-exporter” for metrics that seemed to ignore the image override. I then spent the next few hours trying to get it to work and experimenting with other helm charts that provide a cluster arrangement. I settled on one and got redis working. I did lose some data as some applications like piefed started running and publishing messages that it received to do work from the 3 days of being off-line. I decided not to try to recover that. Oh well, it's only social media. Once I go live there will be more current things to look at. It was a pain, though.

After this, I spent quite a few hours fixing small issues with getting FluxCD to reconcile the state of things, especially since I had made changes to PVCs, which are immutable. That took quite a few more hours to either recreate or undo changes so that FluxCD was happy. Eventually everything came online despite me hitting Docker rate limits. I rebuilt the rest of the various fediverse apps, as I have custom builds for Bookwyrm (books), Piefed (reddit), and Pixelfed (instagram) for my kubernetes cluster.

I then began to rebuild the dashboards that I had lost. I still don't have all of them, but at least now that networking tab show a LOT of devices, including the VLAN. Mission accomplished? I did do one extra and got a log view of long-running queries from different apps that I could annoy the developers with, but they look like some easy fixes with some indexes and light code changes, hopefully.

I still need to rebuild the redis dashboards, as I had some metrics for the different event queues that the apps use, which I could use to monitor is something bad happened. On ocassion, if another server fails to respond, it could cause a queue backup, as I don't believe the varioius apps are “grouping” by domain name, which is a feature with the redis XGROUP command.

Here's a funny thing, though. After getting the services up and running for a couple of days, the RAM usage is the same with 3 control plane nodes as it was with just one, so my worries were for nothing and cost me the cluster.

As part of the recovery, I took the opportunity to create a VIP for talos. This is a static IP address that the different control planes vote on for who is managing. So I changed the talos host from a domain name, such as api.mycluster.com to that IP of 10.132.0.5. I also took the time to migrate from Tailscale's subnet route setup to their operator helm chart. This should let me expose different services over the VPN with a domain name using their MagicDNS system and a meta attribute on the service. I haven't done that yet, though.

This disaster was avoidable and could have been a few minute upgrade if I did everything right, but I was able to take the opportunity to fix some other networking and service issues that I was too afraid to do on a running environment. Now all of my services are communicating over the VLAN, I have a VIP for Talos, Tailscale is upgraded, I've migrated more off of Bitnami, and I can now properly handle a node failure except for full service restarts. I would still have to scale down some things manually for that fail-over. But nobody is making or losing money off of this, except for me and my VPS provider, so good enough.

In the end, I got up and running, and the AI was actually quite helpful for debugging issues and quickly generating commands and templates for volume recovery. It was nice being able to let it either work or run a script to examine the orphan volumes for me. I did have to play around with getting it to create notes to go to new contexts as they would get full quickly once I ran out of Claude usage with my plan. I'm glad I didn't have to type a bunch of stuff myself. Of course, AI is still “that looks about right”, which is a thing that I'm aware of, but it wound up being a useful tool for this recovery.

The other thing that helped a good bit was I was actually in another town to visit an old travel friend. Normally I'm the type of person to obsess about a problem until it's solved, but I was there to visit a friend and nobody's livelihood depends on this. So I pulled myself away to go hang and even after just 15 minutes away from the keyboard I'd start getting new ideas or realizing something new. That's one reason the recovery took several days, because I was still living (and obsessing). The mandatory breaks were probably the most helpful things that I could have done – I just don't know how to replicate those.

#talos #kubernetes #selfhosting #fediverse #keyboardvagabond #whybitnamiwhy #cluster #vps #failover #distasterRecovery

Edit: The below didn't work. Jump to the edit to see the current attempt.

I'm experimenting with where to put these types of blog posts. I have been putting them on my home server, at gotosocial.michaeldileo.org, but I'm thinking of moving over here instead of a micro-blogging platform.

Longhorn, the system that is used to manage storage for Keyboard Vagabond, performs regular backups and disaster recovery management. I noticed that on the last few billing cycles, the costs for S3 cloud storage with BackBlaze was about $25 higher than expected, and given that the last two bills were like this, it's not a fluke.

The costs are from s3_list_objects, over 5M calls last month. It turns out this is a common thing that has been mentioned in github, reddit, Stack Overflow, etc. The solution seems to be just to turn it off. It doesn't seem to be required for backups and disaster recovery to work and Longhorn seems to be doing something very incorrectly to be making all of these calls.

...
data:
    default-resource.yaml: |-
        ...
        "backupstore-poll-interval": "0"

My expectation is that future billing cycles should be well under $10/month for storage. The current daily average storage size is 563GB, or $3.38 per month.

#kubernetes #longhorn #s3 #programming #selfhosting #cloudnative #keyboardvagabond

Edit – the above didn't work (new solution below)

Ok, so the network policy did block the external traffic, but it also blocked some internet traffic that caused the pods to not be in a ready state. I've been playing around with variations of different ports, but I haven't found a full solution yet. I'll update if I get it resolved. I got it. I had to switch to a CiliumNetworkPolicy

I also tried changing the polling interval from 0 to 86400, though I think the issue is ultimately how they do the calls, so bear this in mind if you use Longhorn. Right now I'm toying around with the idea of setting a cap, since my backups happen after midnight, so maybe gamble on the cap getting reset and then a backup happening, then at some point the cap gets hit and further calls fail until the morning? This might be a bad idea, but I think that I could at least limit my daily expenditure.

One thing to note from what I read in various docs is that in Longhorn v1.10.0, they removed the polling configuration variable since you can set it in the UI. I still haven't solved the issue, ultimately.

I see that yesterday longhorn made 145,000 Class C requests (s3-list-objects). I found on a github issue that someone solved the issue be setting a network policy to block egress outside of those hours. I had Claude draw up some policies, configurations, and test scripts to monitor/observe the different system states. The catch, though, is that I use FluxCD to maintain state and configuration, so this policy cannot be managed by flux.

The gist is that a blocking network policy is created manually, then there are two cron jobs: one to delete the policy 5 minutes before backup, and another to recreate it 3 hours later. I'm hoping this will be a solution.

Edit: I think that I finally got it. I had to switch from a NetworkPolicy to CiliumNetworkPolicy, since that's what I'm using (duh?). using toEntities: kube-apiserver fixed a lot of issues. Here's what I have below. It's the blocking network configuration and the cron jobs to remove and re-create it. I still have a billing cap in place for now. I found that all volumes backed up after the daily reset. I'm going to keep it for a few days and then consider whether to remove it or not. I now at least feel better about being a good citizen and not hammering APIs unnecessarily.

---
# NetworkPolicy: Blocks S3 access by default
# This is applied initially, then managed by CronJobs below
# Using CiliumNetworkPolicy for better API server support via toEntities
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: longhorn-block-s3-access
  namespace: longhorn-system
  labels:
    app: longhorn
    purpose: s3-access-control
spec:
  description: "Block external S3 access while allowing internal cluster communication"
  endpointSelector:
    matchLabels:
      app: longhorn-manager
  egress:
    # Allow DNS to kube-system namespace
    - toEndpoints:
      - matchLabels:
          k8s-app: kube-dns
      toPorts:
      - ports:
        - port: "53"
          protocol: UDP
        - port: "53"
          protocol: TCP
    # Explicitly allow Kubernetes API server (critical for Longhorn)
    # Cilium handles this specially - kube-apiserver entity is required
    - toEntities:
      - kube-apiserver
    # Allow all internal cluster traffic (10.0.0.0/8)
    # This includes:
    # - Pod CIDR: 10.244.0.0/16
    # - Service CIDR: 10.96.0.0/12 (API server already covered above)
    # - VLAN Network: 10.132.0.0/24
    # - All other internal 10.x.x.x addresses
    - toCIDR:
      - 10.0.0.0/8
    # Allow pod-to-pod communication within cluster
    # The 10.0.0.0/8 CIDR block above covers all pod-to-pod communication
    # This explicit rule ensures instance-manager pods are reachable
    - toEntities:
      - cluster
    # Block all other egress (including external S3 like Backblaze B2)
---
# RBAC for CronJobs that manage the NetworkPolicy
apiVersion: v1
kind: ServiceAccount
metadata:
  name: longhorn-netpol-manager
  namespace: longhorn-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: longhorn-netpol-manager
  namespace: longhorn-system
rules:
- apiGroups: ["cilium.io"]
  resources: ["ciliumnetworkpolicies"]
  verbs: ["get", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: longhorn-netpol-manager
  namespace: longhorn-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: longhorn-netpol-manager
subjects:
- kind: ServiceAccount
  name: longhorn-netpol-manager
  namespace: longhorn-system
---
# CronJob: Remove NetworkPolicy before backups (12:55 AM daily)
# This allows S3 access during the backup window
apiVersion: batch/v1
kind: CronJob
metadata:
  name: longhorn-enable-s3-access
  namespace: longhorn-system
  labels:
    app: longhorn
    purpose: s3-access-control
spec:
  # Run at 12:55 AM daily (5 minutes before earliest backup at 1:00 AM Sunday weekly)
  schedule: "55 0 * * *"
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 2
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: longhorn-netpol-manager
        spec:
          serviceAccountName: longhorn-netpol-manager
          restartPolicy: OnFailure
          containers:
          - name: delete-netpol
            image: bitnami/kubectl:latest
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - |
              echo "Removing CiliumNetworkPolicy to allow S3 access for backups..."
              kubectl delete ciliumnetworkpolicy longhorn-block-s3-access -n longhorn-system --ignore-not-found=true
              echo "S3 access enabled. Backups can proceed."
---
# CronJob: Re-apply NetworkPolicy after backups (4:00 AM daily)
# This blocks S3 access after the backup window closes
apiVersion: batch/v1
kind: CronJob
metadata:
  name: longhorn-disable-s3-access
  namespace: longhorn-system
  labels:
    app: longhorn
    purpose: s3-access-control
spec:
  # Run at 4:00 AM daily (gives 3 hours 5 minutes for backups to complete)
  schedule: "0 4 * * *"
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 2
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: longhorn-netpol-manager
        spec:
          serviceAccountName: longhorn-netpol-manager
          restartPolicy: OnFailure
          containers:
          - name: create-netpol
            image: bitnami/kubectl:latest
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - |
              echo "Re-applying CiliumNetworkPolicy to block S3 access..."
              kubectl apply -f - <<EOF
              apiVersion: cilium.io/v2
              kind: CiliumNetworkPolicy
              metadata:
                name: longhorn-block-s3-access
                namespace: longhorn-system
                labels:
                  app: longhorn
                  purpose: s3-access-control
              spec:
                description: "Block external S3 access while allowing internal cluster communication"
                endpointSelector:
                  matchLabels:
                    app: longhorn-manager
                egress:
                # Allow DNS to kube-system namespace
                - toEndpoints:
                  - matchLabels:
                      k8s-app: kube-dns
                  toPorts:
                  - ports:
                    - port: "53"
                      protocol: UDP
                    - port: "53"
                      protocol: TCP
                # Explicitly allow Kubernetes API server (critical for Longhorn)
                - toEntities:
                  - kube-apiserver
                # Allow all internal cluster traffic (10.0.0.0/8)
                - toCIDR:
                  - 10.0.0.0/8
                # Allow pod-to-pod communication within cluster
                # The 10.0.0.0/8 CIDR block above covers all pod-to-pod communication
                - toEntities:
                  - cluster
                # Block all other egress (including external S3)
              EOF
              echo "S3 access blocked. Polling stopped until next backup window."