Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment cdk script fails with database 404 #15

Open
kittyandrew opened this issue Aug 1, 2024 · 9 comments
Open

Deployment cdk script fails with database 404 #15

kittyandrew opened this issue Aug 1, 2024 · 9 comments

Comments

@kittyandrew
Copy link

After configuring everything according to the readme and more ("bootstrapping" aws sdk and creating engineering group etc), I'm stuck with a new error:
image

This seems to be related to some database setup, but it doesn't seem to be described anywhere inside readme and not related to the existing database setup section, which comes much later than npm run deploy.

From my limited AWS experience and after reading readme its also strange that it failed to find "default" group or subnet, because as far as I understand this is a reserve name that should exist already (?).

Please let me know if I'm missing something trivial, or doing something wrong.

@dmmiller
Copy link
Collaborator

dmmiller commented Aug 6, 2024

@tyrtel, was this what you were seeing as well? Were you able to get past it?

@dmmiller
Copy link
Collaborator

dmmiller commented Aug 6, 2024

@flooey , did you run into this at all? And if so, what was the fix/workaround?

@yukigesho
Copy link

I got the same error and have been stuck trying to figure it out. Have you found any solution or workaround for this? @kittyandrew

@kittyandrew
Copy link
Author

nope

@yukigesho
Copy link

I discovered that in /ops/aws/src/radical-stack/rds/ProdReplica.ts the default subnet group is hardcoded. However, it cannot be simply 'default'; rather, it should be 'default-vpc-xxxxxxxxxxxxx', which you can verify in the RDS/Subnet groups. After updating this value, another error surfaced. To resolve it, create a security group in your VPC and update the values in /ops/aws/src/radical-stack/rds/ProdReplica.ts. It's also important to note that ops/aws/src/radical-stack/ec2/vpc.ts contains hardcoded IPs and subnet masks. Make sure that everything is configured correctly there too. Currently, I am struggling with this issue:

image

I hope it's just a matter of time.

@yukigesho
Copy link

Okay, now I received:
image
It's progress, but this error message doesn't help me much.
Maybe someone knows the root cause of it?

@dmmiller
Copy link
Collaborator

Someone else ran into that and they just removed monitoring from the deployment to see if they could get it to work. That got them past the monitoring issue.

@yukigesho
Copy link

@dmmiller I'm experiencing a freeze with no error messages, and the screen has been unchanged for about two hours. Here's the screen I'm stuck on:

image

I minimized sizes to see if it would enable deployment (I'm using the free tier). Could this adjustment be causing the freeze? I'm unsure if this is a minor issue or if it indicates a deeper problem.

For example, in ops/aws/src/radical-stack/ec2/autoScalingGroup.ts, I changed EC2.InstanceSize.XLARGE to EC2.InstanceSize.LARGE.

const asg = new autoScaling.AutoScalingGroup(
  radicalStack(),
  `${tier}ServerLondonASG`,
  {
    autoScalingGroupName,
    instanceType: EC2.InstanceType.of(
      // m5.xlarge: general purpose instance type
      // 4 vCPUs, 16 GiB of RAM
      EC2.InstanceClass.M5,
      EC2.InstanceSize.LARGE
    ...

@jwatzman
Copy link

Driving by, if anyone wants to actually fix the issue with monitoring, I have a potential lead from my memory of the issues it had at Cord. It's something like... there is a persistent drive that monitoring uses so that historical data isn't lost when the ec2 instance is rebuilt. The IAM permissions are set so that only the monitoring instance can mount that drive. At least in steady-state, this would cause issues when the ec2 instance would get rebuilt because CF wouldn't assign the new IAM role to the new monitoring instance (which is what allows it to mount the drive) until the instance was up/healthy, but it wouldn't be considered up/healthy until it had mounted the drive. Or something like that -- I didn't 100% pin it down but that is my recollection of what I strongly suspected was going on.

This used to (somehow!) work and then it broke earlier this year. We rebuilt monitoring infrequently enough that I just assigned the drive by hand in the ec2 console in the timeframe that CF was waiting on it (you have about a 10m window, though you need to be watching to find the right time...) the two or three times I hit it.

I'd be unsurprised if an extremely similar issue affected creating monitoring for the first time.

If you want to at least try to have a monitoring instance, you can try removing that persistent drive logic and just having monitoring write to the raw root drive (which you might want to increase in size a little bit). You'll lose all data on rebuild, but it will at least give you something.

Relevant links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants