Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup of subscriptions and connections doesn't seem to be working #83

Closed
fridaystreet opened this issue Jun 9, 2020 · 32 comments
Closed

Comments

@fridaystreet
Copy link

Not entirely sure if this is an issue or just working as expected. I'm just trying to get my head around how the cleanup should work as I seem to have lots of old subscriptions. Looking through the cloudwatch logs it looks like all the old subscriptions for a particular event are getting pulled down each time, then discarded. But essentially it's making the lambda function run longer and longer each time.

My understanding was that subscriptions and subscription operations are removed when they are stale like connections. That doesn't seem to be the case, well at least in our implementation.

looking at the code base I can see that unsubscribeAllByConnectionId is called in unregisterConnection, which is called

  1. when sending the message to the client fails
  2. when the $disconnect is received
  3. when stop message is received

I think there might be a couple of gaps there. Maybe it should be called when the Events processor is iterating through the events and executing the event for all the subscribers. If there is an error in that execution for a subscriber, I think maybe that should cause the subscription to be removed and the client informed so they can resubscribe.

Possibly in closeConnection in the connection manager, although my assumption here is that triggering the apigateway to close the connection is triggering the $disconnect, which in turn should be cleaning up.

So anyway, I'm not sure if there are maybe circumstances where relying on the 2 hour $disconnect from API gateway is failing through the cracks sometimes and subscriptions are left and never cleaned up. Possibly a check to see if the connection in the subscription is valid before executing it in the event processor? Or maybe that should be the role of the subscription manager, it shouldn't return any subscriptions to the event processor if they don't have valid connections?

@fridaystreet
Copy link
Author

been thinking this through a bit and would I be correct in thinking that a ttl on the subscriptions and operations table would be just as well placed as the events table wouldn't it?

I mean the subscriptions and operations are null and void after 2 hours if AWS is going to ditch the connections anyway. Or maybe I missed something as to why they are needed to be kept?

Cheers
Paul

@michalkvasnicak
Copy link
Owner

@fridaystreet you're right, they should be removed but this library relied on AWS AG to call $disconnect event, but it seems isn't guaranteed by AWS.

  • TTL idea: Does AWS create a new connection (new id) after connection reaches its lifetime (2hours) ?
  • Unsubscribing during event processing: This one seems to be good idea too because I don't think it should be a responsibility of SubscriptionManager to return only live connections, because that'd mean we need to fetch the status over and over again.

Maybe the combination of both is way to go. Do you see any parts where it could be breaking change? TTL is not a breaking change but requires you as a developer to set it up. Unsubscribing during event processing could help everyone just by updating the version of a package.

@IslamWahid
Copy link
Contributor

same issue here!

@IslamWahid
Copy link
Contributor

I see that when the page with subscriptions is open, every 10 mins there's a new connection opened with its subscriptions and the old one is not used anymore and that keep just adding to the subscriptions table with no real connection on the connections table.
Screen Shot 2020-06-29 at 19 20 19

@fridaystreet
Copy link
Author

fridaystreet commented Jun 29, 2020

@michalkvasnicak good point on the id being reused. I'm not entirely sure. However, we have been running with just the TTL since I originally posted this and haven't experienced the issue again. I haven't had a chance to do any really in depth tracing or load testing, so possibly it could still be worth looking into a clean up during event processing as well, but certainly as a quick fix, the TTL is working well.

I don't think either TTL or the cleanup would cause a breaking change. Well for us it hasn't. Using Apollo client it just handles it and resubscribe to everything automatically.

Probably also worth noting, as per a previous conversation we were having about dynamo keys, we have made sure to use very specific subscription key names which include the ids of relevant objects (eg new_message_room_some_id), so as to ensure there are minimal reads from dynamo, rather than broad names like 'new_message' and then relying on subscription filtering post pulling all the subscriptions. This seems to scale and perform well so far.

@michalkvasnicak
Copy link
Owner

@fridaystreet thanks for the info. If I find a time on the weekend I'll try to implement removal of dead connections stored in table (ones that were not removed by AWS AG automatically by invoking $disconnect event). TTL is simplest solution so maybe it needs to be documented too.

Probably also worth noting, as per a previous conversation we were having about dynamo keys, we have made sure to use very specific subscription key names which include the ids of relevant objects (eg new_message_room_some_id), so as to ensure there are minimal reads from dynamo, rather than broad names like 'new_message' and then relying on subscription filtering post pulling all the subscriptions. This seems to scale and perform well so far.

Yes this is something we were discussing with @AlpacaGoesCrazy in #86 and also about using SQS as an event source.

@michalkvasnicak
Copy link
Owner

@fridaystreet I took a look and when sending to connection fails, the connection should be deleted and all of its subscriptions. So the problem (as you wrote) is in the event processing when processing of an event fails, then we never try to send anything to connection, so we don't know whether it's stale or not. We can't remove the connection in this case, because the error doesn't necessarily need to be caused by stale connection. So in case of an error we should try to check if connection is stale or not.

@michalkvasnicak
Copy link
Owner

@fridaystreet I'm not sure whether to check connection after event processing fails or before event is processed.

After event processing fails:

We will check the state of specific connection only if event processing failed.

Before an event is processed:

We will check the state of specific connection and only then we will process the event.

I think that for developers it's easier to think about event processing only on active connections. But on the other hand, checking the state of connection on each event can result in API throttling from AWS side.

@fridaystreet
Copy link
Author

@michalkvasnicak yeah I agree with you and after reading your comments, it would make sense to only check on failure I think. However, that said, if the connection is canned (that is killed by telling api gateway to drop it which I think is what unregisterConnection is doing from memory) then the client will just resubscribe. Or at least in the case of apollo it will.

It's been a while since I was looking at this, but I think where I was going was that, if processing fails, just tell api gateway to kill the connection and delete it, then leave it up to the client to resubscribe.

But I can see where you're going with 'only check connection if it fails' I'm assuming somehow it would attempt to send a message and if it fails then delete it.

@michalkvasnicak
Copy link
Owner

@fridaystreet

But I can see where you're going with 'only check connection if it fails' I'm assuming somehow it would attempt to send a message and if it fails then delete it.

There is an API getConnection() on ApiGatewayManagementApi. So I was thinking that on failure we'd just use this API to check if the connection exists and unregister it. sendToConnection() already removes the connection and its subscriptions, see https://github.com/michalkvasnicak/aws-lambda-graphql/blob/master/packages/aws-lambda-graphql/src/DynamoDBConnectionManager.ts#L145.

@fridaystreet
Copy link
Author

and yes agree also, I think for developers there would be an assumption that it wouldn't be processing stale connections. Again I guess with that I was thinking that if the stale connections were filtered out first, then you aren't processing all those connections only to find out they failed at the end. There could be all sorts of db lookups and stuff going on to process an event, it sort of feels like the lookup, in the beginning, could be a bit outweighed by the overhead of processing an event for stale connections only to determine at the end that the processing was in vain.

Possibly it's even a case by case thing based on how much resources it takes to process an event. Maybe it's best to only process valid connections, or maybe it's lightweight enough that stale connections aren't an issue.

Dare I say... could it do both based on dynamic configuration? :-)

@fridaystreet
Copy link
Author

fridaystreet commented Jul 3, 2020

yep that's exactly what I was thinking, use unregisterConnection via send. Sorry just couldn't remember the exact line off top of my head.

@fridaystreet
Copy link
Author

I'm a bit balls deep in trying to get v2 of our product out the door at the mo, but that is close to completion. I can't really do much at the mo, but I'd be happy to commit some time to your project after that to help out on anything you need. The project has really been quite pivotal in what we wanted to achieve.

@michalkvasnicak
Copy link
Owner

Dare I say... could it do both based on dynamic configuration? :-)

Yes I think this one can be based on configuration, it'll need some nice explanation in docs (maybe I should create some documentation website because README.md is not really a good place to point out everything).

I'm leaning to make the check before processing as a default, even if it breaks how it works now, so maybe if someone uses this on high volume of events there will be some throttling from AWS :(. But I'll make some research based on AWS limits.

I'm a bit balls deep in trying to get v2 of our product out the door at the mo, but that is close to completion. I can't really do much at the mo, but I'd be happy to commit some time to your project after that to help out on anything you need. The project has really been quite pivotal in what we wanted to achieve.

Would you mind sharing the product (or after finish)? :) Do you use this library or have own solution inspired by it?

@fridaystreet
Copy link
Author

fridaystreet commented Jul 3, 2020

sorry for all the comments, it's provoking ideas. We've been running the ttl and it's made a massive improvement.

I reckon with the ttl on subscriptions etc and an attempted to send on event processing failure (effectively invoking unregisterConnection on failure), coupled with a filter on the dynamo call to get subscriptions to discard connections past the ttl it would be bang on the money.

I mention the filter because as you probably know, dynamo can take up to 48 hours to delete items past their ttl

@fridaystreet
Copy link
Author

Would you mind sharing the product (or after finish)? :) Do you use this library or have own solution inspired by it?

yeah anything that we do in relation to the library I'm happy to share and contribute. Our product isn't opensource, so I can't share IP, but as I say, if it touches your library or is related I obviously consider those opensource contributions. There's also a fair few things we've had to consider in terms of how to handle authentication and how to structure the event names to get the best performance from dynamo, so happy to share those as well.

@fridaystreet
Copy link
Author

fridaystreet commented Jul 3, 2020

oh sorry I just realised you were meaning just tell you what the product is. Yeah of course. it's called ALEiGN you can find it at http://aleign.com

all the info there references the current v1. It was an express based application built for an mvp. It wasn't massively scalable or consistent. v2 is a complete rewrite, based on serverless, graphql, and a new event-driven architecture. Along with the frontend being completely rebuilt.

The sales/marketing guys are working through updating all the screenshots and reference guides now

@michalkvasnicak
Copy link
Owner

michalkvasnicak commented Jul 3, 2020

yeah anything that we do in relation to the library I'm happy to share and contribute. Our product isn't opensource, so I can't share IP, but as I say, if it touches your library or is related I obviously consider those opensource contributions. There's also a fair few things we've had to consider in terms of how to handle authentication and how to structure the event names to get the best performance from dynamo, so happy to share those as well.

I'd very appreciate that because I wasn't really using this library (I've changed a job so at the moment I'm working on something completely different).

oh sorry I just realised you were meaning just tell you what the product is. Yeah of course. it's called ALEiGN you can find it at http://aleign.com

all the info there references the current v1. It was an express based application built for an mvp. It wasn't massively scalable or consistent. v2 is a complete rewrite, based on serverless, graphql, and a new event-driven architecture. Along with the frontend being completely rebuilt.

The sales/marketing guys are working through updating all the screenshots and reference guides now

Thank you for sharing! It's good to know that someone uses or is inspired by this library. 👍

@IslamWahid
Copy link
Contributor

@michalkvasnicak @fridaystreet guys can you please explain it more

  • if the connections table is being cleaned correctly, so why its related subscriptions aren't being deleted in this case?

  • also, you're talking about 2 hours of AWS AG $disconnect event I experience something different I'm not sure if it's the same or not every 10 mins I see on the apollo client it tries to start a new connection and re-establish the subscriptions again I thought this is causing the increased number of subscriptions, can you please confirm if that's the same?

Screen Shot 2020-07-03 at 13 43 14

Screen Shot 2020-07-03 at 13 43 27

@fridaystreet
Copy link
Author

fridaystreet commented Jul 3, 2020

@IslamWahid yeah I see the same thing every 10 mins, I think you're right that the connections are re-established every 10 mins through api gateway, but that is invoking the $disconnect which is causing this library to clean up the connections and subscriptions automatically.

I think it might be the case that inactive connections are reset after 10 mins, but if there is activity the inactive counter is reset and it keeps alive. But 2 hours is the hard limit maybe? regardless of activity after 2 hours, connections are reset. This is just an assumption, you could create a ping that sends a message every minute and see if it stays alive past the 10 minutes?

I only ever see the number of connections currently active in the db and only subscriptions for those connections from what I can see. My main problem is that if other things fail like event processing like we've been discussing, there seems to be some gaps where $disconnect isn't invoked by AWS, meaning connections and subcriptions aren't cleaned up, hence adding the ttl.

@IslamWahid
Copy link
Contributor

@fridaystreet thanks for your reply and it would be great if you share your TTL quick solution.

@michalkvasnicak
Copy link
Owner

Yes there is 10 minutes idle connection timeout https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html.

@michalkvasnicak
Copy link
Owner

@IslamWahid I think that TTL index can be created on createdAt column in connection table.

@michalkvasnicak
Copy link
Owner

@fridaystreet do you apply TTL on subscriptions as well?

@fridaystreet
Copy link
Author

fridaystreet commented Jul 3, 2020

@IslamWahid because I haven't had time to submit a PR to @michalkvasnicak I've basically just extended his classes to add the extra functionality for now and created our own custom versions of the sub manager and conn manager

Remembering you'll also need to add the config to serverless file to create the tables with a ttl.

class CustomDynamoSubscriptionManager extends DynamoDBSubscriptionManager {

  constructor(props) {
    super(props);

    this.subscribe = async (
      names,
      connection,
      operation
    ) => {
      const subscriptionId = this.generateSubscriptionId(
        connection.id,
        operation.operationId
      );
  
      // we can only subscribe to one subscription in GQL document
      if (names.length !== 1) {
        throw new Error('Only one active operation per event name is allowed');
      }
      const [name] = names;
  
      await this.db
        .batchWrite({
          RequestItems: {
            [this.subscriptionsTableName]: [
              {
                PutRequest: {
                  Item: {
                    connection,
                    operation,
                    event: name,
                    subscriptionId,
                    operationId: operation.operationId,
                    ttl: moment().add(2, 'hours').unix()
                  },
                },
              },
            ],
            [this.subscriptionOperationsTableName]: [
              {
                PutRequest: {
                  Item: {
                    subscriptionId,
                    event: name,
                    ttl: moment().add(2, 'hours').unix()
                  },
                },
              },
            ],
          },
        })
        .promise();
    };    
  }

}


const subscriptionManager = new CustomDynamoSubscriptionManager({
  dynamoDbClient,
  subscriptionsTableName: config.graphql.subscriptions.subscriptionsTableName,
  subscriptionOperationsTableName: config.graphql.subscriptions.subscriptionOperationsTableName
});

class CustomDynamoDBConnectionManager extends DynamoDBConnectionManager {

  constructor(props) {
    super(props);

    this.registerConnection = async ({
      connectionId,
      endpoint
    }) => {
      const connection = {
        id: connectionId,
        data: { endpoint, context: {}, isInitialized: false },
      };
  
      await this.db
        .put({
          TableName: this.connectionsTable,
          Item: {
            createdAt: new Date().toString(),
            id: connection.id,
            data: connection.data,
            ttl: moment().add(2, 'hours').unix()
          },
        })
        .promise();
  
      return connection;
    };
  }
}

const connectionManager = new CustomDynamoDBConnectionManager({
  connectionsTable: config.graphql.subscriptions.connectionsTable,
  apiGatewayManager: process.env.IS_OFFLINE
    ? new ApiGatewayManagementApi({
        endpoint: 'http://localhost:4001',
      })
    : undefined,
  dynamoDbClient,
  subscriptions: subscriptionManager,
});

@fridaystreet
Copy link
Author

@IslamWahid I think that TTL index can be created on createdAt column in connection table.

would be nice, but the ttl has to be a unix timestamp

@IslamWahid
Copy link
Contributor

@fridaystreet @michalkvasnicak thanks for your help, guys!
I would really like to get that in the library so if you don't have much time I may prepare a PR for it.

@IslamWahid
Copy link
Contributor

also, it would be great if we could prepare some guidelines and best practices like what @fridaystreet mentioned above to be documented.

@michalkvasnicak
Copy link
Owner

michalkvasnicak commented Jul 3, 2020

@IslamWahid

I would really like to get that in the library so if you don't have much time I may prepare a PR for it.

I'd really appreciate that. Could you please make it configurable in DynamoDBConnectionManager and DynamoDBSubscriptionManager?

Now I'm really thinking about some simple website for documentation. Something like react-hook-form has, because keeping this in README is not really maintainable and developer friendly.

@fridaystreet
Copy link
Author

@michalkvasnicak yeah react-hook-form is nice site hey, just found that the other week

@IslamWahid
Copy link
Contributor

@michalkvasnicak @fridaystreet may you please review the PR, so we can get it published ASAP

@michalkvasnicak
Copy link
Owner

Marking this as closed by #90 for now.

@IslamWahid great work, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants