Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Metadata To Embeddings #75

Open
david-saeger opened this issue Sep 6, 2024 · 7 comments
Open

Adding Metadata To Embeddings #75

david-saeger opened this issue Sep 6, 2024 · 7 comments

Comments

@david-saeger
Copy link

Id like to add s3 metadata to my embeddings during the embedding creation process and realized that I wasnt sure the best place to do that. I wasnt sure if forking the project and adding to the file processing would be ideal or if there was something I could do by defining a ragLambdaLayer as descibed here

# ragLayerPath: /path/to/rag_layer.zip
In truth I think I am just a little uncertain what these lambda layers do or how to use them. Do they replace the current rag api or add to it?

@petermuller
Copy link
Contributor

Hi David! For the layer zip files, those are just a way to pre-package the layer as we have it defined here and then move the source into a region of your choice, ideally for network-isolated environments where we can't pull dependencies on the fly.

As for what you'd like to do, it sounds like we may need to update the RAG API itself (and we welcome pull requests against the develop branch 🎉 )

If you are willing to hack on LISA to add this, my first guess would be around this area: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py

Specifically this function is what we call to generate the initial embeddings: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L125-L151

and then similaritySearch is doing the embedding call for the prompt text https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L80-L107

We're using LangChain under the hood, and we've created a form of langchain-compatible openai binding for embeddings specifically over here: https://github.com/awslabs/LISA/blob/develop/lisa-sdk/lisapy/langchain.py#L102-L153 (ignore other things in the file, there are some unused clients that we need to clean up 😬 )

So if there's a solution you had in mind or could point us in a direction to help with, I think these would be the best starting points. I'm not sure if this answers your question or helps guide in a direction, so please let me know!

@david-saeger
Copy link
Author

david-saeger commented Sep 9, 2024 via email

@petermuller
Copy link
Contributor

I could see the possibility of adding some fields to the related APIs to add another map to the requests, such that those will contain the additional metadata. The metadata is attached at the Document level, so we could possibly make it as part of the API per document, or if we assume a list of files already in S3, then there's the possibility for us to edit the processing function to add more metadata than just the document location over here: https://github.com/awslabs/LISA/blob/develop/lambda/utilities/file_processing.py#L146

So for your suggestion, would the LISA prefix be related to the metadata already on the S3 object? As in something along the lines of:

  1. Upload file to S3 with Object metadata attached
  2. Use LISA ingestion to consume / embed files
  3. Per file, check if there's S3 metadata (optionally: and check if the metadata is prefixed with a LISA-known prefix)
  4. Add metadata to metadata dictionary that is processed along with the Document object
  5. Metadata is now returned with the document text for requested vectors

Is this the workflow you're thinking of?

@david-saeger
Copy link
Author

david-saeger commented Sep 10, 2024 via email

@david-saeger
Copy link
Author

david-saeger commented Sep 12, 2024

had to step away for the past couple days but kind of circling back to where I was originally when looking through this and trying to figure out a path to get the rag functionality I need. I understand that the layer zip files are included in the config optionally for network isolation. Could they not also serve to replace the RAG functionality if that is my end goal?

Something I am thinking through is what I need out of RAG is pretty boutique and I am doubtful it will be useful to other LISA users (likely would include custom embedding generation logic that is specific to the shape of specific documents) so figure any contribution I make here would end up looking like: place custom functionality somewhere (likely in the form of a lambda) and use it to replace some part or all of the rag API and I am questioning if this already exists in plain site or if I am missing something.

@petermuller
Copy link
Contributor

No worries at all!

I've been thinking on this one for a little bit too and I think the main issue in our way is that our implementation of the RAG feature is fairly limited from the UI. Direct invocation via curl command or similar isn't really documented, but as I'm staring at it, I can see that it is possible to upload a custom list of keys to the rag store, so long as the exist in the LISA-provided document bucket (which is also something that we could edit to be user-provider). And with that, we could then provide additional metadata as part of the ingest_documents request. Several routes to go from here, but possible ones:

  1. any metadata given in a call is added to all docs that are processed (keys: [docname1, docname2], metadata: something_applied_to_all)
  2. metadata is applied to objects whose key match in a key:val relationship (keys: [docname1, docname2], metadata: {docname1: metadata1, docname2: metadata2})

And to answer your question, yes the rag layer could be used that way, but then it's a lot harder for us to support that way or improve on our existing things. I would say even based on all of this, we would still welcome a pull request with your ideas in it, and we can work to find the best path forward on it. If the goal for now is to just make a utility outside of the Chat UI to ingest documents with metadata, I think that backwards compatible changes to the repository API would be fine (as long as it doesn't break the current functionality then I'm good 👍 )

Some points of interest for that:

  • docs = process_record(s3_keys=body["keys"], chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    • this line takes a list of S3 keys (associated with the LISA-generated rag documents bucket), and will chunk them and add a hardcoded metadata of the s3 location for the text
  • docs = [Document(page_content=extracted_text, metadata=_get_metadata(s3_uri=s3_uri))]
    • this line is what associates the metadata to the langchain Document, and the metadata is just a hardcoded dict right here:
      def _get_metadata(s3_uri: str) -> dict:
      return {"source": s3_uri}
  • I think what we could do is accept a "metadata" parameter in the APIGW API and then pass that into the file_processing file where we append it to the metadata dictionary that's already there

just some ideas and totally not prescriptive by any means!

@david-saeger
Copy link
Author

Great ideas peter! I think these are great ways to get to the goal I expressed of adding metadata to vector embeddings. I think I may have convoluted the thread here with a second and related goal I have which I am having a harder time thinking through in terms of how to add in a way that could be useful to the broader LISA community that motivated this comment #75 (comment)

Ill leave it hear in case you have thoughts but recognize it should be in another ticket and think I have the information I was seeking about metadata creation.

Basically I would like to be able to use boutique embedding creation logic so that I could parse a document and include some a prior knowledge about its shape in the embedding creation process so that I can for instance inject a title and subheading for each chunk generated from a section in a policy document.

Looking through the codebase I believe that would require replacing the routine here

def _generate_chunks(docs: List[Document], chunk_size: Optional[int], chunk_overlap: Optional[int]) -> List[Document]:
on a one off bases for a particular document which is hard for me to think through how to implement in a way that is useful to anybody else. As I am writing this It strikes me that the answer may be just in just generate boutique embeddings locally and send them direct to pgvector. Do you see any issues with that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants