-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queryMessages
field added & query generation optimization
#653
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree |
@pamelafox |
|
||
# STEP 1: Generate an optimized keyword search query based on the chat history and the last question | ||
messages = self.get_messages_from_history( | ||
self.query_prompt_template, | ||
self.chatgpt_model, | ||
history, | ||
query_history_input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our prompt says "Below is a history of the conversation so far, and a new question asked by the user that needs to be answered by searching in a knowledge base about employee healthcare plans and the employee handbook."
I'm surprised you got good results by passing in the query history since it would seem to be in disagreement with the prompt. You didn't need to alter the prompt at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yea our prompt is different, customizable for every use case. But is it in a disagreement with the this prompt? I don't think so, since the structure contains history but only in form of query messages and bot generated queries. This is basically list of few shots.
Could you give examples of query generations that worked before after this change? I'm looking into add evaluation metrics to this repository so we can measure changes like this, but it's difficult to evaluate without good test data. |
Most test queries in our benchmarks are more stable after this change was implemented, but we have a bit different use cases, where different teams can alter queries for their needs. We perform single queries and conversation tests with OpenAI rating / similarity evaluation. I don't know which examples you would like, but I can't paste you mine due to internal company policies (these are internal data sets). |
Okay, thanks for the additional information! I think your code change looks good, but I want to evaluate it using a new evaluation pipeline I'm working on in another branch. I'll add multi-turn evaluation to it soon which will enable me to test out this change. Sorry for the delay, but this is a great opportunity to try that out. Also, if you can share anything about how you run evaluations, would love to hear more, as we're trying to figure out good developer flows for evaluation locally and in CI/CD. |
Sure, I will wait till you try to run it - then you can ping me and I can rebase/merge latest changes to my branch so there are no conflicts. About evaluation, there are many possibilities, but GPT is pretty good at such tasks so you can perform standard evaluation by e.g calculating embeddings of the ground truth and bot answer and then comparing them with basic similarity metric (cosine, euclidian) AND you can leverage GPT model and ask him to compare ground truth vs bot answer on the scale of your choosing (just add some few shots so it knows what to do). Pretty sure you got similar ideas in mind already so you can use few scores and blend them to get 'final one' or just use a single metric. For us, stability of the solution is the most important thing. We all know that when you screw up even one single part of the conversation, it is still kept in history and may break things later on, so in our tests we focus mostly on it. Also I want to mention that our tests are nowhere near perfect or complete. We are still evolving them and adjust to our needs, so I am also waiting to see your approach :) |
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed. |
Closes #641
Purpose
queryMessages
to generate optimized search queryhistory
will be used as beforeDoes this introduce a breaking change?
Pull Request Type
What kind of change does this Pull Request introduce?
How to Test
What to Check
Verify that the following are valid
queryMessages
field should be present in request and response json bodyOther Information