Langchain Self Query With Dates

Published

Self querying by date using LangChain doesn’t work well. The default schema used for parsing natural language into the internal representation of langchain for querying a vector store does not work with dates because it uses the wrong type (it tries to use a dict but you can only filter using integers or strings).

To fix that, create your own “structured request” schema for use with a vector DB (like ChromaDB).

TASK_METADATA = [
    AttributeInfo(
        name="title",
        description="The title of the task, meeting, or heading",
        type="string",
    ),
    AttributeInfo(
        name="created_date",
        description="The timestamp the task or meeting was created formatted as an integer",
        type="int",
    ),
]

TASKS_SCHEMA = """\
<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{{{{
    "query": string \\ text string to compare to document contents
    "filter": string \\ logical condition statement for filtering documents
}}}}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` ({allowed_comparators}): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` ({allowed_operators}): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use timestamp in seconds as an integer when handling date data typed values. If you need to convert them, translate the date into a unix epoch timstamp with UTC timezone and double check that it is the correct year. NEVER use the `eq` operator for timestamps, if the requested date is a single day, use the `gte` operator for the requested date AND a `lte` operator for the requested date plus one day. Be very careful with dates.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.\
"""
TASKS_SCHEMA_PROMPT = PromptTemplate.from_template(TASKS_SCHEMA)

prompt = get_query_constructor_prompt(
    document_contents="My tasks",
    attribute_info=TASK_METADATA,
    schema_prompt=TASKS_SCHEMA_PROMPT,
)
output_parser = StructuredQueryOutputParser.from_components()
query_constructor = prompt | TASKS_LLM | output_parser

Unfortunately, this only works for a very limited number of cases because OpenAI doesn’t handle dates well.

  • Sorting Vector Store Results

    Many vector databases can find the top k most similar results to a query but are unable to sort by other document metadata. This is a pretty severe limitation for building LLM applications, especially for ones where time is dimension (meetings, calendars, task lists, etc.). For example, retrieving the 10 most similar results to the phrase “team meeting notes” but not being able to retrieve the team meeting notes from the last month.