{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text-to-SQL\n", "\n", "In this example, we adapt the [code](https://github.com/ShayanTalaei/CHESS/tree/fc6f0b7ef34ccb573d764be8fba52b4afdd20ff5) from the paper [CHESS: Contextual Harnessing for Efficient SQL Synthesis](https://arxiv.org/abs/2405.16755). \n", "\n", "The workflow is as follows:\n", "\n", "\n", "![text-to-sql](../imgs/text_to_sql.png)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To try this example, you should create and activate a new python virtual environment and then run the following commands:\n", "\n", "```console\n", " pip install -r requirements.txt\n", " pip install pysqlite3-binary\n", " pip install -U cognify-ai\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, run the pre-processing script in `./run/run_preprocess.sh` to create the databases. This should generate a `data` folder. Ensure your `.env` file contains the following keys:\n", "- `OPENAI_API_KEY`\n", "- `DB_ROOT_PATH`, which should be set to the path of `data/dev`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data loader\n", "\n", "The original repository expects command line arguments passed into its data-loader. We can preserve the original parser function and just set the arguments in the script itself. This dataset does not contain a ground truth, so we pass in an empty dictionary `{}` as the second value in the tuple.\n", "\n", "Then, we use the data files generated by the pre-processing step in the `data` folder." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "import json\n", "import cognify \n", "from src.utils import parse_arguments\n", "\n", "import cognify\n", "import numpy as np\n", "\n", "import dotenv\n", "dotenv.load_dotenv()\n", "\n", "@cognify.register_data_loader\n", "def load_data():\n", " args = parse_arguments() # \n", "\n", " def read_from_file(data_path, args):\n", " with open(data_path, \"r\") as file:\n", " dataset = json.load(file)\n", "\n", " inputs = []\n", " for data in dataset:\n", " inputs.append(\n", " {\n", " 'args': args,\n", " 'dataset': [data],\n", " }\n", " )\n", " eval_data = [(input, {}) for input in inputs] # no ground truth in this case, set to empty dictionary\n", " return eval_data\n", "\n", " all_train = read_from_file('data/dev/other_sub_sampled.json', args)\n", " test_set = read_from_file('data/dev/sub_sampled_bird_dev_set.json', args)\n", " \n", " # shuffle the data\n", " all_train = np.random.permutation(all_train).tolist()\n", " return all_train[:100], all_train[100:], test_set[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluator\n", "\n", "In this case, the SQL code is executed *during* the workflow in a sandbox environment. Hence, our evaluator does not need to re-execute the code. Instead, it can just return whether the result was correct as a numerical value." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "@cognify.register_evaluator\n", "def eval_text_to_sql(counts):\n", " \"\"\"\n", " Evaluate the statistics of the run.\n", " \"\"\"\n", " correct = any(vs['correct'] == 1 for vs in counts.values())\n", " return correct" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuring the Optimizer\n", "\n", "We've created a search option for text-to-sql that searches over the following:\n", "- Chain-of-Thought reasoning\n", "- Planning before acting\n", "- 2 few-shot examples\n", "- An ensemble of 3 agents for a task\n", "\n", "Let's use these search settings to conduct the optimization." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from cognify.hub.search import text_to_sql\n", "search_settings = text_to_sql.create_search()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start the optimization\n", "\n", "We've provided the 3 code blocks above in `configy.py`. With the Cognify command line interface (CLI), you can start the optimization like this:\n", "\n", "```console\n", "$ cognify optimize workflow.py\n", "```\n", "\n", "Alternatively, you can run the following cell (*warning*: this workflow may run for quite some time):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train, val, dev = load_data()\n", "\n", "opt_cost, pareto_frontier, opt_logs = cognify.optimize(\n", " script_path=\"workflow.py\",\n", " control_param=search_settings,\n", " train_set=train,\n", " val_set=val,\n", " eval_fn=eval_text_to_sql,\n", " force=True, # This will overwrite the existing results\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimization Results\n", "\n", "Cognfiy will output each optimized workflow to a `.cog` file. For this workflow, the optimizer applies the following optimizations:\n", "- use GPT-4o-mini for keyword extraction, table selection and column selection\n", "- use Llama 3.1-8B along with few-shot examples for candidate generation\n", "- ensemble revision\n", " - use chain-of-thought, few-shot examples, and GPT-4o-mini for two of the ensembled models\n", " - use GPT-4o-mini for the third ensembled model\n", " - use Llama 3.1-8B along with chain-of-thought reasoning for the aggregator\n", "\n", "The final optimized workflow is depicted below, with optimizations highlighted in green.\n", "\n", "![t2sql-opt](../imgs/text_to_sql_optimized.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For candidate generation, the few-shot examples used resemble the following:\n", "\n", "> **Demonstration 1:** \n", "> Input (question): \"Which of these players performs the best in crossing actions, Alexis, Ariel Borysiuk or Arouna Kone?\"\n", "> \n", "> Input (schema): \"`CREATE TABLE Player\\n(\\n\\t'id' INTEGER PRIMARY KEY AUTOINCREMENT,\\n\\t'player_api_id' INTEGER UNIQUE...`\"\n", "> \n", "> Input (hint): \"player who perform best in crossing actions refers to `MAX(crossing);`\"\n", ">\n", "> Reasoning: \"To find the player who performs the best in crossing actions, we need to identify the player with the highest `crossing` score in the `Player_Attributes` table. Let's start by writing a query that selects the `player_name` and `crossing` score...\"\n", ">\n", "> Output (query): `SELECT p.player_name, pa.crossing FROM Player p JOIN Player_Attributes pa ON p.player_api_id = pa.player_api_id WHERE p.player_name IN ('Alexis', 'Ariel Borysiuk', 'Arouna Kone') ORDER BY pa.crossing DESC LIMIT 1`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For revision, there are a few more input fields needed:\n", "\n", "> **Demonstration 1** \n", "> Input (question): \"List out the atom id that belongs to the TR346 molecule and how many bond type can be created by this molecule?\" \n", "> \n", "> Input (SQL): \"`SELECT T1.atom_id, COUNT(DISTINCT T2.bond_type) AS bond_type_count FROM atom T1 LEFT JOIN bond T2 ON T1.molecule_id = T2.molecule_id WHERE T1.molecule_id = 'TR346' GROUP BY T1.atom_id;`\" \n", "> \n", "> Input (schema): \"`CREATE TABLE atom\\n(\\n\\t'atom_id' TEXT NOT NULL, -- 'atom id' description: the unique id of atoms\\n\\t'molecule_id' TEXT DEFAULT NULL, -- 'molecule id' description: identifying the molecule`\"\n", "> \n", "> Input (query result): \"[('TR346_1', 1), ('TR346_2', 1), ('TR346_3', 1), ('TR346_4', 1), ('TR346_5', 1), ('TR346_6', 1), ('TR346_7', 1), ('TR346_8', 1)]\" \n", "> \n", "> Reasoning: \"We need to list the `atom_id` that belongs to the molecule with `molecule_id` 'TR346'. We also need to count how many distinct bond types can be created\" \n", "> \n", "> Output (query): \"`SELECT T1.atom_id, (SELECT COUNT(DISTINCT T2.bond_type) FROM bond T2 WHERE T2.molecule_id = 'TR346') AS bond_type_count FROM atom T1 WHERE T1.molecule_id = 'TR346';`\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check out more details on [how to interpret optimization results](https://cognify-ai.readthedocs.io/en/latest/user_guide/tutorials/interpret.html#detailed-transformation-trace)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "cog-source", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }