{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text-to-SQL\n",
    "\n",
    "In this example, we adapt the [code](https://github.com/ShayanTalaei/CHESS/tree/fc6f0b7ef34ccb573d764be8fba52b4afdd20ff5) from the paper [CHESS: Contextual Harnessing for Efficient SQL Synthesis](https://arxiv.org/abs/2405.16755).  \n",
    "\n",
    "The workflow is as follows:\n",
    "\n",
    "\n",
    "![text-to-sql](../imgs/text_to_sql.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To try this example, you should create and activate a new python virtual environment and then run the following commands:\n",
    "\n",
    "```console\n",
    "    pip install -r requirements.txt\n",
    "    pip install pysqlite3-binary\n",
    "    pip install -U cognify-ai\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, run the pre-processing script in `./run/run_preprocess.sh` to create the databases. This should generate a `data` folder. Ensure your `.env` file contains the following keys:\n",
    "- `OPENAI_API_KEY`\n",
    "- `DB_ROOT_PATH`, which should be set to the path of `data/dev`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data loader\n",
    "\n",
    "The original repository expects command line arguments passed into its data-loader. We can preserve the original parser function and just set the arguments in the script itself. This dataset does not contain a ground truth, so we pass in an empty dictionary `{}` as the second value in the tuple.\n",
    "\n",
    "Then, we use the data files generated by the pre-processing step in the `data` folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import cognify \n",
    "from src.utils import parse_arguments\n",
    "\n",
    "import cognify\n",
    "import numpy as np\n",
    "\n",
    "import dotenv\n",
    "dotenv.load_dotenv()\n",
    "\n",
    "@cognify.register_data_loader\n",
    "def load_data():\n",
    "    args = parse_arguments() # \n",
    "\n",
    "    def read_from_file(data_path, args):\n",
    "        with open(data_path, \"r\") as file:\n",
    "            dataset = json.load(file)\n",
    "\n",
    "        inputs = []\n",
    "        for data in dataset:\n",
    "            inputs.append(\n",
    "                {\n",
    "                    'args': args,\n",
    "                    'dataset': [data],\n",
    "                }\n",
    "            )\n",
    "        eval_data = [(input, {}) for input in inputs] # no ground truth in this case, set to empty dictionary\n",
    "        return eval_data\n",
    "\n",
    "    all_train = read_from_file('data/dev/other_sub_sampled.json', args)\n",
    "    test_set = read_from_file('data/dev/sub_sampled_bird_dev_set.json', args)\n",
    "    \n",
    "    # shuffle the data\n",
    "    all_train = np.random.permutation(all_train).tolist()\n",
    "    return all_train[:100], all_train[100:], test_set[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluator\n",
    "\n",
    "In this case, the SQL code is executed *during* the workflow in a sandbox environment. Hence, our evaluator does not need to re-execute the code. Instead, it can just return whether the result was correct as a numerical value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cognify.register_evaluator\n",
    "def eval_text_to_sql(counts):\n",
    "    \"\"\"\n",
    "    Evaluate the statistics of the run.\n",
    "    \"\"\"\n",
    "    correct = any(vs['correct'] == 1 for vs in counts.values())\n",
    "    return correct"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configuring the Optimizer\n",
    "\n",
    "We've created a search option for text-to-sql that searches over the following:\n",
    "- Chain-of-Thought reasoning\n",
    "- Planning before acting\n",
    "- 2 few-shot examples\n",
    "- An ensemble of 3 agents for a task\n",
    "\n",
    "Let's use these search settings to conduct the optimization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cognify.hub.search import text_to_sql\n",
    "search_settings = text_to_sql.create_search()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start the optimization\n",
    "\n",
    "We've provided the 3 code blocks above in `configy.py`. With the Cognify command line interface (CLI), you can start the optimization like this:\n",
    "\n",
    "```console\n",
    "$ cognify optimize workflow.py\n",
    "```\n",
    "\n",
    "Alternatively, you can run the following cell (*warning*: this workflow may run for quite some time):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, val, dev = load_data()\n",
    "\n",
    "opt_cost, pareto_frontier, opt_logs = cognify.optimize(\n",
    "    script_path=\"workflow.py\",\n",
    "    control_param=search_settings,\n",
    "    train_set=train,\n",
    "    val_set=val,\n",
    "    eval_fn=eval_text_to_sql,\n",
    "    force=True, # This will overwrite the existing results\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimization Results\n",
    "\n",
    "Cognfiy will output each optimized workflow to a `.cog` file. For this workflow, the optimizer applies the following optimizations:\n",
    "- use GPT-4o-mini for keyword extraction, table selection and column selection\n",
    "- use Llama 3.1-8B along with few-shot examples for candidate generation\n",
    "- ensemble revision\n",
    "  - use chain-of-thought, few-shot examples, and GPT-4o-mini for two of the ensembled models\n",
    "  - use GPT-4o-mini for the third ensembled model\n",
    "  - use Llama 3.1-8B along with chain-of-thought reasoning for the aggregator\n",
    "\n",
    "The final optimized workflow is depicted below, with optimizations highlighted in green.\n",
    "\n",
    "![t2sql-opt](../imgs/text_to_sql_optimized.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For candidate generation, the few-shot examples used resemble the following:\n",
    "\n",
    "> **Demonstration 1:**  \n",
    "> Input (question): \"Which of these players performs the best in crossing actions, Alexis, Ariel Borysiuk or Arouna Kone?\"\n",
    "> \n",
    "> Input (schema): \"`CREATE TABLE Player\\n(\\n\\t'id' INTEGER PRIMARY KEY AUTOINCREMENT,\\n\\t'player_api_id' INTEGER UNIQUE...`\"\n",
    "> \n",
    "> Input (hint): \"player who perform best in crossing actions refers to `MAX(crossing);`\"\n",
    ">\n",
    "> Reasoning: \"To find the player who performs the best in crossing actions, we need to identify the player with the highest `crossing` score in the `Player_Attributes` table. Let's start by writing a query that selects the `player_name` and `crossing` score...\"\n",
    ">\n",
    "> Output (query): `SELECT p.player_name, pa.crossing FROM Player p JOIN Player_Attributes pa ON p.player_api_id = pa.player_api_id WHERE p.player_name IN ('Alexis', 'Ariel Borysiuk', 'Arouna Kone') ORDER BY pa.crossing DESC LIMIT 1`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For revision, there are a few more input fields needed:\n",
    "\n",
    "> **Demonstration 1**  \n",
    "> Input (question): \"List out the atom id that belongs to the TR346 molecule and how many bond type can be created by this molecule?\"  \n",
    "> \n",
    "> Input (SQL): \"`SELECT T1.atom_id, COUNT(DISTINCT T2.bond_type) AS bond_type_count FROM atom T1 LEFT JOIN bond T2 ON T1.molecule_id = T2.molecule_id WHERE T1.molecule_id = 'TR346' GROUP BY T1.atom_id;`\"  \n",
    "> \n",
    "> Input (schema): \"`CREATE TABLE atom\\n(\\n\\t'atom_id' TEXT NOT NULL, -- 'atom id' description: the unique id of atoms\\n\\t'molecule_id' TEXT DEFAULT NULL, -- 'molecule id' description: identifying the molecule`\"\n",
    ">   \n",
    "> Input (query result): \"[('TR346_1', 1), ('TR346_2', 1), ('TR346_3', 1), ('TR346_4', 1), ('TR346_5', 1), ('TR346_6', 1), ('TR346_7', 1), ('TR346_8', 1)]\"  \n",
    "> \n",
    "> Reasoning: \"We need to list the `atom_id` that belongs to the molecule with `molecule_id` 'TR346'. We also need to count how many distinct bond types can be created\" \n",
    ">  \n",
    "> Output (query): \"`SELECT T1.atom_id, (SELECT COUNT(DISTINCT T2.bond_type) FROM bond T2 WHERE T2.molecule_id = 'TR346') AS bond_type_count FROM atom T1 WHERE T1.molecule_id = 'TR346';`\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check out more details on [how to interpret optimization results](https://cognify-ai.readthedocs.io/en/latest/user_guide/tutorials/interpret.html#detailed-transformation-trace)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cog-source",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}