Math Problem Solving#

In this example, we build a simple workflow for solving MATH problems.

Example Problen and Solution:

P: The Smith family has 4 sons and 3 daughters. In how many ways can they be seated in a row of 7 chairs such that at least 2 boys are next to each other?

S: This problem is a perfect candidate for complementary counting. It will be fairly difficult to try to count this directly, since there are lots of possible cases (just two are BBBBGGG and BGGBBGB, where B is a boy and G is a girl). But there is only one way to assign genders to the seating so that no two boys are next to each other, and that is BGBGBGB. If we seat the children as BGBGBGB, then there are 4 factorial orderings for the 4 boys, and 3 factorial orderings for the 3 girls, giving a total of 4! x 3! = 144 seatings for the 7 children. These are the seatings that we don’t want, so to count the seatings that we do want, we need to subtract these seatings from the total number of seatings without any restrictions. Since there are 7 kids, there are 7 factorial ways to seat them. So the answer is

7! - (4! x 3!) = 5040-144 = 4896

The workflow involves 2 agents:

Modeling (or interpreter) agent: analyzes the problem and models the problem with equations.
Solver agent: focuses on solving the generated model to get the answer.

math

1) Setup#

First, let’s set the environment for workflow execution. We use openai model in this example, please set your key in .env file as:

OPENAI_API_KEY=”your-openai-key”

2) Check Math Workflow#

The implementation is based on langchain and is avaibale in workflow.py.

Try it out with:

%run workflow.py

{'answer': "To solve the problem, we need to analyze the pattern of the student's actions as he opens the lockers.\n\n1. **Initial Setup**: There are 1024 lockers, all initially closed.\n\n2. **First Pass**: The student opens locker 1, then skips locker 2, opens locker 3, skips locker 4, and so on. This means he opens all odd-numbered lockers on the first pass:\n   - Opened lockers: 1, 3, 5, ..., 1023 (total of 512 lockers).\n\n3. **Turning Around**: When he reaches locker 1024, he turns around and starts back. The first closed locker he encounters is locker 2 (since all odd-numbered lockers are open). He opens locker 2, then skips locker 4, opens locker 6, skips locker 8, and continues this pattern:\n   - Opened lockers: 2, 6, 10, ..., 1022 (total of 256 lockers).\n\n4. **Subsequent Passes**: The student continues this process, alternating between opening the first closed locker he encounters and then skipping the next one. Each time he turns around, he opens lockers in a specific pattern:\n   - On the third pass, he will open lockers 4, 12, 20, ..., and so on.\n   - On the fourth pass, he will open lockers 8, 24, 40, ..., and so on.\n\n5. **General Pattern**: Each pass can be described as opening lockers that are multiples of \\(2^n\\) where \\(n\\) is the pass number (starting from 0). The number of lockers opened in each pass decreases as follows:\n   - 1st pass: \\(2^0 = 1\\) (opened 512 lockers)\n   - 2nd pass: \\(2^1 = 2\\) (opened 256 lockers)\n   - 3rd pass: \\(2^2 = 4\\) (opened 128 lockers)\n   - 4th pass: \\(2^3 = 8\\) (opened 64 lockers)\n   - 5th pass: \\(2^4 = 16\\) (opened 32 lockers)\n   - 6th pass: \\(2^5 = 32\\) (opened 16 lockers)\n   - 7th pass: \\(2^6 = 64\\) (opened 8 lockers)\n   - 8th pass: \\(2^7 = 128\\) (opened 4 lockers)\n   - 9th pass: \\(2^8 = 256\\) (opened 2 lockers)\n   - 10th pass: \\(2^9 = 512\\) (opened 1 locker)\n\n6. **Final Pass**: The last locker opened will be the last locker he encounters on the final pass. Since he opens lockers in the order of their numbers, the last locker opened will be locker 1024.\n\nThus, the number of the last locker he opens is:\n\n\\[\n\\boxed{1024}\n\\]"}

3) Optimize The Workflow#

The workflow entry point is already registered using annotation cognify.register_workflow.

Here we configure the optimization pipeline:

Define the evaluation method
Define the data loader
Config the optimizer

3.1 Use LLM-as-judge#

As you can see the standard solution includes both the result and the detailed steps required to achieve it. We utilize an LLM agent to evaluate the generated output for completeness and correctness.

The agent assigns a score on a scale of 0 to 10, accounting for partially correct answers.

We implement the scoring agent with langchain as follows:

import cognify

from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Initialize the model
import dotenv
dotenv.load_dotenv()
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Force agent to respond with a score
from langchain.output_parsers import PydanticOutputParser
class Assessment(BaseModel):
    score: int
    
parser = PydanticOutputParser(pydantic_object=Assessment)

@cognify.register_evaluator
def llm_judge(workflow_input, workflow_output, ground_truth):
    evaluator_prompt = """
You are a math problem evaluator. Your task is to grade the the answer to a math proble by assessing its correctness and completeness.

You should not solve the problem by yourself, a standard solution will be provided. 

Please rate the answer with a score between 0 and 10.
    """
    evaluator_template = ChatPromptTemplate.from_messages(
        [
            ("system", evaluator_prompt),
            ("human", "problem:\n{problem}\n\nstandard solution:\n{solution}\n\nanswer:\n{answer}\n\nYou response format:\n{format_instructions}\n"),
        ]
    )
    evaluator_agent = evaluator_template | model | parser
    assess = evaluator_agent.invoke(
        {
            "problem": workflow_input, 
            "answer": workflow_output, 
            "solution": ground_truth, 
            "format_instructions": parser.get_format_instructions()
        }
    )
    return assess.score

3.2 Load the Data#

We provide the subsampled math data in data._json file for you to start with.

The data should be formatted to align with the function signature of both the workflow entry point and the evaluator.

Signatures are:

workflow (workflow_input) -> {‘workflow_output’: …}
evaluator (workflow_input, workflow_output, ground_truth) -> int

The workflow input expects the workflow_input field and will forward workflow_output to the evaluator.

Additionally, the data loader needs to provide ground_truth to match the evaluator signature. Note: All of these variable names are customizable as long as they are consistent with each other.

With above rule, each data item should be formatted a tuple of (input, ground truth), each being a dictionary with required fields:

(
    {'workflow_input': ...},
    {'ground_truth': ...}
)

The complete data loader code is provided below.

import json
import random

@cognify.register_data_loader
def load_data():
    with open("data._json", "r") as f:
        data = json.load(f)
        
    random.seed(42)
    random.shuffle(data) 
    # format to (input, output) pairs
    new_data = []
    for d in data:
        input_sample = {
            'workflow_input': d["problem"],
        }
        ground_truth = {
            'ground_truth': d["solution"],
        }
        new_data.append((input_sample, ground_truth))
    # train, val, test split
    return new_data[:30], None, new_data[30:]

3.3 Config the optimizer#

Let’s use the default configuration to optimize this workflow. This will decide whether or not to add fewshot examples from the training data and whether to apply chain-of-thought prompting to each agent.

Additionally, the original workflow use gpt-4o for both agents, we also want to tune the model selection to save cost.

The final search space:

2 fewshot examples to add for each agent
whether to apply Chain-of-thought to each agent
select gpt-4o or gpt-4o-mini for each agent

from cognify.hub.search import default

model_configs = [
    # OpenAI models
    cognify.LMConfig(model='gpt-4o-mini', kwargs={'temperature': 0, 'max_tokens': 300}),
    cognify.LMConfig(model='gpt-4o', kwargs={'temperature': 0, 'max_tokens': 300}),
]

search_settings = default.create_search(
    model_selection_cog=model_configs
)

4) Start the Optimization#

You can save the above configs in config.py file and use Cognify’s CLI to fire the optimization with:

$ cognify optimize workflow.py

Alternatively you can run the following:

train, val, dev = load_data()

opt_cost, pareto_frontier, opt_logs = cognify.optimize(
    script_path="workflow.py",
    control_param=search_settings,
    train_set=train,
    val_set=val,
    eval_fn=llm_judge,
    force=True, # This will overwrite the existing results
)

5) Optimization Results#

Cognfiy will output each optimized workflow to a .cog file. For this workflow, the optimizer applies the following optimizations:

use few-shot examples with GPT-4o-mini for the problem modeling (or interpreter) agent
use few-shot examples for the model solver agent

The final optimized workflow is depicted below, with optimizations highlighted in green.

finrobot-opt

For the modeling agent, the following few-shot examples are selected (outputs truncated for brevity):

Demonstration 1:
Input (problem): “A paperboy delivers newspapers to 10 houses along Main Street. Wishing to save effort, he doesn’t always deliver to every house, but to avoid being fired he never misses three consecutive houses..”

Reasoning: “To solve this problem, we can use a combinatorial approach with dynamic programming. Let’s define a sequence where each term represents the number of ways the paperboy can deliver newspapers…”

Output (response): “To solve the problem, we define a_n as the number of ways the paperboy can deliver newspapers to n houses such that he never misses three consecutive houses. The base cases are …”

Demonstration 2:
Input (problem): “A drawer in a darkened room contains $100$ red socks, $80$ green socks, $60$ blue socks and $40$ black socks. A youngster selects socks one at a time from the drawer but is unable to see the color…”

Reasoning: “To solve this problem, we need to determine the minimum number of socks that must be selected to ensure that there are at least 10 pairs of socks. A pair is defined as two socks of the same color…”

Output (response): “To ensure at least 10 pairs of socks, we consider the worst-case scenario where we form 9 pairs and have additional single socks of other colors. We can have 9 pairs of one color (18 socks)…”

This example shows how even without explicitly adding reasoning prompts, few-shot examples can learn from the entire optimization process.

Check out more details on how to interpret optimization results.