How to dramatically improve the reasoning ability of GPT-3

Following up on my recent post about prompt engineering, in this post I explore a technique that dramatically improves GPT-3’s reasoning ability and output quality. Presumably, this will work on other large language models (LLMs) as well.

Somewhat unsurprisingly, this technique is the same one you’d ask a student to use when solving a problem: show their work.

Asking GPT-3 to show its work

Let’s look at an example prompt:

For each example, decide if it's a good product to sell to a startup:

Product: Global office space management platform.
Good for startups: No

Product: Ticket management platform for in-house legal team.
Good for startups: No

Product: Recruiting and interviewing CRM platform.
Good for startups: Yes

Product: Website copy optimization tool.
Good for startups: Yes

Product: Legal compliance tool suite.
Good for startups:

GPT-3 on Davinci (the largest model) with low temperature generates Yes, which I personally think is incorrect. Looking at the logprobs, it predicted Yes at 53.25% and No at 42.20%.

Now, let’s try the same few-shot prompt, but this time asking for the model to show its work:

For each example, decide if it's a good product to sell to a startup:

Product: Global office space management platform.
Reasoning: Startups are small, they don't have global office space to manage.
Good for startups: No

Product: Ticket management platform for in-house legal team.
Reasoning: Startups are usually too small to have an in-house legal team.
Good for startups: No

Product: Recruiting and interviewing CRM platform.
Reasoning: Startups do a lot of recruiting and interviewing.
Good for startups: Yes

Product: Website copy optimization tool.
Reasoning: Startups are focused on growth and need to optimize their websites.
Good for startups: Yes

Product: Legal compliance tool suite.
Reasoning:

GPT-3 outputs a reason of Startups are small and don't have a legal department. and a conclusion of No. Nice!

This time the logprobs were No at 88.19% and Yes at 9.68%.

Another example

Was this a fluke? Let’s try another example.

Classify each sentence as logically correct or incorrect.

Sentence: Dogs are a type of plant.
Result: Incorrect

Sentence: Generally, a roof is above a floor.
Result: Correct

Sentence: The sky on Earth is blue.
Result: Correct

Sentence: Some dogs are smaller than cats.
Result:

It incorrectly responds with Incorrect.

But when we ask it to show its work:

Classify each sentence as logically correct or incorrect.

Sentence: Dogs are a type of plant.
Analysis: Dogs are animals, not plants.
Result: Incorrect

Sentence: Generally, a roof is above a floor.
Analysis: Ignoring oddly-structured buildings, roofs go above floors.
Result: Correct

Sentence: The sky on Earth is blue.
Analysis: Due to Raleigh scattering, the sky on Earth appears blue.
Result: Correct

Sentence: Some dogs are smaller than cats.
Analysis:

We get back:

Some dogs are smaller than some cats.
Result: Correct

That’s it. Ask the model to show its work and its reasoning is much better.

Matt Brockman has explored this idea using the WIC benchmark.

Do you have more examples of this phenomenon as well? Please let me know on Twitter!

Comments