Week 13 - Game theory

Game theory

Game theory

Game theory is the study of systems where there are more than one rational players.

Link to original

2-player zero-sum finite game of perfect information

This is a lot of words but these games are like tick tack toe or noughts and crosses. Whilst these are fun they all boil down to a payoff matrix for the different strategies you take. Like the following

to read this table, if the first player at the top picks action and the second player on the rows picks then player one gets 1 point whereas player two gets -1 point. This is the zero-sum aspect to the game. (Note the game could be non-deterministic but we just take expectation to find the values in the grid.)

Zero-sum game

Zero-sum game

A game is zero-sum if the total losses and gains for the players of all participants are equal to zero.

Link to original

We call the options , , , and strategies.

Minmax decision

Minmax decision

If you choose to minmax, you want to minimise the maximum loss. In terms of gain you want to maximise the minimum gains, which is why it is sometimes referred to maximin.

Link to original

Optimum play exists for 2-player zero-sum games with perfect information

Statement

Lemma

In a 2-player zero-sum games with perfect information there exists an optimal pure strategy for each player.

Proof

Link to original

This means we can determine the value of the game by both players following their best strategy we can work out the value of the game.

Warning

I don’t really understand what happens in the case

What would be the optimal pure strategy?

No more perfect information

Perfect information means that all the knowledge is known by all players at the start of the game. A good example where this is not the case is poker.

Mini-poker

Suppose we are in a game where the following happens:

  1. Player one is dealt a card, 50% of the time it is red and the other 50% it is black.
  2. If player one is dealt a black card they can resign or keep playing. If they are dealt a red card they have to keep playing. If they resign they get -20 score.
  3. Player two then gets to choose to resign or for player one to show them their card.
  4. If player two resigns player one gets 10 points.
  5. If player one shows their card and it is back they get -40 points.
  6. If player one shows their card and it is red they get 30 points.

What is the optimum play for either party?

In this case we can still calculate the payoff matrix. Note if player one chooses a resign strategy by they get dealt a red card they have to keep going.

Don't understand

I don’t really get how this situation is any different to the above ones. The lecturers say now the value of the game differ for both players but it still boils down to a matrix that you could have gotten with a normal game?

Mix strategies

Pure strategy

Pure strategy

In Game theory a pure strategy is deterministic i.e. it is a map from states of the game to actions.

Link to original

However you can have non-deterministic strategies.

Mixed strategy

Mixed strategy

In Game theory a pure strategy is non-deterministic i.e. you have access to a random variable that can be used to make your strategy from states of the game and the result of the random variable to actions.

Link to original

In the game above, if player one holds with probability then we get expected return if player two resigns otherwise we get profit if player two sees the card. Therefore if we choose we can get expected profit of no matter what player does.

If you do the same for player two, picking to see with probability then we get expected return if player one resigns but if they hold. Then we get optimum which also gives player one a profit of 1.

Removing zero-sum

Prisoner's dilemma

Prisoner's dilemma

Two people are caught mid crime and put in separate cells where they can’t talk to each other. They are both offered a plea deal if they testify against their companion, where:

  • If they both cooperate with each other and stay silent they both get one month in jail,
  • If one of them defects and testifies and the other doesn’t whoever testify gets off free whilst the other stays in jail for 3 months, however
  • If they both defect and testify they both get sent away for 2 months.

This can be summarised in the following matrix

If the player stays silent they pay an expected cost of 2, whereas if they testify they pay an expected cost of 1. So the optimum strategy is to testify. However by other testifying combined they are paying the largest cost.

Link to original

In the Prisoner’s dilemma the dominating strategy is for them both to testify. Even if any player could change their decision after finding out the outcome they wouldn’t want to.

Nash equilibrium

Nash equilibrium

In a game with players where player has choice of strategies in . The state is in Nash equilibrium if for all we have Where is the utility function for player .

Link to original

This works for both pure or mixed strategies.

Dominating strategies

Strictly dominated strategy

Strictly dominated strategy

In a game with players where player has choice of strategies in and has utility function . For a player strategy strictly dominates if and only if for all we have

Link to original

This could be used to get Nash equilibrium.

Elimination and Nash Equilibrium

Statement

Lemma

In an -player pure strategy game, if elimination of Strictly dominated strategy eliminates all but one combination of strategies, that combination is the unique Nash equilibrium. Similarly a Nash equilibrium can never be strictly dominated.

Proof

Link to original

Though this doesn’t guarantee finding it. Though there is always a Nash equilibrium in finite games.

Existance of Nash equilibrium

Statement

Lemma

In an -player game where is finite and each player has finite choices of pure strategy then there exists a (maybe mixed) Nash equilibrium.

Proof

Link to original

Note

Normally the best way to get out of the prisoners dilemma is to change the structure of the game.

Multiple rounds

Suppose now you are in a situation where you are playing iterative rounds of a game with the same player.

Known number of rounds

One approach to solving this issue is to combine possible series of strategies as a new series and to investigate the composite game. Though an easier solution to this situation can be found be thinking of them separately.

If the number of rounds are known beforehand, then consider the last round. As there are no games afterwards - there is no incentive to not behave rationally, so we will use the Nash equilibrium strategy.

Though given this last game is already decided we might as well consider playing one less game. Though this inductively gives that each player will play a Nash equilibrium strategy at each step of the game.

However, which Nash equilibrium is an interesting problem!

Unknown number of rounds

At the end of a round suppose there is a chance you play again and a chance you end. This stays constant and you use discounted rewards to evaluate performance.

Note

Discounted rewards is perfectly built for this. Suppose is the random variable for the value of playing this game, let be the reward for the round then Therefore if we pick a strategy that would pay off in round if it were played then the total expected value of this strategy would be

The expected number of round in this set up is .

Tit for Tat

When in the above situation there are some meta strategies you can adopt. For example you can always choose one option - like in the known number of rounds example. However, you can now adopt strategies that depend on the opponents move.

Tit for Tat

Tit for Tat

In a symmetric two player game, the tit for tat strategy just copies the opponents move. In the first round it will do something random.

Link to original

Suppose we do this with the prisoners dilemma. Now lets consider 4 strategies, always stay silent (silent), always testify (testify) or tit for tat (tft) where you start silent or testify. Lets consider the strategy matrix (to simplify things I took a factor of out)

Lets consider the preferred strategy when your opponent has already picked (this depends on ).

Opponent
SilentTestifyTestifyTestify
TestifyTestify, tft-testifyTestify, tft-testifyTestify, tft-testify
tft-silenttestifyAnytft-silent, silent
tft-testifyTestify, tft-testifyAnysilent

Here no matter what we have that testify is a Nash equilibrium. Though we have additional equilibrium depending on

Picturing Mixed strategies for repeated games

Now we are repeating games we could adopt a Mixed strategy of meta strategies. i.e. decide a probability distribution over the strategies and use them dependent on a random variable.

Minmax profile

Minmax profile

For a game, the minmax profile is the expected score a player can achieve assuming your opponents only want to harm you. They are allowed to take any strategy, this can be a Pure strategy or Mixed strategy.

Link to original

Battle of the sexes

Battle of the sexes

Suppose two friends and are going to a town to see a concert. There is two concerts on and but unfortunately they forgot to agree which one to go to and can’t communicate. They will both be unhappy if they go to a different concert and both get a score . slightly prefers concert and if they go there together will get a score of whereas gets a score - whereas it is vice versa for concert . In summary this can be represented in a payoff table as follows.

Link to original

To calculate the Minmax profile of the Battle of the sexes we have to think about how one player could punish the other player the most. (This game is simpler as it is symmetric.) Using a Pure strategy if wanted to punish they would go for and so the best could do would be 1 point.

However, this is not the worst player could do. Instead they use a mixed strategy and went to with probability and with probability . Then player choosing would give them whereas choosing gives them . Which setting gives the expected return of either decision to be .

This gives the following Minmax profile.

\begin{array}{c|c} \ \mbox{A} & \mbox{B} \ \hline \frac{2}{3} & \frac{2}{3} \ \end{array}

You can't use 'macro parameter character #' in math mode## Feasible region In the [[Battle of the sexes]] we saw that one player could in expectation guarantee another player a lower score by adopting a mixed strategy. Therefore they have made the pay off a point that isn't in $\{(0,0), (2,1), (1,2)\}$. So what is the feasible region of expected scores using [[Mixed strategy|mixed strategies]]? If player $A$ picks $x$ with probability $p$ and player $B$ picks $x$ with probability $q$ then we get the expected score $$(pq2 + (1-p)(1-q), pq + 2(1-p)(1-q)).$$ This gives us all the points in the convex hull of the points $\{(0,0), (2,1), (1,2)\}$. ![[feasible_region.excalidraw]] For example to get the line at the bottom set $p = 1$ $\{(2q, q) \vert q \in [0,1]\}$, for the line at the top set $p=0$ $\{((1-q), 2(1-q)) \vert q \in [0,1]\}$. For lines in-between let $p$ vary. ## Security region ![[Security region]] In the [[Battle of the sexes]] example we have plotted the security region below. ![[feasible_region_with_security_region.excalidraw]] ## Folk Theorem We would like to think we could get any point which is feasible and in the security region with sufficient cooperation. That cooperation will have to be enforced with a penalty, such as forcing the opposition player into the [[Minmax profile]]. This will look as follows. ![[Grim trigger strategy]] ![[Folk Theorem]] If we follow through with the threat in the [[Grim trigger strategy]] this might actually be sub-optimal play for us - and we might not want to do it. So this threat is some what implausible. ![[Subgame perfect]] ![[Plausible threat]] ## Pavlov ![[Pavlov strategy]] This strategy feels counter intuitive - like a punch me twice and its all ok mentality. The strategy is in [[Nash equilibrium]] with itself, as both players start out cooperating and will continue forever. However it is also [[Subgame perfect]] and therefore a plausible threat. To see this lets start a game in every previous state. We will represents turns as $(*,*)$ where the first position is what player $A$ did and the second is what player $B$ did. There are two options are moves cooperate $C$ and defect $D$. > $$ \begin{array}{c|cc} \ \mbox{Last} & 1 & 2 \\ \hline (C,C) & (C,C) & (C,C) \\ (C,D) & (D,D) & (C,C) \\ (D,C) & (D,D) & (C,C) \\ (D,D) & (C,C) & (C,C) \end{array}

We see from this no matter what happened last round the Pavlov strategy self rights itself into the cooperate cooperate state.

Computational Folk theorem

Statement

Lemma

For any 2-player discounted reward repeated game it is possible to build a Povlov-like machine with is Subgame perfect and can achieve Nash equilibrium in polynomial time. This is done in one of three ways

  • Pavlov if it is possible,
  • Make it zero-sum (like) and solve using a linear programme, or
  • at most one player can improve.

Proof

Link to original

Stochastic games

Stochastic games are the multi-agent analogy to a Markov decision process.

Stochastic games

Stochastic games

A Stochastic game is defined by the following data:

  • A number of players ,
  • A set of states the game can be in (normally thought of as tuples),
  • A set of actions for each player this may depend on the state we refer to the joint set of actions as ,
  • A model which determines for each set of states and action what the transition probabilities there are from to given action , i.e. ,
  • A reward function for each player for being at a given state. and
  • A discount factor for discounted rewards.

When provided with this we are looking for a strategy for each player that determines what action we take in each state.

Link to original

Semi-wall stochastic game

Semi-wall stochastic game

The semi-wall game is a Gridworld game played on a 3x3 grid. With player starting in the bottom left corner and player starting in the bottom right. The goal of the game is to arrive at the money in the centre top - whoever does that first gets a point. If two players try to enter the same spot with even odds one player moves in and the other stays where they are. There is a semi-wall above each player to start with which has a 50% chance of letting them through, otherwise they stay where they are. The game looks as follows: semi-wall-game

Link to original

Generalisations for zero-sum Stochastic games

In the zero-sum setting we assume only two players, so . However, we leave the terminology general for syntactic ease.

Minimax-Q

Minimax-Q

This is a generalisation of Q-learning to Stochastic games and but is defined for each player. Where we incrementally learn this value

Link to original

This has some nice properties

  • Value iteration works,
  • It converges,
  • has a unique solution for each ,
  • Policies can be computed independently from one another,
  • We can update in polynomial time, and
  • Q-functions are sufficient to specify the policy.

Generalisation to Stochastic games

You can do the same for non-Zero-sum games using Nash equilibrium instead of minmax but it doesn’t have the nice properties.

This has some nice properties

  • Value iteration doesn’t works,
  • It doesn’t converges,
  • has a multiple solution for each (as there is no unique Nash equilibrium),
  • Policies cannot be computed independently from one another,
  • We can update in PPAD time, and
  • Q-functions are not sufficient to specify the policy.

Though there are ideas to correct for this:

  • We can use repeated Stochastic games,
  • You can talk to players,
  • You can limit players computational power, and
  • Allowing players to pay each other some of their reward.