Lokee
kindly drew my attention to an interesting article recently – "Generosity leads to evolutionary success" – which is largely based on a paper by Alexander
Stewart and Joshua Plotkin, "From extortion to generosity, the evolution of zero-determinant strategies in theprisoner’s dilemma". The Stewart&Plotkin paper was, again
largely, a response to what strikes me as a somewhat more technical paper by
William Press and Freeman Dyson, “Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent” (this
latter paper certainly seems less accessible to a lay reader such as myself,
irrespective of how technical it is).
I
spent a bit of time mulling over these papers (and a couple more that are
referenced by one or both of them) and thought I might share the outcome of
those ponderings.
-----------------------------
My
first reaction was to think that, based on the popular article at the
Archaeology News Network (ANN), there might be scientific vindication of some
of the ideas raised in the series of articles Morality as Playing Games. In that series, I primarily suggested that
self-interest might lie behind our morality (and more specifically the
avoidance of loss). I also suggested
that successful ethical systems would involve overt generosity and kindness
during good times combined with an ability to act less generously either during
bad times, or when no-one is watching. I
further suggested that cooperation might arise when you compete against a third
party (for example, in the prisoner’s dilemma,
the prisoners can compete against each other, which is the standard assumption,
or they can cooperate in order to compete against the prosecutor).
The
key evidence that seems to vindicate the idea that bad times promote less
generous behaviour is in Figure 3 of Stewart&Plotkin. They found that in an evolving population
“generous” strategies were more successful than “extortionate” strategies,
basically because “extortionate” strategies don’t do well against themselves so
they are in effect self-limiting – but when the populations were small (for
example when it comes down to a population consisting of just you and me, or my
family and your family), “extortionate” strategies prevail.
But
what, you might ask, is an “extortionate” strategy?
Let’s
return to our prisoners, Larry and Wally.
I’ll have to modify the scenario slightly to introduce the idea of T, R,
P and S which are “maximum payoff”, “mutual cooperation payoff”, “mutual
defection payoff” and “minimum payoff” and are conventionally set to values of
T=5, R=3, P=1 and S=0 (note however that Stewart&Plotkin use a “donation
game” variant in which T=B, R=B-C, P=0 and S=-C, where B>C so that R>P).
-----------------------------------
In Ethical Prisoners, I
explained that Larry and Wally are faced with a dilemma in which they can
choose to either cooperate with each other (by remaining silent with respect to
a crime they are accused of committing) or defect (by confessing).
I
used a table to explain how these options play out, which I’ve updated below:
|
Larry
defects (confesses)
|
Wally
defects
(confesses)
|
payoffLARRY=1
(P)
payoffWALLY=1
(P)
|
Wally
cooperates
(remains
silent)
|
payoffLARRY=5
(T)
|
payoffWALLY=0
(S)
|
or
|
Larry
cooperates (remains silent)
|
Wally
defects
(confesses)
|
payoffLARRY=0
(S)
|
payoffWALLY=5
(T)
|
|
Wally
cooperates
(remains
silent)
|
payoffLARRY=3
(R)
payoffWALLY=3
(R)
|
The
results are more traditionally represented in terms of cooperation (c) and
defection (d) like this (where Larry is the focal player):
|
cWALLY
|
dWALLY
|
cLARRY
|
R=3
|
T=5
|
dLARRY
|
S=0
|
P=1
|
Larry
and Wally can choose different strategies depending on not only what sort of
outcome they want but also what sort of overall scenario they find themselves
in. As originally framed, the prisoners’
dilemma (PD) is a one-shot affair, meaning that Larry and Wally face off
against each other once and make a single decision with no historical context
or potential for future consequences. We
can, however, consider an iterated prisoners’ dilemma (IPD) in which Larry and
Wally would make equivalent decisions repeatedly, in which they face off many
times and, presumably, can learn about each other’s behaviour and react
accordingly.
If
we ignore the prosecutor and any pre-existing moral imperative to not abandon
your partner in crime (as discussed in Ethical Prisoners),
in the one-shot PD Larry and Wally are likely to choose a simple defection
strategy since defection not only opens up the possibility of a maximum payoff,
but also avoids the possibility of the minimum payoff.
Things
get more interesting with the IPD.
Strategies can now involve a consideration of previous moves and there
has been a wealth of evidence to show that the most successful strategy in
terms of overall payoff is what is known as Tit-For-Tat (TFT).
(In
terms of head-to-head battles, the most successful strategy is (and remains)
the simple defection strategy Always Defect (ALLD), since it either wins or
draws.)
We
can represent the TFT strategy as a probability table like this:
previous
round
|
cWALLY
|
dWALLY
|
cLARRY
|
p1=1
|
p2=0
|
dLARRY
|
p3=1
|
p4=0
|
where,
·
p1 is the probability of Larry
cooperating if both cooperated last round,
·
p2 is the probability of Larry
cooperating if Wally defected while Larry cooperated last round,
·
p3 is the probability of Larry
cooperating if Wally cooperated while Larry defected last round, and
·
p4 is the probability of Larry
cooperating if both defected last round.
In
short, if Wally defected in the previous round, then Larry will defect this
round but if Wally cooperated, then Larry will cooperate.
Different
strategies can be tested against each other using PD-bots in tournaments. With simple strategies as TFT and ALLD, the
whole outcome of the tournament is predicated on the first round. Assuming that a TFT-bot starts off with
cooperation, two TFT-bots will cooperate forever, obtaining an average score of
3 each, two ALLD-bots will defect forever obtaining an average score of 1 each
and a TFT-ALLD pairing will result in an initial win for ALLD followed by
mutual defection forever obtaining average scores that approach 1 (down from 5
for ALLD and up from 0 for TFT).
(If
we don’t assume that TFT-bots start off with cooperation, but instead with
defection, then all results default to an average score of 1.)
When
assessing the average score of strategies over many rounds against a range of
opponents and many iterations in each round, TFT has been shown to be a clear
winner (with a cooperative start) despite losing one iteration per round in any
match-up against an ALLD.
What
Press&Dyson discovered is that there are more complex strategies, being a
subset of “Zero Determinate” or ZD strategies, in which an “extortionate”
player can drive a self- interested, evolutionary opponent to always cooperate. A “concrete” example of an extortionate
strategy on the part of Larry (per Press&Dyson) is:
previous
round
|
cWALLY
|
dWALLY
|
cLARRY
|
p1=11/13
|
p2=1/2
|
dLARRY
|
p3=7/26
|
p4=0
|
Even
before going into more detail, it might be pretty easy to see that Wally’s best
option is to always cooperate in order to maximise the frequency with which
Larry cooperates. The likelihood of
Larry cooperating is always higher if Wally has just cooperated, and if also
Larry has just cooperated. The flip side
of this is that the likelihood of Larry defecting is higher if either of them
has just defected. These facts combine
to drive Wally towards unilateral cooperation in order to maximise his score.
What
Press&Dyson showed is that if Wally does unilaterally cooperate to maximise
his score against the extortionate Larry, he does so at the cost of maximising Larry’s
score above his own. What I interpret
out of this is that since Wally can’t win against Larry in the long term, his
obvious choices are to:
·
maximise his own score, thereby ensuring that
Larry wins, or
·
minimise both scores by locking into mutual
defection (and thereby secure a draw).
----------------------
Stewart&Plotkin
were interested in ZD strategies in general but somewhat more intrigued by a different
subset of them, not the “extortionate” strategies but rather what they called
“generous” ZD strategies.
The
generosity in question can be interpreted in two ways. Firstly, and I think possibly most
importantly, a generous ZD strategy is more forgiving and this can be brought
about by having a non-zero value of p4
(which avoids being locked eternally into mutual defection). Secondly, a generous ZD strategy tends to
maximise average payoffs for both players, which can be done by using
strategies which have a low value of χ and a value of κ that approaches R,
where R is the value of mutual cooperation and, in combination, κ and χ constitute
an indication of how “extortionate” the strategy is. The latter term, χ, is referred to by
Press&Dyson as an “extortion factor” where χ=1 is “fairness” and higher
values are increasingly extortionate.
(This term might otherwise be referred to as “leverage”.) One can obtain an indication of how wide the
gulf is between the average payoffs for each player (sLARRY and sWALLY) using an equation that
applies for ZD strategies:
sLARRY - κ
= χ . (sWALLY - κ)
If Larry’s
strategy produces χ=1, then the payoffs are equal for both players,
irrespective of the value of κ for that strategy.
--------------------------
Note that determining precisely what the parameters χ and
κ relate to in reality is a little complex.
So far I’ve seen no simple method that one can use to select values of χ
and κ and from them generate a strategy.
It is possible however to fiddle with the values of p1, p2,
p3 and p4 in a coordinated way to raise or lower
the value of the parameters.
Note
also that with conventional values of T, R, P and S, there is a limitation to
the value of sLARRY and
sWALLY, such that 2P <=
sLARRY + sWALLY
<= 2R, due to the typical assumption that 2R > T + S.
------------------------------
Extortionate
strategies are defined as those for which κ=P (the payoff for mutual defection)
and χ>1. In such strategies, the equation
sLARRY - κ
= χ . (sWALLY - κ)
shows quantitatively that the other player can only increase their payoff by
simultaneously increasing the extortionist’s payoff, for example for a strategy
with χ=4 and κ=P=1:
sLARRY
|
sWALLY
|
1.000
|
1.00
|
1.250
|
1.06
|
1.500
|
1.13
|
1.750
|
1.19
|
2.000
|
1.25
|
2.250
|
1.31
|
2.500
|
1.38
|
2.750
|
1.44
|
3.000
|
1.50
|
3.250
|
1.56
|
3.500
|
1.63
|
3.750
|
1.69
|
4.000
|
1.75
|
4.200
|
1.80
|
3.250
|
1.56
|
3.500
|
1.63
|
3.750
|
1.69
|
4.000
|
1.75
|
4.200
|
1.80
|
where
sLARRY + sWALLY =
6.0 is a limiting value.
Strategies
that are more generous have a higher value of κ, for example for a strategy with
χ=4 and κ=R=3:
sLARRY
|
sWALLY
|
0.000
|
2.25
|
0.250
|
2.31
|
0.500
|
2.38
|
0.750
|
2.44
|
1.000
|
2.50
|
1.250
|
2.56
|
1.500
|
2.63
|
1.750
|
2.69
|
2.000
|
2.75
|
2.250
|
2.81
|
2.500
|
2.88
|
2.750
|
2.94
|
3.000
|
3.00
|
This
strategy is still manipulative in a way, since Larry is still encouraging Wally
to increase Larry’s score while Wally tries to increase his own, but it’s more
generous since until parity is reached (where sLARRY = sWALLY), Wally’s
score will be higher than Larry’s.
What
the strategy does do, however, is open Larry up to sabotage by Wally. Wally could, at the cost of a fraction of a
point, drive Larry’s score to 1.0. Larry
might not obtain any benefit from increasing his score above 2.5 so even
without malice on his part, lack of incentive to cooperate further might damage
Larry catastrophically.
This
vulnerability to sabotage can be mitigated by Larry if he were to choose a strategy
with a lower value of χ: by lowering the “extortion factor” which, in a
generous strategy, works against you (hence the suggestion that the term
“leverage” be used). For example if Larry’s
strategy produces χ=1.5 and κ=R=3, he will get the following results:
sLARRY
|
sWALLY
|
0.600
|
1.40
|
0.750
|
1.50
|
1.000
|
1.67
|
1.250
|
1.83
|
1.500
|
2.00
|
1.750
|
2.17
|
2.000
|
2.33
|
2.250
|
2.50
|
2.500
|
2.67
|
2.750
|
2.83
|
3.000
|
3.00
|
where
sLARRY + sWALLY =
2.0 is a limiting value.
This
strategy narrows the gap in scores between yourself and the other (thus limiting
the effectiveness of any attempts to sabotage you), increases the self-harm
done by a saboteur and makes it impossible for an opponent to drive your score
to zero – while still being generous.
If χ
is reduced even further, to a value below 1, a potential saboteur would do more
damage to themselves than they would to their opponent, for example with χ=0.75
and κ=R=3:
sLARRY
|
sWALLY
|
~1.285
|
~0.715
|
1.500
|
1.00
|
1.750
|
1.33
|
2.000
|
1.67
|
2.250
|
2.00
|
2.500
|
2.33
|
2.750
|
2.67
|
3.000
|
3.00
|
where
sLARRY + sWALLY =
2.0 is a limiting value.
Another
potential way to limit sabotage is to set κ to a lower value – within the range
P < κ < R. Doing so again
increases the harm a saboteur does to themselves in their efforts to harm you. For example for a strategy with χ=1.5 and
κ=(P+R)/2=2:
sLARRY
|
sWALLY
|
0.800
|
1.20
|
1.000
|
1.33
|
1.250
|
1.50
|
1.500
|
1.67
|
1.750
|
1.83
|
2.000
|
2.00
|
2.250
|
2.17
|
2.500
|
2.33
|
2.750
|
2.50
|
3.000
|
2.67
|
3.200
|
2.80
|
where
sLARRY + sWALLY =
6.0 and sLARRY + sWALLY =
2.0 are limiting values.
Note
however that the payoffs for sLARRY and
sWALLY always reach parity
at κ, no matter what value κ is set to, so reducing κ would a self-defeating
strategy on the part a generous Larry, since a malign Wally can still force him
into an inferior position by reducing his payoff to below whatever level Larry
sets for κ. Therefore, employing a
strategy with a lower value of κ would just decrease Larry’s score. On the other hand, against a more reasonable,
evolutionary player who aims only to increase their own score, without worrying
about Larry’s, this strategy is slightly superior giving Larry a marginally
higher payoff than a strategy with a higher value for κ.
A
better result, however, can be obtained using a lower value of χ and a higher
value of κ, for example with χ=0.11 and κ=2.75:
sLARRY
|
sWALLY
|
2.625
|
0.60
|
2.750
|
1.74
|
2.875
|
2.875
|
2.900
|
3.10
|
where
sLARRY + sWALLY =
6.0 is a limiting value.
Here,
Wally is highly encouraged to score well, even slightly higher than Larry and is
punished severely if he acts malignly without damaging Larry in any significant
way, particularly since Larry has signalled that he is uninterested in a maximal
score. I’m not totally convinced that
this strategy is better than one in which κ=R and χ is set low, but there may
be situations in which it has its benefits (specifically thinking of situations
in which success brings with it some other risk, so that being a dog that’s
doing nicely without actually being top dog is preferable).
As
in Ethical Prisoners , the
strategies that Larry and Wally select will depend at least partially on who they
believe they are playing against – in terms of this scenario, their strategies
will depend on whether they want to maximise their own scores or to beat the other.
Perhaps
the best strategy is somewhat more complex, one in which Larry identifies what
sort of opponent he has and what the playing environment is like, and then
choses an appropriate level of generosity or extortion to match.
----------------
There’s
something else to unpack from this. If
this mathematical modelling can be applied to real life situations, and there
are indications that it can, then something that we can take away is that if
the “players” consider that either times are poor or the population is small,
then they will tend to act less generously, since generosity tends to prevail
only in good times with larger populations.
With
creatures as intellectually complex as humans, we need to be careful about how
we consider population size. Some people
might consider that there is only a population size of two – people like me (i.e.
primarily me) and everyone else. Very
few will, in practical terms, consider the population to be a little over 7
billion.
These
findings could be considered support for the notion that we should encourage
people to regard themselves as part of a larger inclusive society because by
doing so we would encourage more cooperative and generous behaviour.
Similarly,
we could consider that a sense that times are poor can dissuade people from
engaging in generous behaviour (examples are replete in dystopian themed tales
in which people turn on each other during tougher times) – so constant proclamations
of doom from the media and politicians (especially those in opposition) can become
self-fulfilling. If we take the opposite
approach by highlighting the positive, we might be able to generate a virtuous
circle in which a feeling that things are getting better motivates people to
act more generously, resulting in evidence that things are in fact getting
better thus locking further generosity.
----------------
In
the next part, I’ll look at what evolution means in Press&Dyson and
Stewart&Plotkin and share the results of my own very simple modelling,
modelling which indicates that generosity may in fact be the best strategy for
evolving populations.
No comments:
Post a Comment
Feel free to comment, but play nicely!
Sadly, the unremitting attention of a spambot means you may have to verify your humanity.