Quick start guide

This packages implements methods described in the paper:

Exact decoding of the sequentially Markov coalescent
Caleb Ki, Jonathan Terhorst
bioRxiv 2020.09.21.307355; doi:10.1101/2020.09.21.307355

The methods are implemented in a Python package called xsmc. The package has two main uses: drawing exact samples from the posterior distribution, and fast estimation of the maximum a posteriori (MAP) path. Convenience methods are also included for plotting.

We will demonstrate how the method works on simulated data using msprime.

[3]:
def simulate(recombination_rate, length, sample_size=2):
    return msp.simulate(
        sample_size=sample_size,
        mutation_rate=1e-8,
        recombination_rate=recombination_rate,
        Ne=1e4,
        length=length,
        random_seed=np.random.randint(1, int(4e9)),
        demographic_events=[
            msp.PopulationParametersChange(time=0.0, initial_size=1e5),
            msp.PopulationParametersChange(time=1000.0, initial_size=1e4),
        ],
    )

All methods are accessed by creating an instance of the xsmc.XSMC class. The class has three required arguments:

  • ts, a tskit.TreeSequence instance containing the data.

  • focal, the node id of the focal haplotype in ts.

  • panel, a list of node ids in ts which are potential targets for focal to copy from.

First, we focus on the simplest type of analysis, where we have two chromosomes sampled from a singled diploid individual.

[4]:
from xsmc import XSMC, Segmentation

data = simulate(recombination_rate=1e-9, length=1e6, sample_size=2)
xs = XSMC(data, focal=0, panel=[1], rho_over_theta=0.1)

Exact sampling

To sample paths from the posterior, use the XSMC.sample().

[5]:
paths = xs.sample(k=100, seed=1)

The returned object is a list of length k of Segmentation objects. These objects can be easily plotted to visualize the posterior:

[6]:
for p in paths:
    p.draw(alpha=0.1)
plt.yscale("log")
_images/quickstart_10_0.png

To compare the posterior with the truth, we can use built-in convenience methods to turn data into a Segmentation object and plot it.

[7]:
truth = Segmentation.from_ts(data, focal=0, panel=[1])
truth.draw()
plt.yscale("log")
_images/quickstart_12_0.png

Rescaling the output

You may notice that the \(y\) scale of the above plots is different. By default, Segmentations returned by XSMC are in units of coalescent time. This matches the scaling of the Li & Durbin’s PSMC.

To convert these times to generations, you must know the biological mutation rate \(\mu\), and then estimate \(N_0\) by the formula \(N_0 = \theta / (4 \mu)\):

[8]:
N0 = xs.theta / (4 * 1e-8)

Then, you can convert time to generations by multiplying by \(2N_0\).

[9]:
for p in paths:
    p.rescale(2 * N0).draw(
        color="tab:blue", alpha=0.1
    )  # divide by the mutation rate used in the simulation, convert to haploid
truth.draw(color="tab:red")
plt.yscale("log")
# plt.ylim(0, 1e4)
_images/quickstart_16_0.png

Computing functionals of the posterior

The Segmentation.to_pp() method can be used to get access to the raw stepwise function it represents (returned as a piecewise polynomial, whence the name). By calling this method on paths returned from XSMC.sample_paths(), this you can compute functionals of the posterior distribution.

[10]:
x = np.linspace(0, data.get_sequence_length(), 1000)
y = np.array([p.to_pp()(x) for p in paths])
q = (
    np.quantile(y, [0.25, 0.5, 0.75], axis=0) * 2 * N0
)  # faster than calling .rescale() on each path
plt.plot(x, q[1])
plt.fill_between(x, q[0], q[2], alpha=0.3)
truth.draw()
plt.yscale("log")
_images/quickstart_18_0.png

Multiple samples

We can analyze multiple samples using a “trunk approximation” described in the paper. In this case, the returned segmentation approximates the earliest time to coalescence with any member of the panel.

[11]:
data = simulate(recombination_rate=1e-8, length=1e6, sample_size=20)
focal = 0
panel = list(range(1, 20))
xs = XSMC(data, focal=focal, panel=panel, rho_over_theta=1.0)
paths = xs.sample(k=100, seed=1)
[12]:
N0 = xs.theta / (4 * 1e-8)
for p in paths:
    p.rescale(2 * N0).draw(color="tab:blue", alpha=0.05)
truth = Segmentation.from_ts(data, focal=focal, panel=panel)
truth.draw(color="tab:red")
plt.yscale("log")
_images/quickstart_21_0.png

Viterbi Decoding

The other function of the XSMC class is to compute the Viterbi/most probable path under the posterior:

[13]:
data = simulate(recombination_rate=1e-9, length=10e6, sample_size=10)
xs = XSMC(data, 0, [1, 2], rho_over_theta=1e-9 / 1.4e-8)
N0 = xs.theta / (4 * 1e-8)
map_path = xs.viterbi()
map_path.rescale(2 * N0).draw(label="viterbi")
Segmentation.from_ts(data, 0, [1, 2]).draw(label="truth")
plt.yscale("log")
plt.legend()
[13]:
<matplotlib.legend.Legend at 0x7f0493c47190>
_images/quickstart_23_1.png

Note that each path also includes a haplotype component which is not plotted:

[14]:
map_path.segments[0]
[14]:
Segment(hap=2, interval=array([     0, 476205]), height=0.08770837623933274, mutations=21)

This says that the first segment in the MAP path had height 0.09 (in coalecent units) and coalesced with panel haplotype 2.

[ ]: