Probability to find the sequence “Rar!” in a random (uniform) bytes stream at a position $\le n$

I have random (uniform distribution) stream of bytes (integers from $0$ upto $255$). What probability $p(n)$ to find the sequence “Rar!” (ASCII) at a position $\le n$ from the start of the stream?

Solutions Collecting From Web of "Probability to find the sequence “Rar!” in a random (uniform) bytes stream at a position $\le n$"

There’s a simple approximate answer and a more complicated exact answer.

For the approximate answer, we can neglect the fact that the probability of finding the string at some point isn’t exactly independent of the probability of having found it before that point. In this approximation, at every byte beginning with the fourth you have a $(1/256)^4$ chance of encountering your string, so the probability of having found it by $n$-th byte is approximately

$$1-\left(1-\left(\frac1{256}\right)^4\right)^{n-3}\;.$$

for $n\ge4$. (Of course it’s exactly $0$ for $n\lt4$.)

To get the exact answer, you can keep track of your state according to the number of consecutive matches you’ve just encountered. You start in state $0$. The probability to move to state $n+1$ from state $n$ is $1/256$, and the probability to return to state $0$ is $255/256$. Once you get to state $4$ you stay there. Thus you have a Markov process with transition matrix

$$\pmatrix{\frac{255}{256}&\frac1{256}&0&0&0\\\frac{255}{256}&0&\frac{1}{256}&0&0\\\frac{255}{256}&0&0&\frac{1}{256}&0\\\frac{255}{256}&0&0&0&\frac{1}{256}\\0&0&0&0&1}\;.$$

The initial probability distribution is $1$ in state $0$ and $0$ everywhere else, i.e. $(1,0,0,0,0)$. In each step, the transition matrix is applied to this vector from the right, so after $n$ steps the distribution is the initial distribution times the $n$-th power of the transition matrix. The powers of a matrix can be expressed succinctly using its eigenvectors and eigenvalues. Wolfram|Alpha will calculate the eigensystem of the matrix for you. (I don’t know how to get it to calculate the left eigenvectors, so I transposed the matrix.) Note that three eigenvalues are very small, so the corresponding components of the distribution will decay very quickly and will be negligible in practice. The initial distribution is approximately given by $\nu_1-\nu_2$, so the distribution after the $n$-th step is approximately $\nu_1-\lambda_2^n\nu_2$ (since $\lambda_1$ is exactly $1$), so the probability you want is about $1-\lambda_2^n$. If you click on “More Digits”, you’ll find that $\lambda_2\approx0.9999999997681$ is very close to but not exactly equal to $1-(1/256)^4\approx0.9999999997672$, so the first approximation was quite a good one.

Consider the Markov chain with states corresponding to $0$,”R”,”Ra”,”Rar” and “Rar!” and transition matrix
$$ P = \pmatrix{\frac{255}{256} & \frac{1}{256} & 0 & 0 & 0\cr
\frac{254}{256} & \frac{1}{256} & \frac{1}{256} & 0 & 0\cr
\frac{254}{256} & \frac{1}{256} & 0 & \frac{1}{256} & 0 \cr
\frac{254}{256} & \frac{1}{256} & 0 & 0 & \frac{1}{256}\cr
0 & 0 & 0 & 0 & 1\cr} $$
Then what you’re looking for is $(P^n)_{15}$.
The solution will be of the form $p(n) = 1 + \sum_r c_r r^n$ where the sum is over the roots of the polynomial ${t}^{4}-t^3 + 1/256^4$, and the $c_r$ are chosen so that $p(0) = p(1) = p(2) = p(3) = 0$.

EDIT: using Maple’s rsolve to solve the recurrence $-{s}^{4}p \left( n \right) +{s}^{4}p \left( n+1 \right) +p \left( n+3
\right) -2\,p \left( n+4 \right) +p \left( n+5 \right) =0$ with $p(0)=p(1)=p(2)=p(3)=0, p(4) = s^4$ (where $s = 1/256$) I get
$$ p(n) = 1 + \sum_r \frac{1}{(4 r^3 s^4 – 1) r^{n+1}}$$
where the sum is over the roots $r$ of $s^4 z^4 – z + 1$. Three
of the roots are quite large (giving very rapidly-decaying terms),
one is very near 1: in fact it is
$$ r_4 = 1+s^4+4 s^8+22 s^{12}+140s^{16}+\ldots = {}_3F_2(1/4,1/2,3/4;\,2/3,4/3;\,{\frac {256}{27}}\,{s}^{4})$$.
which for $s = 1/256$ is approximately $1.00000000023283064387071006368$.
Thus for all but very small $n$, $p(n)$ is very close to
$$1- 1.00000000093132257613336155988\,{( 1.00000000023283064387071006368)}
^{-n- 1}$$

Set $p=1/256^4$. Then the inclusion-exclusion principle gives the exact formula
$$p(n)=\sum_{m \geq1}(-1)^{m+1}{n-3m\choose m} p^m.$$
Only finitely many terms in the sum are non-zero, but there are about $n/4$ of them,
so the full formula may not be too practical. On the other hand, partial sums alternate
between upper and lower bounds on the true value $p(n)$.

In practice, the Poisson approximation may be your best bet.
It is easy to understand and work with:
$$p(n)\approx 1-\exp(-np).$$

EDIT: As pointed out in the comments below, this answer is incorrect. However, it is a good approximation, to 16 or 17 decimal places (before exponentiation by n).


The probability of finding a given character at any position is $1/256 = 2^{-8}$, so the probability of finding “Rar!” at any position is $(2^{-8})^4 = 2^{-32}.$

The (unconditional) probability of finding “Rar!” at position 0 is $2^{-32}$.
The (unconditional) probability of finding “Rar!” at position 1 is $2^{-32}$.
$\vdots$
The (unconditional) probability of finding “Rar!” at position n is $2^{-32}$.

These are all unconditional (i.e., independent) probabilities. But if “Rar!” were to occur at position $i$, it can’t occur at position $i+1$, $i+2$, or $i+3$, so we can’t really use these unconditional probabilities, and it’s actually kind of tricky to calculate directly the probability that “Rar!” will occur at least once in the first $n$ positions.

It’s much easier to calculate the probability that “Rar!” won’t occur in the first $n$ positions, because the act of “Rar!” not occurring does not affect the future probability of “Rar!” ocurring. The probability that it won’t occur in a particular position is $(1-2^{-32})$. The probability that it won’t occur in the first $n$ positions is $(1-2^{-32})^n$.

So, to answer your question, the probability that “Rar!” will occur in the first $n$ positions is
$$ p(n) = 1 – (1 – 2^{-32})^n.$$

Finally, I would just like to remark that although bytes range from 0 to 255, the ASCII encoding only ranges from 0 to 127, and a good chunk of those are control characters like “delete” or “start of header”, that don’t display an actual symbol/letter/digit. If it’s up to you, perhaps you could choose a narrower range, if you’re looking for strings in randomly generated text.

Happy number crunching!