I'd say you're the one being pedantic ;)
That's what I mean. I meant I'm being very, very pedantic about the definition of "Markov" :)
I'm also not going to claim this is an any way useful for reasoning about LLMs.
It's a pretty serious contortion to call an LLM a Markov Chain.
Is it? Let's say your LLM has a maximum of 1000 tokens. Your state at time any time is 1000 tokens, plus 1000 bools indicting the presence or absence of a token.
Your state transition function is the transformer network.
Now you can generate state N+1 from state N without reference to state N-1!
There is some sleight of hand here. Implementation of Markov processes do have a variety of formulations, for example producing a probability distribution from a state, then you sample the distribution and blammmo! New state.
But that's a specific case of [new state] = f([old state],[noise])
That function could equally well be a transformer run on the old state with some noise injected. There's not an explicit modelling of the probability distribution, but does there need to be?
Rather trivially, the transformer perfectly models itself so while we don't have an explicit formulation of the distribution that we can sample, we have a perfect representation of the generate-and-sample function.
We're talking hundreds of millions of significant digits for the token IDs.
I'm not sure I get why. I think all you need is the exact simply the old tokens with the newly generated one concatenated on. Or more pedantically, the equivalent of that implemented with a fixed length and a set of occupancy bools.
With that you never need a previous state because all the previous state is in the current state, which is possible because the state is fixed in size and not all that big.
Is it cheating to simply fold the old state into the new state? Isn't that what velocity are and acceleration are with respect to position in a Kalman filter?