The model incorporates a Sliding Parallel Residual Transformer Module (SPRT) that splits the standard Transformer encoder into two parts based on a windowing scheme: one part serves as a vision-text ...