Abstract: Image and sentence matching is a critical task to bridge the visual and textual discrepancy due to the heterogeneous modalities. Great progress has been made by exploring the coarse-grained ...