Our cross-modal alignment datasets focus on precise matching and time-semantic calibration between image-text, video-text, audio-text and multiple modalities. Adopt manual fine calibration and semantic annotation, unify timeline, content logic and scene correlation of different modalities. Covering daily life, games, advertising, education and other scenarios, samples include short matching and long-sequence cross-modal content.