Our multimodal understanding datasets realize accurate semantic alignment of image, video, text and audio, specially built for MLLM multimodal large model training. Covering daily scenes, game content, industrial scenarios, panoramic space and real-life interaction samples, each group of multimodal data is manually matched and annotated for cross-modal semantic correlation.