It has been a fundamental yet emerging challenge for computer vision to automatically describe visual content with natural language. Especially, thanks to the recent development of Recurrent Neural Networks (RNNs), there has been tremendous interest in the task of image or video captioning, where each image or video is described with a single natural sentence. This task is also one of focuses in our lab. To better explore technologies on this direction, we collected the original videos of the 2nd Microsoft Video to Language Challenge for our research.
In the 2nd MSR Video to Language Challenge, the training set, validation set and testing data in the 1st MSR Video to Language Challenge are combined as the new training data. An additional test set of around 3K video clips will be released as the final evaluation set. As such, there are 10K and 3K video clips for training and testing this year, respectively. Each video is annotated with 20 natural sentences.
* In MSR-VTT dataset, the category information is provided for each video clip and the video clip contains audio information as well.
All video info and caption sentences are formatted in a JSON file as