Human attention in image captioning
There are two parts of the dataset: capgaze1: contains 1000 images, and raw data (eye-fixations and verbal description) from 5 native English speakers. This part of data was used for the analysis. For data privacy reason, the voice of the verbal description was converted by a masking process (pitch modulation, the content was preserved). capgaze2: contains 3000 images, and processed data (we combined all the eye-fixations from different people for each image into a fixation map). This part of data was used for developing saliency prediction model under the image captioning task.