Dataset Description

Visually impaired and blind users are usually able to broadly recognize objects by means of touch and sound. It is instead harder for them to identify specific brands, models, or flavors. In this paper we propose a system designed to provide automatic visual feedbacks on objects of interest to visually impaired users, helping them identifying a specific instance out of a given category. Our visual recognition module is implemented by means of an image retrieval procedure which is highly scalable and effective for the purpose.

Glassense-Vision is a set of data we acquired and annotated to the purpose of providing a quantitative and repeatable assessment of the proposed method. The dataset includes 7 different use cases, meaning different object categories, where for each one of them we provide training (reference images used also to build dictionaries) and test images. All images in the dataset are manually annotated. The different use cases (object categories) can be grouped in three main geometrical types:

  • Flat objects - which include the banknote use case, which is a particularly meaningful flat, non-rigid, degradable type of object.
  • Boxes - Parallelepiped objects - boxes, which include cereal-boxes and medicine-boxes use cases.
  • Cylindrical objects - bottles and jars, which include water bottles, bean cans, tomato sauce jars and deodorants use cases.

The following table summarizes the size of the dataset and the heterogeneity of the different object categories:

Gallery images have been acquired in a relatively uniform setting, while the object was lying on a surface. Instead, most query data have been acquired while holding the object in the user’s hands and in a variety of scenarios, in order to test the system against different variations: partial information, rotations, small viewpoint and scale changes. Illumination changes can be observed across the different images in the query datasets. All images have been stored at a resolution of 665x1182 pixels. Overall, 3 users participated to the acquisition.

Sample images from the gallery: each image shows a different use case, acquired on a uniform background.

A few additional information is needed for the banknote case. The banknotes gallery is built considering the Euro currency where the five classes are EUR 5 (first and second series), 10 (first and second series), 20, 50 and 100.

The dataset can be downloaded from this page, see details below. The material given for each object category includes:

  • the training images (gallery), manually annotated with the class and the object instance.
  • the dictionaries for each object category, which can be used to build image descriptors.
  • the test images (query), manually annotated with the class and the object instance.

In our paper, we also divided the query dataset according to different transformations, which are provided in the readme files.


Click on the links below to download zip files (about 2GB each) of the seven object categories.

Two binary file formats are used:

.mat format

This one is used to store GMM dictionaries for fisher vectors (the Fisher encoding uses GMM to construct a visual word dictionary). A GMM is a collection of K Gaussian distributions. GMM are often used to model dictionaries of visual words (for instance for fisher vectors). In order to build a fisher vector for a fixed dimension, it is needed three matrices (means, covariances and priors) representing the GMM. These matrices are stored in individual .mat files, where the rows are the number of SURF components (64) and the columns are the number of Gaussian distributions (each distribution represents a cluster of data points).

.fvecs format

This one is used to store visual dictionaries (provided in This file stores a set of centroids representing a visual dictionary. There is no header. Centroids are stored in raw. Each centroid takes 260 bytes, as shown below.

Field Field type Description
desdim int descriptor dimension
components float*desdim the centroids components

Click on the link below to download the Matlab scripts to handle the data and a simple demo:

Evaluation Protocol

We have used the following protocol to evaluate our object-instance recognition system. Given an object category and the query dataset associated to it:

The performance is measured by the recognition rate over all query images. The recognition rate is the percentage of correct recognized images in the query dataset.

Copyright Notice

SLIPGURU is the copyright holder of all the images included in the dataset. If you use this dataset, please cite the following paper: Joan Sosa-Garcia and Francesca Odone. "Hands on recognition: adding a vision touch to tactile and sound perception for visually impaired users" (submitted to the IEEE Transactions on Human-Machine Systems). This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

Contacts and Acknowledgements

This dataset has been created in the context of the GLASSENSE project, a regional project developed within the SI4Life Ligurian Regional Hub - Research and Innovation - Live Sciences

For more information: Joan.Sosa.Garcia <at>; Francesca.Odone <at>