EmojiOctopi

Object Property Reasoning with Large Tactile-Language Models

RSS 2024

1National University of Singapore, 2University of Washington
Paper Code + Dataset

Summary: Large VLM physical property prediction and scenario reasoning with GelSight tactile videos.



Robot demonstration.


Overview

We introduce PhysiCLeAR, a new dataset containing GelSight tactile videos of everyday household objects. The videos are collected by hand with two exploratory procedures, pressing and rotation, and annotated for three useful physical properties: hardness, roughness and bumpiness. PhysiCLeAR leverages the videos and annotations to create five language-driven physical description and understanding tasks. We train and evaluate Octopi, a large VLM, on PhysiCLeAR for tactile-grounded physical understanding and scenario reasoning.

Model pipeline.

Our experiments show that Octopi is able to predict physical properties from the tactile videos accurately and use the physical properties to reason about and resolve scenarios.



PhysiCLeAR: Tactile Dataset

A new GelSight tactile dataset with property diversity, object diversity, and material diversity for three useful physical properties: hardness, roughness and bumpiness. It contains 74 everyday household objects and 408 tactile videos. The videos are annotated by three annotators for the physical properties.

Dataset items.

Tactile Image Samples



Physical Property Details

The physical object properties selected, along with their descriptions and semantic categories.

Dataset features.


GelSight Dataset Comparisons

PhysiCLeAR provides physical property labels for tactile descriptions and physical reasoning across three physical properties. We further compare against existing datasets across three diversity measures. Property diversity refers to whether there are objects in the dataset that vary across the three properties we selected: hardness, roughness and bumpiness. Object diversity indicates whether there is more than one type of object in the dataset. Material diversity indicates the number of different materials in the dataset.

Dataset features.


LLM Training & Evaluation Suite

PhysiCLeAR also contains five physical description and understanding tasks. We give each task's motivation and indicate whether they are used for Octopi's training and/or evaluation. Specific details about the prompt setup of each task can be found in our paper.

LLM training and evaluation suite.


Octopi: Tactile VLM

Octopi comprises three trained components: 1) tactile input encoder, 2) projection module, and 3) LLM, similar to prior LVLM models. We use the CLIP (ViT-L/14) visual encoder to extract feature representations from input tactile videos. The encoder's output is then mapped to the LLM's word embedding space using a projection module (two linear layers). Finally, the LLM forms the language understanding component of Octopi. We use the open-source LLaMA-based LLM, Vicuna v1.5, recognized for its dialogue capabilities. Language embeddings are derived through tokenization and then Vicuna's word embedding layer, with <tact_start> and <tact_end> being newly trained word embeddings indicating the start and end of a tactile frame sequence from a single tactile sensor.

Model diagram.

Training Methodology

Octopi training is done in three steps. The fire emoji indicates that the component is trained while the snowflake emoji indicates that it is frozen.

Model diagram.


Results

Physical Property Prediction

Octopi-7b and Octopi-13b perform above the random baseline for object property predictions and have similar performance to Fine-tuned CLIP + Classification, indicating that Octopi can be used for object property prediction. Octopi-13b has a higher combined accuracy (i.e. all three physical properties are correctly predicted for a given object) when compared to Octopi-7b, suggesting there are performance gains with larger LLMs for tactile signal grounding.

Overall results.


Scenario Reasoning

During scenario reasoning, we do not provide ground-truth property descriptions. Our experiments show Octopi-7b and Octopi-13b perform above the random baseline, indicating that Octopi can be used for scenario reasoning. Furthermore, leveraging object properties (i.e. Object Property Description) significantly improves scenario reasoning for Octopi, which supports our overall hypothesis that leveraging these properties is helpful for these tasks. Interestingly, we observe that the 7b model marginally outperformed the 13b model. More details about the scenarios can be found in our paper.

Overall results.


Scenario Reasoning Examples


Rice (Cooked v.s. Uncooked) Reasoning
Rice reasoning interaction.
Toothbrush Part Reasoning
Toothbrush part reasoning interaction.

Citation

If you use this work or find it helpful, please consider citing our work.


      @article{yu2024octopi,
        title={Octopi: Object Property Reasoning with Large Tactile-Language Models},
        author={Yu, Samson and Lin, Kelvin and Xiao, Anxing and Duan, Jiafei and Soh, Harold},
        journal={arXiv preprint arXiv:2405.02794},
        year={2024}
      }