distilbert-base-uncased-finetuned-imdb

This model is a fine-tuned version of distilbert-base-uncased on the IMDb dataset.

Model description

This model uses DistilBERT (a smaller, faster, cheaper version of BERT) and performs Domain Adaptation on movie reviews.

While the original DistilBERT was pre-trained on English Wikipedia and BookCorpus (factual and literary data), this version is fine-tuned on the IMDb dataset to better understand the specific vocabulary, sentiment nuances, and context of movie reviews. This process allows the model to predict masked tokens that are contextually relevant to the film industry and subjective opinions.

Intended uses & limitations

Intended Uses:

Masked Language Modeling: The model can be used to fill in the blank ([MASK]) in sentences related to movies or reviews.
Domain Adaptation Base: This model can serve as a better starting point (backbone) for training a downstream classifier (e.g., sentiment analysis) on movie reviews compared to the vanilla DistilBERT.

Limitations:

The model is trained on a downsampled version of the IMDb dataset (10,000 samples) for demonstration purposes, so it may not be as robust as a model trained on the full corpus.
It is biased towards the specific vocabulary found in internet movie reviews (which can be highly polarized).

Training and evaluation data

The model was trained on the Large Movie Review Dataset (IMDb).

Preprocessing:

Tokenization: Used the DistilBERT tokenizer with a chunk size of 128 tokens.
Masking: Random masking with a probability of 0.15 (15%).
Sampling: To speed up training for the tutorial, the dataset was downsampled:
- Training set: 10,000 examples.
- Test set: 1,000 examples (10% of training).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 64
eval_batch_size: 64
seed: 42
optimizer: AdamW
weight_decay: 0.01
lr_scheduler_type: linear
num_epochs: 3.0
mixed_precision_training: Native AMP (fp16=True)

Training results

The model achieved a significant reduction in perplexity compared to the pre-trained base model, indicating successful adaptation to the movie review domain.

Epoch	Perplexity
0	11.40
1	10.90
2	10.73

Note: The base model started with a perplexity of ~21.75 on this dataset before fine-tuning.

Framework versions

Transformers 4.57.2
Pytorch 2.9.0+cu126
Datasets 4.0.0
Tokenizers 0.22.1

Downloads last month: 17

Safetensors

Model size

67M params

Tensor type

F32

Model tree for rajaykumar12959/distilbert-base-uncased-finetuned-imdb

Base model

distilbert/distilbert-base-uncased

Finetuned

(10342)

this model

Dataset used to train rajaykumar12959/distilbert-base-uncased-finetuned-imdb

Evaluation results

Metadata error: specify a dataset to view leaderboard