distilbert-base-uncased-finetuned-imdb

This model is a fine-tuned version of distilbert-base-uncased on the IMDb dataset.

Model description

This model uses DistilBERT (a smaller, faster, cheaper version of BERT) and performs Domain Adaptation on movie reviews.

While the original DistilBERT was pre-trained on English Wikipedia and BookCorpus (factual and literary data), this version is fine-tuned on the IMDb dataset to better understand the specific vocabulary, sentiment nuances, and context of movie reviews. This process allows the model to predict masked tokens that are contextually relevant to the film industry and subjective opinions.

Intended uses & limitations

Intended Uses:

  • Masked Language Modeling: The model can be used to fill in the blank ([MASK]) in sentences related to movies or reviews.
  • Domain Adaptation Base: This model can serve as a better starting point (backbone) for training a downstream classifier (e.g., sentiment analysis) on movie reviews compared to the vanilla DistilBERT.

Limitations:

  • The model is trained on a downsampled version of the IMDb dataset (10,000 samples) for demonstration purposes, so it may not be as robust as a model trained on the full corpus.
  • It is biased towards the specific vocabulary found in internet movie reviews (which can be highly polarized).

Training and evaluation data

The model was trained on the Large Movie Review Dataset (IMDb).

Preprocessing:

  • Tokenization: Used the DistilBERT tokenizer with a chunk size of 128 tokens.
  • Masking: Random masking with a probability of 0.15 (15%).
  • Sampling: To speed up training for the tutorial, the dataset was downsampled:
    • Training set: 10,000 examples.
    • Test set: 1,000 examples (10% of training).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 64
  • eval_batch_size: 64
  • seed: 42
  • optimizer: AdamW
  • weight_decay: 0.01
  • lr_scheduler_type: linear
  • num_epochs: 3.0
  • mixed_precision_training: Native AMP (fp16=True)

Training results

The model achieved a significant reduction in perplexity compared to the pre-trained base model, indicating successful adaptation to the movie review domain.

Epoch Perplexity
0 11.40
1 10.90
2 10.73

Note: The base model started with a perplexity of ~21.75 on this dataset before fine-tuning.

Framework versions

  • Transformers 4.57.2
  • Pytorch 2.9.0+cu126
  • Datasets 4.0.0
  • Tokenizers 0.22.1
Downloads last month
17
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rajaykumar12959/distilbert-base-uncased-finetuned-imdb

Finetuned
(10342)
this model

Dataset used to train rajaykumar12959/distilbert-base-uncased-finetuned-imdb