NVIDIA
NVIDIA
NemoCurator Domain Classifier
Model
NVIDIA
NVIDIA
NemoCurator Domain Classifier

A text classification model to classify documents into one of 26 domain classes.

Domain Classifier

Model Overview

This is a text classification model to classify documents into one of 26 domain classes:

'Adult', 'Arts_and_Entertainment', 'Autos_and_Vehicles', 'Beauty_and_Fitness', 'Books_and_Literature', 'Business_and_Industrial', 'Computers_and_Electronics', 'Finance', 'Food_and_Drink', 'Games', 'Health', 'Hobbies_and_Leisure', 'Home_and_Garden', 'Internet_and_Telecom', 'Jobs_and_Education', 'Law_and_Government', 'News', 'Online_Communities', 'People_and_Society', 'Pets_and_Animals', 'Real_Estate', 'Science', 'Sensitive_Subjects', 'Shopping', 'Sports', 'Travel_and_Transportation'

Model Architecture

The model architecture is Deberta V3 Base
Context length is 512 tokens

Training (details)

Training data:

Training steps:

Model was trained in multiple rounds using Wikipedia and Common Crawl data, labeled by a combination of pseudo labels and Google Cloud API.

How To Use This Model

Input

The model takes one or several paragraphs of text as input.
Example input:

q Directions
1. Mix 2 flours and baking powder together
2. Mix water and egg in a separate bowl. Add dry to wet little by little
3. Heat frying pan on medium
4. Pour batter into pan and then put blueberries on top before flipping
5. Top with desired toppings!

Output

The model outputs one of the 26 domain classes as the predicted domain for each input sample.
Example output:

Food_and_Drink

How to use in NeMo Curator

The inference code is available on NeMo Curator's GitHub repository. Download the model.pth file and check out this example notebook to get started.

Evaluation Benchmarks

Evaluation Metric: PR-AUC
PR-AUC score on evaluation set with 105k samples - 0.9873
PR-AUC score for each domain:

DomainPR-AUC
Adult0.999
Arts_and_Entertainment0.997
Autos_and_Vehicles0.997
Beauty_and_Fitness0.997
Books_and_Literature0.995
Business_and_Industrial0.982
Computers_and_Electronics0.992
Finance0.989
Food_and_Drink0.998
Games0.997
Health0.997
Hobbies_and_Leisure0.984
Home_and_Garden0.997
Internet_and_Telecom0.982
Jobs_and_Education0.993
Law_and_Government0.967
News0.918
Online_Communities0.983
People_and_Society0.975
Pets_and_Animals0.997
Real_Estate0.997
Science0.988
Sensitive_Subjects0.982
Shopping0.995
Sports0.995
Travel_and_Transportation0.996
Mean0.9873

References

License

License to use this model is covered by the Apache 2.0. By downloading the public and release version of the model, you accept the terms and conditions of the Apache License 2.0.
This repository contains the code for the domain classifier model.

Publisher
NVIDIA
NVIDIA
Latest Version1.0
UpdatedFebruary 3, 2025 UTC
Compressed Size701.43 MB