A text classification model to classify documents into one of 26 domain classes.
Domain Classifier
Model Overview
This is a text classification model to classify documents into one of 26 domain classes:
'Adult', 'Arts_and_Entertainment', 'Autos_and_Vehicles', 'Beauty_and_Fitness', 'Books_and_Literature', 'Business_and_Industrial', 'Computers_and_Electronics', 'Finance', 'Food_and_Drink', 'Games', 'Health', 'Hobbies_and_Leisure', 'Home_and_Garden', 'Internet_and_Telecom', 'Jobs_and_Education', 'Law_and_Government', 'News', 'Online_Communities', 'People_and_Society', 'Pets_and_Animals', 'Real_Estate', 'Science', 'Sensitive_Subjects', 'Shopping', 'Sports', 'Travel_and_Transportation'
Model Architecture
The model architecture is Deberta V3 Base
Context length is 512 tokens
Training (details)
Training data:
- 1 million Common Crawl samples, labeled using Google Cloud’s Natural Language API: https://cloud.google.com/natural-language/docs/classifying-text
- 500k Wikepedia articles, curated using Wikipedia-API: https://pypi.org/project/Wikipedia-API/
Training steps:
Model was trained in multiple rounds using Wikipedia and Common Crawl data, labeled by a combination of pseudo labels and Google Cloud API.
How To Use This Model
Input
The model takes one or several paragraphs of text as input.
Example input:
Output
The model outputs one of the 26 domain classes as the predicted domain for each input sample.
Example output:
How to use in NeMo Curator
The inference code is available on NeMo Curator's GitHub repository. Download the model.pth file and check out this example notebook to get started.
Evaluation Benchmarks
Evaluation Metric: PR-AUC
PR-AUC score on evaluation set with 105k samples - 0.9873
PR-AUC score for each domain:
| Domain | PR-AUC |
|---|---|
| Adult | 0.999 |
| Arts_and_Entertainment | 0.997 |
| Autos_and_Vehicles | 0.997 |
| Beauty_and_Fitness | 0.997 |
| Books_and_Literature | 0.995 |
| Business_and_Industrial | 0.982 |
| Computers_and_Electronics | 0.992 |
| Finance | 0.989 |
| Food_and_Drink | 0.998 |
| Games | 0.997 |
| Health | 0.997 |
| Hobbies_and_Leisure | 0.984 |
| Home_and_Garden | 0.997 |
| Internet_and_Telecom | 0.982 |
| Jobs_and_Education | 0.993 |
| Law_and_Government | 0.967 |
| News | 0.918 |
| Online_Communities | 0.983 |
| People_and_Society | 0.975 |
| Pets_and_Animals | 0.997 |
| Real_Estate | 0.997 |
| Science | 0.988 |
| Sensitive_Subjects | 0.982 |
| Shopping | 0.995 |
| Sports | 0.995 |
| Travel_and_Transportation | 0.996 |
| Mean | 0.9873 |
References
License
License to use this model is covered by the Apache 2.0. By downloading the public and release version of the model, you accept the terms and conditions of the Apache License 2.0.
This repository contains the code for the domain classifier model.