What is a dataset?

A dataset is a collection of files used in the fine-tuning process. It consists of a mandatory training file and an optional validation file.
  • Training File (Required): Contains the data used to teach the model. The model learns from these examples and adjusts its internal weights accordingly.
  • Validation File (Optional): Contains data not present in the training set. It is used to gauge how well the new model performs on unseen data, which helps detect overfitting. A validation file is highly recommended for robust model evaluation.
If a validation dataset is not provided, our fine-tuning service will randomly select 1% of your training dataset to evaluate the fine-tuning at the end of the fine-tuning process and provide evaluation metrics.

Data Formatting Requirements

1. File Format The fine-tuning service accepts files only in CSV (Comma-Separated Values) format. 2. Column Structure Your CSV files must contain the following columns:
  • prompt(Optional): This column should contain the input text, instruction, or question for the model.
  • answer(Required): This column must contain the desired output or response from the model.
Here is an example of a valid CSV file for fine-tuning:
questionanswer
What is the capital of France?The capital of France is Paris.
Who wrote “To Kill a Mockingbird”?Harper Lee wrote “To Kill a Mockingbird”.
Explain the theory of relativity in simple terms.The theory of relativity, developed by Albert Einstein, describes how gravity is a property of spacetime, and how space and time are linked.

Dataset Overview

A dataset is the artifact used by a fine-tuning job. It consists of a required training file and an optional validation file. The diagram below illustrates the relationship between a dataset, its component files, and the required format.

Dataset Management Workflow

Once you have prepared your training and validation files in the required CSV format, you can create a dataset to start fine-tuning your model.

Prerequisites

Before you begin, ensure you have the following:
  • Service Token (JWT): A valid JWT is required to authenticate your requests. Please see our guide on how to create a service token.
  • Organization ID: You can find your Organization ID by navigating to Settings → Organisation in the Nscale platform.

Create Dataset

Step 1: Upload Your Files

Your training and optional validation CSV files must be uploaded individually. Each successful upload returns a response containing a unique id. It is essential to save the id for each uploaded file, as you will need them in the next step to create your dataset.
 curl -X POST https://fine-tuning.api.nscale.com/api/v1/organizations/$ORGANIZATION_ID/files \
 -H "Authorization: Bearer $NSCALE_API_TOKEN" \
 -H 'Content-Type: multipart/form-data' \
 -H 'Accept: application/json' \
 -F 'file=@"<PATH_TO_FILE>"'

Step 2: Create a New Dataset

Once you have the file id for your training and validation files, you can create a dataset. A dataset groups these files under a single ID that you’ll use to start a fine-tuning job. To create a dataset, provide a name, the fileid for your training file, and optionally, the file id for your validation file.

curl -X POST "https://fine-tuning.api.nscale.com/api/v1/organizations/$ORGANIZATION_ID/datasets" \
  -H "Authorization: Bearer $NSCALE_API_TOKEN" \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json' \
  -d '{
 "name": "example_dataset",
 "training_file_id": "682d47e8-6d65-4c9a-a9fe-0d695c610366",
 "validation_file_id": "4df01235-360e-4b7c-816e-da3e370de6c2" // optional}
A successful request creates the dataset artifact and returns its details, including the new datasetid . With your new dataset created, you’re ready to start fine-tuning. See the Fine-Tuning guide for the next steps.

List all Datasets

To retrive a list of all datasets, use:
curl -X GET "https://fine-tuning.api.nscale.com/api/v1/organizations/$ORGANIZATION_ID/datasets" \
-H "Authorization: Bearer $NSCALE_API_TOKEN"

Get a Dataset

To get a particular dataset, use:
curl -X GET "https://fine-tuning.api.nscale.com/api/v1/organizations/$ORGANIZATION_ID/datasets/$DATASET_ID"
-H "Authorization: Bearer $NSCALE_API_TOKEN"

Delete a Dataset

To delete a dataset, use:
curl -X DELETE "https://fine-tuning.api.nscale.com/api/v1/organizations/$ORGANIZATION_ID/datasets/$DATASET_ID" \
-H "Authorization: Bearer $NSCALE_API_TOKEN" \
-H 'Content-Type: application/json' \
-H 'Accept: application/json'