medical custom dataset for fine-tuning llama2-526互联

data preparation

we use huggingface shibin6624/medical to fine-tuning llama2, please note that this dataset is consist of en and cn data, here we just use en data.

dataset structure

now we download and load dataset, then save them into train.csv, validation.csv and test.csv.

from datasets import load_dataset
import os


dataset = load_dataset("shibing624/medical", "finetune")


save_path = "../medical"
os.makedirs(save_path, exist_ok=True)  


dataset['train'].to_csv(os.path.join(save_path, 'train.csv'), index=False)
dataset['validation'].to_csv(os.path.join(save_path, 'validation.csv'), index=False)
dataset['test'].to_csv(os.path.join(save_path, 'test.csv'), index=False)

then we split English part from their name as test_en.csv, validation_en.csv, train_en.csv, shown as bellow.

change code

in repository llama2-tutorial, replace the dataset.py as the following code

def get_preprocessed_medical(dataset_config, tokenizer, split):
    if split == "train":
        data_path = "../dataset/medical/train_en.csv"
    elif split == "validation":
        data_path = "../dataset/medical/validation_en.csv"
    elif split == "test":
        data_path = "../dataset/medical/test_en.csv"

    dataset = datasets.load_dataset(
        "csv",
        data_files={split: "../dataset/medical/train_en.csv"}
    )[split]

    prompt = (
        f"answer the question in instruction:\n{{instruction}}\n---\noutput:\n"
    )

    def apply_prompt_template(sample):
        return {
            "prompt": prompt.format(instruction=sample["instruction"]),
            "output": sample["output"],
        }

    dataset = dataset.map(apply_prompt_template, remove_columns=list(dataset.features))

    def tokenize_add_label(sample):
        prompt = tokenizer.encode(tokenizer.bos_token + sample["prompt"], add_special_tokens=False)
        answer = tokenizer.encode(sample["output"] + tokenizer.eos_token, add_special_tokens=False)

        sample = {
            "input_ids": prompt + answer,
            "attention_mask": [1] * (len(prompt) + len(answer)),
            "labels": [-100] * len(prompt) + answer,
        }

        return sample

    dataset = dataset.map(tokenize_add_label, remove_columns=list(dataset.features))

    return dataset

clone llama-recipes repository tied with llama2-tutorial, here is the directory structure, no matter where you put your data, but needs to be specified in your dataset.py code

fine tuning

run the following code under llama2-tutorial folder.

python -m llama_recipes.finetuning \
	--use_peft \
	--peft_method lora \
	--quantization \
	--model_name ./llama/models_hf/7B \
	--dataset custom_dataset \
	--custom_dataset.file "dataset.py:get_preprocessed_medical" \
	--output_dir ../llama/fine-tuning/medical \
	--batch_size_training 1 \
	--num_epochs 3