# Data Preparation ## Recommended Data Format The recommended way to prepare the data is through yaml file and prepare your sft data in jsonlines/parquet/arrow. You need to prepare a YAML file to specify the data path and data type. The YAML file should look like this: ```yaml datasets: - path: data_folder: data_type: json/jsonl - path: data_folder: data_type: json/jsonl ... ``` The actual dataset format can refer to the debug dataset we provide on [huggingface](https://huggingface.co/datasets/kcz358/lmms_engine_test) or refer to the protocol files in `src/lmms_engine/protocol/data_proto.py` ### Cloud Data Access With the data scaling, it might be very redundant to download and extract all the data to your local storage (and unrealistic). A way to cope with this is through object storage. The training framework now supports using `google cloud storage` and `azure blob storage` to access the data file directly. To use it, you should specify in your training config that ```json { "dataset_config": { ... "object_storage": "azure", # Or gcs "bucket_name": "llava", ... } } ``` Then the data folder should be the path to the data folder on the cloud storage. You should export the credentials before running the application ```bash export GOOGLE_APPLICATION_CREDENTIALS="" export AZURE_STORAGE_SAS_URL="" ``` Please contact the administrator to get your credential ## HF Format In our initial code design, we also integrated the huggingface format. But since we believe it is currently relatively hard to scale using this format. This format has mainly been deprecated and not under maintenance.