学习fairseq之Roberta

git clone https://github.com/pytorch/fairseq.git

cd fairseq

pip install --editable ./

For MacOS:

CFLAGS="-stdlib=libc++" pip install --editable ./

接下来可以参考https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

下载数据集WikiText-103

1
2
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

将数据集按照GPT-2 BPE进行编码:

1
2
3
4
5
6
7
8
9
10
11
12
mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
python -m examples.roberta.multiprocessing_bpe_encoder \
--encoder-json gpt2_bpe/encoder.json \
--vocab-bpe gpt2_bpe/vocab.bpe \
--inputs wikitext-103-raw/wiki.${SPLIT}.raw \
--outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
--keep-empty \
--workers 60; \
done
1
2
3
4
5
6
7
8
9
wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
fairseq-preprocess \
--only-source \
--srcdict gpt2_bpe/dict.txt \
--trainpref wikitext-103-raw/wiki.train.bpe \
--validpref wikitext-103-raw/wiki.valid.bpe \
--testpref wikitext-103-raw/wiki.test.bpe \
--destdir data-bin/wikitext-103 \
--workers 60

得到data-bin/wikitext-103 文件夹,文件夹下是预处理好的数据集。

接下来可以开始训练啦,需要用到GPU资源。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000 # Warmup the learning rate over this many updates
PEAK_LR=0.0005 # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512 # Max sequence length
MAX_POSITIONS=512 # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16 # Number of sequences per batch (batch size)
UPDATE_FREQ=16 # Increase the batch size 16x

DATA_DIR=data-bin/wikitext-103

fairseq-train --fp16 $DATA_DIR \
--task masked_lm --criterion masked_lm \
--arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--batch-size $MAX_SENTENCES --update-freq $UPDATE_FREQ \
--max-update $TOTAL_UPDATES --log-format simple --log-interval 1

如果出现如下错误信息:

1
2
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.

在*~/.bashrc* 中添加export MKL_THREADING_LAYER=GNU就可以解决:

Warning

1
/home/User/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:397: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.

8卡服务器运行:

2020-12-28 15:40:38 | INFO | fairseq.distributed_utils | distributed init (rank 0): tcp://localhost:12737

2020-12-28 15:40:38 | INFO | fairseq.distributed_utils | distributed init (rank 1): tcp://localhost:12737

2020-12-28 15:40:38 | INFO | fairseq.distributed_utils | distributed init (rank 2): tcp://localhost:12737

2020-12-28 15:40:38 | INFO | fairseq.distributed_utils | distributed init (rank 6): tcp://localhost:12737

2020-12-28 15:40:38 | INFO | fairseq.distributed_utils | distributed init (rank 5): tcp://localhost:12737

2020-12-28 15:40:38 | INFO | fairseq.distributed_utils | distributed init (rank 3): tcp://localhost:12737

2020-12-28 15:40:38 | INFO | fairseq.distributed_utils | distributed init (rank 4): tcp://localhost:12737

2020-12-28 15:40:38 | INFO | fairseq.distributed_utils | distributed init (rank 7): tcp://localhost:12737

2020-12-28 15:40:45 | INFO | fairseq.distributed_utils | initialized host song1-SYS-4029GP-TRT as rank 5

2020-12-28 15:40:45 | INFO | fairseq.distributed_utils | initialized host song1-SYS-4029GP-TRT as rank 6

2020-12-28 15:40:45 | INFO | fairseq.distributed_utils | initialized host song1-SYS-4029GP-TRT as rank 0

2020-12-28 15:40:45 | INFO | fairseq.distributed_utils | initialized host song1-SYS-4029GP-TRT as rank 7

2020-12-28 15:40:45 | INFO | fairseq.distributed_utils | initialized host song1-SYS-4029GP-TRT as rank 3

2020-12-28 15:40:45 | INFO | fairseq.distributed_utils | initialized host song1-SYS-4029GP-TRT as rank 2

2020-12-28 15:40:45 | INFO | fairseq.distributed_utils | initialized host song1-SYS-4029GP-TRT as rank 4

2020-12-28 15:40:45 | INFO | fairseq.distributed_utils | initialized host song1-SYS-4029GP-TRT as rank 1

2020-12-28 15:40:45 | INFO | fairseq_cli.train | {‘_name’: None, ‘common’: {‘_name’: None, ‘no_progress_bar’: False, ‘log_interval’: 1, ‘log_format’: ‘simple’, ‘tensorboard_logdir’: None, ‘wandb_project’: None, ‘azureml_logging’: False, ‘seed’: 1, ‘cpu’: False, ‘tpu’: False, ‘bf16’: False, ‘memory_efficient_bf16’: False, ‘fp16’: True, ‘memory_efficient_fp16’: False, ‘fp16_no_flatten_grads’: False, ‘fp16_init_scale’: 128, ‘fp16_scale_window’: None, ‘fp16_scale_tolerance’: 0.0, ‘min_loss_scale’: 0.0001, ‘threshold_loss_scale’: None, ‘user_dir’: None, ‘empty_cache_freq’: 0, ‘all_gather_list_size’: 16384, ‘model_parallel_size’: 1, ‘quantization_config_path’: None, ‘profile’: False, ‘reset_logging’: True}, ‘common_eval’: {‘_name’: None, ‘path’: None, ‘post_process’: None, ‘quiet’: False, ‘model_overrides’: ‘{}’, ‘results_path’: None}, ‘distributed_training’: {‘_name’: None, ‘distributed_world_size’: 8, ‘distributed_rank’: 0, ‘distributed_backend’: ‘nccl’, ‘distributed_init_method’: ‘tcp://localhost:12737’, ‘distributed_port’: -1, ‘device_id’: 0, ‘distributed_no_spawn’: False, ‘ddp_backend’: ‘c10d’, ‘bucket_cap_mb’: 25, ‘fix_batches_to_gpus’: False, ‘find_unused_parameters’: False, ‘fast_stat_sync’: False, ‘heartbeat_timeout’: -1, ‘broadcast_buffers’: False, ‘distributed_wrapper’: ‘DDP’, ‘slowmo_momentum’: None, ‘slowmo_algorithm’: ‘LocalSGD’, ‘localsgd_frequency’: 3, ‘nprocs_per_node’: 8, ‘pipeline_model_parallel’: False, ‘pipeline_balance’: None, ‘pipeline_devices’: None, ‘pipeline_chunks’: 0, ‘pipeline_encoder_balance’: None, ‘pipeline_encoder_devices’: None, ‘pipeline_decoder_balance’: None, ‘pipeline_decoder_devices’: None, ‘pipeline_checkpoint’: ‘never’, ‘zero_sharding’: ‘none’, ‘tpu’: False, ‘distributed_num_procs’: 8}, ‘dataset’: {‘_name’: None, ‘num_workers’: 1, ‘skip_invalid_size_inputs_valid_test’: False, ‘max_tokens’: None, ‘batch_size’: 16, ‘required_batch_size_multiple’: 8, ‘required_seq_len_multiple’: 1, ‘dataset_impl’: None, ‘data_buffer_size’: 10, ‘train_subset’: ‘train’, ‘valid_subset’: ‘valid’, ‘validate_interval’: 1, ‘validate_interval_updates’: 0, ‘validate_after_updates’: 0, ‘fixed_validation_seed’: None, ‘disable_validation’: False, ‘max_tokens_valid’: None, ‘batch_size_valid’: 16, ‘curriculum’: 0, ‘gen_subset’: ‘test’, ‘num_shards’: 1, ‘shard_id’: 0}, ‘optimization’: {‘_name’: None, ‘max_epoch’: 0, ‘max_update’: 125000, ‘stop_time_hours’: 0.0, ‘clip_norm’: 0.0, ‘sentence_avg’: False, ‘update_freq’: [16], ‘lr’: [0.0005], ‘stop_min_lr’: -1.0, ‘use_bmuf’: False}, ‘checkpoint’: {‘_name’: None, ‘save_dir’: ‘checkpoints’, ‘restore_file’: ‘checkpoint_last.pt’, ‘finetune_from_model’: None, ‘reset_dataloader’: False, ‘reset_lr_scheduler’: False, ‘reset_meters’: False, ‘reset_optimizer’: False, ‘optimizer_overrides’: ‘{}’, ‘save_interval’: 1, ‘save_interval_updates’: 0, ‘keep_interval_updates’: -1, ‘keep_last_epochs’: -1, ‘keep_best_checkpoints’: -1, ‘no_save’: False, ‘no_epoch_checkpoints’: False, ‘no_last_checkpoints’: False, ‘no_save_optimizer_state’: False, ‘best_checkpoint_metric’: ‘loss’, ‘maximize_best_checkpoint_metric’: False, ‘patience’: -1, ‘checkpoint_suffix’: ‘’, ‘checkpoint_shard_count’: 1, ‘load_checkpoint_on_all_dp_ranks’: False, ‘model_parallel_size’: 1, ‘distributed_rank’: 0}, ‘bmuf’: {‘_name’: None, ‘block_lr’: 1.0, ‘block_momentum’: 0.875, ‘global_sync_iter’: 50, ‘warmup_iterations’: 500, ‘use_nbm’: False, ‘average_sync’: False, ‘distributed_world_size’: 8}, ‘generation’: {‘_name’: None, ‘beam’: 5, ‘nbest’: 1, ‘max_len_a’: 0.0, ‘max_len_b’: 200, ‘min_len’: 1, ‘match_source_len’: False, ‘unnormalized’: False, ‘no_early_stop’: False, ‘no_beamable_mm’: False, ‘lenpen’: 1.0, ‘unkpen’: 0.0, ‘replace_unk’: None, ‘sacrebleu’: False, ‘score_reference’: False, ‘prefix_size’: 0, ‘no_repeat_ngram_size’: 0, ‘sampling’: False, ‘sampling_topk’: -1, ‘sampling_topp’: -1.0, ‘constraints’: None, ‘temperature’: 1.0, ‘diverse_beam_groups’: -1, ‘diverse_beam_strength’: 0.5, ‘diversity_rate’: -1.0, ‘print_alignment’: None, ‘print_step’: False, ‘lm_path’: None, ‘lm_weight’: 0.0, ‘iter_decode_eos_penalty’: 0.0, ‘iter_decode_max_iter’: 10, ‘iter_decode_force_max_iter’: False, ‘iter_decode_with_beam’: 1, ‘iter_decode_with_external_reranker’: False, ‘retain_iter_history’: False, ‘retain_dropout’: False, ‘retain_dropout_modules’: None, ‘decoding_format’: None, ‘no_seed_provided’: False}, ‘eval_lm’: {‘_name’: None, ‘output_word_probs’: False, ‘output_word_stats’: False, ‘context_window’: 0, ‘softmax_batch’: 9223372036854775807}, ‘interactive’: {‘_name’: None, ‘buffer_size’: 0, ‘input’: ‘-‘}, ‘model’: Namespace(_name=’roberta_base’, activation_dropout=0.0, activation_fn=’gelu’, adam_betas=’(0.9,0.98)’, adam_eps=1e-06, all_gather_list_size=16384, arch=’roberta_base’, attention_dropout=0.1, azureml_logging=False, batch_size=16, batch_size_valid=16, best_checkpoint_metric=’loss’, bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix=’’, clip_norm=0.0, cpu=False, criterion=’masked_lm’, curriculum=0, data=’data-bin/wikitext-103’, data_buffer_size=10, dataset_impl=None, ddp_backend=’c10d’, device_id=0, disable_validation=False, distributed_backend=’nccl’, distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper=’DDP’, dropout=0.1, empty_cache_freq=0, encoder_attention_heads=12, encoder_embed_dim=768, encoder_ffn_embed_dim=3072, encoder_layerdrop=0, encoder_layers=12, encoder_layers_to_keep=None, end_learning_rate=0.0, eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, freq_weighted_replacement=False, gen_subset=’test’, heartbeat_timeout=-1, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, leave_unmasked_prob=0.1, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format=’simple’, log_interval=1, lr=[0.0005], lr_scheduler=’polynomial_decay’, mask_multiple_length=1, mask_prob=0.15, mask_stdev=0.0, mask_whole_words=False, max_epoch=0, max_tokens=None, max_tokens_valid=None, max_update=125000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, nprocs_per_node=8, num_shards=1, num_workers=1, optimizer=’adam’, optimizer_overrides=’{}’, pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint=’never’, pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, pooler_activation_fn=’tanh’, pooler_dropout=0.0, power=1.0, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, random_token_prob=0.1, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file=’checkpoint_last.pt’, sample_break_mode=’complete’, save_dir=’checkpoints’, save_interval=1, save_interval_updates=0, scoring=’bleu’, seed=1, sentence_avg=False, shard_id=0, shorten_data_split_list=’’, shorten_method=’none’, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm=’LocalSGD’, slowmo_momentum=None, spectral_norm_classification_head=False, stop_min_lr=-1.0, stop_time_hours=0, task=’masked_lm’, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=512, total_num_update=’125000’, tpu=False, train_subset=’train’, unk=3, untie_weights_roberta=False, update_freq=[16], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset=’valid’, validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_updates=10000, weight_decay=0.01, zero_sharding=’none’), ‘task’: Namespace(_name=’masked_lm’, activation_dropout=0.0, activation_fn=’gelu’, adam_betas=’(0.9,0.98)’, adam_eps=1e-06, all_gather_list_size=16384, arch=’roberta_base’, attention_dropout=0.1, azureml_logging=False, batch_size=16, batch_size_valid=16, best_checkpoint_metric=’loss’, bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix=’’, clip_norm=0.0, cpu=False, criterion=’masked_lm’, curriculum=0, data=’data-bin/wikitext-103’, data_buffer_size=10, dataset_impl=None, ddp_backend=’c10d’, device_id=0, disable_validation=False, distributed_backend=’nccl’, distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper=’DDP’, dropout=0.1, empty_cache_freq=0, encoder_attention_heads=12, encoder_embed_dim=768, encoder_ffn_embed_dim=3072, encoder_layerdrop=0, encoder_layers=12, encoder_layers_to_keep=None, end_learning_rate=0.0, eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, freq_weighted_replacement=False, gen_subset=’test’, heartbeat_timeout=-1, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, leave_unmasked_prob=0.1, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format=’simple’, log_interval=1, lr=[0.0005], lr_scheduler=’polynomial_decay’, mask_multiple_length=1, mask_prob=0.15, mask_stdev=0.0, mask_whole_words=False, max_epoch=0, max_tokens=None, max_tokens_valid=None, max_update=125000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, nprocs_per_node=8, num_shards=1, num_workers=1, optimizer=’adam’, optimizer_overrides=’{}’, pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint=’never’, pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, pooler_activation_fn=’tanh’, pooler_dropout=0.0, power=1.0, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, random_token_prob=0.1, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file=’checkpoint_last.pt’, sample_break_mode=’complete’, save_dir=’checkpoints’, save_interval=1, save_interval_updates=0, scoring=’bleu’, seed=1, sentence_avg=False, shard_id=0, shorten_data_split_list=’’, shorten_method=’none’, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm=’LocalSGD’, slowmo_momentum=None, spectral_norm_classification_head=False, stop_min_lr=-1.0, stop_time_hours=0, task=’masked_lm’, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=512, total_num_update=’125000’, tpu=False, train_subset=’train’, unk=3, untie_weights_roberta=False, update_freq=[16], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset=’valid’, validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_updates=10000, weight_decay=0.01, zero_sharding=’none’), ‘criterion’: Namespace(_name=’masked_lm’, activation_dropout=0.0, activation_fn=’gelu’, adam_betas=’(0.9,0.98)’, adam_eps=1e-06, all_gather_list_size=16384, arch=’roberta_base’, attention_dropout=0.1, azureml_logging=False, batch_size=16, batch_size_valid=16, best_checkpoint_metric=’loss’, bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix=’’, clip_norm=0.0, cpu=False, criterion=’masked_lm’, curriculum=0, data=’data-bin/wikitext-103’, data_buffer_size=10, dataset_impl=None, ddp_backend=’c10d’, device_id=0, disable_validation=False, distributed_backend=’nccl’, distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper=’DDP’, dropout=0.1, empty_cache_freq=0, encoder_attention_heads=12, encoder_embed_dim=768, encoder_ffn_embed_dim=3072, encoder_layerdrop=0, encoder_layers=12, encoder_layers_to_keep=None, end_learning_rate=0.0, eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, freq_weighted_replacement=False, gen_subset=’test’, heartbeat_timeout=-1, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, leave_unmasked_prob=0.1, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format=’simple’, log_interval=1, lr=[0.0005], lr_scheduler=’polynomial_decay’, mask_multiple_length=1, mask_prob=0.15, mask_stdev=0.0, mask_whole_words=False, max_epoch=0, max_tokens=None, max_tokens_valid=None, max_update=125000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, nprocs_per_node=8, num_shards=1, num_workers=1, optimizer=’adam’, optimizer_overrides=’{}’, pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint=’never’, pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, pooler_activation_fn=’tanh’, pooler_dropout=0.0, power=1.0, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, random_token_prob=0.1, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file=’checkpoint_last.pt’, sample_break_mode=’complete’, save_dir=’checkpoints’, save_interval=1, save_interval_updates=0, scoring=’bleu’, seed=1, sentence_avg=False, shard_id=0, shorten_data_split_list=’’, shorten_method=’none’, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm=’LocalSGD’, slowmo_momentum=None, spectral_norm_classification_head=False, stop_min_lr=-1.0, stop_time_hours=0, task=’masked_lm’, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=512, total_num_update=’125000’, tpu=False, train_subset=’train’, unk=3, untie_weights_roberta=False, update_freq=[16], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset=’valid’, validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_updates=10000, weight_decay=0.01, zero_sharding=’none’), ‘optimizer’: {‘_name’: ‘adam’, ‘adam_betas’: ‘(0.9,0.98)’, ‘adam_eps’: 1e-06, ‘weight_decay’: 0.01, ‘use_old_adam’: False, ‘tpu’: False, ‘lr’: [0.0005]}, ‘lr_scheduler’: {‘_name’: ‘polynomial_decay’, ‘warmup_updates’: 10000, ‘force_anneal’: None, ‘end_learning_rate’: 0.0, ‘power’: 1.0, ‘total_num_update’: 125000.0, ‘lr’: [0.0005]}, ‘scoring’: {‘_name’: ‘bleu’, ‘pad’: 1, ‘eos’: 2, ‘unk’: 3}, ‘bpe’: None, ‘tokenizer’: None}

2020-12-28 15:40:45 | INFO | fairseq.tasks.masked_lm | dictionary: 50264 types

2020-12-28 15:40:45 | INFO | fairseq.data.data_utils | loaded 3760 examples from: data-bin/wikitext-103/valid

2020-12-28 15:40:45 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: data-bin/wikitext-103/valid

2020-12-28 15:40:51 | INFO | fairseq_cli.train | RobertaModel(

(encoder): RobertaEncoder(

(sentence_encoder): TransformerSentenceEncoder(

(dropout_module): FairseqDropout()

(embed_tokens): Embedding(50265, 768, padding_idx=1)

(embed_positions): LearnedPositionalEmbedding(514, 768, padding_idx=1)

(layers): ModuleList(

​ (0): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (1): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (2): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (3): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (4): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (5): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (6): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (7): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (8): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (9): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (10): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

​ (11): TransformerSentenceEncoderLayer(

​ (dropout_module): FairseqDropout()

​ (activation_dropout_module): FairseqDropout()

​ (self_attn): MultiheadAttention(

​ (dropout_module): FairseqDropout()

​ (k_proj): Linear(in_features=768, out_features=768, bias=True)

​ (v_proj): Linear(in_features=768, out_features=768, bias=True)

​ (q_proj): Linear(in_features=768, out_features=768, bias=True)

​ (out_proj): Linear(in_features=768, out_features=768, bias=True)

​ )

​ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ (fc1): Linear(in_features=768, out_features=3072, bias=True)

​ (fc2): Linear(in_features=3072, out_features=768, bias=True)

​ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

​ )

)

(emb_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

)

(lm_head): RobertaLMHead(

(dense): Linear(in_features=768, out_features=768, bias=True)

(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)

)

)

(classification_heads): ModuleDict()

)

2020-12-28 15:40:51 | INFO | fairseq_cli.train | task: MaskedLMTask

2020-12-28 15:40:51 | INFO | fairseq_cli.train | model: RobertaModel

2020-12-28 15:40:51 | INFO | fairseq_cli.train | criterion: MaskedLmLoss

2020-12-28 15:40:51 | INFO | fairseq_cli.train | num. model params: 124696665 (num. trained: 124696665)

2020-12-28 15:40:56 | INFO | fairseq.trainer | detected shared parameter: encoder.sentence_encoder.embed_tokens.weight <- encoder.lm_head.weight

2020-12-28 15:40:57 | INFO | fairseq.utils | CUDA enviroments for all 8 workers

2020-12-28 15:40:57 | INFO | fairseq.utils | rank 0: capabilities = 7.5 ; total memory = 23.653 GB ; name = TITAN RTX

2020-12-28 15:40:57 | INFO | fairseq.utils | rank 1: capabilities = 7.5 ; total memory = 23.653 GB ; name = TITAN RTX

2020-12-28 15:40:57 | INFO | fairseq.utils | rank 2: capabilities = 7.5 ; total memory = 23.653 GB ; name = TITAN RTX

2020-12-28 15:40:57 | INFO | fairseq.utils | rank 3: capabilities = 7.5 ; total memory = 23.653 GB ; name = TITAN RTX

2020-12-28 15:40:57 | INFO | fairseq.utils | rank 4: capabilities = 7.5 ; total memory = 23.653 GB ; name = TITAN RTX

2020-12-28 15:40:57 | INFO | fairseq.utils | rank 5: capabilities = 7.5 ; total memory = 23.653 GB ; name = TITAN RTX

2020-12-28 15:40:57 | INFO | fairseq.utils | rank 6: capabilities = 7.5 ; total memory = 23.653 GB ; name = TITAN RTX

2020-12-28 15:40:57 | INFO | fairseq.utils | rank 7: capabilities = 7.5 ; total memory = 23.653 GB ; name = TITAN RTX

2020-12-28 15:40:57 | INFO | fairseq.utils | CUDA enviroments for all 8 workers

2020-12-28 15:40:57 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)

2020-12-28 15:40:57 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 16

2020-12-28 15:40:57 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/checkpoint_last.pt

2020-12-28 15:40:57 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/checkpoint_last.pt

2020-12-28 15:40:57 | INFO | fairseq.trainer | loading train data for epoch 1

2020-12-28 15:40:57 | INFO | fairseq.data.data_utils | loaded 1801350 examples from: data-bin/wikitext-103/train

2020-12-28 15:40:57 | INFO | fairseq.tasks.masked_lm | loaded 280678 blocks from: data-bin/wikitext-103/train

fairseq​模型解析https://zhuanlan.zhihu.com/p/141210591