LysandreJik released this
Oct 11, 2019
· 2 commits to master since this release
Two new models have been added since release 2.0.
Several updates have been made to the distillation script, including the possibility to distill GPT-2 and to distill on the SQuAD task. By @VictorSanh.
The run_glue.py example script can now run on a Pytorch TPU.
Several example scripts have been improved and refactored to use the full potential of the new tokenizer functions:
Enhancements have been made on the tokenizers. Two new methods have been added: get_special_tokens_mask and truncate_sequences .
The former returns a mask indicating which tokens are special tokens in a token list, and which are tokens from the initial sequences. The latter truncate sequences according to a strategy.
Both of those methods are called by the encode_plus method, which itself is called by the encode method. The encode_plus now returns a larger dictionary which holds information about the special tokens, as well as the overflowing tokens.
Thanks to @julien-c, @thomwolf, and @LysandreJik for these additions.
The two methods add_special_tokens_single_sequence and add_special_tokens_sequence_pair have been removed. They have been replaced by the single method build_inputs_with_special_tokens which has a more comprehensible name and manages both sequence singletons and pairs.
The boolean parameter truncate_first_sequence has been removed in tokenizers' encode and encode_plus methods, being replaced by a strategy in the form of a string: 'longest_first', 'only_second', 'only_first' or 'do_not_truncate' are accepted strategies.
When the encode or encode_plus methods are called with a specified max_length, the sequences will now always be truncated or throw an error if overflowing.
New contributing guidelines have been added, alongside library development requirements by @rlouf, the newest member of the HuggingFace team.
thomwolf released this
Sep 26, 2019
· 17 commits to master since this release
Following the extension to TensorFlow 2.0, pytorch-transformers => transformers
Install with pip install transformers
pip install transformers
Also, note that PyTorch is no longer in the requirements so don't forget to install TensorFlow 2.0 and/or PyTorch to be able to use (and load) the models.
All the PyTorch nn.Module classes now have their counterpart in TensorFlow 2.0 as tf.keras.Model classes. TensorFlow 2.0 classes have the same name as their PyTorch counterparts prefixed with TF.
The interoperability between TensorFlow and PyTorch is actually a lot deeper than what is usually meant when talking about libraries with multiple backends:
Training on TPU using free TPUs provided in the TensorFlow Research Cloud (TFRC) program is possible but requires to implement a custom training loop (not possible with keras.fit at the moment). We will add an example of such a custom training loop soon.
Tokenizers have been improved to provide extended encoding methods encoding_plus and additional arguments to encoding. Please refer to the doc for detailed usage of the new options.
To be able to better use Torchscript both on CPU and GPUs (see #1010, #1204 and #1195) the specific order of some models keywords inputs (attention_mask, token_type_ids...) has been changed.
If you used to call the models with keyword names for keyword arguments, e.g. model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids), this should not cause any breaking change.
model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
If you used to call the models with positional inputs for keyword arguments, e.g. model(inputs_ids, attention_mask, token_type_ids), you should double-check the exact order of input arguments.
model(inputs_ids, attention_mask, token_type_ids)
PyTorch is no longer in the requirements so don't forget to install TensorFlow 2.0 and/or PyTorch to be able to use (and load) the models.
The method add_special_tokens_sentence_pair has been renamed to the more appropriate name add_special_tokens_sequence_pair. The same holds true for the method add_special_tokens_single_sentence which has been changed to add_special_tokens_single_sequence.
LysandreJik released this
Sep 4, 2019
· 15 commits to master since this release
Huggingface's new transformer architecture, DistilBERT described in Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.
This new model architecture comes with two pretrained checkpoints:
The third OpenAI GPT-2 checkpoint is available in the library: 774M parameters, 36 layers, and 20 heads.
We have added two new XLM models in 17 and 100 languages which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.
Pytorch-Transformers torch.hub interface is based on Auto-Models which are generic classes designed to be instantiated using from_pretrained() in a model architecture guessed from the pretrained checkpoint name (ex AutoModel.from_pretrained('bert-base-uncased') will instantiate a BertModeland load the 'bert-case-uncased' checkpoint in it). They are currently 4 classes of Auto-Models:AutoModel, AutoModelWithLMHead, AutoModelForSequenceClassificationandAutoModelForQuestionAnswering`.
AutoModel.from_pretrained('bert-base-uncased') will instantiate a
and load the 'bert-case-uncased' checkpoint in it). They are currently 4 classes of Auto-Models:
Support for XLM is improved by carefully reproducing the original tokenization workflow (work by @shijie-wu in #1092). We now rely on sacremoses, a python port of Moses tokenizer, truecaser and normalizer by @alvations, for XLM word tokenization.
In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:
* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.
LysandreJik released this
Aug 15, 2019
· 3 commits to master since this release
RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
Thanks to Myle Ott from Facebook for his help.
Tokenizers get two new methods:
These methods add the model-specific special tokens to sequences. The sentence pair creates a list of tokens with the cls and sep tokens according to the way the model was trained.
[CLS] SEQUENCE_0 [SEP] SEQUENCE_1 [SEP]
[CLS] SEQUENCE_0 [SEP] [SEP] SEQUENCE_1 [SEP]
The tokenizer encode function gets two new arguments:
tokenizer.encode(text, text_pair=None, add_special_tokens=False)
If the text_pair is specified, encode will return a tuple of encoded sequences. If the add_special_tokens is set to True, the sequences will be built with the models' respective special tokens using the previously described methods.
There are three new classes with this release that instantiate one of the base model classes of the library from a pre-trained model configuration: AutoConfig, AutoModel, and AutoTokenizer.
Those classes take as input a pre-trained model name or path and instantiate one of the corresponding classes. The input string indicates to the class which architecture should be instantiated. If the string contains "bert", AutoConfig instantiates a BertConfig, AutoModel instantiates a BertModel and AutoTokenizer instantiates a BertTokenizer.
The same can be done for all the library's base models. The Auto classes check for the associated strings: "openai-gpt", "gpt2", "transfo-xl", "xlnet", "xlm" and "roberta". The documentation associated with this change can be found here.
Some examples have been refactored to better reflect the current library. Those are: simple_lm_finetuning.py, finetune_on_pregenerated.py, as well as run_glue.py that has been adapted to the RoBERTa model. The examples run_squad and run_glue.py have better dataset processing with caching.
thomwolf released this
Jul 16, 2019
· 4 commits to master since this release
pytorch-pretrained-bert => pytorch-transformers
Install with pip install pytorch-transformers
pip install pytorch-transformers
We went from ten (in pytorch-pretrained-bert 0.6.2) to twenty-seven (in pytorch-transformers 1.0) pretrained model weights.
The newly added model weights are, in summary:
The documentation lists all the models with the shortcut names and we are currently adding full details of the associated pretraining/fine-tuning parameters.
New documentation is currently being created at https://huggingface.co/pytorch-transformers/ and should be finalized over the coming days.
See the readme for a quick tour of the API.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
all_hidden_states, all_attentions = model(input_ids)[-2:]
Using tokenizer.add_tokens() and tokenizer.add_special_tokens(), one can now easily add tokens to each model vocabulary. The model's input embeddings can be resized accordingly to add associated word embeddings (to be trained) using model.resize_token_embeddings(len(tokenizer))
The serialization methods have been standardized and you probably should switch to the new method save_pretrained(save_directory) if you were using any other serialization method before.
### Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
All models are now compatible with Torchscript.
model = model_class.from_pretrained(pretrained_weights, torchscript=True)
traced_model = torch.jit.trace(model, (input_ids,))
The examples scripts have been refactored and gathered in three main examples (run_glue.py, run_squad.py and run_generation.py) which are common to several models and are designed to offer SOTA performances on the respective tasks while being clean starting point to design your own scripts.
Other examples scripts (like run_bertology.py) will be added in the coming weeks.
The migration section of the readme lists the breaking changes when switching from pytorch-pretrained-bert to pytorch-transformers.
The main breaking change is that all models now returns a tuple of results.