Ludwig是一个基于TensorFlow构建的工具箱,可以训练和测试深度学习模型,而无需编写代码
Ludwig是一个基于TensorFlow构建的工具箱,可以训练和测试深度学习模型,而无需编写代码 w4nderlust released this
Changelog
Additions
- Added feature identification logic (#957)
- Added Backend interface for abstracting DataFrame preprocessing steps (#1014)
- Add support for transforming numeric predictions that were normalized (#1015)
- Added Kaggle API integration and Titanic dataset (#1021)
- Add Korean translation for the README (#1022)
- Added cast_columns function to preprocessing and cast_column function to all feature mixin classes (#1027)
- Added custom encoder / decoder registration decorator (#1017)
- Add titles to Hyperopt Report visualization (#1026)
-Added cast_columns function to preprocessing and cast_column function to all feature mixin classes (#1027) - Added label-wise probability to binary feature predictions (#1033)
- Add support for num_layers in sequence generator decoder (#1050)
- Added Flickr8k dataset (#1053)
- Add support for transforming numeric predictions that were normalized (#1015)
Improvements
- Improved triggering of cache re-creation (now it depends also on changes in feature types)
- Improved legend and add tight_layout param to compare predictions plot (#1037)
- Improved postprocessing for binary features so prediction vocab matches inputs (#1038)
- Bump TensorFlow and tfa-nightly for 2.4.0 release (#1058)
- Updated Dockerfiles to TensorFlow 2.4.0 (#1059)
Bugfixes
- Fix missing yaml files for datasets in pip package
- Fix hdf5 preprocessing error
- Fix calculation of the metric score for hyperopt (#1031)
- Fix wrong argument in visualize.py from
-f
to-ofn
(#1032) - Fix fill NaN by adding selected conversion of columns to string when computing metadata (#1042)
- Fix: inconsistent seq length for probabilities (#1043)
- Fix issues with changes in xlrd package (#1056)
Assets
2
w4nderlust released this
Additions
- Added dataset module (#949) containing MNIST, SST-2, SST-5, REUTERS, OHSUMED, FEVER and GoEmotions datasets
- Add Ludwig Model Serve Example (#947)
- Add checksum mechanism for HDF5 and Meta JSON cache file (#1006)-
Improvements
- Updated run_experiment to use new skip parameters and returns (#955)
- Several improvements to testing (more coverage, with faster tests)
- Changed default value of HF encoder trainable parameter to True (for performance reasons) (#996)
- Improved and slightly modified visualization functions API-
Bugfixes
- Changed not to is None in dataset checks in hyperopt.run.hyperopt() (#956)
- Fix LudwigModel.predict() when skip_save_predictions = False (#962)
- Fix #963: Convert materialized tensors to numpy arrays up front to avoid repeated conversion ()
- Fix errors with DataFrame truth checks in hyperopt (#956)
- Added truncation to HF tokenizer (#978)
- Reimplemented Jaccard Metric for the Set Feature (#979)
- Fix learning rate computation with decay and warmup (#982)
- Fix CLI logger typos (#998, #999)
- Fix loading of split from hdf5 (#1003)
- Fix visualization unit tests (#981)
- Fix concatenate_csv to work with arbitrary read functions and renamed concatenate_datasets
- Fix compatibility issue with matplotlib 3.3.3
- Limit numpy and h5py max versions due to tensorflow 2.3.1 max supported versions (#990)
- Fixed usage of model_load_path with Horovod (#1011)
Assets
2
w4nderlust released this
Improvements
- Full porting to TensorFlow 2.
- New hyperparameter optimization functionality through the
hyperopt
command. - Integration with HuggingFace Transformers for pre-trained text encoders.
- Refactored preprocessing with new supported data formats:
auto
,csv
,df
,dict
,excel
,feather
,fwf
,hdf5
(cache file produced during previous training),html
(file containing a single HTML<table>
),json
,jsonl
,parquet
,pickle
(pickled Pandas DataFrame),sas
,spss
,stata
,tsv
. - improved validation logic.
- New Transformer encoders for sequential data types (sequence, text, audio, timeseries).
- new
batch_predict
functionality in the REST API. - New export command to export to SavedModel and Neuropod.
- New
collect_summary
command to print out a model summary with layers names. - Modified the
predict
command, and splitt it intopredict
andevaluate
. The first only produces predictions, the second evaluates those predictions against ground truth. - Two new hyperopt-related visualizations:
hyperopt_report
andhyperopt_hiplot
. - Improved tracking of metrics in the TensorBoard.
- Greatly improved test suite.
- Various documentation improvements.
Bugfixes
This release includes a fundamental rewrite of the internals, so many bugs have been fixed while rewiting.
This list includes only the ones that have a specific Issue associated with them, but many others where addressed.
- Fix #649: Replaced SPLIT with 'split' in example code.
- Fix documentation, wrong parameter name (#684)
- Fix #702: Fixed setting defaults in binary output feature.
- Fix #729: Reduce output was not passed to the sequence encoder inside the sequence combiner.
- Fix #742: Renamed self._learning_rate in Progresstracker.
- Fix #799: Added tf_version to description.json.
- Fix #840: Better messaging for plateau logic.
- Fix #850: Switch from ValueError to Warning to make stratify work on non-output features.
- Fix ##844: Load LudwigModel in test_savedmodel before creating saved model.
- Fix #833: loads the model after training and before predicting if the model was saved on disk.
- Fix #933: Added NumpyDecoder before returning JSON response from server.
- Fix #935: Multiple categorical features with different vocabs now work.
Breaking changes
Because of the change in the underlying tensor computation library (TensorFlow 1 to TensorFlow 2) and the internal reworking it required, models trained with v0.2 don't work on v0.3.
We suggest to retrain such models, in most cases the same model definition can be used, although one impactuful breaking change is that model_definition
are now called config
, because they don't contain only information about the model, but also training, preprocessing, and a newly added hyperopt section.
There have been some changes in the parameters inside the config too.
In particular, one main change is dropout
that now it is a float value that can be specified for each encode / combiner / decoder / layer, while before it was a boolean parameter.
As a consequence, the dropout_rate
parameter in the training section has been removed.
Another change in training parameters are the available optimizers.
TensorFlow 2 doesn't ship with some of the ones that were exposed in Ludwig (adagradda
, proximalgd
, proximaladagrad
) and the momentum optimizer has been removed as now it is a parameter of the sgd
optimizer.
Newly added optimizers are nadam
and adamax
.
Note that the accuracy
metric for the combined
feature has been removed because it was misleading in some scenarios when multiple features of different types where trained.
In most cases, encoders, combiners and decoders now have an increased number of exposed parameters to play with for increased flexibility.
One notable change is that the previous BERT encoder has been replaced by an HuggingFace based one with different parameters, and it is now available only for text features.
Please refer to the User Guide for details for each encoder.
Tokenizers also changed substantially with new parameters supported, refer to User Guide for more details.
Other major changes are related to the CLI interface.
The predict
command has been replaced in functionality with a simplified predict
and a new evaluate
. The first only produces predictions, the second evaluates those predictions against ground truth.
Some parameters of all CLI commands changed.
All different type of data_*
parameters have been replaced by generic dataset
, training_set
, validation_set
and test_set
parameters, while the data format is automatically determined, but can also be set manually by using the data_format
argument. There is no
gpu_fractionany more, but now users can specify
gpu_limit` for managing the VRAM usage.
For all additional minor changes to the CLI please refer to the User Guide.
The programmatic API changed too, as a consequence.
Now all the parameters match closely the ones of the CLI interface, including the new dataset
and gpu
parameters.
Also in this case the predict
function has been split into predict
and evaluate
.
Finally, the returned values of most functions changed to include some intermediate processing values, like for instance the preprocessed and split data when calling train
, the output experiment directory and so on.
Notably, now there is an experiment
function in the API too, together with a new hyperopt
one.
For more datails, refer to the API reference.
Contriburotrs
@jimthompson5802 @tgaddair @kaushikb11 @ANarayan @calio @dme65 @ydudin3 @carlogrisetti @ifokeev @flozi00 @soovam123 @KushalP1 @JiByungKyu @stremlau @adiov @martinremy @dsblank @jakobt @vkuzmin-uber @mbzhu1 @moritzebeling @lnxpy
Assets
2
w4nderlust released this
Improvements
Added integration with Weights and Biases.
Added K-Fold cross validation.
Added 4 examples with their respective code and Jupyter Notebooks: Hyper-parameter optimization, K-Fold Cross Validation, MNIST, Titanic.
Greatly improved the measures tracked on the TensorBoard.
Added auto-detect function for field separator when reading CSVs.
Added CI tooling.
Class weights can be specified as a dictionary #615.
Removed deprecation warning from h5py.
Removed most deprecation warning from TensorFlow.
Bypass multiprocessing.Pool.map for faster execution.
Updated TensorFlow dependency to 1.15.2.
Various documentation improvements.
Bugfixes
Fix cudnn error on RTX GPUs.
Fix inverted confusion_matrix axis.
Fix #201: Removed whitespace as a separator option.
Fix #540: Fixed default text parameters for sampled loss.
Fix #541: Docker image improvements (removed libgmp and spacy model download).
Fix #554: Fix audio input test case in docker container.
Fix #570: Temporary dolution for in_memory
flag usage in API.
Fix #574: Setting intra and inter op parallelism to 0 so that TF determine them automatically.
Fix #329 and #575: Fixed use of SavedModel and added an integration test.
Fix #609: When predicting, if a split is in the CSV, data is split correctly.
Fix #616: Change preprocessing in siamese network example.
Fix #620: Failure in unit tests for 1 vs all calibration plots.
Fix #632: Setting minimum version requirements for six
.
Fix #636: CLI output table column ordering preserved when resuming.
Fix #641: Added multi-task learning section specifying the weight for each output feature in the User Guide.
Fix #642: Fixing horovod use when loading a model as initialization.
Contriburotrs
@jimthompson5802 @calz1 @pingsutw @vanpelt @carlogrisetti @anttisaukko @dsblank @borisdayma @flozi00 @jshah02
Assets
2
w4nderlust released this
Improvements
Add Filter Bank features for audio.
Added two more parameters skip_save_test_predictions
and skip_save_test_statistics
to train and experiment CLI commands and API.
Updated to spaCy 2.2 with support for Norvegian and Lithuanian tokenizers.
Reorganized dependencies, now the defaults are barebone and there are several axtra ones.
Added fc_layers
to H3 embed encoder.
Added get_preprocessing_params
in preprocessing.
Refactored image features preprocessing to use multiprocessing.
Refactored preprocessing with strategy pattern.
Bugfixes
Fix #452: Removed dependency on gmpy
.
Fix #465: Adds capability to set the vocabulary from a Glove file.
Fix #480: Adds a health check to ludwig serve
.
Fix #481: Added some examples of visualization commands.
Fix #491: Improved skip parameters, now no directories are created if not needed.
Fix #492: Adds skip saving unprocessed output api.py
.
Fix #493: Added parameters for the vocabulary file and the UNK
and PAD
symbols in sequence feature call to create_vocabulary
in the calculation of metadata.
Fix #500: Fixed learning_curves()
when the training statistics file does not contain validation.
Fix #509: Fixes in_memory
issues in image features.
Fix #525: Adding check is_on_master()
before creating save_path
dir./ectory
Fix #510: Fixed version of pydantic.
Fix #532: Improved speed of add_sequence_feature_column()
.
Potentially breaking changes
Fix #520: Renamed field parameter in visualization to output_feature_name for clarity and improved documentation. Please make sure to rename you function calls if you were using this parameter by name (the order keeps the same).
Contributors
@sriki18 @carlogrisetti @areeves87 @naresh-bhandari @revolunet @patrickvonplaten @Athanaziz @dsblank @tgaddair @Mechachleopteryx @AlexeyGy @yu-iskw
Assets
2
w4nderlust released this
Improvements
New BERT encoder and with its BPE tokenizer
Added Audio features that can be used also for speech data (with appropriate preprocessing feature extraction)
Added H3 feature, together with 3 encoders to deal with spatial information
Added Date feature and two encoders to deal with temporal information
Improved Comet.ml integration
Refactored visualization.py
to make individual functions usable from API
Added capability of saving visualization graph in the visualization command and visualizations_utils.py
Added a serve
command that allows for spawning a prediction server using FastAPI
Added a test
command (that requires output columns in the data) to avoid confusion with predict
(which does not require output columns)
Added pixel normalization and pixel standardization scaling options for image features
Added greyscaling of images if specified channels = 1 and img channels is 3 or 4
Added normalization strategies for numerical features (#367)
Added experiment name parameter in the API (#357)
Refactored text tokenizers
Several improvements in logging
Added a method for saving models with SavedModels
in model.py
and exposes it in the API with a save_for_serving()
function (#329)(#425)
Upgraded to the latest version of TensorFlow 1.14 (#429)
Added learning rate warmup for non distributed settings
Bugfixes
Fix #321: Removed the 6n+2 check for ResNet size
Fix #328: adds missing UPDATE_OPS to the optimization operation
Fix #336: GloVe embeddings loading now reads utf-8 encoded files
Fix #336: Addresses the malformed lines issue in embeddings loading
Fix #346: added a parameter indicating if the session should be closed after training in full_train
Fix #351: values in categorical columns are now stripped before being compared to the vocabulary
Fix #364: associate the right function to non english text format functions
Fix #372: set evaluate performance parameter to false in predict.py
Fix #394: Improved error explaination when image dimensions don't match and improved documentation accordingly
Fix #411: Images in HDF5 are now correctly saved as uint8
instead of int8
Fix #431: missing libgmp3-dev dependency in docker (#428)
Fix fixed image resizing
Fix model load path (#424)
Fix batch norm in convolutional layers (now uses tf internal layer and not the one in contrib)
Several additional minor fixes
Contributors
@carlogrisetti @jaipradeesh @glongh @dsblank @danicattaneob @gogasca @lordeddard @IgorWilbert @patrickvonplaten @ojus1 @jimthompson5802 @johnwahba @revolunet @gogasca
Assets
2
w4nderlust released this
Improvements
- Improved import speed by ~50%
- Improved Comet.ml integration
- Replaced
only_predict
withevaluate_performance
(and flipped the logic) in all predict commands and functions - Refactored preprocessing functions for improved testability, understanbility and extensibility
- Added
data_dict
to the train method inLudwigModel
- Improved tests speed
Bugfixes
- Fix issue #283:
word_format
in text features is now properly used - Fix issue #286: avoid using signal when not on main thread
- Fix issue where the order of operations when preprocessing images between resizing and changing channels was inverted
- Fix safety issues: now using
yaml.safe_load
instead ofyaml.load
and replaced pickling of the progress tracker with a JSON equivalent - Fix minor bug with missing
tied_weights
key in some features - Fixed a few minor issues discovered with deepsource.io
Other Changes
- If before
LudwigModel
would be imported fromludwig
now it should be imported fromludwig.api
. This change was needed for speeding up imports
Contributors
Assets
2
Watchers:189 |
Star:7583 |
Fork:890 |
创建时间: 2018-12-28 07:58:12 |
最后Commits: 4天前 |
许可协议:Apache-2.0 |
5a7816b
Compare
Changelog
Additions
Improvements
Bugfixes